update docs

2025-04-27 07:33:56 +00:00 · 2024-06-24 19:02:22 -04:00 · 2024-06-24 19:02:22 -04:00 · b2e1f68ac8
commit b2e1f68ac8
parent ca7055b667
2 changed files with 93 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,11 @@
 wordfreq is a Python library for looking up the frequencies of words in many
 languages, based on many sources of data.

+The word frequencies are a snapshot of language usage through about 2021. I may
+continue to make packaging updates, but the data is unlikely to be updated again.
+The world where I had a reasonable way to collect reliable word frequencies is
+not the world we live in now. See [SUNSET.md](./SUNSET.md) for more information.
+
 Author: Robyn Speer

 ## Installation
@ -502,6 +507,22 @@ streaming Twitter API, in accordance with Twitter's Developer Agreement &
 Policy. This software gives statistics about words that are commonly used on
 Twitter; it does not display or republish any Twitter content.

+## Can I convert wordfreq to a more convenient form for my purposes, like a CSV file?
+
+No. The CSV format does not have any space for attribution or license
+information, and therefore does not follow the CC-By-SA license. Even if you
+tried to include the proper attribution in a header or in another file, someone
+would likely just strip it out.
+
+wordfreq isn't particularly separable from its code, anyway. It depends on its
+normalization and word segmentation process, which is implemented in Python
+code, to give appropriate results.
+
+A reasonable way to transform wordfreq would be to port the library to another
+programming language, with all credits included and packaged in the usual way
+for that language.
+
+
 ## Citing wordfreq

 If you use wordfreq in your research, please cite it! We publish the code
--- a/SUNSET.md
+++ b/SUNSET.md
@ -0,0 +1,72 @@
+# Why wordfreq will not be updated
+
+The wordfreq data is a snapshot of language that could be found in various
+online sources up through 2021. There are several reasons why it will not be
+updated anymore.
+
+
+## Generative AI has polluted the data
+
+I don't think anyone has reliable information about post-2021 language usage by
+humans.
+
+The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at
+large is full of slop generated by large language models, written by no one to
+communicate nothing. Including this slop in the data skews the word
+frequencies.
+
+Sure, there was spam in the wordfreq data sources, but it was manageable and
+often identifiable. Large language models generate text that masquerades as
+real language with intention behind it, even though there is none, and their
+output crops up everywhere.
+
+As one example, [Philip Shapira
+reports](https://pshapira.net/2024/03/31/delving-into-delve/) that ChatGPT
+(OpenAI's popular brand of language model circa 2024) is obsessed with the word
+"delve" in a way that people never have been, and caused its overall frequency
+to increase by an order of magnitude.
+
+
+## Information that used to be free became expensive
+
+wordfreq is not just concerned with formal printed words. It collected more
+conversational language usage from two sources in particular: Twitter and
+Reddit.
+
+The Twitter data was always built on sand. Even when Twitter allowed free
+access to a portion of their "firehose", the terms of use did not allow me to
+distribute that data outside of the company where I collected it (Luminoso).
+wordfreq has the frequencies that were built with that data as input, but the
+collected data didn't belong to me and I don't have it anymore.
+
+Now Twitter is gone anyway, its public APIs have shut down, and the site has
+been replaced with an oligarch's plaything, a spam-infested right-wing cesspool
+called X. Even if X made its raw data feed available (which it doesn't), there
+would be no valuable information to be found there.
+
+Reddit also stopped providing public data archives, and now they sell their
+archives at a price that only OpenAI will pay.
+
+And given what's happening to the field, I don't blame them.
+
+
+## I don't want to be part of this scene anymore
+
+wordfreq used to be at the intersection of my interests. I was doing corpus
+linguistics in a way that could also benefit natural language processing tools.
+
+The field I know as "natural language processing" is hard to find these days.
+It's all being devoured by generative AI. Other techniques still exist but
+generative AI sucks up all the air in the room and gets all the money. It's
+rare to see NLP research that doesn't have a dependency on closed data
+controlled by OpenAI and Google, two companies that I already despise.
+
+I don't want to work on anything that could be confused with generative AI,
+or that could benefit generative AI.
+
+OpenAI and Google can collect their own damn data. I hope they have to pay a
+very high price for it, and I hope they're constantly cursing the mess that
+they made themselves.
+
+— Robyn Speer
+