diff --git a/README.md b/README.md index 5d4fc0e..46f4565 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,11 @@ wordfreq is a Python library for looking up the frequencies of words in many languages, based on many sources of data. +The word frequencies are a snapshot of language usage through about 2021. I may +continue to make packaging updates, but the data is unlikely to be updated again. +The world where I had a reasonable way to collect reliable word frequencies is +not the world we live in now. See [SUNSET.md](./SUNSET.md) for more information. + Author: Robyn Speer ## Installation @@ -502,6 +507,22 @@ streaming Twitter API, in accordance with Twitter's Developer Agreement & Policy. This software gives statistics about words that are commonly used on Twitter; it does not display or republish any Twitter content. +## Can I convert wordfreq to a more convenient form for my purposes, like a CSV file? + +No. The CSV format does not have any space for attribution or license +information, and therefore does not follow the CC-By-SA license. Even if you +tried to include the proper attribution in a header or in another file, someone +would likely just strip it out. + +wordfreq isn't particularly separable from its code, anyway. It depends on its +normalization and word segmentation process, which is implemented in Python +code, to give appropriate results. + +A reasonable way to transform wordfreq would be to port the library to another +programming language, with all credits included and packaged in the usual way +for that language. + + ## Citing wordfreq If you use wordfreq in your research, please cite it! We publish the code diff --git a/SUNSET.md b/SUNSET.md new file mode 100644 index 0000000..f926740 --- /dev/null +++ b/SUNSET.md @@ -0,0 +1,72 @@ +# Why wordfreq will not be updated + +The wordfreq data is a snapshot of language that could be found in various +online sources up through 2021. There are several reasons why it will not be +updated anymore. + + +## Generative AI has polluted the data + +I don't think anyone has reliable information about post-2021 language usage by +humans. + +The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at +large is full of slop generated by large language models, written by no one to +communicate nothing. Including this slop in the data skews the word +frequencies. + +Sure, there was spam in the wordfreq data sources, but it was manageable and +often identifiable. Large language models generate text that masquerades as +real language with intention behind it, even though there is none, and their +output crops up everywhere. + +As one example, [Philip Shapira +reports](https://pshapira.net/2024/03/31/delving-into-delve/) that ChatGPT +(OpenAI's popular brand of language model circa 2024) is obsessed with the word +"delve" in a way that people never have been, and caused its overall frequency +to increase by an order of magnitude. + + +## Information that used to be free became expensive + +wordfreq is not just concerned with formal printed words. It collected more +conversational language usage from two sources in particular: Twitter and +Reddit. + +The Twitter data was always built on sand. Even when Twitter allowed free +access to a portion of their "firehose", the terms of use did not allow me to +distribute that data outside of the company where I collected it (Luminoso). +wordfreq has the frequencies that were built with that data as input, but the +collected data didn't belong to me and I don't have it anymore. + +Now Twitter is gone anyway, its public APIs have shut down, and the site has +been replaced with an oligarch's plaything, a spam-infested right-wing cesspool +called X. Even if X made its raw data feed available (which it doesn't), there +would be no valuable information to be found there. + +Reddit also stopped providing public data archives, and now they sell their +archives at a price that only OpenAI will pay. + +And given what's happening to the field, I don't blame them. + + +## I don't want to be part of this scene anymore + +wordfreq used to be at the intersection of my interests. I was doing corpus +linguistics in a way that could also benefit natural language processing tools. + +The field I know as "natural language processing" is hard to find these days. +It's all being devoured by generative AI. Other techniques still exist but +generative AI sucks up all the air in the room and gets all the money. It's +rare to see NLP research that doesn't have a dependency on closed data +controlled by OpenAI and Google, two companies that I already despise. + +I don't want to work on anything that could be confused with generative AI, +or that could benefit generative AI. + +OpenAI and Google can collect their own damn data. I hope they have to pay a +very high price for it, and I hope they're constantly cursing the mess that +they made themselves. + +— Robyn Speer +