update docs

2025-04-27 07:33:56 +00:00 · 2024-06-24 19:02:22 -04:00 · 2024-06-24 19:02:22 -04:00 · b2e1f68ac8
commit b2e1f68ac8
parent ca7055b667
2 changed files with 93 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,11 @@
 wordfreq is a Python library for looking up the frequencies of words in many
 languages, based on many sources of data.
 The word frequencies are a snapshot of language usage through about 2021. I may
 continue to make packaging updates, but the data is unlikely to be updated again.
 The world where I had a reasonable way to collect reliable word frequencies is
 not the world we live in now. See [SUNSET.md](./SUNSET.md) for more information.
 Author: Robyn Speer
 ## Installation
@ -502,6 +507,22 @@ streaming Twitter API, in accordance with Twitter's Developer Agreement &
 Policy. This software gives statistics about words that are commonly used on
 Twitter; it does not display or republish any Twitter content.
 ## Can I convert wordfreq to a more convenient form for my purposes, like a CSV file?
 No. The CSV format does not have any space for attribution or license
 information, and therefore does not follow the CC-By-SA license. Even if you
 tried to include the proper attribution in a header or in another file, someone
 would likely just strip it out.
 wordfreq isn't particularly separable from its code, anyway. It depends on its
 normalization and word segmentation process, which is implemented in Python
 code, to give appropriate results.
 A reasonable way to transform wordfreq would be to port the library to another
 programming language, with all credits included and packaged in the usual way
 for that language.
 ## Citing wordfreq
 If you use wordfreq in your research, please cite it! We publish the code
--- a/SUNSET.md
+++ b/SUNSET.md
@ -0,0 +1,72 @@
 # Why wordfreq will not be updated
 The wordfreq data is a snapshot of language that could be found in various
 online sources up through 2021. There are several reasons why it will not be
 updated anymore.
 ## Generative AI has polluted the data
 I don't think anyone has reliable information about post-2021 language usage by
 humans.
 The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at
 large is full of slop generated by large language models, written by no one to
 communicate nothing. Including this slop in the data skews the word
 frequencies.
 Sure, there was spam in the wordfreq data sources, but it was manageable and
 often identifiable. Large language models generate text that masquerades as
 real language with intention behind it, even though there is none, and their
 output crops up everywhere.
 As one example, [Philip Shapira
 reports](https://pshapira.net/2024/03/31/delving-into-delve/) that ChatGPT
 (OpenAI's popular brand of language model circa 2024) is obsessed with the word
 "delve" in a way that people never have been, and caused its overall frequency
 to increase by an order of magnitude.
 ## Information that used to be free became expensive
 wordfreq is not just concerned with formal printed words. It collected more
 conversational language usage from two sources in particular: Twitter and
 Reddit.
 The Twitter data was always built on sand. Even when Twitter allowed free
 access to a portion of their "firehose", the terms of use did not allow me to
 distribute that data outside of the company where I collected it (Luminoso).
 wordfreq has the frequencies that were built with that data as input, but the
 collected data didn't belong to me and I don't have it anymore.
 Now Twitter is gone anyway, its public APIs have shut down, and the site has
 been replaced with an oligarch's plaything, a spam-infested right-wing cesspool
 called X. Even if X made its raw data feed available (which it doesn't), there
 would be no valuable information to be found there.
 Reddit also stopped providing public data archives, and now they sell their
 archives at a price that only OpenAI will pay.
 And given what's happening to the field, I don't blame them.
 ## I don't want to be part of this scene anymore
 wordfreq used to be at the intersection of my interests. I was doing corpus
 linguistics in a way that could also benefit natural language processing tools.
 The field I know as "natural language processing" is hard to find these days.
 It's all being devoured by generative AI. Other techniques still exist but
 generative AI sucks up all the air in the room and gets all the money. It's
 rare to see NLP research that doesn't have a dependency on closed data
 controlled by OpenAI and Google, two companies that I already despise.
 I don't want to work on anything that could be confused with generative AI,
 or that could benefit generative AI.
 OpenAI and Google can collect their own damn data. I hope they have to pay a
 very high price for it, and I hope they're constantly cursing the mess that
 they made themselves.
 — Robyn Speer