mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
146fbae1b3
Remove a misinterpretable sentence about Reddit data
79 lines
3.5 KiB
Markdown
79 lines
3.5 KiB
Markdown
# Why wordfreq will not be updated
|
|
|
|
The wordfreq data is a snapshot of language that could be found in various
|
|
online sources up through 2021. There are several reasons why it will not be
|
|
updated anymore.
|
|
|
|
|
|
## Generative AI has polluted the data
|
|
|
|
I don't think anyone has reliable information about post-2021 language usage by
|
|
humans.
|
|
|
|
The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at
|
|
large is full of slop generated by large language models, written by no one to
|
|
communicate nothing. Including this slop in the data skews the word
|
|
frequencies.
|
|
|
|
Sure, there was spam in the wordfreq data sources, but it was manageable and
|
|
often identifiable. Large language models generate text that masquerades as
|
|
real language with intention behind it, even though there is none, and their
|
|
output crops up everywhere.
|
|
|
|
As one example, [Philip Shapira
|
|
reports](https://pshapira.net/2024/03/31/delving-into-delve/) that ChatGPT
|
|
(OpenAI's popular brand of generative language model circa 2024) is obsessed
|
|
with the word "delve" in a way that people never have been, and caused its
|
|
overall frequency to increase by an order of magnitude.
|
|
|
|
|
|
## Information that used to be free became expensive
|
|
|
|
wordfreq is not just concerned with formal printed words. It collected more
|
|
conversational language usage from two sources in particular: Twitter and
|
|
Reddit.
|
|
|
|
The Twitter data was always built on sand. Even when Twitter allowed free
|
|
access to a portion of their "firehose", the terms of use did not allow me to
|
|
distribute that data outside of the company where I collected it (Luminoso).
|
|
wordfreq has the frequencies that were built with that data as input, but the
|
|
collected data didn't belong to me and I don't have it anymore.
|
|
|
|
Now Twitter is gone anyway, its public APIs have shut down, and the site has
|
|
been replaced with an oligarch's plaything, a spam-infested right-wing cesspool
|
|
called X. Even if X made its raw data feed available (which it doesn't), there
|
|
would be no valuable information to be found there.
|
|
|
|
Reddit also stopped providing public data archives, and now they sell their
|
|
archives at a price that only OpenAI will pay.
|
|
|
|
|
|
## I don't want to be part of this scene anymore
|
|
|
|
wordfreq used to be at the intersection of my interests. I was doing corpus
|
|
linguistics in a way that could also benefit natural language processing tools.
|
|
|
|
The field I know as "natural language processing" is hard to find these days.
|
|
It's all being devoured by generative AI. Other techniques still exist but
|
|
generative AI sucks up all the air in the room and gets all the money. It's
|
|
rare to see NLP research that doesn't have a dependency on closed data
|
|
controlled by OpenAI and Google, two companies that I already despise.
|
|
|
|
wordfreq was built by collecting a whole lot of text in a lot of languages.
|
|
That used to be a pretty reasonable thing to do, and not the kind of thing
|
|
someone would be likely to object to. Now, the text-slurping tools are mostly
|
|
used for training generative AI, and people are quite rightly on the defensive.
|
|
If someone is collecting all the text from your books, articles, Web site, or
|
|
public posts, it's very likely because they are creating a plagiarism machine
|
|
that will claim your words as its own.
|
|
|
|
So I don't want to work on anything that could be confused with generative AI,
|
|
or that could benefit generative AI.
|
|
|
|
OpenAI and Google can collect their own damn data. I hope they have to pay a
|
|
very high price for it, and I hope they're constantly cursing the mess that
|
|
they made themselves.
|
|
|
|
— Robyn Speer
|
|
|