mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
73 lines
3.0 KiB
Markdown
73 lines
3.0 KiB
Markdown
|
# Why wordfreq will not be updated
|
||
|
|
||
|
The wordfreq data is a snapshot of language that could be found in various
|
||
|
online sources up through 2021. There are several reasons why it will not be
|
||
|
updated anymore.
|
||
|
|
||
|
|
||
|
## Generative AI has polluted the data
|
||
|
|
||
|
I don't think anyone has reliable information about post-2021 language usage by
|
||
|
humans.
|
||
|
|
||
|
The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at
|
||
|
large is full of slop generated by large language models, written by no one to
|
||
|
communicate nothing. Including this slop in the data skews the word
|
||
|
frequencies.
|
||
|
|
||
|
Sure, there was spam in the wordfreq data sources, but it was manageable and
|
||
|
often identifiable. Large language models generate text that masquerades as
|
||
|
real language with intention behind it, even though there is none, and their
|
||
|
output crops up everywhere.
|
||
|
|
||
|
As one example, [Philip Shapira
|
||
|
reports](https://pshapira.net/2024/03/31/delving-into-delve/) that ChatGPT
|
||
|
(OpenAI's popular brand of language model circa 2024) is obsessed with the word
|
||
|
"delve" in a way that people never have been, and caused its overall frequency
|
||
|
to increase by an order of magnitude.
|
||
|
|
||
|
|
||
|
## Information that used to be free became expensive
|
||
|
|
||
|
wordfreq is not just concerned with formal printed words. It collected more
|
||
|
conversational language usage from two sources in particular: Twitter and
|
||
|
Reddit.
|
||
|
|
||
|
The Twitter data was always built on sand. Even when Twitter allowed free
|
||
|
access to a portion of their "firehose", the terms of use did not allow me to
|
||
|
distribute that data outside of the company where I collected it (Luminoso).
|
||
|
wordfreq has the frequencies that were built with that data as input, but the
|
||
|
collected data didn't belong to me and I don't have it anymore.
|
||
|
|
||
|
Now Twitter is gone anyway, its public APIs have shut down, and the site has
|
||
|
been replaced with an oligarch's plaything, a spam-infested right-wing cesspool
|
||
|
called X. Even if X made its raw data feed available (which it doesn't), there
|
||
|
would be no valuable information to be found there.
|
||
|
|
||
|
Reddit also stopped providing public data archives, and now they sell their
|
||
|
archives at a price that only OpenAI will pay.
|
||
|
|
||
|
And given what's happening to the field, I don't blame them.
|
||
|
|
||
|
|
||
|
## I don't want to be part of this scene anymore
|
||
|
|
||
|
wordfreq used to be at the intersection of my interests. I was doing corpus
|
||
|
linguistics in a way that could also benefit natural language processing tools.
|
||
|
|
||
|
The field I know as "natural language processing" is hard to find these days.
|
||
|
It's all being devoured by generative AI. Other techniques still exist but
|
||
|
generative AI sucks up all the air in the room and gets all the money. It's
|
||
|
rare to see NLP research that doesn't have a dependency on closed data
|
||
|
controlled by OpenAI and Google, two companies that I already despise.
|
||
|
|
||
|
I don't want to work on anything that could be confused with generative AI,
|
||
|
or that could benefit generative AI.
|
||
|
|
||
|
OpenAI and Google can collect their own damn data. I hope they have to pay a
|
||
|
very high price for it, and I hope they're constantly cursing the mess that
|
||
|
they made themselves.
|
||
|
|
||
|
— Robyn Speer
|
||
|
|