mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 01:11:37 +00:00
update docs
This commit is contained in:
parent
ca7055b667
commit
b2e1f68ac8
21
README.md
21
README.md
@ -1,6 +1,11 @@
|
|||||||
wordfreq is a Python library for looking up the frequencies of words in many
|
wordfreq is a Python library for looking up the frequencies of words in many
|
||||||
languages, based on many sources of data.
|
languages, based on many sources of data.
|
||||||
|
|
||||||
|
The word frequencies are a snapshot of language usage through about 2021. I may
|
||||||
|
continue to make packaging updates, but the data is unlikely to be updated again.
|
||||||
|
The world where I had a reasonable way to collect reliable word frequencies is
|
||||||
|
not the world we live in now. See [SUNSET.md](./SUNSET.md) for more information.
|
||||||
|
|
||||||
Author: Robyn Speer
|
Author: Robyn Speer
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
@ -502,6 +507,22 @@ streaming Twitter API, in accordance with Twitter's Developer Agreement &
|
|||||||
Policy. This software gives statistics about words that are commonly used on
|
Policy. This software gives statistics about words that are commonly used on
|
||||||
Twitter; it does not display or republish any Twitter content.
|
Twitter; it does not display or republish any Twitter content.
|
||||||
|
|
||||||
|
## Can I convert wordfreq to a more convenient form for my purposes, like a CSV file?
|
||||||
|
|
||||||
|
No. The CSV format does not have any space for attribution or license
|
||||||
|
information, and therefore does not follow the CC-By-SA license. Even if you
|
||||||
|
tried to include the proper attribution in a header or in another file, someone
|
||||||
|
would likely just strip it out.
|
||||||
|
|
||||||
|
wordfreq isn't particularly separable from its code, anyway. It depends on its
|
||||||
|
normalization and word segmentation process, which is implemented in Python
|
||||||
|
code, to give appropriate results.
|
||||||
|
|
||||||
|
A reasonable way to transform wordfreq would be to port the library to another
|
||||||
|
programming language, with all credits included and packaged in the usual way
|
||||||
|
for that language.
|
||||||
|
|
||||||
|
|
||||||
## Citing wordfreq
|
## Citing wordfreq
|
||||||
|
|
||||||
If you use wordfreq in your research, please cite it! We publish the code
|
If you use wordfreq in your research, please cite it! We publish the code
|
||||||
|
72
SUNSET.md
Normal file
72
SUNSET.md
Normal file
@ -0,0 +1,72 @@
|
|||||||
|
# Why wordfreq will not be updated
|
||||||
|
|
||||||
|
The wordfreq data is a snapshot of language that could be found in various
|
||||||
|
online sources up through 2021. There are several reasons why it will not be
|
||||||
|
updated anymore.
|
||||||
|
|
||||||
|
|
||||||
|
## Generative AI has polluted the data
|
||||||
|
|
||||||
|
I don't think anyone has reliable information about post-2021 language usage by
|
||||||
|
humans.
|
||||||
|
|
||||||
|
The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at
|
||||||
|
large is full of slop generated by large language models, written by no one to
|
||||||
|
communicate nothing. Including this slop in the data skews the word
|
||||||
|
frequencies.
|
||||||
|
|
||||||
|
Sure, there was spam in the wordfreq data sources, but it was manageable and
|
||||||
|
often identifiable. Large language models generate text that masquerades as
|
||||||
|
real language with intention behind it, even though there is none, and their
|
||||||
|
output crops up everywhere.
|
||||||
|
|
||||||
|
As one example, [Philip Shapira
|
||||||
|
reports](https://pshapira.net/2024/03/31/delving-into-delve/) that ChatGPT
|
||||||
|
(OpenAI's popular brand of language model circa 2024) is obsessed with the word
|
||||||
|
"delve" in a way that people never have been, and caused its overall frequency
|
||||||
|
to increase by an order of magnitude.
|
||||||
|
|
||||||
|
|
||||||
|
## Information that used to be free became expensive
|
||||||
|
|
||||||
|
wordfreq is not just concerned with formal printed words. It collected more
|
||||||
|
conversational language usage from two sources in particular: Twitter and
|
||||||
|
Reddit.
|
||||||
|
|
||||||
|
The Twitter data was always built on sand. Even when Twitter allowed free
|
||||||
|
access to a portion of their "firehose", the terms of use did not allow me to
|
||||||
|
distribute that data outside of the company where I collected it (Luminoso).
|
||||||
|
wordfreq has the frequencies that were built with that data as input, but the
|
||||||
|
collected data didn't belong to me and I don't have it anymore.
|
||||||
|
|
||||||
|
Now Twitter is gone anyway, its public APIs have shut down, and the site has
|
||||||
|
been replaced with an oligarch's plaything, a spam-infested right-wing cesspool
|
||||||
|
called X. Even if X made its raw data feed available (which it doesn't), there
|
||||||
|
would be no valuable information to be found there.
|
||||||
|
|
||||||
|
Reddit also stopped providing public data archives, and now they sell their
|
||||||
|
archives at a price that only OpenAI will pay.
|
||||||
|
|
||||||
|
And given what's happening to the field, I don't blame them.
|
||||||
|
|
||||||
|
|
||||||
|
## I don't want to be part of this scene anymore
|
||||||
|
|
||||||
|
wordfreq used to be at the intersection of my interests. I was doing corpus
|
||||||
|
linguistics in a way that could also benefit natural language processing tools.
|
||||||
|
|
||||||
|
The field I know as "natural language processing" is hard to find these days.
|
||||||
|
It's all being devoured by generative AI. Other techniques still exist but
|
||||||
|
generative AI sucks up all the air in the room and gets all the money. It's
|
||||||
|
rare to see NLP research that doesn't have a dependency on closed data
|
||||||
|
controlled by OpenAI and Google, two companies that I already despise.
|
||||||
|
|
||||||
|
I don't want to work on anything that could be confused with generative AI,
|
||||||
|
or that could benefit generative AI.
|
||||||
|
|
||||||
|
OpenAI and Google can collect their own damn data. I hope they have to pay a
|
||||||
|
very high price for it, and I hope they're constantly cursing the mess that
|
||||||
|
they made themselves.
|
||||||
|
|
||||||
|
— Robyn Speer
|
||||||
|
|
Loading…
Reference in New Issue
Block a user