mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 01:11:37 +00:00
update docs
This commit is contained in:
parent
ca7055b667
commit
b2e1f68ac8
21
README.md
21
README.md
@ -1,6 +1,11 @@
|
||||
wordfreq is a Python library for looking up the frequencies of words in many
|
||||
languages, based on many sources of data.
|
||||
|
||||
The word frequencies are a snapshot of language usage through about 2021. I may
|
||||
continue to make packaging updates, but the data is unlikely to be updated again.
|
||||
The world where I had a reasonable way to collect reliable word frequencies is
|
||||
not the world we live in now. See [SUNSET.md](./SUNSET.md) for more information.
|
||||
|
||||
Author: Robyn Speer
|
||||
|
||||
## Installation
|
||||
@ -502,6 +507,22 @@ streaming Twitter API, in accordance with Twitter's Developer Agreement &
|
||||
Policy. This software gives statistics about words that are commonly used on
|
||||
Twitter; it does not display or republish any Twitter content.
|
||||
|
||||
## Can I convert wordfreq to a more convenient form for my purposes, like a CSV file?
|
||||
|
||||
No. The CSV format does not have any space for attribution or license
|
||||
information, and therefore does not follow the CC-By-SA license. Even if you
|
||||
tried to include the proper attribution in a header or in another file, someone
|
||||
would likely just strip it out.
|
||||
|
||||
wordfreq isn't particularly separable from its code, anyway. It depends on its
|
||||
normalization and word segmentation process, which is implemented in Python
|
||||
code, to give appropriate results.
|
||||
|
||||
A reasonable way to transform wordfreq would be to port the library to another
|
||||
programming language, with all credits included and packaged in the usual way
|
||||
for that language.
|
||||
|
||||
|
||||
## Citing wordfreq
|
||||
|
||||
If you use wordfreq in your research, please cite it! We publish the code
|
||||
|
72
SUNSET.md
Normal file
72
SUNSET.md
Normal file
@ -0,0 +1,72 @@
|
||||
# Why wordfreq will not be updated
|
||||
|
||||
The wordfreq data is a snapshot of language that could be found in various
|
||||
online sources up through 2021. There are several reasons why it will not be
|
||||
updated anymore.
|
||||
|
||||
|
||||
## Generative AI has polluted the data
|
||||
|
||||
I don't think anyone has reliable information about post-2021 language usage by
|
||||
humans.
|
||||
|
||||
The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at
|
||||
large is full of slop generated by large language models, written by no one to
|
||||
communicate nothing. Including this slop in the data skews the word
|
||||
frequencies.
|
||||
|
||||
Sure, there was spam in the wordfreq data sources, but it was manageable and
|
||||
often identifiable. Large language models generate text that masquerades as
|
||||
real language with intention behind it, even though there is none, and their
|
||||
output crops up everywhere.
|
||||
|
||||
As one example, [Philip Shapira
|
||||
reports](https://pshapira.net/2024/03/31/delving-into-delve/) that ChatGPT
|
||||
(OpenAI's popular brand of language model circa 2024) is obsessed with the word
|
||||
"delve" in a way that people never have been, and caused its overall frequency
|
||||
to increase by an order of magnitude.
|
||||
|
||||
|
||||
## Information that used to be free became expensive
|
||||
|
||||
wordfreq is not just concerned with formal printed words. It collected more
|
||||
conversational language usage from two sources in particular: Twitter and
|
||||
Reddit.
|
||||
|
||||
The Twitter data was always built on sand. Even when Twitter allowed free
|
||||
access to a portion of their "firehose", the terms of use did not allow me to
|
||||
distribute that data outside of the company where I collected it (Luminoso).
|
||||
wordfreq has the frequencies that were built with that data as input, but the
|
||||
collected data didn't belong to me and I don't have it anymore.
|
||||
|
||||
Now Twitter is gone anyway, its public APIs have shut down, and the site has
|
||||
been replaced with an oligarch's plaything, a spam-infested right-wing cesspool
|
||||
called X. Even if X made its raw data feed available (which it doesn't), there
|
||||
would be no valuable information to be found there.
|
||||
|
||||
Reddit also stopped providing public data archives, and now they sell their
|
||||
archives at a price that only OpenAI will pay.
|
||||
|
||||
And given what's happening to the field, I don't blame them.
|
||||
|
||||
|
||||
## I don't want to be part of this scene anymore
|
||||
|
||||
wordfreq used to be at the intersection of my interests. I was doing corpus
|
||||
linguistics in a way that could also benefit natural language processing tools.
|
||||
|
||||
The field I know as "natural language processing" is hard to find these days.
|
||||
It's all being devoured by generative AI. Other techniques still exist but
|
||||
generative AI sucks up all the air in the room and gets all the money. It's
|
||||
rare to see NLP research that doesn't have a dependency on closed data
|
||||
controlled by OpenAI and Google, two companies that I already despise.
|
||||
|
||||
I don't want to work on anything that could be confused with generative AI,
|
||||
or that could benefit generative AI.
|
||||
|
||||
OpenAI and Google can collect their own damn data. I hope they have to pay a
|
||||
very high price for it, and I hope they're constantly cursing the mess that
|
||||
they made themselves.
|
||||
|
||||
— Robyn Speer
|
||||
|
Loading…
Reference in New Issue
Block a user