mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 01:11:37 +00:00
Sep 2024 update based on popular coverage
This commit is contained in:
parent
146fbae1b3
commit
bafaf71cdd
27
SUNSET.md
27
SUNSET.md
@ -1,10 +1,26 @@
|
||||
# Note from September 2024
|
||||
|
||||
This documentation page has gotten a lot of attention recently! I
|
||||
think most of the people who find it understand where I'm coming from. I'd
|
||||
like to highlight a couple of things, now that people are linking to this
|
||||
page from all sorts of contexts.
|
||||
|
||||
- I still work on open-source libraries. Here's [ftfy](https://github.com/rspeer/python-ftfy),
|
||||
the popular multi-purpose Unicode fixer.
|
||||
|
||||
- You could see this freezing of wordfreq data as a good thing. Many people
|
||||
have found wordfreq useful, and the latest version isn't going away. The
|
||||
conclusion that I'm documenting here is that _updating it would make it
|
||||
worse_, so instead, I'm not updating it. It'll become outdated over time,
|
||||
but it won't get actively worse. That's a pretty okay fate for something
|
||||
on the Internet!
|
||||
|
||||
# Why wordfreq will not be updated
|
||||
|
||||
The wordfreq data is a snapshot of language that could be found in various
|
||||
online sources up through 2021. There are several reasons why it will not be
|
||||
updated anymore.
|
||||
|
||||
|
||||
## Generative AI has polluted the data
|
||||
|
||||
I don't think anyone has reliable information about post-2021 language usage by
|
||||
@ -29,6 +45,9 @@ overall frequency to increase by an order of magnitude.
|
||||
|
||||
## Information that used to be free became expensive
|
||||
|
||||
Before I wrote this page, I'd been looking at how I would run the tool that
|
||||
updates wordfreq's data sources.
|
||||
|
||||
wordfreq is not just concerned with formal printed words. It collected more
|
||||
conversational language usage from two sources in particular: Twitter and
|
||||
Reddit.
|
||||
@ -70,9 +89,7 @@ that will claim your words as its own.
|
||||
So I don't want to work on anything that could be confused with generative AI,
|
||||
or that could benefit generative AI.
|
||||
|
||||
OpenAI and Google can collect their own damn data. I hope they have to pay a
|
||||
very high price for it, and I hope they're constantly cursing the mess that
|
||||
they made themselves.
|
||||
OpenAI and Google can collect their own damn data, and I hope they have to pay a
|
||||
very high price for it. They made this mess themselves.
|
||||
|
||||
— Robyn Speer
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user