From bafaf71cdd5ec8273b7f038dbaa7ef82f9c782be Mon Sep 17 00:00:00 2001 From: "Elia Robyn Lake (Robyn Speer)" Date: Sun, 22 Sep 2024 20:58:30 -0400 Subject: [PATCH] Sep 2024 update based on popular coverage --- SUNSET.md | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/SUNSET.md b/SUNSET.md index e3fed21..473e8c8 100644 --- a/SUNSET.md +++ b/SUNSET.md @@ -1,10 +1,26 @@ +# Note from September 2024 + +This documentation page has gotten a lot of attention recently! I +think most of the people who find it understand where I'm coming from. I'd +like to highlight a couple of things, now that people are linking to this +page from all sorts of contexts. + +- I still work on open-source libraries. Here's [ftfy](https://github.com/rspeer/python-ftfy), + the popular multi-purpose Unicode fixer. + +- You could see this freezing of wordfreq data as a good thing. Many people + have found wordfreq useful, and the latest version isn't going away. The + conclusion that I'm documenting here is that _updating it would make it + worse_, so instead, I'm not updating it. It'll become outdated over time, + but it won't get actively worse. That's a pretty okay fate for something + on the Internet! + # Why wordfreq will not be updated The wordfreq data is a snapshot of language that could be found in various online sources up through 2021. There are several reasons why it will not be updated anymore. - ## Generative AI has polluted the data I don't think anyone has reliable information about post-2021 language usage by @@ -29,6 +45,9 @@ overall frequency to increase by an order of magnitude. ## Information that used to be free became expensive +Before I wrote this page, I'd been looking at how I would run the tool that +updates wordfreq's data sources. + wordfreq is not just concerned with formal printed words. It collected more conversational language usage from two sources in particular: Twitter and Reddit. @@ -70,9 +89,7 @@ that will claim your words as its own. So I don't want to work on anything that could be confused with generative AI, or that could benefit generative AI. -OpenAI and Google can collect their own damn data. I hope they have to pay a -very high price for it, and I hope they're constantly cursing the mess that -they made themselves. +OpenAI and Google can collect their own damn data, and I hope they have to pay a +very high price for it. They made this mess themselves. — Robyn Speer -