Commit Graph

455 Commits

Author SHA1 Message Date
Rob Speer
60a7c4d1ec revise the process of building Wikipedia counts 2015-03-30 18:09:07 -04:00
Rob Speer
58da7797da Fix Dutch lists
- Use surface forms consistently, not stems
- Count all instances of words on Wikipedia, not one per article


Former-commit-id: 3507d8b630
2015-03-12 16:00:03 -04:00
Andrew Lin
6f1dd0280c Merge pull request #3 from LuminosoInsight/variable_name_fix
Fix a variable name for clarity.

Former-commit-id: cfe58cd899
2015-03-11 14:10:53 -04:00
Rob Speer
9059d6db9e new Dutch data, bump version to 0.6
Former-commit-id: 377336bcdc
2015-03-03 15:54:45 -05:00
Andrew Lin
4b722a1f79 Fix a variable name for clarity.
Former-commit-id: 434c603798
2015-03-03 11:59:46 -05:00
Andrew Lin
a82dfedeba Merge pull request #2 from LuminosoInsight/new-twitter-lists
New twitter lists

Former-commit-id: 5a4d3a87d5
2015-02-17 15:36:13 -05:00
Rob Speer
e6e5f6c9bf add surface forms from Twitter 2014 data
Former-commit-id: ffdaa82b11
2015-02-17 15:06:11 -05:00
Rob Speer
a51e9b062e stop running 'remove_unsafe_private_use' unnecessarily
Former-commit-id: b6f246ecbb
2015-02-17 14:02:36 -05:00
Rob Speer
be2b68b1de enable wordlist balancing, surface form counting 2015-02-17 13:43:22 -05:00
Rob Speer
46b3cdbcd4 add twitter-stems-2014 wordlist data
Former-commit-id: 6ab72201cd
2015-02-11 13:29:32 -05:00
Rob Speer
fcd6044c2d add utility for combining wordlists 2015-02-11 11:45:10 -05:00
Rob Speer
d3374a9fe1 command-line entry points 2015-02-10 12:28:29 -05:00
Rob Speer
693c35476f Initial commit 2015-02-04 20:19:36 -05:00
Rob Speer
3aa6ced5cd Allow multithreaded SQLite on Python 3
Former-commit-id: bf0071fd8b
2014-10-02 18:10:09 -04:00
Rob Speer
789659128a construct the download path correctly, even on Windows
Former-commit-id: 6d90cef415
2014-09-08 10:56:48 -04:00
Rob Speer
dd3514a506 remove unused global
Former-commit-id: c55a701885
2014-09-02 14:29:31 -04:00
Rob Speer
8ac65dc644 cleanups to building and uploading, from code review
Former-commit-id: 5dee417302
2014-08-18 14:14:01 -04:00
Rob Speer
33dd311450 Add license text for the whole package
Former-commit-id: cb7b2b76e6
2014-06-02 16:37:32 -04:00
Rob Speer
c7c8078883 A different plan for the top-level word_frequency function.
When, before, I was importing wordfreq.query at the top level, this
created a dependency loop when installing wordfreq.

The new top-level __init__.py provides just a `word_frequency` function,
which imports the real function as needed and calls it. This should
avoid the dependency loop, at the cost of making
`wordfreq.word_frequency` slightly less efficient than
`wordfreq.query.word_frequency`.


Former-commit-id: 44ccf40742
2014-02-24 18:03:31 -05:00
Rob Speer
e29df8346d version 0.4: minor code changes, debugged database
- The database is built under Python 3.3.2, so it should correctly
  implement Python 3's Unicode tricks, including special handling
  of Greek lowercase letters. (Version 0.3 was supposed to do this
  as well, but apparently, it didn't.)
- `word_frequency` and `iter_wordlist` can be imported from the
  top level.
- The new function `random_words` supplies a string made from
  random words that are sufficiently high in rank order.


Former-commit-id: 3702a7c8d0
2014-02-24 16:29:06 -05:00
Rob Speer
dc16996458 Sometimes you need some random words.
Former-commit-id: 3447ae732e
2014-01-06 15:51:10 -05:00
Andrew Lin
3340367519 Remove the tests for metanl_word_frequency too. Doh.
Former-commit-id: 68d262791c
2013-11-11 13:21:25 -05:00
Rob Speer
65f61d8a2e Merge pull request #1 from LuminosoInsight/remove_metanl_wf
Remove metanl_word_frequency(), which we no longer need.

Former-commit-id: 63bebe6ad3
2013-11-11 10:13:25 -08:00
Rob Speer
d63c868ba8 data is now hosted on wordfreq.services.luminoso.com
Former-commit-id: 56f2c606f1
2013-11-07 14:43:15 -05:00
Andrew Lin
a70b3847a6 Remove metanl_word_frequency(), which we no longer need.
Former-commit-id: 76a7267670
2013-11-04 16:51:25 -05:00
Rob Speer
1edee91b05 Clear wordlists before inserting them; yell at Python 2
Former-commit-id: 823b3828cd
2013-11-01 19:29:37 -04:00
Rob Speer
38266b9916 Revert "code review and pep8 fixes"
This reverts commit 8ba4a6660e [formerly b4b8ba8be7].

Conflicts:
	wordfreq/transfer.py

Former-commit-id: 5c8ba34492
2013-11-01 17:33:39 -04:00
Rob Speer
cb3821a304 Merge branch 'master' of github.com:LuminosoInsight/wordfreq
Conflicts:
	wordfreq/transfer.py

Former-commit-id: 90e042f196
2013-11-01 17:05:59 -04:00
Rob Speer
8ba4a6660e code review and pep8 fixes
Former-commit-id: b4b8ba8be7
2013-11-01 17:05:12 -04:00
Lance Nathan
2e41138adb Two small stylistic tweaks
Former-commit-id: ea29469643
2013-10-31 16:00:48 -04:00
Rob Speer
280eca22ce make the tests less picky about numerical exactness
Former-commit-id: 2b2bd943d2
2013-10-31 15:43:19 -04:00
Rob Speer
26852dde89 try to match the wordlist metanl actually uses
Former-commit-id: 90772e33fb
2013-10-31 15:13:22 -04:00
Rob Speer
def8a71b44 The metanl scale is not what I thought it was.
Former-commit-id: 0d2fb21726
2013-10-31 14:38:01 -04:00
Rob Speer
63b465c767 Don't download the DB if the right version is already there
Former-commit-id: e931062b5a
2013-10-31 14:12:04 -04:00
Rob Speer
8c3e8f9eb4 try being really nonspecific about functools32 versions
Former-commit-id: c1564908f2
2013-10-31 14:06:06 -04:00
Rob Speer
676cba640f be less specific about the functools32 version
Former-commit-id: 2542cf9e35
2013-10-31 14:02:40 -04:00
Rob Speer
9fd9028d3c Add wordfreq_data files.
Now the build process is repeatable from scratch, even if something goes
wrong with the download server.


Former-commit-id: 26c0d7dd28
2013-10-31 13:39:02 -04:00
Rob Speer
2cf812a64e When strings are inconsistent between py2 and 3, don't test them on py2. 2013-10-31 13:11:13 -04:00
Rob Speer
10115f3965 add util.py, which provides standardize_word 2013-10-30 18:14:43 -04:00
Rob Speer
2f7572e3fc and of course this changes the metanl constant 2013-10-30 18:14:34 -04:00
Rob Speer
8ef11fd33c Turns out we need to change the metanl constant after normalizing words. 2013-10-30 16:58:10 -04:00
Rob Speer
40102a3f63 Normalize words when storing them or looking them up. 2013-10-30 14:59:57 -04:00
Rob Speer
3063b3915a Revise the build test to compare lengths of wordlists.
The test currently fails on Python 3, for some strange reason.
2013-10-30 13:22:56 -04:00
Lance
ce07c881c5 Another Py3 change, this one for functools32 2013-10-30 12:06:41 -04:00
Lance
357cbb531e Py3 tweak to urllib import 2013-10-30 11:57:50 -04:00
Rob Speer
be183b2564 Change default values to offsets. 2013-10-29 18:06:47 -04:00
Rob Speer
2907f7f077 now this package has tests 2013-10-29 17:21:55 -04:00
Rob Speer
ca5b3e2f5d Implement the data uploady downloady stuff in setup. 2013-10-29 16:44:13 -04:00
Rob Speer
793893e738 Deal with database connections more consistently 2013-10-29 16:43:58 -04:00
Rob Speer
c475415f74 Add a couple of useful statistics about wordlists 2013-10-29 16:42:38 -04:00