Commit Graph

209 Commits

Author SHA1 Message Date
Robyn Speer
f3958d63ae Switch to a more precise centibel scale.
Former-commit-id: 7862a4d2b6
2015-06-22 17:36:30 -04:00
Joshua Chin
7ef897e739 updated data with a clean build
Former-commit-id: 90fc3970c6
2015-06-18 11:38:57 -04:00
Robyn Speer
c980cd883a copy-edit some docstrings
Former-commit-id: 981cc249f1
2015-06-17 14:42:38 -04:00
Joshua Chin
7e5a1476c0 updated dB_to_freq docstring
Former-commit-id: 68b1c121bd
2015-06-17 14:34:03 -04:00
Joshua Chin
ac7c71ae70 clearified the docstrings for random_words and random_ascii_words
Former-commit-id: a289ab7f8b
2015-06-17 14:26:06 -04:00
Joshua Chin
e877e2120b corrected available_languages to return a dict of strs to strs
Former-commit-id: 9f288bac31
2015-06-17 12:43:13 -04:00
Joshua Chin
0e25c91c93 changed yield to yield from in iter_wordlist
Former-commit-id: 6cc962bfea
2015-06-17 12:38:31 -04:00
Joshua Chin
dbb588a1cf updated db_to_freq docstring
Former-commit-id: 9b30da4dec
2015-06-17 12:24:23 -04:00
Joshua Chin
fff550b233 added docstrings
Former-commit-id: 7e808bf7c1
2015-06-17 12:20:50 -04:00
Joshua Chin
76c504bdd2 removed temporary variable
Former-commit-id: 053e4da3e6
2015-06-17 12:16:02 -04:00
Robyn Speer
860e929bf8 update Japanese data; test Japanese and token combining
Former-commit-id: 611a6a35de
2015-05-28 14:01:56 -04:00
Robyn Speer
5db3c4ef9e Work on making Japanese tokenization use MeCab consistently
Former-commit-id: 05cf94d1fd
2015-05-27 18:10:25 -04:00
Robyn Speer
65f6107d36 rebuild data
Former-commit-id: 84e5edcea1
2015-05-21 20:36:15 -04:00
Robyn Speer
8954061a2a allow more language matches; reorder some parameters
Former-commit-id: b42594fa5f
2015-05-21 20:35:02 -04:00
Robyn Speer
aa0e844b81 add new data files from wordfreq_builder
Former-commit-id: 35aec061de
2015-05-11 18:45:47 -04:00
Robyn Speer
f92598b13d WIP: burn stuff down
Former-commit-id: 9b63e54471
2015-05-08 15:28:52 -04:00
Robyn Speer
5ef406fd43 fix reused variable name
Former-commit-id: 506073030a
2015-05-06 17:06:37 -04:00
Robyn Speer
cb6b2a8002 v0.7: make a proper Dutch 'surfaces' list
Former-commit-id: 873ace87db
2015-04-30 13:01:24 -04:00
Robyn Speer
b4dfdaa47c Merge branch 'master' into dutch-201503
Conflicts:
	wordfreq/build.py

Former-commit-id: 6cf46ee5aa
2015-04-29 14:36:24 -04:00
Robyn Speer
d29e8bfddf start a new multilingual wordlist called 'stems'
So far, this wordlist is only in Dutch.


Former-commit-id: af5f65b328
2015-03-31 15:59:30 -04:00
Robyn Speer
ca944e54aa new Dutch data, bump version to 0.6
Former-commit-id: 377336bcdc
2015-03-03 15:54:45 -05:00
Andrew Lin
6882ac9f0e Fix a variable name for clarity.
Former-commit-id: 434c603798
2015-03-03 11:59:46 -05:00
Robyn Speer
ad22387a53 add surface forms from Twitter 2014 data
Former-commit-id: ffdaa82b11
2015-02-17 15:06:11 -05:00
Robyn Speer
8d57b39a7b stop running 'remove_unsafe_private_use' unnecessarily
Former-commit-id: b6f246ecbb
2015-02-17 14:02:36 -05:00
Robyn Speer
f4280dcad0 add twitter-stems-2014 wordlist data
Former-commit-id: 6ab72201cd
2015-02-11 13:29:32 -05:00
Robyn Speer
03fac20b1b Allow multithreaded SQLite on Python 3
Former-commit-id: bf0071fd8b
2014-10-02 18:10:09 -04:00
Robyn Speer
5153faf43e construct the download path correctly, even on Windows
Former-commit-id: 6d90cef415
2014-09-08 10:56:48 -04:00
Robyn Speer
0c61406cdc remove unused global
Former-commit-id: c55a701885
2014-09-02 14:29:31 -04:00
Robyn Speer
b357ffaa09 cleanups to building and uploading, from code review
Former-commit-id: 5dee417302
2014-08-18 14:14:01 -04:00
Robyn Speer
a06c3fc648 A different plan for the top-level word_frequency function.
When, before, I was importing wordfreq.query at the top level, this
created a dependency loop when installing wordfreq.

The new top-level __init__.py provides just a `word_frequency` function,
which imports the real function as needed and calls it. This should
avoid the dependency loop, at the cost of making
`wordfreq.word_frequency` slightly less efficient than
`wordfreq.query.word_frequency`.


Former-commit-id: 44ccf40742
2014-02-24 18:03:31 -05:00
Robyn Speer
b6b3a6f5f6 version 0.4: minor code changes, debugged database
- The database is built under Python 3.3.2, so it should correctly
  implement Python 3's Unicode tricks, including special handling
  of Greek lowercase letters. (Version 0.3 was supposed to do this
  as well, but apparently, it didn't.)
- `word_frequency` and `iter_wordlist` can be imported from the
  top level.
- The new function `random_words` supplies a string made from
  random words that are sufficiently high in rank order.


Former-commit-id: 3702a7c8d0
2014-02-24 16:29:06 -05:00
Robyn Speer
207defe6ff Sometimes you need some random words.
Former-commit-id: 3447ae732e
2014-01-06 15:51:10 -05:00
Robyn Speer
f369df3e82 Merge pull request #1 from LuminosoInsight/remove_metanl_wf
Remove metanl_word_frequency(), which we no longer need.

Former-commit-id: 63bebe6ad3
2013-11-11 10:13:25 -08:00
Robyn Speer
634cf6af6d data is now hosted on wordfreq.services.luminoso.com
Former-commit-id: 56f2c606f1
2013-11-07 14:43:15 -05:00
Andrew Lin
cf45720f66 Remove metanl_word_frequency(), which we no longer need.
Former-commit-id: 76a7267670
2013-11-04 16:51:25 -05:00
Robyn Speer
5f7c7e032c Clear wordlists before inserting them; yell at Python 2
Former-commit-id: 823b3828cd
2013-11-01 19:29:37 -04:00
Robyn Speer
5fc933495f Revert "code review and pep8 fixes"
This reverts commit ae6e03fa06 [formerly b4b8ba8be7].

Conflicts:
	wordfreq/transfer.py

Former-commit-id: 5c8ba34492
2013-11-01 17:33:39 -04:00
Robyn Speer
4d904a3bae Merge branch 'master' of github.com:LuminosoInsight/wordfreq
Conflicts:
	wordfreq/transfer.py

Former-commit-id: 90e042f196
2013-11-01 17:05:59 -04:00
Robyn Speer
ae6e03fa06 code review and pep8 fixes
Former-commit-id: b4b8ba8be7
2013-11-01 17:05:12 -04:00
Lance Nathan
cbb3207e4f Two small stylistic tweaks
Former-commit-id: ea29469643
2013-10-31 16:00:48 -04:00
Robyn Speer
313306f12e try to match the wordlist metanl actually uses
Former-commit-id: 90772e33fb
2013-10-31 15:13:22 -04:00
Robyn Speer
773f6b9843 The metanl scale is not what I thought it was.
Former-commit-id: 0d2fb21726
2013-10-31 14:38:01 -04:00
Robyn Speer
101e767ad9 When strings are inconsistent between py2 and 3, don't test them on py2. 2013-10-31 13:11:13 -04:00
Robyn Speer
52bcb99c48 add util.py, which provides standardize_word 2013-10-30 18:14:43 -04:00
Robyn Speer
5b31bd415f and of course this changes the metanl constant 2013-10-30 18:14:34 -04:00
Robyn Speer
4bda3e6b6f Turns out we need to change the metanl constant after normalizing words. 2013-10-30 16:58:10 -04:00
Robyn Speer
8f00846117 Normalize words when storing them or looking them up. 2013-10-30 14:59:57 -04:00
Robyn Speer
ea5de7cb2a Revise the build test to compare lengths of wordlists.
The test currently fails on Python 3, for some strange reason.
2013-10-30 13:22:56 -04:00
Lance
74cfb69f5a Another Py3 change, this one for functools32 2013-10-30 12:06:41 -04:00
Lance
de41143159 Py3 tweak to urllib import 2013-10-30 11:57:50 -04:00
Robyn Speer
68f7b25cf7 Change default values to offsets. 2013-10-29 18:06:47 -04:00
Robyn Speer
8a48e57749 now this package has tests 2013-10-29 17:21:55 -04:00
Robyn Speer
a95d88d1b9 Implement the data uploady downloady stuff in setup. 2013-10-29 16:44:13 -04:00
Robyn Speer
91a62dbee5 Deal with database connections more consistently 2013-10-29 16:43:58 -04:00
Robyn Speer
4fc1971b0f Add a couple of useful statistics about wordlists 2013-10-29 16:42:38 -04:00
Robyn Speer
67fefa5dd5 add query.iter_wordlist, to visit all words in a list 2013-10-29 12:44:16 -04:00
Robyn Speer
c0ed89c015 revise config.py, clarify some of query.py 2013-10-29 12:18:38 -04:00
Robyn Speer
a92fed80cf better default parameters and better log messages in building 2013-10-29 12:04:17 -04:00
Robyn Speer
e8273e47a1 Initial version.
Noticeably missing: data files or any way to get them.
2013-10-28 19:26:44 -04:00