Commit Graph

83 Commits

Author SHA1 Message Date
Rob Speer
b7f125e1d9 copy-edit some docstrings
Former-commit-id: 981cc249f1
2015-06-17 14:42:38 -04:00
Rob Speer
632b6992fe Merge pull request #8 from LuminosoInsight/newbuild-refactor
Refactored the newbuild branch, in response to the preliminary review notes

Former-commit-id: 13988f8e3d
2015-06-17 14:35:37 -04:00
Joshua Chin
d55770dedc updated dB_to_freq docstring
Former-commit-id: 68b1c121bd
2015-06-17 14:34:03 -04:00
Joshua Chin
a10d652685 clearified the docstrings for random_words and random_ascii_words
Former-commit-id: a289ab7f8b
2015-06-17 14:26:06 -04:00
Joshua Chin
cde8ac5366 corrected available_languages to return a dict of strs to strs
Former-commit-id: 9f288bac31
2015-06-17 12:43:13 -04:00
Joshua Chin
baca17c2ef changed yield to yield from in iter_wordlist
Former-commit-id: 6cc962bfea
2015-06-17 12:38:31 -04:00
Joshua Chin
b5a6fcc03b updated db_to_freq docstring
Former-commit-id: 9b30da4dec
2015-06-17 12:24:23 -04:00
Joshua Chin
5f7f661a1f added docstrings
Former-commit-id: 7e808bf7c1
2015-06-17 12:20:50 -04:00
Joshua Chin
c5d8bac7d5 removed temporary variable
Former-commit-id: 053e4da3e6
2015-06-17 12:16:02 -04:00
Rob Speer
9a46b80028 clearer error on py2
Former-commit-id: ed19d79c5a
2015-05-28 14:05:11 -04:00
Rob Speer
51f4e4c826 add installation instructions to the readme
Former-commit-id: 0f4ca80026
2015-05-28 14:02:12 -04:00
Rob Speer
1f41cb083c update Japanese data; test Japanese and token combining
Former-commit-id: 611a6a35de
2015-05-28 14:01:56 -04:00
Rob Speer
d991373c1d Work on making Japanese tokenization use MeCab consistently
Former-commit-id: 05cf94d1fd
2015-05-27 18:10:25 -04:00
Rob Speer
e4e146f22f Merge branch 'master' into newbuild
Conflicts:
	setup.py
	wordfreq/build.py
	wordfreq/config.py

Former-commit-id: 0e5156e162
2015-05-21 20:41:47 -04:00
Rob Speer
b807d01f8f rebuild data
Former-commit-id: 84e5edcea1
2015-05-21 20:36:15 -04:00
Rob Speer
a1c31d3390 remove old tests
Former-commit-id: 410912d8f0
2015-05-21 20:36:09 -04:00
Rob Speer
24a8e5531b allow more language matches; reorder some parameters
Former-commit-id: b42594fa5f
2015-05-21 20:35:02 -04:00
Rob Speer
5b4107bd1d tests for new wordfreq with full coverage
Former-commit-id: df863a5169
2015-05-21 20:34:17 -04:00
Rob Speer
c953fc1626 update README, another setup fix
Former-commit-id: dd41e61c57
2015-05-13 04:09:34 -04:00
Rob Speer
5cbc0d0f94 update dependencies
Former-commit-id: f13cca4d81
2015-05-12 12:30:01 -04:00
Rob Speer
6f61cac4cb restore missing line in setup.py
Former-commit-id: bb18f741e2
2015-05-12 12:24:18 -04:00
Rob Speer
1c65cb9f14 add new data files from wordfreq_builder
Former-commit-id: 35aec061de
2015-05-11 18:45:47 -04:00
Rob Speer
9cd6f7c5c5 WIP: burn stuff down
Former-commit-id: 9b63e54471
2015-05-08 15:28:52 -04:00
Lance Nathan
1bde55d516 Tweak to previous variable name fix
Former-commit-id: e8a1548d93
2015-05-06 17:57:10 -04:00
Lance Nathan
d577c9e9c9 Merge pull request #6 from LuminosoInsight/ftfy4
Clean data with ftfy v4

Former-commit-id: 4632ffb177
2015-05-06 17:32:45 -04:00
Lance Nathan
b82b183c7a Merge pull request #5 from LuminosoInsight/dutch-201504
Better Dutch surface-form data

Former-commit-id: 5f05b52fe5
2015-05-06 17:15:21 -04:00
Rob Speer
f000ac2f1d fix reused variable name
Former-commit-id: 506073030a
2015-05-06 17:06:37 -04:00
Rob Speer
c439d492a5 set version number to 0.8
Former-commit-id: 2f3bb955d1
2015-05-05 12:05:00 -04:00
Rob Speer
d7ea4c420c Merge branch 'dutch-201504' into ftfy4
Conflicts:
	setup.py

Former-commit-id: 24a7c73e6d
2015-05-05 12:04:44 -04:00
Rob Speer
0cc89b1afa require ftfy 4
Former-commit-id: 70b2c678ea
2015-05-05 12:04:13 -04:00
Rob Speer
732c932ac7 v0.7: make a proper Dutch 'surfaces' list
Former-commit-id: 873ace87db
2015-04-30 13:01:24 -04:00
Rob Speer
4c44872d15 Merge branch 'master' into dutch-201503
Conflicts:
	wordfreq/build.py

Former-commit-id: 6cf46ee5aa
2015-04-29 14:36:24 -04:00
Rob Speer
d3c41fd8d8 start a new multilingual wordlist called 'stems'
So far, this wordlist is only in Dutch.


Former-commit-id: af5f65b328
2015-03-31 15:59:30 -04:00
Rob Speer
58da7797da Fix Dutch lists
- Use surface forms consistently, not stems
- Count all instances of words on Wikipedia, not one per article


Former-commit-id: 3507d8b630
2015-03-12 16:00:03 -04:00
Andrew Lin
6f1dd0280c Merge pull request #3 from LuminosoInsight/variable_name_fix
Fix a variable name for clarity.

Former-commit-id: cfe58cd899
2015-03-11 14:10:53 -04:00
Rob Speer
9059d6db9e new Dutch data, bump version to 0.6
Former-commit-id: 377336bcdc
2015-03-03 15:54:45 -05:00
Andrew Lin
4b722a1f79 Fix a variable name for clarity.
Former-commit-id: 434c603798
2015-03-03 11:59:46 -05:00
Andrew Lin
a82dfedeba Merge pull request #2 from LuminosoInsight/new-twitter-lists
New twitter lists

Former-commit-id: 5a4d3a87d5
2015-02-17 15:36:13 -05:00
Rob Speer
e6e5f6c9bf add surface forms from Twitter 2014 data
Former-commit-id: ffdaa82b11
2015-02-17 15:06:11 -05:00
Rob Speer
a51e9b062e stop running 'remove_unsafe_private_use' unnecessarily
Former-commit-id: b6f246ecbb
2015-02-17 14:02:36 -05:00
Rob Speer
46b3cdbcd4 add twitter-stems-2014 wordlist data
Former-commit-id: 6ab72201cd
2015-02-11 13:29:32 -05:00
Rob Speer
3aa6ced5cd Allow multithreaded SQLite on Python 3
Former-commit-id: bf0071fd8b
2014-10-02 18:10:09 -04:00
Rob Speer
789659128a construct the download path correctly, even on Windows
Former-commit-id: 6d90cef415
2014-09-08 10:56:48 -04:00
Rob Speer
dd3514a506 remove unused global
Former-commit-id: c55a701885
2014-09-02 14:29:31 -04:00
Rob Speer
8ac65dc644 cleanups to building and uploading, from code review
Former-commit-id: 5dee417302
2014-08-18 14:14:01 -04:00
Rob Speer
33dd311450 Add license text for the whole package
Former-commit-id: cb7b2b76e6
2014-06-02 16:37:32 -04:00
Rob Speer
c7c8078883 A different plan for the top-level word_frequency function.
When, before, I was importing wordfreq.query at the top level, this
created a dependency loop when installing wordfreq.

The new top-level __init__.py provides just a `word_frequency` function,
which imports the real function as needed and calls it. This should
avoid the dependency loop, at the cost of making
`wordfreq.word_frequency` slightly less efficient than
`wordfreq.query.word_frequency`.


Former-commit-id: 44ccf40742
2014-02-24 18:03:31 -05:00
Rob Speer
e29df8346d version 0.4: minor code changes, debugged database
- The database is built under Python 3.3.2, so it should correctly
  implement Python 3's Unicode tricks, including special handling
  of Greek lowercase letters. (Version 0.3 was supposed to do this
  as well, but apparently, it didn't.)
- `word_frequency` and `iter_wordlist` can be imported from the
  top level.
- The new function `random_words` supplies a string made from
  random words that are sufficiently high in rank order.


Former-commit-id: 3702a7c8d0
2014-02-24 16:29:06 -05:00
Rob Speer
dc16996458 Sometimes you need some random words.
Former-commit-id: 3447ae732e
2014-01-06 15:51:10 -05:00
Andrew Lin
3340367519 Remove the tests for metanl_word_frequency too. Doh.
Former-commit-id: 68d262791c
2013-11-11 13:21:25 -05:00