Robyn Speer
a69bd518af
limit final builds to languages with >= 2 sources
2015-05-07 23:59:04 -04:00
Robyn Speer
a46f1af4b8
fix dependency
2015-05-07 23:55:57 -04:00
Robyn Speer
a5f6113824
a reasonably complete build process
2015-05-07 19:38:33 -04:00
Robyn Speer
4e1006b334
process leeds and opensubtitles
2015-05-07 17:07:33 -04:00
Robyn Speer
40680cad55
abstract how we define build rules a bit
2015-05-07 16:59:28 -04:00
Robyn Speer
04bde8d617
WIP on more build steps
2015-05-07 16:49:53 -04:00
Lance Nathan
60c7f3a7da
Tweak to previous variable name fix
...
Former-commit-id: e8a1548d93
2015-05-06 17:57:10 -04:00
Lance Nathan
8400dee933
Merge pull request #6 from LuminosoInsight/ftfy4
...
Clean data with ftfy v4
Former-commit-id: 4632ffb177
2015-05-06 17:32:45 -04:00
Lance Nathan
bbf9164542
Merge pull request #5 from LuminosoInsight/dutch-201504
...
Better Dutch surface-form data
Former-commit-id: 5f05b52fe5
2015-05-06 17:15:21 -04:00
Robyn Speer
5ef406fd43
fix reused variable name
...
Former-commit-id: 506073030a
2015-05-06 17:06:37 -04:00
Robyn Speer
7c09fec692
add rules to count wikipedia tokens
2015-05-05 15:21:24 -04:00
Robyn Speer
c55e44e486
fix the 'count' ninja rule
2015-05-05 14:06:13 -04:00
Robyn Speer
59409266ca
add and adjust some build steps
...
- more build steps for Wikipedia
- rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that
the results are preliminary
2015-05-05 13:59:21 -04:00
Robyn Speer
d5d24cb098
set version number to 0.8
...
Former-commit-id: 2f3bb955d1
2015-05-05 12:05:00 -04:00
Robyn Speer
c9ca5b94b0
Merge branch 'dutch-201504' into ftfy4
...
Conflicts:
setup.py
Former-commit-id: 24a7c73e6d
2015-05-05 12:04:44 -04:00
Robyn Speer
922c658b68
require ftfy 4
...
Former-commit-id: 70b2c678ea
2015-05-05 12:04:13 -04:00
Robyn Speer
33c5f78c07
add wiki-parsing process
2015-05-04 13:25:01 -04:00
Robyn Speer
9ac4838eda
not using wordfreq.cfg anymore
2015-04-30 16:25:42 -04:00
Robyn Speer
efcf436112
WIP on new build system
2015-04-30 16:24:28 -04:00
Robyn Speer
98b3e51da1
use script codes for Chinese
2015-04-30 13:02:58 -04:00
Robyn Speer
cb6b2a8002
v0.7: make a proper Dutch 'surfaces' list
...
Former-commit-id: 873ace87db
2015-04-30 13:01:24 -04:00
Robyn Speer
76ea7f1bd5
define some ninja rules
2015-04-29 17:13:58 -04:00
Robyn Speer
524f7c760b
WIP on Ninja build automation
2015-04-29 15:59:06 -04:00
Robyn Speer
f77a61e675
move commands into cli/ directory
2015-04-29 15:22:04 -04:00
Robyn Speer
2bf8870832
always use surface forms
2015-04-29 15:17:00 -04:00
Robyn Speer
b4dfdaa47c
Merge branch 'master' into dutch-201503
...
Conflicts:
wordfreq/build.py
Former-commit-id: 6cf46ee5aa
2015-04-29 14:36:24 -04:00
Robyn Speer
38261a6a0a
handle multi-word stems correctly
2015-04-29 13:45:53 -04:00
Robyn Speer
d29e8bfddf
start a new multilingual wordlist called 'stems'
...
So far, this wordlist is only in Dutch.
Former-commit-id: af5f65b328
2015-03-31 15:59:30 -04:00
Robyn Speer
6b57075275
revise the process of building Wikipedia counts
2015-03-30 18:09:07 -04:00
Robyn Speer
56e811be19
Fix Dutch lists
...
- Use surface forms consistently, not stems
- Count all instances of words on Wikipedia, not one per article
Former-commit-id: 3507d8b630
2015-03-12 16:00:03 -04:00
Andrew Lin
6e98ca9822
Merge pull request #3 from LuminosoInsight/variable_name_fix
...
Fix a variable name for clarity.
Former-commit-id: cfe58cd899
2015-03-11 14:10:53 -04:00
Robyn Speer
ca944e54aa
new Dutch data, bump version to 0.6
...
Former-commit-id: 377336bcdc
2015-03-03 15:54:45 -05:00
Andrew Lin
6882ac9f0e
Fix a variable name for clarity.
...
Former-commit-id: 434c603798
2015-03-03 11:59:46 -05:00
Andrew Lin
39d914f8e1
Merge pull request #2 from LuminosoInsight/new-twitter-lists
...
New twitter lists
Former-commit-id: 5a4d3a87d5
2015-02-17 15:36:13 -05:00
Robyn Speer
ad22387a53
add surface forms from Twitter 2014 data
...
Former-commit-id: ffdaa82b11
2015-02-17 15:06:11 -05:00
Robyn Speer
8d57b39a7b
stop running 'remove_unsafe_private_use' unnecessarily
...
Former-commit-id: b6f246ecbb
2015-02-17 14:02:36 -05:00
Robyn Speer
bc780c63c8
enable wordlist balancing, surface form counting
2015-02-17 13:43:22 -05:00
Robyn Speer
f4280dcad0
add twitter-stems-2014 wordlist data
...
Former-commit-id: 6ab72201cd
2015-02-11 13:29:32 -05:00
Robyn Speer
07e61be7e3
add utility for combining wordlists
2015-02-11 11:45:10 -05:00
Robyn Speer
23bd5ba76c
command-line entry points
2015-02-10 12:28:29 -05:00
Robyn Speer
8b322ce534
Initial commit
2015-02-04 20:19:36 -05:00
Robyn Speer
03fac20b1b
Allow multithreaded SQLite on Python 3
...
Former-commit-id: bf0071fd8b
2014-10-02 18:10:09 -04:00
Robyn Speer
5153faf43e
construct the download path correctly, even on Windows
...
Former-commit-id: 6d90cef415
2014-09-08 10:56:48 -04:00
Robyn Speer
0c61406cdc
remove unused global
...
Former-commit-id: c55a701885
2014-09-02 14:29:31 -04:00
Robyn Speer
b357ffaa09
cleanups to building and uploading, from code review
...
Former-commit-id: 5dee417302
2014-08-18 14:14:01 -04:00
Robyn Speer
759534392f
Add license text for the whole package
...
Former-commit-id: cb7b2b76e6
2014-06-02 16:37:32 -04:00
Robyn Speer
a06c3fc648
A different plan for the top-level word_frequency function.
...
When, before, I was importing wordfreq.query at the top level, this
created a dependency loop when installing wordfreq.
The new top-level __init__.py provides just a `word_frequency` function,
which imports the real function as needed and calls it. This should
avoid the dependency loop, at the cost of making
`wordfreq.word_frequency` slightly less efficient than
`wordfreq.query.word_frequency`.
Former-commit-id: 44ccf40742
2014-02-24 18:03:31 -05:00
Robyn Speer
b6b3a6f5f6
version 0.4: minor code changes, debugged database
...
- The database is built under Python 3.3.2, so it should correctly
implement Python 3's Unicode tricks, including special handling
of Greek lowercase letters. (Version 0.3 was supposed to do this
as well, but apparently, it didn't.)
- `word_frequency` and `iter_wordlist` can be imported from the
top level.
- The new function `random_words` supplies a string made from
random words that are sufficiently high in rank order.
Former-commit-id: 3702a7c8d0
2014-02-24 16:29:06 -05:00
Robyn Speer
207defe6ff
Sometimes you need some random words.
...
Former-commit-id: 3447ae732e
2014-01-06 15:51:10 -05:00
Andrew Lin
181e8e08fa
Remove the tests for metanl_word_frequency too. Doh.
...
Former-commit-id: 68d262791c
2013-11-11 13:21:25 -05:00