Lance Nathan
d577c9e9c9
Merge pull request #6 from LuminosoInsight/ftfy4
...
Clean data with ftfy v4
Former-commit-id: 4632ffb177
2015-05-06 17:32:45 -04:00
Lance Nathan
b82b183c7a
Merge pull request #5 from LuminosoInsight/dutch-201504
...
Better Dutch surface-form data
Former-commit-id: 5f05b52fe5
2015-05-06 17:15:21 -04:00
Rob Speer
f000ac2f1d
fix reused variable name
...
Former-commit-id: 506073030a
2015-05-06 17:06:37 -04:00
Rob Speer
16928ed182
add rules to count wikipedia tokens
2015-05-05 15:21:24 -04:00
Rob Speer
bd579e2319
fix the 'count' ninja rule
2015-05-05 14:06:13 -04:00
Rob Speer
5787b6bb73
add and adjust some build steps
...
- more build steps for Wikipedia
- rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that
the results are preliminary
2015-05-05 13:59:21 -04:00
Rob Speer
c439d492a5
set version number to 0.8
...
Former-commit-id: 2f3bb955d1
2015-05-05 12:05:00 -04:00
Rob Speer
d7ea4c420c
Merge branch 'dutch-201504' into ftfy4
...
Conflicts:
setup.py
Former-commit-id: 24a7c73e6d
2015-05-05 12:04:44 -04:00
Rob Speer
0cc89b1afa
require ftfy 4
...
Former-commit-id: 70b2c678ea
2015-05-05 12:04:13 -04:00
Rob Speer
61b9440e3d
add wiki-parsing process
2015-05-04 13:25:01 -04:00
Rob Speer
34400de35a
not using wordfreq.cfg anymore
2015-04-30 16:25:42 -04:00
Rob Speer
5437bb4e85
WIP on new build system
2015-04-30 16:24:28 -04:00
Rob Speer
2a1b16b55c
use script codes for Chinese
2015-04-30 13:02:58 -04:00
Rob Speer
732c932ac7
v0.7: make a proper Dutch 'surfaces' list
...
Former-commit-id: 873ace87db
2015-04-30 13:01:24 -04:00
Rob Speer
4dae2f8caf
define some ninja rules
2015-04-29 17:13:58 -04:00
Rob Speer
14e445a937
WIP on Ninja build automation
2015-04-29 15:59:06 -04:00
Rob Speer
815d393b74
move commands into cli/ directory
2015-04-29 15:22:04 -04:00
Rob Speer
5d14d24738
always use surface forms
2015-04-29 15:17:00 -04:00
Rob Speer
4c44872d15
Merge branch 'master' into dutch-201503
...
Conflicts:
wordfreq/build.py
Former-commit-id: 6cf46ee5aa
2015-04-29 14:36:24 -04:00
Rob Speer
70c9e99ee4
handle multi-word stems correctly
2015-04-29 13:45:53 -04:00
Rob Speer
d3c41fd8d8
start a new multilingual wordlist called 'stems'
...
So far, this wordlist is only in Dutch.
Former-commit-id: af5f65b328
2015-03-31 15:59:30 -04:00
Rob Speer
60a7c4d1ec
revise the process of building Wikipedia counts
2015-03-30 18:09:07 -04:00
Rob Speer
58da7797da
Fix Dutch lists
...
- Use surface forms consistently, not stems
- Count all instances of words on Wikipedia, not one per article
Former-commit-id: 3507d8b630
2015-03-12 16:00:03 -04:00
Andrew Lin
6f1dd0280c
Merge pull request #3 from LuminosoInsight/variable_name_fix
...
Fix a variable name for clarity.
Former-commit-id: cfe58cd899
2015-03-11 14:10:53 -04:00
Rob Speer
9059d6db9e
new Dutch data, bump version to 0.6
...
Former-commit-id: 377336bcdc
2015-03-03 15:54:45 -05:00
Andrew Lin
4b722a1f79
Fix a variable name for clarity.
...
Former-commit-id: 434c603798
2015-03-03 11:59:46 -05:00
Andrew Lin
a82dfedeba
Merge pull request #2 from LuminosoInsight/new-twitter-lists
...
New twitter lists
Former-commit-id: 5a4d3a87d5
2015-02-17 15:36:13 -05:00
Rob Speer
e6e5f6c9bf
add surface forms from Twitter 2014 data
...
Former-commit-id: ffdaa82b11
2015-02-17 15:06:11 -05:00
Rob Speer
a51e9b062e
stop running 'remove_unsafe_private_use' unnecessarily
...
Former-commit-id: b6f246ecbb
2015-02-17 14:02:36 -05:00
Rob Speer
be2b68b1de
enable wordlist balancing, surface form counting
2015-02-17 13:43:22 -05:00
Rob Speer
46b3cdbcd4
add twitter-stems-2014 wordlist data
...
Former-commit-id: 6ab72201cd
2015-02-11 13:29:32 -05:00
Rob Speer
fcd6044c2d
add utility for combining wordlists
2015-02-11 11:45:10 -05:00
Rob Speer
d3374a9fe1
command-line entry points
2015-02-10 12:28:29 -05:00
Rob Speer
693c35476f
Initial commit
2015-02-04 20:19:36 -05:00
Rob Speer
3aa6ced5cd
Allow multithreaded SQLite on Python 3
...
Former-commit-id: bf0071fd8b
2014-10-02 18:10:09 -04:00
Rob Speer
789659128a
construct the download path correctly, even on Windows
...
Former-commit-id: 6d90cef415
2014-09-08 10:56:48 -04:00
Rob Speer
dd3514a506
remove unused global
...
Former-commit-id: c55a701885
2014-09-02 14:29:31 -04:00
Rob Speer
8ac65dc644
cleanups to building and uploading, from code review
...
Former-commit-id: 5dee417302
2014-08-18 14:14:01 -04:00
Rob Speer
33dd311450
Add license text for the whole package
...
Former-commit-id: cb7b2b76e6
2014-06-02 16:37:32 -04:00
Rob Speer
c7c8078883
A different plan for the top-level word_frequency function.
...
When, before, I was importing wordfreq.query at the top level, this
created a dependency loop when installing wordfreq.
The new top-level __init__.py provides just a `word_frequency` function,
which imports the real function as needed and calls it. This should
avoid the dependency loop, at the cost of making
`wordfreq.word_frequency` slightly less efficient than
`wordfreq.query.word_frequency`.
Former-commit-id: 44ccf40742
2014-02-24 18:03:31 -05:00
Rob Speer
e29df8346d
version 0.4: minor code changes, debugged database
...
- The database is built under Python 3.3.2, so it should correctly
implement Python 3's Unicode tricks, including special handling
of Greek lowercase letters. (Version 0.3 was supposed to do this
as well, but apparently, it didn't.)
- `word_frequency` and `iter_wordlist` can be imported from the
top level.
- The new function `random_words` supplies a string made from
random words that are sufficiently high in rank order.
Former-commit-id: 3702a7c8d0
2014-02-24 16:29:06 -05:00
Rob Speer
dc16996458
Sometimes you need some random words.
...
Former-commit-id: 3447ae732e
2014-01-06 15:51:10 -05:00
Andrew Lin
3340367519
Remove the tests for metanl_word_frequency too. Doh.
...
Former-commit-id: 68d262791c
2013-11-11 13:21:25 -05:00
Rob Speer
65f61d8a2e
Merge pull request #1 from LuminosoInsight/remove_metanl_wf
...
Remove metanl_word_frequency(), which we no longer need.
Former-commit-id: 63bebe6ad3
2013-11-11 10:13:25 -08:00
Rob Speer
d63c868ba8
data is now hosted on wordfreq.services.luminoso.com
...
Former-commit-id: 56f2c606f1
2013-11-07 14:43:15 -05:00
Andrew Lin
a70b3847a6
Remove metanl_word_frequency(), which we no longer need.
...
Former-commit-id: 76a7267670
2013-11-04 16:51:25 -05:00
Rob Speer
1edee91b05
Clear wordlists before inserting them; yell at Python 2
...
Former-commit-id: 823b3828cd
2013-11-01 19:29:37 -04:00
Rob Speer
38266b9916
Revert "code review and pep8 fixes"
...
This reverts commit 8ba4a6660e
[formerly b4b8ba8be7
].
Conflicts:
wordfreq/transfer.py
Former-commit-id: 5c8ba34492
2013-11-01 17:33:39 -04:00
Rob Speer
cb3821a304
Merge branch 'master' of github.com:LuminosoInsight/wordfreq
...
Conflicts:
wordfreq/transfer.py
Former-commit-id: 90e042f196
2013-11-01 17:05:59 -04:00
Rob Speer
8ba4a6660e
code review and pep8 fixes
...
Former-commit-id: b4b8ba8be7
2013-11-01 17:05:12 -04:00