Rob Speer
d6cc90792f
Makefile should only be needed for bootstrapping Ninja
2015-05-08 12:39:31 -04:00
Rob Speer
b541fe68e1
Merge branch 'ninja-build'
...
Conflicts:
wordfreq_builder/cmd_count_twitter.py
wordfreq_builder/cmd_count_wikipedia.py
2015-05-08 00:01:01 -04:00
Rob Speer
2f14417bcf
limit final builds to languages with >= 2 sources
2015-05-07 23:59:04 -04:00
Rob Speer
1b7a2b9d0b
fix dependency
2015-05-07 23:55:57 -04:00
Rob Speer
abb0e059c8
a reasonably complete build process
2015-05-07 19:38:33 -04:00
Rob Speer
02d8b32119
process leeds and opensubtitles
2015-05-07 17:07:33 -04:00
Rob Speer
7e238cf547
abstract how we define build rules a bit
2015-05-07 16:59:28 -04:00
Rob Speer
d2f9c60776
WIP on more build steps
2015-05-07 16:49:53 -04:00
Lance Nathan
e8a1548d93
Tweak to previous variable name fix
2015-05-06 17:57:10 -04:00
Lance Nathan
4632ffb177
Merge pull request #6 from LuminosoInsight/ftfy4
...
Clean data with ftfy v4
2015-05-06 17:32:45 -04:00
Lance Nathan
5f05b52fe5
Merge pull request #5 from LuminosoInsight/dutch-201504
...
Better Dutch surface-form data
2015-05-06 17:15:21 -04:00
Rob Speer
506073030a
fix reused variable name
2015-05-06 17:06:37 -04:00
Rob Speer
16928ed182
add rules to count wikipedia tokens
2015-05-05 15:21:24 -04:00
Rob Speer
bd579e2319
fix the 'count' ninja rule
2015-05-05 14:06:13 -04:00
Rob Speer
5787b6bb73
add and adjust some build steps
...
- more build steps for Wikipedia
- rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that
the results are preliminary
2015-05-05 13:59:21 -04:00
Rob Speer
2f3bb955d1
set version number to 0.8
2015-05-05 12:05:00 -04:00
Rob Speer
24a7c73e6d
Merge branch 'dutch-201504' into ftfy4
...
Conflicts:
setup.py
2015-05-05 12:04:44 -04:00
Rob Speer
70b2c678ea
require ftfy 4
2015-05-05 12:04:13 -04:00
Rob Speer
61b9440e3d
add wiki-parsing process
2015-05-04 13:25:01 -04:00
Rob Speer
34400de35a
not using wordfreq.cfg anymore
2015-04-30 16:25:42 -04:00
Rob Speer
5437bb4e85
WIP on new build system
2015-04-30 16:24:28 -04:00
Rob Speer
2a1b16b55c
use script codes for Chinese
2015-04-30 13:02:58 -04:00
Rob Speer
873ace87db
v0.7: make a proper Dutch 'surfaces' list
2015-04-30 13:01:24 -04:00
Rob Speer
4dae2f8caf
define some ninja rules
2015-04-29 17:13:58 -04:00
Rob Speer
14e445a937
WIP on Ninja build automation
2015-04-29 15:59:06 -04:00
Rob Speer
815d393b74
move commands into cli/ directory
2015-04-29 15:22:04 -04:00
Rob Speer
5d14d24738
always use surface forms
2015-04-29 15:17:00 -04:00
Rob Speer
6cf46ee5aa
Merge branch 'master' into dutch-201503
...
Conflicts:
wordfreq/build.py
2015-04-29 14:36:24 -04:00
Rob Speer
70c9e99ee4
handle multi-word stems correctly
2015-04-29 13:45:53 -04:00
Rob Speer
af5f65b328
start a new multilingual wordlist called 'stems'
...
So far, this wordlist is only in Dutch.
2015-03-31 15:59:30 -04:00
Rob Speer
60a7c4d1ec
revise the process of building Wikipedia counts
2015-03-30 18:09:07 -04:00
Rob Speer
3507d8b630
Fix Dutch lists
...
- Use surface forms consistently, not stems
- Count all instances of words on Wikipedia, not one per article
2015-03-12 16:00:03 -04:00
Andrew Lin
cfe58cd899
Merge pull request #3 from LuminosoInsight/variable_name_fix
...
Fix a variable name for clarity.
2015-03-11 14:10:53 -04:00
Rob Speer
377336bcdc
new Dutch data, bump version to 0.6
2015-03-03 15:54:45 -05:00
Andrew Lin
434c603798
Fix a variable name for clarity.
2015-03-03 11:59:46 -05:00
Andrew Lin
5a4d3a87d5
Merge pull request #2 from LuminosoInsight/new-twitter-lists
...
New twitter lists
2015-02-17 15:36:13 -05:00
Rob Speer
ffdaa82b11
add surface forms from Twitter 2014 data
2015-02-17 15:06:11 -05:00
Rob Speer
b6f246ecbb
stop running 'remove_unsafe_private_use' unnecessarily
2015-02-17 14:02:36 -05:00
Rob Speer
be2b68b1de
enable wordlist balancing, surface form counting
2015-02-17 13:43:22 -05:00
Rob Speer
6ab72201cd
add twitter-stems-2014 wordlist data
2015-02-11 13:29:32 -05:00
Rob Speer
fcd6044c2d
add utility for combining wordlists
2015-02-11 11:45:10 -05:00
Rob Speer
d3374a9fe1
command-line entry points
2015-02-10 12:28:29 -05:00
Rob Speer
693c35476f
Initial commit
2015-02-04 20:19:36 -05:00
Rob Speer
bf0071fd8b
Allow multithreaded SQLite on Python 3
2014-10-02 18:10:09 -04:00
Rob Speer
6d90cef415
construct the download path correctly, even on Windows
2014-09-08 10:56:48 -04:00
Rob Speer
c55a701885
remove unused global
2014-09-02 14:29:31 -04:00
Rob Speer
5dee417302
cleanups to building and uploading, from code review
2014-08-18 14:14:01 -04:00
Rob Speer
cb7b2b76e6
Add license text for the whole package
2014-06-02 16:37:32 -04:00
Rob Speer
44ccf40742
A different plan for the top-level word_frequency function.
...
When, before, I was importing wordfreq.query at the top level, this
created a dependency loop when installing wordfreq.
The new top-level __init__.py provides just a `word_frequency` function,
which imports the real function as needed and calls it. This should
avoid the dependency loop, at the cost of making
`wordfreq.word_frequency` slightly less efficient than
`wordfreq.query.word_frequency`.
2014-02-24 18:03:31 -05:00
Rob Speer
3702a7c8d0
version 0.4: minor code changes, debugged database
...
- The database is built under Python 3.3.2, so it should correctly
implement Python 3's Unicode tricks, including special handling
of Greek lowercase letters. (Version 0.3 was supposed to do this
as well, but apparently, it didn't.)
- `word_frequency` and `iter_wordlist` can be imported from the
top level.
- The new function `random_words` supplies a string made from
random words that are sufficiently high in rank order.
2014-02-24 16:29:06 -05:00