Rob Speer
a893823d6e
un-flake wordfreq_builder.tokenizers, and edit docstrings
2015-08-26 13:03:23 -04:00
Rob Speer
5a1fc00aaa
Strip apostrophes from edges of tokens
...
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
2015-08-25 12:41:48 -04:00
Rob Speer
de73888a76
use better regexes in wordfreq_builder tokenizer
2015-08-24 19:05:46 -04:00
Rob Speer
140ca6c050
remove Hangul fillers that confuse cld2
2015-08-24 17:11:18 -04:00
Andrew Lin
6d40912ef9
Stylistic cleanups to word_counts.py.
2015-07-31 19:26:18 -04:00
Andrew Lin
53621c34df
Remove redundant reference to wikipedia in builder README.
2015-07-31 19:12:59 -04:00
Rob Speer
e9f9c94e36
Don't use the file-reading cutoff when writing centibels
2015-07-28 18:45:26 -04:00
Rob Speer
c5708b24e4
put back the freqs_to_cBpack cutoff; prepare for 1.0
2015-07-28 18:01:12 -04:00
Rob Speer
32102ba3c2
Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
...
Code review fixes 2015 07 17
2015-07-22 15:09:00 -04:00
Joshua Chin
93cd902899
updated read_freqs docs
2015-07-22 10:06:16 -04:00
Joshua Chin
4fe9d110e1
fixed style
2015-07-22 10:05:11 -04:00
Joshua Chin
6453d864c4
reordered command line args
2015-07-22 10:04:14 -04:00
Joshua Chin
8081145922
bugfix
2015-07-21 10:12:56 -04:00
Joshua Chin
c5f82ecac1
fixed rules.ninja
2015-07-20 17:20:29 -04:00
Joshua Chin
643571c69c
fixed build bug
2015-07-20 16:51:25 -04:00
Joshua Chin
173278fdd3
ensure removal of tatweels (hopefully)
2015-07-20 16:48:36 -04:00
Joshua Chin
298d3c1d24
unhoisted if statement
2015-07-20 11:10:41 -04:00
Joshua Chin
accb7e398c
ninja.py is now pep8 compliant
2015-07-20 11:06:58 -04:00
Joshua Chin
221acf7921
fixed build
2015-07-17 17:44:01 -04:00
Rob Speer
2d1020daac
mention the Wikipedia data, and credit Hermit Dave
2015-07-17 17:09:36 -04:00
Joshua Chin
f31f9a1bcd
fixed tokenize_twitter
2015-07-17 16:37:47 -04:00
Joshua Chin
a44927e98e
added cld2 tokenizer comments
2015-07-17 16:03:33 -04:00
Joshua Chin
11a1c51321
fix arabic tokens
2015-07-17 15:52:12 -04:00
Joshua Chin
c75c735d8d
fixed syntax
2015-07-17 15:43:24 -04:00
Joshua Chin
303bd88ba2
renamed tokenize file to tokenize twitter
2015-07-17 15:27:26 -04:00
Joshua Chin
d6519cf736
created last_tab flag
2015-07-17 15:19:09 -04:00
Joshua Chin
620becb7e8
removed uncessary if statement
2015-07-17 15:14:06 -04:00
Joshua Chin
d988b1b42e
generated freq dict in place
2015-07-17 15:13:25 -04:00
Joshua Chin
e37c689031
corrected docstring
2015-07-17 15:12:23 -04:00
Joshua Chin
002351bace
removed unnecessary strip
2015-07-17 15:11:28 -04:00
Joshua Chin
7fc23666a9
moved last_tab to tokenize_twitter
2015-07-17 15:10:17 -04:00
Joshua Chin
528285a982
removed unused function
2015-07-17 15:03:14 -04:00
Joshua Chin
59d3c72758
fixed spacing
2015-07-17 15:02:34 -04:00
Joshua Chin
10028be212
removed unnecessary format
2015-07-17 15:01:25 -04:00
Joshua Chin
3b368b66dd
cleaned up BAD_CHAR_RANGE
2015-07-17 15:00:59 -04:00
Joshua Chin
c2d1cdcb31
moved test tokenizers
2015-07-17 14:58:58 -04:00
Joshua Chin
5d26c9f57f
added docstring and moved to scripts
2015-07-17 14:56:18 -04:00
Joshua Chin
bdc791af8f
style changes
2015-07-17 14:54:32 -04:00
Joshua Chin
4d5ec57144
removed bad comment
2015-07-17 14:54:09 -04:00
Joshua Chin
39f01b0485
removed unused scripts
2015-07-17 14:53:18 -04:00
Joshua Chin
98a7a8093b
removed mkdir -p for many cases
2015-07-17 14:45:22 -04:00
Joshua Chin
449a656edd
removed TOKENIZE_TWITTER
2015-07-17 14:43:14 -04:00
Joshua Chin
00e18b7d4b
removed TOKENIZE_TWITTER option
2015-07-17 14:40:49 -04:00
Joshua Chin
772c0cddd1
more README fixes
2015-07-17 14:40:33 -04:00
Joshua Chin
0a085132f4
fixed README
2015-07-17 14:35:43 -04:00
Rob Speer
8633e8c2a9
update the wordfreq_builder README
2015-07-13 11:58:48 -04:00
Rob Speer
41dba74da2
add docstrings and remove some brackets
2015-07-07 18:22:51 -04:00
Joshua Chin
b0f759d322
Removes mention of Rosette from README
2015-07-07 10:32:16 -04:00
Rob Speer
10c04d116f
add 'twitter' as a final build, and a new build dir
...
The `data/dist` directory is now a convenient place to find the final
built files that can be copied into wordfreq.
2015-07-01 17:45:39 -04:00
Rob Speer
37375383e8
cope with occasional Unicode errors in the input
2015-06-30 17:05:40 -04:00
Rob Speer
4771c12814
remove wiki2tokens and tokenize_wikipedia
...
These components are no longer necessary. Wikipedia output can and
should be tokenized with the standard tokenizer, instead of the
almost-equivalent one in the Nim code.
2015-06-30 15:28:01 -04:00
Rob Speer
9a2855394d
fix comment and whitespace involving tokenize_twitter
2015-06-30 15:18:37 -04:00
Rob Speer
f305679caf
Switch to a centibel scale, add a header to the data
2015-06-22 17:38:13 -04:00
Rob Speer
d16683f2b9
Merge pull request #2 from LuminosoInsight/review-refactor
...
Adds a number of bugfixes and improvements to wordfreq_builder
2015-06-19 15:29:52 -04:00
Rob Speer
5bc1f0c097
restore missing Russian OpenSubtitles data
2015-06-19 12:36:08 -04:00
Joshua Chin
3746af1350
updated freqs_to_dBpack docstring
2015-06-18 10:32:53 -04:00
Joshua Chin
59ce14cdd0
revised read_freqs docstring
2015-06-18 10:28:22 -04:00
Joshua Chin
04bf6aadcc
updated monolingual_tokenize_file docstring, and removed unused argument
2015-06-18 10:20:54 -04:00
Joshua Chin
91dd73a2b5
tokenize_file should ignore lines with unknown languages
2015-06-18 10:18:57 -04:00
Joshua Chin
ffc01c75a0
Fixed CLD2_BAD_CHAR regex
2015-06-18 10:18:00 -04:00
Joshua Chin
8277de2c7f
changed tokenize_file: cld2 return 'un' instead of None if it cannot recognize the language
2015-06-17 14:19:28 -04:00
Joshua Chin
b24f31d30a
tokenize_file: don't join tokens if language is None
2015-06-17 14:18:18 -04:00
Joshua Chin
99d97956e6
automatically closes input file in tokenize_file
2015-06-17 11:42:34 -04:00
Joshua Chin
e50c0c6917
updated test to check number parsing
2015-06-17 11:30:25 -04:00
Joshua Chin
c71e93611b
fixed build process
2015-06-17 11:25:07 -04:00
Joshua Chin
8317ea6d51
updated directory of twitter output
2015-06-16 17:32:58 -04:00
Joshua Chin
da93bc89c2
removed intermediate twitter file rules
2015-06-16 17:28:09 -04:00
Joshua Chin
87f08780c8
improved tokenize_file and updated docstring
2015-06-16 17:27:27 -04:00
Joshua Chin
bea8963a79
renamed pretokenize_twitter to tokenize twitter, and deleted format_twitter
2015-06-16 17:26:52 -04:00
Joshua Chin
aeedb408b7
fixed bugs and removed unused code
2015-06-16 17:25:06 -04:00
Joshua Chin
64644d8ede
changed tokenizer to only strip t.co urls
2015-06-16 16:11:31 -04:00
Joshua Chin
b649d45e61
Added codepoints U+10FFFE and U+10FFFF to CLD2_BAD_CHAR_RANGE
2015-06-16 16:03:58 -04:00
Joshua Chin
a200a0a689
added tests for the tokenizer and language recognizer
2015-06-16 16:00:14 -04:00
Joshua Chin
1cf7e3d2b9
added pycld2 dependency
2015-06-16 15:06:22 -04:00
Joshua Chin
297d981e20
Replaced Rosette with cld2 language recognizer and wordfreq tokenizer
2015-06-16 14:45:49 -04:00
Rob Speer
b78d8ca3ee
ninja2dot: make a graph of the build process
2015-06-15 13:14:32 -04:00
Rob Speer
56d447a825
Reorganize and document some functions
2015-06-15 12:40:31 -04:00
Rob Speer
3d28491f4d
okay, apparently you can't mix code blocks and bullets
2015-06-01 11:39:42 -04:00
Rob Speer
d202474763
is this indented enough for you, markdown
2015-06-01 11:38:10 -04:00
Rob Speer
9927a8c414
add a README
2015-06-01 11:37:19 -04:00
Rob Speer
cbe3513e08
Tokenize Japanese consistently with MeCab
2015-05-27 17:44:58 -04:00
Rob Speer
536c15fbdb
give mecab a larger buffer
2015-05-26 19:34:46 -04:00
Rob Speer
5de81c7111
fix build rules for Japanese Wikipedia
2015-05-26 18:08:57 -04:00
Rob Speer
3d5b3d47e8
fix version in config.py
2015-05-26 18:08:46 -04:00
Rob Speer
ffd352f148
correct a Leeds bug; add some comments to rules.ninja
2015-05-26 18:08:04 -04:00
Rob Speer
50ff85ce19
add Google Books data for English
2015-05-11 18:44:28 -04:00
Rob Speer
c707b32345
move some functions to the wordfreq package
2015-05-11 17:02:52 -04:00
Rob Speer
d0d777ed91
use a more general-purpose tokenizer, not 'retokenize'
2015-05-08 12:40:14 -04:00
Rob Speer
35128a94ca
build.ninja knows about its own dependencies
2015-05-08 12:40:06 -04:00
Rob Speer
d6cc90792f
Makefile should only be needed for bootstrapping Ninja
2015-05-08 12:39:31 -04:00
Rob Speer
2f14417bcf
limit final builds to languages with >= 2 sources
2015-05-07 23:59:04 -04:00
Rob Speer
1b7a2b9d0b
fix dependency
2015-05-07 23:55:57 -04:00
Rob Speer
abb0e059c8
a reasonably complete build process
2015-05-07 19:38:33 -04:00
Rob Speer
02d8b32119
process leeds and opensubtitles
2015-05-07 17:07:33 -04:00
Rob Speer
7e238cf547
abstract how we define build rules a bit
2015-05-07 16:59:28 -04:00
Rob Speer
d2f9c60776
WIP on more build steps
2015-05-07 16:49:53 -04:00
Rob Speer
16928ed182
add rules to count wikipedia tokens
2015-05-05 15:21:24 -04:00
Rob Speer
bd579e2319
fix the 'count' ninja rule
2015-05-05 14:06:13 -04:00
Rob Speer
5787b6bb73
add and adjust some build steps
...
- more build steps for Wikipedia
- rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that
the results are preliminary
2015-05-05 13:59:21 -04:00
Rob Speer
61b9440e3d
add wiki-parsing process
2015-05-04 13:25:01 -04:00