Commit Graph

421 Commits

Author SHA1 Message Date
Robyn Speer
59409266ca add and adjust some build steps
- more build steps for Wikipedia
- rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that
  the results are preliminary
2015-05-05 13:59:21 -04:00
Robyn Speer
d5d24cb098 set version number to 0.8
Former-commit-id: 2f3bb955d1
2015-05-05 12:05:00 -04:00
Robyn Speer
c9ca5b94b0 Merge branch 'dutch-201504' into ftfy4
Conflicts:
	setup.py

Former-commit-id: 24a7c73e6d
2015-05-05 12:04:44 -04:00
Robyn Speer
922c658b68 require ftfy 4
Former-commit-id: 70b2c678ea
2015-05-05 12:04:13 -04:00
Robyn Speer
33c5f78c07 add wiki-parsing process 2015-05-04 13:25:01 -04:00
Robyn Speer
9ac4838eda not using wordfreq.cfg anymore 2015-04-30 16:25:42 -04:00
Robyn Speer
efcf436112 WIP on new build system 2015-04-30 16:24:28 -04:00
Robyn Speer
98b3e51da1 use script codes for Chinese 2015-04-30 13:02:58 -04:00
Robyn Speer
cb6b2a8002 v0.7: make a proper Dutch 'surfaces' list
Former-commit-id: 873ace87db
2015-04-30 13:01:24 -04:00
Robyn Speer
76ea7f1bd5 define some ninja rules 2015-04-29 17:13:58 -04:00
Robyn Speer
524f7c760b WIP on Ninja build automation 2015-04-29 15:59:06 -04:00
Robyn Speer
f77a61e675 move commands into cli/ directory 2015-04-29 15:22:04 -04:00
Robyn Speer
2bf8870832 always use surface forms 2015-04-29 15:17:00 -04:00
Robyn Speer
b4dfdaa47c Merge branch 'master' into dutch-201503
Conflicts:
	wordfreq/build.py

Former-commit-id: 6cf46ee5aa
2015-04-29 14:36:24 -04:00
Robyn Speer
38261a6a0a handle multi-word stems correctly 2015-04-29 13:45:53 -04:00
Robyn Speer
d29e8bfddf start a new multilingual wordlist called 'stems'
So far, this wordlist is only in Dutch.


Former-commit-id: af5f65b328
2015-03-31 15:59:30 -04:00
Robyn Speer
6b57075275 revise the process of building Wikipedia counts 2015-03-30 18:09:07 -04:00
Robyn Speer
56e811be19 Fix Dutch lists
- Use surface forms consistently, not stems
- Count all instances of words on Wikipedia, not one per article


Former-commit-id: 3507d8b630
2015-03-12 16:00:03 -04:00
Andrew Lin
6e98ca9822 Merge pull request #3 from LuminosoInsight/variable_name_fix
Fix a variable name for clarity.

Former-commit-id: cfe58cd899
2015-03-11 14:10:53 -04:00
Robyn Speer
ca944e54aa new Dutch data, bump version to 0.6
Former-commit-id: 377336bcdc
2015-03-03 15:54:45 -05:00
Andrew Lin
6882ac9f0e Fix a variable name for clarity.
Former-commit-id: 434c603798
2015-03-03 11:59:46 -05:00
Andrew Lin
39d914f8e1 Merge pull request #2 from LuminosoInsight/new-twitter-lists
New twitter lists

Former-commit-id: 5a4d3a87d5
2015-02-17 15:36:13 -05:00
Robyn Speer
ad22387a53 add surface forms from Twitter 2014 data
Former-commit-id: ffdaa82b11
2015-02-17 15:06:11 -05:00
Robyn Speer
8d57b39a7b stop running 'remove_unsafe_private_use' unnecessarily
Former-commit-id: b6f246ecbb
2015-02-17 14:02:36 -05:00
Robyn Speer
bc780c63c8 enable wordlist balancing, surface form counting 2015-02-17 13:43:22 -05:00
Robyn Speer
f4280dcad0 add twitter-stems-2014 wordlist data
Former-commit-id: 6ab72201cd
2015-02-11 13:29:32 -05:00
Robyn Speer
07e61be7e3 add utility for combining wordlists 2015-02-11 11:45:10 -05:00
Robyn Speer
23bd5ba76c command-line entry points 2015-02-10 12:28:29 -05:00
Robyn Speer
8b322ce534 Initial commit 2015-02-04 20:19:36 -05:00
Robyn Speer
03fac20b1b Allow multithreaded SQLite on Python 3
Former-commit-id: bf0071fd8b
2014-10-02 18:10:09 -04:00
Robyn Speer
5153faf43e construct the download path correctly, even on Windows
Former-commit-id: 6d90cef415
2014-09-08 10:56:48 -04:00
Robyn Speer
0c61406cdc remove unused global
Former-commit-id: c55a701885
2014-09-02 14:29:31 -04:00
Robyn Speer
b357ffaa09 cleanups to building and uploading, from code review
Former-commit-id: 5dee417302
2014-08-18 14:14:01 -04:00
Robyn Speer
759534392f Add license text for the whole package
Former-commit-id: cb7b2b76e6
2014-06-02 16:37:32 -04:00
Robyn Speer
a06c3fc648 A different plan for the top-level word_frequency function.
When, before, I was importing wordfreq.query at the top level, this
created a dependency loop when installing wordfreq.

The new top-level __init__.py provides just a `word_frequency` function,
which imports the real function as needed and calls it. This should
avoid the dependency loop, at the cost of making
`wordfreq.word_frequency` slightly less efficient than
`wordfreq.query.word_frequency`.


Former-commit-id: 44ccf40742
2014-02-24 18:03:31 -05:00
Robyn Speer
b6b3a6f5f6 version 0.4: minor code changes, debugged database
- The database is built under Python 3.3.2, so it should correctly
  implement Python 3's Unicode tricks, including special handling
  of Greek lowercase letters. (Version 0.3 was supposed to do this
  as well, but apparently, it didn't.)
- `word_frequency` and `iter_wordlist` can be imported from the
  top level.
- The new function `random_words` supplies a string made from
  random words that are sufficiently high in rank order.


Former-commit-id: 3702a7c8d0
2014-02-24 16:29:06 -05:00
Robyn Speer
207defe6ff Sometimes you need some random words.
Former-commit-id: 3447ae732e
2014-01-06 15:51:10 -05:00
Andrew Lin
181e8e08fa Remove the tests for metanl_word_frequency too. Doh.
Former-commit-id: 68d262791c
2013-11-11 13:21:25 -05:00
Robyn Speer
f369df3e82 Merge pull request #1 from LuminosoInsight/remove_metanl_wf
Remove metanl_word_frequency(), which we no longer need.

Former-commit-id: 63bebe6ad3
2013-11-11 10:13:25 -08:00
Robyn Speer
634cf6af6d data is now hosted on wordfreq.services.luminoso.com
Former-commit-id: 56f2c606f1
2013-11-07 14:43:15 -05:00
Andrew Lin
cf45720f66 Remove metanl_word_frequency(), which we no longer need.
Former-commit-id: 76a7267670
2013-11-04 16:51:25 -05:00
Robyn Speer
5f7c7e032c Clear wordlists before inserting them; yell at Python 2
Former-commit-id: 823b3828cd
2013-11-01 19:29:37 -04:00
Robyn Speer
5fc933495f Revert "code review and pep8 fixes"
This reverts commit ae6e03fa06 [formerly b4b8ba8be7].

Conflicts:
	wordfreq/transfer.py

Former-commit-id: 5c8ba34492
2013-11-01 17:33:39 -04:00
Robyn Speer
4d904a3bae Merge branch 'master' of github.com:LuminosoInsight/wordfreq
Conflicts:
	wordfreq/transfer.py

Former-commit-id: 90e042f196
2013-11-01 17:05:59 -04:00
Robyn Speer
ae6e03fa06 code review and pep8 fixes
Former-commit-id: b4b8ba8be7
2013-11-01 17:05:12 -04:00
Lance Nathan
cbb3207e4f Two small stylistic tweaks
Former-commit-id: ea29469643
2013-10-31 16:00:48 -04:00
Robyn Speer
5168da105a make the tests less picky about numerical exactness
Former-commit-id: 2b2bd943d2
2013-10-31 15:43:19 -04:00
Robyn Speer
313306f12e try to match the wordlist metanl actually uses
Former-commit-id: 90772e33fb
2013-10-31 15:13:22 -04:00
Robyn Speer
773f6b9843 The metanl scale is not what I thought it was.
Former-commit-id: 0d2fb21726
2013-10-31 14:38:01 -04:00
Robyn Speer
351378e318 Don't download the DB if the right version is already there
Former-commit-id: e931062b5a
2013-10-31 14:12:04 -04:00