Merge pull request #48 from LuminosoInsight/code-review-notes

Code review notes
This commit is contained in:
Robyn Speer 2017-02-15 12:29:25 -08:00 committed by GitHub
commit ae7bc5764b
2 changed files with 2 additions and 2 deletions

View File

@ -14,7 +14,7 @@
- Add automatic transliteration of Serbian text
- Adjust tokenization of apostrophes next to vowel sounds: the French word
"l'heure" is now tokenized similarly to "l'arc"
- Numbers longer than a single digit are smashed into the same word frequency,
- Multi-digit numbers of each length are smashed into the same word frequency,
to remove meaningless differences and increase compatibility with word2vec.
(Internally, their digits are replaced by zeroes.)
- Another new frequency-merging strategy (drop the highest and lowest,

View File

@ -39,7 +39,7 @@ SR_CYRL_TO_LATN_DICT = {
# letters surrounded by Latin.
# Russian letters
ord('Ё'): 'Jo', ord('ё'): 'Jo',
ord('Ё'): 'Jo', ord('ё'): 'jo',
ord('Й'): 'J', ord('й'): 'j',
ord('Щ'): 'Šč', ord('щ'): 'šč',
ord('Ъ'): '', ord('ъ'): '',