mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
Merge pull request #48 from LuminosoInsight/code-review-notes
Code review notes
This commit is contained in:
commit
ae7bc5764b
@ -14,7 +14,7 @@
|
||||
- Add automatic transliteration of Serbian text
|
||||
- Adjust tokenization of apostrophes next to vowel sounds: the French word
|
||||
"l'heure" is now tokenized similarly to "l'arc"
|
||||
- Numbers longer than a single digit are smashed into the same word frequency,
|
||||
- Multi-digit numbers of each length are smashed into the same word frequency,
|
||||
to remove meaningless differences and increase compatibility with word2vec.
|
||||
(Internally, their digits are replaced by zeroes.)
|
||||
- Another new frequency-merging strategy (drop the highest and lowest,
|
||||
|
@ -39,7 +39,7 @@ SR_CYRL_TO_LATN_DICT = {
|
||||
# letters surrounded by Latin.
|
||||
|
||||
# Russian letters
|
||||
ord('Ё'): 'Jo', ord('ё'): 'Jo',
|
||||
ord('Ё'): 'Jo', ord('ё'): 'jo',
|
||||
ord('Й'): 'J', ord('й'): 'j',
|
||||
ord('Щ'): 'Šč', ord('щ'): 'šč',
|
||||
ord('Ъ'): '', ord('ъ'): '',
|
||||
|
Loading…
Reference in New Issue
Block a user