mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
Merge pull request #42 from LuminosoInsight/mecab-finder
Look for MeCab dictionaries in various places besides this package Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628
This commit is contained in:
commit
e4b32afa18
74
README.md
74
README.md
@ -1,4 +1,5 @@
|
|||||||
Tools for working with word frequencies from various corpora.
|
wordfreq is a Python library for looking up the frequencies of words in many
|
||||||
|
languages, based on many sources of data.
|
||||||
|
|
||||||
Author: Robyn Speer
|
Author: Robyn Speer
|
||||||
|
|
||||||
@ -15,31 +16,76 @@ or by getting the repository and running its setup.py:
|
|||||||
|
|
||||||
python3 setup.py install
|
python3 setup.py install
|
||||||
|
|
||||||
Japanese and Chinese have additional external dependencies so that they can be
|
|
||||||
tokenized correctly.
|
|
||||||
|
|
||||||
To be able to look up word frequencies in Japanese, you need to additionally
|
## Additional CJK installation
|
||||||
install mecab-python3, which itself depends on libmecab-dev and its dictionary.
|
|
||||||
These commands will install them on Ubuntu:
|
|
||||||
|
|
||||||
sudo apt-get install mecab-ipadic-utf8 libmecab-dev
|
Chinese, Japanese, and Korean have additional external dependencies so that
|
||||||
pip3 install mecab-python3
|
they can be tokenized correctly. Here we'll explain how to set them up,
|
||||||
|
in increasing order of difficulty.
|
||||||
|
|
||||||
|
|
||||||
|
### Chinese
|
||||||
|
|
||||||
To be able to look up word frequencies in Chinese, you need Jieba, a
|
To be able to look up word frequencies in Chinese, you need Jieba, a
|
||||||
pure-Python Chinese tokenizer:
|
pure-Python Chinese tokenizer:
|
||||||
|
|
||||||
pip3 install jieba
|
pip3 install jieba
|
||||||
|
|
||||||
These dependencies can also be requested as options when installing wordfreq.
|
|
||||||
For example:
|
|
||||||
|
|
||||||
pip3 install wordfreq[mecab,jieba]
|
### Japanese
|
||||||
|
|
||||||
|
We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
|
||||||
|
things need to be installed:
|
||||||
|
|
||||||
|
* The MeCab development library (called `libmecab-dev` on Ubuntu)
|
||||||
|
* The UTF-8 version of the `ipadic` Japanese dictionary
|
||||||
|
(called `mecab-ipadic-utf8` on Ubuntu)
|
||||||
|
* The `mecab-python3` Python interface
|
||||||
|
|
||||||
|
To install these three things on Ubuntu, you can run:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
sudo apt-get install libmecab-dev mecab-ipadic-utf8
|
||||||
|
pip3 install mecab-python3
|
||||||
|
```
|
||||||
|
|
||||||
|
If you choose to install `ipadic` from somewhere else or from its source code,
|
||||||
|
be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
|
||||||
|
give you nonsense results.
|
||||||
|
|
||||||
|
|
||||||
|
### Korean
|
||||||
|
|
||||||
|
Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
|
||||||
|
Yungho Yu. This dictionary is not available as an Ubuntu package.
|
||||||
|
|
||||||
|
Here's a process you can use to install the Korean dictionary and the other
|
||||||
|
MeCab dependencies:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
sudo apt-get install libmecab-dev mecab-utils
|
||||||
|
pip3 install mecab-python3
|
||||||
|
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||||
|
tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||||
|
cd mecab-ko-dic-2.0.1-20150920
|
||||||
|
./autogen.sh
|
||||||
|
make
|
||||||
|
sudo make install
|
||||||
|
```
|
||||||
|
|
||||||
|
If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
|
||||||
|
tokenize those languages, it will raise an error and show you the list of
|
||||||
|
paths it searched.
|
||||||
|
|
||||||
|
Sorry that this is difficult. We tried to just package the data files we need
|
||||||
|
with wordfreq, like we do for Chinese, but PyPI would reject the package for
|
||||||
|
being too large.
|
||||||
|
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
wordfreq provides access to estimates of the frequency with which a word is
|
wordfreq provides access to estimates of the frequency with which a word is
|
||||||
used, in 18 languages (see *Supported languages* below).
|
used, in 27 languages (see *Supported languages* below).
|
||||||
|
|
||||||
It provides three kinds of pre-built wordlists:
|
It provides three kinds of pre-built wordlists:
|
||||||
|
|
||||||
@ -288,10 +334,6 @@ The terms of use of this data are:
|
|||||||
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
|
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
|
||||||
of a link to http://books.google.com/ngrams, would be appreciated.
|
of a link to http://books.google.com/ngrams, would be appreciated.
|
||||||
|
|
||||||
`wordfreq` uses MeCab, by Taku Kudo, plus Korean data files by Yongwoon Lee and
|
|
||||||
Yungho Yu. The Korean data is under an Apache 2 license, a copy of which
|
|
||||||
appears in wordfreq/data/mecab-ko-dic/COPYING.
|
|
||||||
|
|
||||||
`wordfreq` also contains data derived from the following Creative Commons-licensed
|
`wordfreq` also contains data derived from the following Creative Commons-licensed
|
||||||
sources:
|
sources:
|
||||||
|
|
||||||
|
Binary file not shown.
Binary file not shown.
@ -1,29 +0,0 @@
|
|||||||
;
|
|
||||||
; Configuration file of IPADIC
|
|
||||||
;
|
|
||||||
; $Id: dicrc,v 1.4 2006/04/08 06:41:36 taku-ku Exp $;
|
|
||||||
;
|
|
||||||
cost-factor = 800
|
|
||||||
bos-feature = BOS/EOS,*,*,*,*,*,*,*,*
|
|
||||||
eval-size = 8
|
|
||||||
unk-eval-size = 4
|
|
||||||
config-charset = UTF-8
|
|
||||||
|
|
||||||
; yomi
|
|
||||||
node-format-yomi = %pS%f[7]
|
|
||||||
unk-format-yomi = %M
|
|
||||||
eos-format-yomi = \n
|
|
||||||
|
|
||||||
; simple
|
|
||||||
node-format-simple = %m\t%F-[0,1,2,3]\n
|
|
||||||
eos-format-simple = EOS\n
|
|
||||||
|
|
||||||
; ChaSen
|
|
||||||
node-format-chasen = %m\t%f[7]\t%f[6]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
|
|
||||||
unk-format-chasen = %m\t%m\t%m\t%F-[0,1,2,3]\t\t\n
|
|
||||||
eos-format-chasen = EOS\n
|
|
||||||
|
|
||||||
; ChaSen (include spaces)
|
|
||||||
node-format-chasen2 = %M\t%f[7]\t%f[6]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
|
|
||||||
unk-format-chasen2 = %M\t%m\t%m\t%F-[0,1,2,3]\t\t\n
|
|
||||||
eos-format-chasen2 = EOS\n
|
|
Binary file not shown.
@ -1 +0,0 @@
|
|||||||
c926154d533ccaef1515af6883056d69c34ca239
|
|
Binary file not shown.
@ -1,201 +0,0 @@
|
|||||||
Apache License
|
|
||||||
Version 2.0, January 2004
|
|
||||||
http://www.apache.org/licenses/
|
|
||||||
|
|
||||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
|
||||||
|
|
||||||
1. Definitions.
|
|
||||||
|
|
||||||
"License" shall mean the terms and conditions for use, reproduction,
|
|
||||||
and distribution as defined by Sections 1 through 9 of this document.
|
|
||||||
|
|
||||||
"Licensor" shall mean the copyright owner or entity authorized by
|
|
||||||
the copyright owner that is granting the License.
|
|
||||||
|
|
||||||
"Legal Entity" shall mean the union of the acting entity and all
|
|
||||||
other entities that control, are controlled by, or are under common
|
|
||||||
control with that entity. For the purposes of this definition,
|
|
||||||
"control" means (i) the power, direct or indirect, to cause the
|
|
||||||
direction or management of such entity, whether by contract or
|
|
||||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
|
||||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
|
||||||
|
|
||||||
"You" (or "Your") shall mean an individual or Legal Entity
|
|
||||||
exercising permissions granted by this License.
|
|
||||||
|
|
||||||
"Source" form shall mean the preferred form for making modifications,
|
|
||||||
including but not limited to software source code, documentation
|
|
||||||
source, and configuration files.
|
|
||||||
|
|
||||||
"Object" form shall mean any form resulting from mechanical
|
|
||||||
transformation or translation of a Source form, including but
|
|
||||||
not limited to compiled object code, generated documentation,
|
|
||||||
and conversions to other media types.
|
|
||||||
|
|
||||||
"Work" shall mean the work of authorship, whether in Source or
|
|
||||||
Object form, made available under the License, as indicated by a
|
|
||||||
copyright notice that is included in or attached to the work
|
|
||||||
(an example is provided in the Appendix below).
|
|
||||||
|
|
||||||
"Derivative Works" shall mean any work, whether in Source or Object
|
|
||||||
form, that is based on (or derived from) the Work and for which the
|
|
||||||
editorial revisions, annotations, elaborations, or other modifications
|
|
||||||
represent, as a whole, an original work of authorship. For the purposes
|
|
||||||
of this License, Derivative Works shall not include works that remain
|
|
||||||
separable from, or merely link (or bind by name) to the interfaces of,
|
|
||||||
the Work and Derivative Works thereof.
|
|
||||||
|
|
||||||
"Contribution" shall mean any work of authorship, including
|
|
||||||
the original version of the Work and any modifications or additions
|
|
||||||
to that Work or Derivative Works thereof, that is intentionally
|
|
||||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
|
||||||
or by an individual or Legal Entity authorized to submit on behalf of
|
|
||||||
the copyright owner. For the purposes of this definition, "submitted"
|
|
||||||
means any form of electronic, verbal, or written communication sent
|
|
||||||
to the Licensor or its representatives, including but not limited to
|
|
||||||
communication on electronic mailing lists, source code control systems,
|
|
||||||
and issue tracking systems that are managed by, or on behalf of, the
|
|
||||||
Licensor for the purpose of discussing and improving the Work, but
|
|
||||||
excluding communication that is conspicuously marked or otherwise
|
|
||||||
designated in writing by the copyright owner as "Not a Contribution."
|
|
||||||
|
|
||||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
|
||||||
on behalf of whom a Contribution has been received by Licensor and
|
|
||||||
subsequently incorporated within the Work.
|
|
||||||
|
|
||||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
|
||||||
this License, each Contributor hereby grants to You a perpetual,
|
|
||||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
|
||||||
copyright license to reproduce, prepare Derivative Works of,
|
|
||||||
publicly display, publicly perform, sublicense, and distribute the
|
|
||||||
Work and such Derivative Works in Source or Object form.
|
|
||||||
|
|
||||||
3. Grant of Patent License. Subject to the terms and conditions of
|
|
||||||
this License, each Contributor hereby grants to You a perpetual,
|
|
||||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
|
||||||
(except as stated in this section) patent license to make, have made,
|
|
||||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
|
||||||
where such license applies only to those patent claims licensable
|
|
||||||
by such Contributor that are necessarily infringed by their
|
|
||||||
Contribution(s) alone or by combination of their Contribution(s)
|
|
||||||
with the Work to which such Contribution(s) was submitted. If You
|
|
||||||
institute patent litigation against any entity (including a
|
|
||||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
|
||||||
or a Contribution incorporated within the Work constitutes direct
|
|
||||||
or contributory patent infringement, then any patent licenses
|
|
||||||
granted to You under this License for that Work shall terminate
|
|
||||||
as of the date such litigation is filed.
|
|
||||||
|
|
||||||
4. Redistribution. You may reproduce and distribute copies of the
|
|
||||||
Work or Derivative Works thereof in any medium, with or without
|
|
||||||
modifications, and in Source or Object form, provided that You
|
|
||||||
meet the following conditions:
|
|
||||||
|
|
||||||
(a) You must give any other recipients of the Work or
|
|
||||||
Derivative Works a copy of this License; and
|
|
||||||
|
|
||||||
(b) You must cause any modified files to carry prominent notices
|
|
||||||
stating that You changed the files; and
|
|
||||||
|
|
||||||
(c) You must retain, in the Source form of any Derivative Works
|
|
||||||
that You distribute, all copyright, patent, trademark, and
|
|
||||||
attribution notices from the Source form of the Work,
|
|
||||||
excluding those notices that do not pertain to any part of
|
|
||||||
the Derivative Works; and
|
|
||||||
|
|
||||||
(d) If the Work includes a "NOTICE" text file as part of its
|
|
||||||
distribution, then any Derivative Works that You distribute must
|
|
||||||
include a readable copy of the attribution notices contained
|
|
||||||
within such NOTICE file, excluding those notices that do not
|
|
||||||
pertain to any part of the Derivative Works, in at least one
|
|
||||||
of the following places: within a NOTICE text file distributed
|
|
||||||
as part of the Derivative Works; within the Source form or
|
|
||||||
documentation, if provided along with the Derivative Works; or,
|
|
||||||
within a display generated by the Derivative Works, if and
|
|
||||||
wherever such third-party notices normally appear. The contents
|
|
||||||
of the NOTICE file are for informational purposes only and
|
|
||||||
do not modify the License. You may add Your own attribution
|
|
||||||
notices within Derivative Works that You distribute, alongside
|
|
||||||
or as an addendum to the NOTICE text from the Work, provided
|
|
||||||
that such additional attribution notices cannot be construed
|
|
||||||
as modifying the License.
|
|
||||||
|
|
||||||
You may add Your own copyright statement to Your modifications and
|
|
||||||
may provide additional or different license terms and conditions
|
|
||||||
for use, reproduction, or distribution of Your modifications, or
|
|
||||||
for any such Derivative Works as a whole, provided Your use,
|
|
||||||
reproduction, and distribution of the Work otherwise complies with
|
|
||||||
the conditions stated in this License.
|
|
||||||
|
|
||||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
|
||||||
any Contribution intentionally submitted for inclusion in the Work
|
|
||||||
by You to the Licensor shall be under the terms and conditions of
|
|
||||||
this License, without any additional terms or conditions.
|
|
||||||
Notwithstanding the above, nothing herein shall supersede or modify
|
|
||||||
the terms of any separate license agreement you may have executed
|
|
||||||
with Licensor regarding such Contributions.
|
|
||||||
|
|
||||||
6. Trademarks. This License does not grant permission to use the trade
|
|
||||||
names, trademarks, service marks, or product names of the Licensor,
|
|
||||||
except as required for reasonable and customary use in describing the
|
|
||||||
origin of the Work and reproducing the content of the NOTICE file.
|
|
||||||
|
|
||||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
|
||||||
agreed to in writing, Licensor provides the Work (and each
|
|
||||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
|
||||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
|
||||||
implied, including, without limitation, any warranties or conditions
|
|
||||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
|
||||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
|
||||||
appropriateness of using or redistributing the Work and assume any
|
|
||||||
risks associated with Your exercise of permissions under this License.
|
|
||||||
|
|
||||||
8. Limitation of Liability. In no event and under no legal theory,
|
|
||||||
whether in tort (including negligence), contract, or otherwise,
|
|
||||||
unless required by applicable law (such as deliberate and grossly
|
|
||||||
negligent acts) or agreed to in writing, shall any Contributor be
|
|
||||||
liable to You for damages, including any direct, indirect, special,
|
|
||||||
incidental, or consequential damages of any character arising as a
|
|
||||||
result of this License or out of the use or inability to use the
|
|
||||||
Work (including but not limited to damages for loss of goodwill,
|
|
||||||
work stoppage, computer failure or malfunction, or any and all
|
|
||||||
other commercial damages or losses), even if such Contributor
|
|
||||||
has been advised of the possibility of such damages.
|
|
||||||
|
|
||||||
9. Accepting Warranty or Additional Liability. While redistributing
|
|
||||||
the Work or Derivative Works thereof, You may choose to offer,
|
|
||||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
|
||||||
or other liability obligations and/or rights consistent with this
|
|
||||||
License. However, in accepting such obligations, You may act only
|
|
||||||
on Your own behalf and on Your sole responsibility, not on behalf
|
|
||||||
of any other Contributor, and only if You agree to indemnify,
|
|
||||||
defend, and hold each Contributor harmless for any liability
|
|
||||||
incurred by, or claims asserted against, such Contributor by reason
|
|
||||||
of your accepting any such warranty or additional liability.
|
|
||||||
|
|
||||||
END OF TERMS AND CONDITIONS
|
|
||||||
|
|
||||||
APPENDIX: How to apply the Apache License to your work.
|
|
||||||
|
|
||||||
To apply the Apache License to your work, attach the following
|
|
||||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
|
||||||
replaced with your own identifying information. (Don't include
|
|
||||||
the brackets!) The text should be enclosed in the appropriate
|
|
||||||
comment syntax for the file format. We also recommend that a
|
|
||||||
file or class name and description of purpose be included on the
|
|
||||||
same "printed page" as the copyright notice for easier
|
|
||||||
identification within third-party archives.
|
|
||||||
|
|
||||||
Copyright [yyyy] [name of copyright owner]
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
you may not use this file except in compliance with the License.
|
|
||||||
You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software
|
|
||||||
distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
See the License for the specific language governing permissions and
|
|
||||||
limitations under the License.
|
|
Binary file not shown.
@ -1,25 +0,0 @@
|
|||||||
;
|
|
||||||
; Configuration file of mecab-ko-dic
|
|
||||||
;
|
|
||||||
|
|
||||||
# 비용 값으로 변환할 때 배율 팩터입니다. 700에서 800에서 문제가 없습니다.
|
|
||||||
cost-factor = 800
|
|
||||||
# 문장의 시작, 문장 끝에 대한 소성(素性)입니다. CSV로 표현합니다.
|
|
||||||
bos-feature = BOS/EOS,*,*,*,*,*,*,*,*
|
|
||||||
# 알려진 단어의 경우 소성(素性)의 처음부터 몇 개까지 일치하면 정답으로
|
|
||||||
# 인정하는지를 지정합니다. 일반적으로 알려진 단어는 품사 활용 등의 정보만
|
|
||||||
# 맞추면 되기 때문에, "읽기", "발음" 소성(素性)은 무시하도록 합니다.
|
|
||||||
# 여기에서는 3가지가 평가됩니다.
|
|
||||||
eval-size = 4
|
|
||||||
# 알 수 없는 단어의 경우
|
|
||||||
# 소성의 처음부터 몇 개까지 일치하면 정답으로 인정할지를 지정합니다.
|
|
||||||
unk-eval-size = 2
|
|
||||||
# dicrc, char.def, unk.def, pos-id.def 파일의 문자 코드셋입니다.
|
|
||||||
config-charset = UTF-8
|
|
||||||
# 좌측에 공백을 포함하는 품사의 연접 비용을 늘리기 위한 설정입니다.
|
|
||||||
# mecab-ko에서만 사용되는 설정입니다. 다음과 같은 형식을 가집니다.
|
|
||||||
# <posid 1>,<posid 1 penalty cost>,<posid 2>,<posid 2 penalty cost>...
|
|
||||||
#
|
|
||||||
# 예) 120,6000 => posid가 120인 품사(조사)의 좌측에 공백을 포함할 경우
|
|
||||||
# 연접 비용을 6000만큼 늘림
|
|
||||||
left-space-penalty-factor = 100,3000,120,6000,172,3000,183,3000,184,3000,185,3000,200,3000,210,6000,220,3000,221,3000,222,3000,230,3000
|
|
File diff suppressed because it is too large
Load Diff
@ -1 +0,0 @@
|
|||||||
2dbb57fe707d7dddd2392526aad7cbac77378bb3
|
|
@ -1 +0,0 @@
|
|||||||
58619494b4f81190218b76d9d2090607830e51ec
|
|
@ -1,66 +0,0 @@
|
|||||||
UNKNOWN,*,*,*,*,*,*,*,* 0
|
|
||||||
*,*,*,*,Compound,*,*,*,* 1
|
|
||||||
*,*,*,*,Inflect,EC,*,*,* 200
|
|
||||||
*,*,*,*,Inflect,EF,*,*,* 200
|
|
||||||
*,*,*,*,Inflect,EP,*,*,* 200
|
|
||||||
*,*,*,*,Inflect,ETM,*,*,* 200
|
|
||||||
*,*,*,*,Inflect,ETN,*,*,* 200
|
|
||||||
*,*,*,*,Inflect,JC,*,*,* 210
|
|
||||||
*,*,*,*,Inflect,JKB,*,*,* 210
|
|
||||||
*,*,*,*,Inflect,JKC,*,*,* 210
|
|
||||||
*,*,*,*,Inflect,JKG,*,*,* 210
|
|
||||||
*,*,*,*,Inflect,JKO,*,*,* 210
|
|
||||||
*,*,*,*,Inflect,JKQ,*,*,* 210
|
|
||||||
*,*,*,*,Inflect,JKS,*,*,* 210
|
|
||||||
*,*,*,*,Inflect,JKV,*,*,* 210
|
|
||||||
*,*,*,*,Inflect,JX,*,*,* 210
|
|
||||||
*,*,*,*,Inflect,XSA,*,*,* 220
|
|
||||||
*,*,*,*,Inflect,XSN,*,*,* 221
|
|
||||||
*,*,*,*,Inflect,XSV,*,*,* 222
|
|
||||||
*,*,*,*,Inflect,VCP,*,*,* 230
|
|
||||||
*,*,*,*,Inflect,*,*,*,* 2
|
|
||||||
*,*,*,*,Preanalysis,*,*,*,* 3
|
|
||||||
EC,*,*,*,*,*,*,*,* 100
|
|
||||||
EF,*,*,*,*,*,*,*,* 100
|
|
||||||
EP,*,*,*,*,*,*,*,* 100
|
|
||||||
ETM,*,*,*,*,*,*,*,* 100
|
|
||||||
ETN,*,*,*,*,*,*,*,* 100
|
|
||||||
IC,*,*,*,*,*,*,*,* 110
|
|
||||||
JC,*,*,*,*,*,*,*,* 120
|
|
||||||
JKB,*,*,*,*,*,*,*,* 120
|
|
||||||
JKC,*,*,*,*,*,*,*,* 120
|
|
||||||
JKG,*,*,*,*,*,*,*,* 120
|
|
||||||
JKO,*,*,*,*,*,*,*,* 120
|
|
||||||
JKQ,*,*,*,*,*,*,*,* 120
|
|
||||||
JKS,*,*,*,*,*,*,*,* 120
|
|
||||||
JKV,*,*,*,*,*,*,*,* 120
|
|
||||||
JX,*,*,*,*,*,*,*,* 120
|
|
||||||
MAG,*,*,*,*,*,*,*,* 130
|
|
||||||
MAJ,*,*,*,*,*,*,*,* 131
|
|
||||||
MM,*,*,*,*,*,*,*,* 140
|
|
||||||
NNG,*,*,*,*,*,*,*,* 150
|
|
||||||
NNP,*,*,*,*,*,*,*,* 150
|
|
||||||
NNB,*,*,*,*,*,*,*,* 150
|
|
||||||
NNBC,*,*,*,*,*,*,*,* 150
|
|
||||||
NP,*,*,*,*,*,*,*,* 150
|
|
||||||
NR,*,*,*,*,*,*,*,* 150
|
|
||||||
SF,*,*,*,*,*,*,*,* 160
|
|
||||||
SH,*,*,*,*,*,*,*,* 161
|
|
||||||
SL,*,*,*,*,*,*,*,* 162
|
|
||||||
SN,*,*,*,*,*,*,*,* 163
|
|
||||||
SP,*,*,*,*,*,*,*,* 164
|
|
||||||
SSC,*,*,*,*,*,*,*,* 165
|
|
||||||
SSO,*,*,*,*,*,*,*,* 166
|
|
||||||
SC,*,*,*,*,*,*,*,* 167
|
|
||||||
SY,*,*,*,*,*,*,*,* 168
|
|
||||||
SE,*,*,*,*,*,*,*,* 169
|
|
||||||
VA,*,*,*,*,*,*,*,* 170
|
|
||||||
VCN,*,*,*,*,*,*,*,* 171
|
|
||||||
VCP,*,*,*,*,*,*,*,* 172
|
|
||||||
VV,*,*,*,*,*,*,*,* 173
|
|
||||||
VX,*,*,*,*,*,*,*,* 174
|
|
||||||
XPN,*,*,*,*,*,*,*,* 181
|
|
||||||
XR,*,*,*,*,*,*,*,* 182
|
|
||||||
XSA,*,*,*,*,*,*,*,* 183
|
|
||||||
XSN,*,*,*,*,*,*,*,* 184
|
|
||||||
XSV,*,*,*,*,*,*,*,* 185
|
|
@ -1,51 +0,0 @@
|
|||||||
# Feature(POS) to Internal State mapping
|
|
||||||
#
|
|
||||||
# 소성(素性)열에서 내부 상태 소생(素生)열로 변환하는 매핑을 정의합니다.
|
|
||||||
#
|
|
||||||
# CRF는 unigram, 왼쪽 문맥 bigram, 오른쪽 문맥 bigram의 3가지 정보를 사용하여
|
|
||||||
# 통계를 계산합니다
|
|
||||||
#
|
|
||||||
# 각 섹션 다음에 한 줄에 하나의 매핑 규칙이 계속됩니다.
|
|
||||||
#
|
|
||||||
# 일치패턴 대상(変換先)
|
|
||||||
#
|
|
||||||
# 일치패턴에는 간단한 정규식을 사용할 수 있습니다.
|
|
||||||
#
|
|
||||||
# * : 모든 문자열과 일치
|
|
||||||
# (AB|CD|EF) : AB 또는 CD 또는 EF 일치
|
|
||||||
# AB : 문자열 AB에만 완전 매치
|
|
||||||
#
|
|
||||||
# 대상(変換先)은 $1 $2, $3..라는 매크로를 사용하여 소생(素生)의
|
|
||||||
# 각 요소(CSV로 표시된 요소)의 내용을 참조할 수 있습니다.
|
|
||||||
#
|
|
||||||
# 품사태그,종성유무,표기,타입,첫번째품사,마지막품사,원형
|
|
||||||
#
|
|
||||||
# Unigram 내부 상태에 대한 매핑
|
|
||||||
[unigram rewrite]
|
|
||||||
*,*,*,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,$8,$9
|
|
||||||
|
|
||||||
# 왼쪽 문맥 bigram 내부 상태에 대한 매핑
|
|
||||||
# 기본으로 의미부류($2)까지 표현한다.
|
|
||||||
# 조사외 몇 가지의 품사에서는 품사,의미부로,읽기($1,$2,*,$4)로 표현한다.
|
|
||||||
# 하늘/NNG,T,*,*,*,*,* + 은/J,*,은,*,*,*,*
|
|
||||||
#
|
|
||||||
[left rewrite]
|
|
||||||
BOS/EOS,*,*,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,$8,BOS/EOS
|
|
||||||
SF,*,*,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,$8,BOS/EOS
|
|
||||||
*,*,*,*,Inflect,(JC|JKB|JKC|JKG|JKO|JKQ|JKS|JKV|JX|NNB|NNBC|VCP|ETM|XSN),*,*,* $6,$2,*,$4,*,*,*,*,*
|
|
||||||
*,*,*,*,(Inflect|Preanalysis),*,*,*,* $6,$2,*,*,*,*,*,*,*
|
|
||||||
(JC|JKB|JKC|JKG|JKO|JKQ|JKS|JKV|JX|NNB|NNBC|VCP|ETM|XSN),*,*,*,*,*,*,*,* $1,$2,*,$4,*,*,*,*,*
|
|
||||||
*,*,*,*,*,*,*,*,* $1,$2,*,*,*,*,*,*,*
|
|
||||||
|
|
||||||
# 오른쪽 문맥 bigram 내부 상태에 대한 매핑
|
|
||||||
# 기본으로 종성유무($3)까지 표현한다.
|
|
||||||
# 조사외 몇 가지의 품사에서는 품사,의미,종성유무,읽기($1,$2,$3,$4)로 표현한다.
|
|
||||||
# ex) 하늘/NN,T,*,*,*,*,* + 은/J,T,은,*,*,*,*
|
|
||||||
[right rewrite]
|
|
||||||
BOS/EOS,*,*,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,$8,BOS/EOS
|
|
||||||
SF,*,*,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,$8,BOS/EOS
|
|
||||||
SL,*,*,*,*,*,*,*,* NNG,$2,$3,*,*,*,*,*,*
|
|
||||||
*,*,*,*,Inflect,*,(JC|JKB|JKC|JKG|JKO|JKQ|JKS|JKV|JX|NNB|NNBC|XSN),*,* $7,$2,$3,$4,*,*,*,*,*
|
|
||||||
*,*,*,*,(Inflect|Preanalysis),*,*,*,* $7,$2,$3,*,*,*,*,*,*
|
|
||||||
(JC|JKB|JKC|JKG|JKO|JKQ|JKS|JKV|JX|NNB|NNBC|XSN),*,*,*,*,*,*,*,* $1,$2,$3,$4,*,*,*,*,*
|
|
||||||
*,*,*,*,*,*,*,*,* $1,$2,$3,*,*,*,*,*,*
|
|
File diff suppressed because it is too large
Load Diff
@ -1 +0,0 @@
|
|||||||
9655d23c3a0900764cbcdb8d8395d0f09ec098ed
|
|
Binary file not shown.
@ -1,12 +1,61 @@
|
|||||||
from pkg_resources import resource_filename
|
from pkg_resources import resource_filename
|
||||||
import MeCab
|
import MeCab
|
||||||
import unicodedata
|
import unicodedata
|
||||||
|
import os
|
||||||
|
|
||||||
|
|
||||||
|
# MeCab has fixed-sized buffers for many things, including the dictionary path
|
||||||
|
MAX_PATH_LENGTH = 58
|
||||||
|
|
||||||
|
|
||||||
|
def find_mecab_dictionary(names):
|
||||||
|
"""
|
||||||
|
Find a MeCab dictionary with a given name. The dictionary has to be
|
||||||
|
installed separately -- see wordfreq's README for instructions.
|
||||||
|
"""
|
||||||
|
suggested_pkg = names[0]
|
||||||
|
paths = [
|
||||||
|
os.path.expanduser('~/.local/lib/mecab/dic'),
|
||||||
|
'/var/lib/mecab/dic',
|
||||||
|
'/var/local/lib/mecab/dic',
|
||||||
|
'/usr/lib/mecab/dic',
|
||||||
|
'/usr/local/lib/mecab/dic',
|
||||||
|
]
|
||||||
|
full_paths = [os.path.join(path, name) for path in paths for name in names]
|
||||||
|
checked_paths = [path for path in full_paths if len(path) <= MAX_PATH_LENGTH]
|
||||||
|
for path in checked_paths:
|
||||||
|
if os.path.exists(path):
|
||||||
|
return path
|
||||||
|
|
||||||
|
error_lines = [
|
||||||
|
"Couldn't find the MeCab dictionary named %r." % suggested_pkg,
|
||||||
|
"You should download or use your system's package manager to install",
|
||||||
|
"the %r package." % suggested_pkg,
|
||||||
|
"",
|
||||||
|
"We looked in the following locations:"
|
||||||
|
] + ["\t%s" % path for path in checked_paths]
|
||||||
|
|
||||||
|
skipped_paths = [path for path in full_paths if len(path) > MAX_PATH_LENGTH]
|
||||||
|
if skipped_paths:
|
||||||
|
error_lines += [
|
||||||
|
"We had to skip these paths that are too long for MeCab to find:",
|
||||||
|
] + ["\t%s" % path for path in skipped_paths]
|
||||||
|
|
||||||
|
raise OSError('\n'.join(error_lines))
|
||||||
|
|
||||||
|
|
||||||
|
def make_mecab_analyzer(names):
|
||||||
|
"""
|
||||||
|
Get a MeCab analyzer object, given a list of names the dictionary might
|
||||||
|
have.
|
||||||
|
"""
|
||||||
|
return MeCab.Tagger('-d %s' % find_mecab_dictionary(names))
|
||||||
|
|
||||||
|
|
||||||
# Instantiate the MeCab analyzers for each language.
|
# Instantiate the MeCab analyzers for each language.
|
||||||
MECAB_ANALYZERS = {
|
MECAB_ANALYZERS = {
|
||||||
'ja': MeCab.Tagger('-d %s' % resource_filename('wordfreq', 'data/mecab-ja-ipadic')),
|
'ja': make_mecab_analyzer(['mecab-ipadic-utf8', 'ipadic-utf8']),
|
||||||
'ko': MeCab.Tagger('-d %s' % resource_filename('wordfreq', 'data/mecab-ko-dic'))
|
'ko': make_mecab_analyzer(['mecab-ko-dic', 'ko-dic'])
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user