mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
stop including MeCab dictionaries in the package
Former-commit-id: b3dd8479ab
This commit is contained in:
parent
410e8c255b
commit
1519df503c
74
README.md
74
README.md
@ -16,42 +16,70 @@ or by getting the repository and running its setup.py:
|
||||
|
||||
python3 setup.py install
|
||||
|
||||
### Additional CJK setup
|
||||
|
||||
Chinese, Japanese, and Korean have additional external dependencies so that they can be
|
||||
tokenized correctly.
|
||||
## Additional CJK installation
|
||||
|
||||
Chinese, Japanese, and Korean have additional external dependencies so that
|
||||
they can be tokenized correctly. Here we'll explain how to set them up,
|
||||
in increasing order of difficulty.
|
||||
|
||||
|
||||
### Chinese
|
||||
|
||||
To be able to look up word frequencies in Chinese, you need Jieba, a
|
||||
pure-Python Chinese tokenizer:
|
||||
|
||||
pip3 install jieba
|
||||
|
||||
To be able to look up word frequencies in Japanese or Korean, you need to additionally
|
||||
install mecab-python3, which itself depends on libmecab-dev.
|
||||
These commands will install them on Ubuntu:
|
||||
|
||||
sudo apt-get install libmecab-dev
|
||||
pip3 install mecab-python3
|
||||
### Japanese
|
||||
|
||||
If you installed wordfreq from Git, this should be all you need, because the
|
||||
dictionary files are included. Otherwise, read on.
|
||||
We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
|
||||
things need to be installed:
|
||||
|
||||
### Getting dictionary files for the PyPI version
|
||||
* The MeCab development library (called `libmecab-dev` on Ubuntu)
|
||||
* The UTF-8 version of the `ipadic` Japanese dictionary
|
||||
(called `mecab-ipadic-utf8` on Ubuntu)
|
||||
* The `mecab-python3` Python interface
|
||||
|
||||
If you installed wordfreq from PyPI (for example, using pip), and you want to
|
||||
handle Japanese and Korean, you need to get their MeCab dictionary files
|
||||
separately. We would prefer to include them in the package, but PyPI has a size
|
||||
limit.
|
||||
To install these three things on Ubuntu, you can run:
|
||||
|
||||
The Japanese dictionary is called 'mecab-ipadic-utf8', and is available as an Ubuntu
|
||||
package by that name:
|
||||
```sh
|
||||
sudo apt-get install libmecab-dev mecab-ipadic-utf8
|
||||
pip3 install mecab-python3
|
||||
```
|
||||
|
||||
sudo apt-get install mecab-ipadic-utf8
|
||||
If you choose to install `ipadic` from somewhere else or from its source code,
|
||||
be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
|
||||
give you nonsense results.
|
||||
|
||||
The Korean dictionary does not have an Ubuntu package. One option, besides getting it
|
||||
from wordfreq's Git repository, is to install it from source from:
|
||||
|
||||
https://bitbucket.org/eunjeon/mecab-ko-dic
|
||||
### Korean
|
||||
|
||||
Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
|
||||
Yungho Yu. This dictionary is not available as an Ubuntu package.
|
||||
|
||||
Here's a process you can use to install the Korean dictionary and the other
|
||||
MeCab dependencies:
|
||||
|
||||
```sh
|
||||
sudo apt-get install libmecab-dev mecab-utils
|
||||
pip3 install mecab-python3
|
||||
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||
tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||
cd mecab-ko-dic-2.0.1-20150920
|
||||
./autogen.sh
|
||||
make
|
||||
sudo make install
|
||||
```
|
||||
|
||||
If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
|
||||
tokenize those languages, it will raise an error and show you the list of
|
||||
paths it searched.
|
||||
|
||||
Sorry that this is difficult. We tried to just package the data files we need
|
||||
with wordfreq, like we do for Chinese, but PyPI would reject the package for
|
||||
being too large.
|
||||
|
||||
|
||||
## Usage
|
||||
@ -306,10 +334,6 @@ The terms of use of this data are:
|
||||
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
|
||||
of a link to http://books.google.com/ngrams, would be appreciated.
|
||||
|
||||
`wordfreq` uses MeCab, by Taku Kudo, plus Korean data files by Yongwoon Lee and
|
||||
Yungho Yu. The Korean data is under an Apache 2 license, a copy of which
|
||||
appears in wordfreq/data/mecab-ko-dic/COPYING.
|
||||
|
||||
`wordfreq` also contains data derived from the following Creative Commons-licensed
|
||||
sources:
|
||||
|
||||
|
Binary file not shown.
@ -1,29 +0,0 @@
|
||||
;
|
||||
; Configuration file of IPADIC
|
||||
;
|
||||
; $Id: dicrc,v 1.4 2006/04/08 06:41:36 taku-ku Exp $;
|
||||
;
|
||||
cost-factor = 800
|
||||
bos-feature = BOS/EOS,*,*,*,*,*,*,*,*
|
||||
eval-size = 8
|
||||
unk-eval-size = 4
|
||||
config-charset = UTF-8
|
||||
|
||||
; yomi
|
||||
node-format-yomi = %pS%f[7]
|
||||
unk-format-yomi = %M
|
||||
eos-format-yomi = \n
|
||||
|
||||
; simple
|
||||
node-format-simple = %m\t%F-[0,1,2,3]\n
|
||||
eos-format-simple = EOS\n
|
||||
|
||||
; ChaSen
|
||||
node-format-chasen = %m\t%f[7]\t%f[6]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
|
||||
unk-format-chasen = %m\t%m\t%m\t%F-[0,1,2,3]\t\t\n
|
||||
eos-format-chasen = EOS\n
|
||||
|
||||
; ChaSen (include spaces)
|
||||
node-format-chasen2 = %M\t%f[7]\t%f[6]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
|
||||
unk-format-chasen2 = %M\t%m\t%m\t%F-[0,1,2,3]\t\t\n
|
||||
eos-format-chasen2 = EOS\n
|
Binary file not shown.
@ -1 +0,0 @@
|
||||
c926154d533ccaef1515af6883056d69c34ca239
|
Binary file not shown.
@ -1,201 +0,0 @@
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
APPENDIX: How to apply the Apache License to your work.
|
||||
|
||||
To apply the Apache License to your work, attach the following
|
||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
||||
replaced with your own identifying information. (Don't include
|
||||
the brackets!) The text should be enclosed in the appropriate
|
||||
comment syntax for the file format. We also recommend that a
|
||||
file or class name and description of purpose be included on the
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright [yyyy] [name of copyright owner]
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
Binary file not shown.
@ -1,25 +0,0 @@
|
||||
;
|
||||
; Configuration file of mecab-ko-dic
|
||||
;
|
||||
|
||||
# 비용 값으로 변환할 때 배율 팩터입니다. 700에서 800에서 문제가 없습니다.
|
||||
cost-factor = 800
|
||||
# 문장의 시작, 문장 끝에 대한 소성(素性)입니다. CSV로 표현합니다.
|
||||
bos-feature = BOS/EOS,*,*,*,*,*,*,*,*
|
||||
# 알려진 단어의 경우 소성(素性)의 처음부터 몇 개까지 일치하면 정답으로
|
||||
# 인정하는지를 지정합니다. 일반적으로 알려진 단어는 품사 활용 등의 정보만
|
||||
# 맞추면 되기 때문에, "읽기", "발음" 소성(素性)은 무시하도록 합니다.
|
||||
# 여기에서는 3가지가 평가됩니다.
|
||||
eval-size = 4
|
||||
# 알 수 없는 단어의 경우
|
||||
# 소성의 처음부터 몇 개까지 일치하면 정답으로 인정할지를 지정합니다.
|
||||
unk-eval-size = 2
|
||||
# dicrc, char.def, unk.def, pos-id.def 파일의 문자 코드셋입니다.
|
||||
config-charset = UTF-8
|
||||
# 좌측에 공백을 포함하는 품사의 연접 비용을 늘리기 위한 설정입니다.
|
||||
# mecab-ko에서만 사용되는 설정입니다. 다음과 같은 형식을 가집니다.
|
||||
# <posid 1>,<posid 1 penalty cost>,<posid 2>,<posid 2 penalty cost>...
|
||||
#
|
||||
# 예) 120,6000 => posid가 120인 품사(조사)의 좌측에 공백을 포함할 경우
|
||||
# 연접 비용을 6000만큼 늘림
|
||||
left-space-penalty-factor = 100,3000,120,6000,172,3000,183,3000,184,3000,185,3000,200,3000,210,6000,220,3000,221,3000,222,3000,230,3000
|
File diff suppressed because it is too large
Load Diff
@ -1 +0,0 @@
|
||||
2dbb57fe707d7dddd2392526aad7cbac77378bb3
|
@ -1 +0,0 @@
|
||||
58619494b4f81190218b76d9d2090607830e51ec
|
@ -1,66 +0,0 @@
|
||||
UNKNOWN,*,*,*,*,*,*,*,* 0
|
||||
*,*,*,*,Compound,*,*,*,* 1
|
||||
*,*,*,*,Inflect,EC,*,*,* 200
|
||||
*,*,*,*,Inflect,EF,*,*,* 200
|
||||
*,*,*,*,Inflect,EP,*,*,* 200
|
||||
*,*,*,*,Inflect,ETM,*,*,* 200
|
||||
*,*,*,*,Inflect,ETN,*,*,* 200
|
||||
*,*,*,*,Inflect,JC,*,*,* 210
|
||||
*,*,*,*,Inflect,JKB,*,*,* 210
|
||||
*,*,*,*,Inflect,JKC,*,*,* 210
|
||||
*,*,*,*,Inflect,JKG,*,*,* 210
|
||||
*,*,*,*,Inflect,JKO,*,*,* 210
|
||||
*,*,*,*,Inflect,JKQ,*,*,* 210
|
||||
*,*,*,*,Inflect,JKS,*,*,* 210
|
||||
*,*,*,*,Inflect,JKV,*,*,* 210
|
||||
*,*,*,*,Inflect,JX,*,*,* 210
|
||||
*,*,*,*,Inflect,XSA,*,*,* 220
|
||||
*,*,*,*,Inflect,XSN,*,*,* 221
|
||||
*,*,*,*,Inflect,XSV,*,*,* 222
|
||||
*,*,*,*,Inflect,VCP,*,*,* 230
|
||||
*,*,*,*,Inflect,*,*,*,* 2
|
||||
*,*,*,*,Preanalysis,*,*,*,* 3
|
||||
EC,*,*,*,*,*,*,*,* 100
|
||||
EF,*,*,*,*,*,*,*,* 100
|
||||
EP,*,*,*,*,*,*,*,* 100
|
||||
ETM,*,*,*,*,*,*,*,* 100
|
||||
ETN,*,*,*,*,*,*,*,* 100
|
||||
IC,*,*,*,*,*,*,*,* 110
|
||||
JC,*,*,*,*,*,*,*,* 120
|
||||
JKB,*,*,*,*,*,*,*,* 120
|
||||
JKC,*,*,*,*,*,*,*,* 120
|
||||
JKG,*,*,*,*,*,*,*,* 120
|
||||
JKO,*,*,*,*,*,*,*,* 120
|
||||
JKQ,*,*,*,*,*,*,*,* 120
|
||||
JKS,*,*,*,*,*,*,*,* 120
|
||||
JKV,*,*,*,*,*,*,*,* 120
|
||||
JX,*,*,*,*,*,*,*,* 120
|
||||
MAG,*,*,*,*,*,*,*,* 130
|
||||
MAJ,*,*,*,*,*,*,*,* 131
|
||||
MM,*,*,*,*,*,*,*,* 140
|
||||
NNG,*,*,*,*,*,*,*,* 150
|
||||
NNP,*,*,*,*,*,*,*,* 150
|
||||
NNB,*,*,*,*,*,*,*,* 150
|
||||
NNBC,*,*,*,*,*,*,*,* 150
|
||||
NP,*,*,*,*,*,*,*,* 150
|
||||
NR,*,*,*,*,*,*,*,* 150
|
||||
SF,*,*,*,*,*,*,*,* 160
|
||||
SH,*,*,*,*,*,*,*,* 161
|
||||
SL,*,*,*,*,*,*,*,* 162
|
||||
SN,*,*,*,*,*,*,*,* 163
|
||||
SP,*,*,*,*,*,*,*,* 164
|
||||
SSC,*,*,*,*,*,*,*,* 165
|
||||
SSO,*,*,*,*,*,*,*,* 166
|
||||
SC,*,*,*,*,*,*,*,* 167
|
||||
SY,*,*,*,*,*,*,*,* 168
|
||||
SE,*,*,*,*,*,*,*,* 169
|
||||
VA,*,*,*,*,*,*,*,* 170
|
||||
VCN,*,*,*,*,*,*,*,* 171
|
||||
VCP,*,*,*,*,*,*,*,* 172
|
||||
VV,*,*,*,*,*,*,*,* 173
|
||||
VX,*,*,*,*,*,*,*,* 174
|
||||
XPN,*,*,*,*,*,*,*,* 181
|
||||
XR,*,*,*,*,*,*,*,* 182
|
||||
XSA,*,*,*,*,*,*,*,* 183
|
||||
XSN,*,*,*,*,*,*,*,* 184
|
||||
XSV,*,*,*,*,*,*,*,* 185
|
@ -1,51 +0,0 @@
|
||||
# Feature(POS) to Internal State mapping
|
||||
#
|
||||
# 소성(素性)열에서 내부 상태 소생(素生)열로 변환하는 매핑을 정의합니다.
|
||||
#
|
||||
# CRF는 unigram, 왼쪽 문맥 bigram, 오른쪽 문맥 bigram의 3가지 정보를 사용하여
|
||||
# 통계를 계산합니다
|
||||
#
|
||||
# 각 섹션 다음에 한 줄에 하나의 매핑 규칙이 계속됩니다.
|
||||
#
|
||||
# 일치패턴 대상(変換先)
|
||||
#
|
||||
# 일치패턴에는 간단한 정규식을 사용할 수 있습니다.
|
||||
#
|
||||
# * : 모든 문자열과 일치
|
||||
# (AB|CD|EF) : AB 또는 CD 또는 EF 일치
|
||||
# AB : 문자열 AB에만 완전 매치
|
||||
#
|
||||
# 대상(変換先)은 $1 $2, $3..라는 매크로를 사용하여 소생(素生)의
|
||||
# 각 요소(CSV로 표시된 요소)의 내용을 참조할 수 있습니다.
|
||||
#
|
||||
# 품사태그,종성유무,표기,타입,첫번째품사,마지막품사,원형
|
||||
#
|
||||
# Unigram 내부 상태에 대한 매핑
|
||||
[unigram rewrite]
|
||||
*,*,*,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,$8,$9
|
||||
|
||||
# 왼쪽 문맥 bigram 내부 상태에 대한 매핑
|
||||
# 기본으로 의미부류($2)까지 표현한다.
|
||||
# 조사외 몇 가지의 품사에서는 품사,의미부로,읽기($1,$2,*,$4)로 표현한다.
|
||||
# 하늘/NNG,T,*,*,*,*,* + 은/J,*,은,*,*,*,*
|
||||
#
|
||||
[left rewrite]
|
||||
BOS/EOS,*,*,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,$8,BOS/EOS
|
||||
SF,*,*,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,$8,BOS/EOS
|
||||
*,*,*,*,Inflect,(JC|JKB|JKC|JKG|JKO|JKQ|JKS|JKV|JX|NNB|NNBC|VCP|ETM|XSN),*,*,* $6,$2,*,$4,*,*,*,*,*
|
||||
*,*,*,*,(Inflect|Preanalysis),*,*,*,* $6,$2,*,*,*,*,*,*,*
|
||||
(JC|JKB|JKC|JKG|JKO|JKQ|JKS|JKV|JX|NNB|NNBC|VCP|ETM|XSN),*,*,*,*,*,*,*,* $1,$2,*,$4,*,*,*,*,*
|
||||
*,*,*,*,*,*,*,*,* $1,$2,*,*,*,*,*,*,*
|
||||
|
||||
# 오른쪽 문맥 bigram 내부 상태에 대한 매핑
|
||||
# 기본으로 종성유무($3)까지 표현한다.
|
||||
# 조사외 몇 가지의 품사에서는 품사,의미,종성유무,읽기($1,$2,$3,$4)로 표현한다.
|
||||
# ex) 하늘/NN,T,*,*,*,*,* + 은/J,T,은,*,*,*,*
|
||||
[right rewrite]
|
||||
BOS/EOS,*,*,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,$8,BOS/EOS
|
||||
SF,*,*,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,$8,BOS/EOS
|
||||
SL,*,*,*,*,*,*,*,* NNG,$2,$3,*,*,*,*,*,*
|
||||
*,*,*,*,Inflect,*,(JC|JKB|JKC|JKG|JKO|JKQ|JKS|JKV|JX|NNB|NNBC|XSN),*,* $7,$2,$3,$4,*,*,*,*,*
|
||||
*,*,*,*,(Inflect|Preanalysis),*,*,*,* $7,$2,$3,*,*,*,*,*,*
|
||||
(JC|JKB|JKC|JKG|JKO|JKQ|JKS|JKV|JX|NNB|NNBC|XSN),*,*,*,*,*,*,*,* $1,$2,$3,$4,*,*,*,*,*
|
||||
*,*,*,*,*,*,*,*,* $1,$2,$3,*,*,*,*,*,*
|
File diff suppressed because it is too large
Load Diff
@ -1 +0,0 @@
|
||||
9655d23c3a0900764cbcdb8d8395d0f09ec098ed
|
Binary file not shown.
@ -4,6 +4,10 @@ import unicodedata
|
||||
import os
|
||||
|
||||
|
||||
# MeCab has fixed-sized buffers for many things, including the dictionary path
|
||||
MAX_PATH_LENGTH = 58
|
||||
|
||||
|
||||
def find_mecab_dictionary(names):
|
||||
"""
|
||||
Find a MeCab dictionary with a given name. The dictionary might come as
|
||||
@ -15,7 +19,6 @@ def find_mecab_dictionary(names):
|
||||
"""
|
||||
suggested_pkg = names[0]
|
||||
paths = [
|
||||
resource_filename('wordfreq', 'data'),
|
||||
os.path.expanduser('~/.local/lib/mecab/dic'),
|
||||
'/var/lib/mecab/dic',
|
||||
'/var/local/lib/mecab/dic',
|
||||
@ -24,7 +27,7 @@ def find_mecab_dictionary(names):
|
||||
]
|
||||
full_paths = [os.path.join(path, name) for path in paths for name in names]
|
||||
for path in full_paths:
|
||||
if os.path.exists(path):
|
||||
if os.path.exists(path) and len(path) <= MAX_PATH_LENGTH:
|
||||
return path
|
||||
|
||||
error_lines = [
|
||||
@ -33,7 +36,13 @@ def find_mecab_dictionary(names):
|
||||
"the %r package." % suggested_pkg,
|
||||
"",
|
||||
"We looked in the following locations:"
|
||||
] + ["\t%s" % path for path in full_paths]
|
||||
] + ["\t%s" % path for path in full_paths if len(path) <= MAX_PATH_LENGTH]
|
||||
|
||||
skipped_paths = [path for path in full_paths if len(path) > MAX_PATH_LENGTH]
|
||||
if skipped_paths:
|
||||
error_lines += [
|
||||
"We had to skip these paths that are too long for MeCab to find:",
|
||||
] + ["\t%s" % path for path in skipped_paths]
|
||||
|
||||
raise OSError('\n'.join(error_lines))
|
||||
|
||||
@ -50,7 +59,7 @@ def make_mecab_analyzer(names):
|
||||
# Instantiate the MeCab analyzers for each language.
|
||||
MECAB_ANALYZERS = {
|
||||
'ja': make_mecab_analyzer(['mecab-ipadic-utf8', 'mecab-ja-ipadic', 'ipadic-utf8']),
|
||||
'ko': make_mecab_analyzer(['mecab-ko-dic', 'ko-dic'])
|
||||
'ko': make_mecab_analyzer(['mecab-ko-dic', 'ko-dic', 'mecab-ko-dic-2.0.1-20150920'])
|
||||
}
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user