yomapi

submit train init

1 +# ML base Spacing Correcter
2 +This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing "TrainKoSpacing"), using FastText instead of Word2Vec
3 +
4 +## Performances
5 +| Model | Test Accuracy(%) | Encoding Time Cost |
6 +| :------------: | :------------: | :------------: |
7 +| TrainKoSpacing | 96.6147 | 02m 23s|
8 +| 자모분해 FastText | 98.9915 | 08h 20m 11s
9 +| 2 Stage FastText | 99.0888 | 03m 23s
10 +
11 +## Data
12 +#### Corpus
13 +
14 +We mainly focus on the National Institute of Korean Language 모두의 말뭉치 corpus and National Information Society Agency AI-Hub data. However, due to the license issue, we are restricted to distribute this dataset. You should be able to get them throw the link below
15 +[National Institute of Korean Language 모두의 말뭉치](https://corpus.korean.go.kr/).
16 +[National Information Society Agency AI-Hub](https://aihub.or.kr/aihub-data/natural-language/about "National Information Society Agency AI-Hub")
17 +
18 +#### Data format
19 +Bziped file consisting of one sentence per line.
20 +
21 +```
22 +~/KoSpacing/data$ bzcat train.txt.bz2 | head
23 +엠마누엘 웅가로 / 의상서 실내 장식품으로… 디자인 세계 넓혀
24 +프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다.
25 +웅가로는 침실과 식당, 욕실에서 사용하는 갖가지 직물제품을 디자인해 최근 파리의 갤러리 라파예트백화점에서 '색의 컬렉션'이라는 이름으로 전시회를 열었다.
26 +```
27 +
28 +
29 +## Architecture
30 +
31 +### Model
32 +![kosapcing_img](img/kosapcing_img.png)
33 +
34 +### Word Embedding
35 +#### 자모분해
36 +To get similar shpae of Korean charector, use 자모분해 FastText word embedding.
37 +ex)
38 +자연어처리
39 +ㅈ ㅏ – ㅇ ㅕ ㄴ ㅇ ㅓ – ㅊ ㅓ – ㄹ ㅣ –
40 +
41 +#### 2 stage FastText
42 +Becasue of time to handdle 자모분해, use 자모분해 FastText only for Out of Vocabulary charector.
43 +![2-stage-FastText_img](img/2-stage-FastText.png)
44 +
45 +### Thresholding
46 +Because middle part of output distribution are evenly distributed.
47 +![probability_distribution_of_output_vector](img/probability_distribution_of_output_vector.png)
48 +
49 +Use log transform and second derivative
50 +result:
51 +![Thresholding_result](img/Thresholding_result.png)
52 +
53 +
54 +
55 +## How to Run
56 +
57 +
58 +### Installation
59 +
60 +- For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow.
61 +- Support only above Python 3.7.
62 +### Requirement
63 +
64 +- Python (>= 3.7)
65 +- MXNet (>= 1.6.0)
66 +- tqdm (>= 4.19.5)
67 +- Pandas (>= 0.22.0)
68 +- Gensim (>= 3.8.1)
69 +- GluonNLP (>= 0.9.1)
70 +- soynlp (>= 0.0.493)
71 +
72 +### Dependencies
73 +
74 +```bash
75 +pip install -r requirements.txt
76 +```
77 +
78 +### Training
79 +
80 +```bash
81 +python train.py --train --train-samp-ratio 1.0 --num-epoch 50 --train_data data/train.txt.bz2 --test_data data/test.txt.bz2 --outputs train_log_to --model_type kospacing --model-file fasttext
82 +```
83 +
84 +### Evaluation
85 +
86 +```bash
87 +python train.py --model-params model/kospacing.params --model_type kospacing
88 +sent > 중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
89 +중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
90 +spaced sent[0.12sec/sent] > 중국은 2018년 평창동계올림픽의 반환점에 이르기까지 아직 노골드 행진이다.
91 +```
92 +
93 +### Directory
94 +Directory guide for embedding model files
95 + bold texts means necessary
96 +
97 +- model
98 + - **fasttext**
99 + - fasttext_vis
100 + - **fasttext.trainables.vectors_ngrams_lockf.npy**
101 + - **fasttext.wv.vectors_ngrams.npy**
102 + - **kospacing_wv.np**
103 + - **w2idx.dic**
104 +
105 +- jamo_model
106 + - **fasttext**
107 + - fasttext_vis
108 + - **fasttext.trainables.vectors_ngrams_lockf.npy**
109 + - **fasttext.wv.vectors_ngrams.npy**
110 + - **kospacing_wv.np**
111 + - **w2idx.dic**
112 +
113 +### Reference
114 +TrainKoSpacing: https://github.com/haven-jeon/TrainKoSpacing
115 +딥 러닝을 이용한 자연어 처리 입문: https://wikidocs.net/book/2155
116 +
......
1 + Apache License
2 + Version 2.0, January 2004
3 + http://www.apache.org/licenses/
4 +
5 + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 +
7 + 1. Definitions.
8 +
9 + "License" shall mean the terms and conditions for use, reproduction,
10 + and distribution as defined by Sections 1 through 9 of this document.
11 +
12 + "Licensor" shall mean the copyright owner or entity authorized by
13 + the copyright owner that is granting the License.
14 +
15 + "Legal Entity" shall mean the union of the acting entity and all
16 + other entities that control, are controlled by, or are under common
17 + control with that entity. For the purposes of this definition,
18 + "control" means (i) the power, direct or indirect, to cause the
19 + direction or management of such entity, whether by contract or
20 + otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 + outstanding shares, or (iii) beneficial ownership of such entity.
22 +
23 + "You" (or "Your") shall mean an individual or Legal Entity
24 + exercising permissions granted by this License.
25 +
26 + "Source" form shall mean the preferred form for making modifications,
27 + including but not limited to software source code, documentation
28 + source, and configuration files.
29 +
30 + "Object" form shall mean any form resulting from mechanical
31 + transformation or translation of a Source form, including but
32 + not limited to compiled object code, generated documentation,
33 + and conversions to other media types.
34 +
35 + "Work" shall mean the work of authorship, whether in Source or
36 + Object form, made available under the License, as indicated by a
37 + copyright notice that is included in or attached to the work
38 + (an example is provided in the Appendix below).
39 +
40 + "Derivative Works" shall mean any work, whether in Source or Object
41 + form, that is based on (or derived from) the Work and for which the
42 + editorial revisions, annotations, elaborations, or other modifications
43 + represent, as a whole, an original work of authorship. For the purposes
44 + of this License, Derivative Works shall not include works that remain
45 + separable from, or merely link (or bind by name) to the interfaces of,
46 + the Work and Derivative Works thereof.
47 +
48 + "Contribution" shall mean any work of authorship, including
49 + the original version of the Work and any modifications or additions
50 + to that Work or Derivative Works thereof, that is intentionally
51 + submitted to Licensor for inclusion in the Work by the copyright owner
52 + or by an individual or Legal Entity authorized to submit on behalf of
53 + the copyright owner. For the purposes of this definition, "submitted"
54 + means any form of electronic, verbal, or written communication sent
55 + to the Licensor or its representatives, including but not limited to
56 + communication on electronic mailing lists, source code control systems,
57 + and issue tracking systems that are managed by, or on behalf of, the
58 + Licensor for the purpose of discussing and improving the Work, but
59 + excluding communication that is conspicuously marked or otherwise
60 + designated in writing by the copyright owner as "Not a Contribution."
61 +
62 + "Contributor" shall mean Licensor and any individual or Legal Entity
63 + on behalf of whom a Contribution has been received by Licensor and
64 + subsequently incorporated within the Work.
65 +
66 + 2. Grant of Copyright License. Subject to the terms and conditions of
67 + this License, each Contributor hereby grants to You a perpetual,
68 + worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 + copyright license to reproduce, prepare Derivative Works of,
70 + publicly display, publicly perform, sublicense, and distribute the
71 + Work and such Derivative Works in Source or Object form.
72 +
73 + 3. Grant of Patent License. Subject to the terms and conditions of
74 + this License, each Contributor hereby grants to You a perpetual,
75 + worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 + (except as stated in this section) patent license to make, have made,
77 + use, offer to sell, sell, import, and otherwise transfer the Work,
78 + where such license applies only to those patent claims licensable
79 + by such Contributor that are necessarily infringed by their
80 + Contribution(s) alone or by combination of their Contribution(s)
81 + with the Work to which such Contribution(s) was submitted. If You
82 + institute patent litigation against any entity (including a
83 + cross-claim or counterclaim in a lawsuit) alleging that the Work
84 + or a Contribution incorporated within the Work constitutes direct
85 + or contributory patent infringement, then any patent licenses
86 + granted to You under this License for that Work shall terminate
87 + as of the date such litigation is filed.
88 +
89 + 4. Redistribution. You may reproduce and distribute copies of the
90 + Work or Derivative Works thereof in any medium, with or without
91 + modifications, and in Source or Object form, provided that You
92 + meet the following conditions:
93 +
94 + (a) You must give any other recipients of the Work or
95 + Derivative Works a copy of this License; and
96 +
97 + (b) You must cause any modified files to carry prominent notices
98 + stating that You changed the files; and
99 +
100 + (c) You must retain, in the Source form of any Derivative Works
101 + that You distribute, all copyright, patent, trademark, and
102 + attribution notices from the Source form of the Work,
103 + excluding those notices that do not pertain to any part of
104 + the Derivative Works; and
105 +
106 + (d) If the Work includes a "NOTICE" text file as part of its
107 + distribution, then any Derivative Works that You distribute must
108 + include a readable copy of the attribution notices contained
109 + within such NOTICE file, excluding those notices that do not
110 + pertain to any part of the Derivative Works, in at least one
111 + of the following places: within a NOTICE text file distributed
112 + as part of the Derivative Works; within the Source form or
113 + documentation, if provided along with the Derivative Works; or,
114 + within a display generated by the Derivative Works, if and
115 + wherever such third-party notices normally appear. The contents
116 + of the NOTICE file are for informational purposes only and
117 + do not modify the License. You may add Your own attribution
118 + notices within Derivative Works that You distribute, alongside
119 + or as an addendum to the NOTICE text from the Work, provided
120 + that such additional attribution notices cannot be construed
121 + as modifying the License.
122 +
123 + You may add Your own copyright statement to Your modifications and
124 + may provide additional or different license terms and conditions
125 + for use, reproduction, or distribution of Your modifications, or
126 + for any such Derivative Works as a whole, provided Your use,
127 + reproduction, and distribution of the Work otherwise complies with
128 + the conditions stated in this License.
129 +
130 + 5. Submission of Contributions. Unless You explicitly state otherwise,
131 + any Contribution intentionally submitted for inclusion in the Work
132 + by You to the Licensor shall be under the terms and conditions of
133 + this License, without any additional terms or conditions.
134 + Notwithstanding the above, nothing herein shall supersede or modify
135 + the terms of any separate license agreement you may have executed
136 + with Licensor regarding such Contributions.
137 +
138 + 6. Trademarks. This License does not grant permission to use the trade
139 + names, trademarks, service marks, or product names of the Licensor,
140 + except as required for reasonable and customary use in describing the
141 + origin of the Work and reproducing the content of the NOTICE file.
142 +
143 + 7. Disclaimer of Warranty. Unless required by applicable law or
144 + agreed to in writing, Licensor provides the Work (and each
145 + Contributor provides its Contributions) on an "AS IS" BASIS,
146 + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 + implied, including, without limitation, any warranties or conditions
148 + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 + PARTICULAR PURPOSE. You are solely responsible for determining the
150 + appropriateness of using or redistributing the Work and assume any
151 + risks associated with Your exercise of permissions under this License.
152 +
153 + 8. Limitation of Liability. In no event and under no legal theory,
154 + whether in tort (including negligence), contract, or otherwise,
155 + unless required by applicable law (such as deliberate and grossly
156 + negligent acts) or agreed to in writing, shall any Contributor be
157 + liable to You for damages, including any direct, indirect, special,
158 + incidental, or consequential damages of any character arising as a
159 + result of this License or out of the use or inability to use the
160 + Work (including but not limited to damages for loss of goodwill,
161 + work stoppage, computer failure or malfunction, or any and all
162 + other commercial damages or losses), even if such Contributor
163 + has been advised of the possibility of such damages.
164 +
165 + 9. Accepting Warranty or Additional Liability. While redistributing
166 + the Work or Derivative Works thereof, You may choose to offer,
167 + and charge a fee for, acceptance of support, warranty, indemnity,
168 + or other liability obligations and/or rights consistent with this
169 + License. However, in accepting such obligations, You may act only
170 + on Your own behalf and on Your sole responsibility, not on behalf
171 + of any other Contributor, and only if You agree to indemnify,
172 + defend, and hold each Contributor harmless for any liability
173 + incurred by, or claims asserted against, such Contributor by reason
174 + of your accepting any such warranty or additional liability.
175 +
176 + END OF TERMS AND CONDITIONS
177 +
178 + APPENDIX: How to apply the Apache License to your work.
179 +
180 + To apply the Apache License to your work, attach the following
181 + boilerplate notice, with the fields enclosed by brackets "[]"
182 + replaced with your own identifying information. (Don't include
183 + the brackets!) The text should be enclosed in the appropriate
184 + comment syntax for the file format. We also recommend that a
185 + file or class name and description of purpose be included on the
186 + same "printed page" as the copyright notice for easier
187 + identification within third-party archives.
188 +
189 + Copyright [yyyy] [name of copyright owner]
190 +
191 + Licensed under the Apache License, Version 2.0 (the "License");
192 + you may not use this file except in compliance with the License.
193 + You may obtain a copy of the License at
194 +
195 + http://www.apache.org/licenses/LICENSE-2.0
196 +
197 + Unless required by applicable law or agreed to in writing, software
198 + distributed under the License is distributed on an "AS IS" BASIS,
199 + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 + See the License for the specific language governing permissions and
201 + limitations under the License.
No preview for this file type
1 +# coding=utf-8
2 +# Copyright 2020 Heewon Jeon. All rights reserved.
3 +#
4 +# Licensed under the Apache License, Version 2.0 (the "License");
5 +# you may not use this file except in compliance with the License.
6 +# You may obtain a copy of the License at
7 +#
8 +# http://www.apache.org/licenses/LICENSE-2.0
9 +#
10 +# Unless required by applicable law or agreed to in writing, software
11 +# distributed under the License is distributed on an "AS IS" BASIS,
12 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 +# See the License for the specific language governing permissions and
14 +# limitations under the License.
15 +
16 +import argparse
17 +from utils.embedding_maker import create_embeddings
18 +
19 +
20 +parser = argparse.ArgumentParser(description='Korean Autospacing Embedding Maker')
21 +
22 +parser.add_argument('--num-iters', type=int, default=5,
23 + help='number of iterations to train (default: 5)')
24 +
25 +parser.add_argument('--min-count', type=int, default=100,
26 + help='mininum word counts to filter (default: 100)')
27 +
28 +parser.add_argument('--embedding-size', type=int, default=100,
29 + help='embedding dimention size (default: 100)')
30 +
31 +parser.add_argument('--num-worker', type=int, default=16,
32 + help='number of thread (default: 16)')
33 +
34 +parser.add_argument('--window-size', type=int, default=8,
35 + help='skip-gram window size (default: 8)')
36 +
37 +parser.add_argument('--corpus_dir', type=str, default='data',
38 + help='training resource dir')
39 +
40 +parser.add_argument('--train', action='store_true', default=True,
41 + help='do embedding trainig (default: True)')
42 +
43 +parser.add_argument('--model-file', type=str, default='kospacing_wv.mdl',
44 + help='output object from Word2Vec() (default: kospacing_wv.mdl)')
45 +
46 +parser.add_argument('--numpy-wv', type=str, default='kospacing_wv.np',
47 + help='numpy object file path from Word2Vec() (default: kospacing_wv.np)')
48 +
49 +parser.add_argument('--w2idx', type=str, default='w2idx.dic',
50 + help='item to index json dictionary (default: w2idx.dic)')
51 +
52 +parser.add_argument('--model-dir', type=str, default='model',
53 + help='dir to save models (default: model)')
54 +
55 +opt = parser.parse_args()
56 +
57 +if opt.train:
58 + create_embeddings(opt.corpus_dir, opt.model_dir + '/' +
59 + opt.model_file, opt.model_dir + '/' + opt.numpy_wv,
60 + opt.model_dir + '/' + opt.w2idx, min_count=opt.min_count,
61 + iter=opt.num_iters,
62 + size=opt.embedding_size, workers=opt.num_worker, window=opt.window_size)
File mode changed
File mode changed
1 +absl-py==0.11.0
2 +astunparse==1.6.3
3 +cachetools==4.2.1
4 +certifi==2020.12.5
5 +chardet==4.0.0
6 +click==7.1.2
7 +cmake==3.18.4.post1
8 +Cython==0.29.21
9 +Flask==1.1.2
10 +Flask-Cors==3.0.9
11 +flatbuffers==1.12
12 +gast==0.3.3
13 +gensim==3.8.3
14 +gluonnlp==0.10.0
15 +google-auth==1.26.1
16 +google-auth-oauthlib==0.4.2
17 +google-pasta==0.2.0
18 +graphviz==0.8.4
19 +grpcio==1.32.0
20 +h5py==2.10.0
21 +idna==2.10
22 +importlib-metadata==3.4.0
23 +itsdangerous==1.1.0
24 +Jinja2==2.11.2
25 +joblib==1.0.1
26 +Keras==2.4.3
27 +Keras-Preprocessing==1.1.2
28 +Markdown==3.3.3
29 +MarkupSafe==1.1.1
30 +mxnet-cu101==1.7.0
31 +mxnet-cu101mkl==1.6.0.post0
32 +mxnet-mkl==1.6.0
33 +numpy==1.19.5
34 +oauthlib==3.1.0
35 +opt-einsum==3.3.0
36 +packaging==20.9
37 +pandas==1.2.2
38 +protobuf==3.14.0
39 +psutil==5.8.0
40 +pyasn1==0.4.8
41 +pyasn1-modules==0.2.8
42 +pyparsing==2.4.7
43 +python-dateutil==2.8.1
44 +pytz==2020.5
45 +PyYAML==5.3.1
46 +requests==2.25.1
47 +requests-oauthlib==1.3.0
48 +rsa==4.6
49 +scikit-learn==0.24.1
50 +scipy==1.6.0
51 +six==1.15.0
52 +smart-open==4.0.1
53 +soynlp==0.0.493
54 +tensorboard==2.4.0
55 +tensorboard-plugin-wit==1.7.0
56 +tensorflow==2.4.1
57 +tensorflow-estimator==2.4.0
58 +termcolor==1.1.0
59 +threadpoolctl==2.1.0
60 +tqdm==4.56.0
61 +typing-extensions==3.7.4.3
62 +urllib3==1.26.3
63 +Werkzeug==1.0.1
64 +wrapt==1.12.1
65 +zipp==3.4.0
1 +# coding=utf-8
2 +# Copyright 2020 Heewon Jeon. All rights reserved.
3 +#
4 +# Licensed under the Apache License, Version 2.0 (the "License");
5 +# you may not use this file except in compliance with the License.
6 +# You may obtain a copy of the License at
7 +#
8 +# http://www.apache.org/licenses/LICENSE-2.0
9 +#
10 +# Unless required by applicable law or agreed to in writing, software
11 +# distributed under the License is distributed on an "AS IS" BASIS,
12 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 +# See the License for the specific language governing permissions and
14 +# limitations under the License.
15 +
16 +import argparse
17 +import bz2
18 +import logging
19 +import re
20 +import time
21 +from functools import lru_cache
22 +from timeit import default_timer as timer
23 +
24 +import gluonnlp as nlp
25 +import mxnet as mx
26 +import mxnet.autograd as autograd
27 +import numpy as np
28 +from mxnet import gluon
29 +from mxnet.gluon import nn, rnn
30 +from tqdm import tqdm
31 +import csv
32 +
33 +from utils.embedding_maker import (encoding_and_padding, load_embedding,
34 + load_vocab)
35 +
36 +logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s] %(message)s")
37 +logger = logging.getLogger()
38 +
39 +parser = argparse.ArgumentParser(description='Korean Autospacing Trainer')
40 +parser.add_argument('--num-epoch',
41 + type=int,
42 + default=5,
43 + help='number of iterations to train (default: 5)')
44 +
45 +parser.add_argument('--n-hidden',
46 + type=int,
47 + default=200,
48 + help='GRU hidden size (default: 200)')
49 +
50 +parser.add_argument('--max-seq-len',
51 + type=int,
52 + default=200,
53 + help='max sentence length on input (default: 200)')
54 +
55 +parser.add_argument('--num-gpus',
56 + type=int,
57 + default=1,
58 + help='number of gpus (default: 1)')
59 +
60 +parser.add_argument('--vocab-file',
61 + type=str,
62 + default='model/w2idx.dic',
63 + help='vocabarary file (default: model/w2idx.dic)')
64 +
65 +parser.add_argument(
66 + '--embedding-file',
67 + type=str,
68 + default='model/kospacing_wv.np',
69 + help='embedding matrix file (default: model/kospacing_wv.np)')
70 +
71 +parser.add_argument('--train',
72 + action='store_true',
73 + default=False,
74 + help='do trainig (default: False)')
75 +
76 +parser.add_argument(
77 + '--model-file',
78 + type=str,
79 + default='kospacing_wv.mdl',
80 + help='output object from Word2Vec() (default: kospacing_wv.mdl)')
81 +
82 +parser.add_argument('--train-samp-ratio',
83 + type=float,
84 + default=0.50,
85 + help='random train sample ration (default: 0.50)')
86 +
87 +parser.add_argument('--model-prefix',
88 + type=str,
89 + default='kospacing',
90 + help='prefix of output model file (default: kospacing)')
91 +
92 +parser.add_argument('--model-params',
93 + type=str,
94 + default='kospacing_0.params',
95 + help='model params file (default: kospacing_0.params)')
96 +
97 +parser.add_argument('--test',
98 + action='store_true',
99 + default=False,
100 + help='eval train set (default: False)')
101 +
102 +parser.add_argument('--batch_size',
103 + type=int,
104 + default=100,
105 + help='train batch size')
106 +
107 +parser.add_argument('--test_batch_size',
108 + type=int,
109 + default=100,
110 + help='test batch size')
111 +
112 +parser.add_argument('--n_workers',
113 + type=int,
114 + default=10,
115 + help='number of dataloader workers')
116 +
117 +parser.add_argument('--train_data',
118 + type=str,
119 + default='data/UCorpus_spacing_train.txt.bz2',
120 + help='bziped train data')
121 +
122 +parser.add_argument('--test_data',
123 + type=str,
124 + default='data/UCorpus_spacing_test.txt.bz2',
125 + help='bziped test data')
126 +
127 +parser.add_argument('--model_type',
128 + type=str,
129 + default='kospacing',
130 + help='kospacing or kospacing2')
131 +
132 +parser.add_argument('--outputs',
133 + type=str,
134 + default='outputs',
135 + help='directory to save log and model params')
136 +
137 +opt = parser.parse_args()
138 +
139 +nlp.utils.mkdir(opt.outputs)
140 +
141 +fileHandler = logging.FileHandler(opt.outputs + '/' + 'log.log')
142 +fileHandler.setFormatter(logFormatter)
143 +logger.addHandler(fileHandler)
144 +
145 +consoleHandler = logging.StreamHandler()
146 +consoleHandler.setFormatter(logFormatter)
147 +logger.addHandler(consoleHandler)
148 +
149 +logger.setLevel(logging.DEBUG)
150 +logger.info(opt)
151 +
152 +GPU_COUNT = opt.num_gpus
153 +ctx = [mx.gpu(i) for i in range(GPU_COUNT)]
154 +
155 +
156 +# Model class
157 +class korean_autospacing_base(gluon.HybridBlock):
158 + def __init__(self, n_hidden, vocab_size, embed_dim, max_seq_length,
159 + **kwargs):
160 + super(korean_autospacing_base, self).__init__(**kwargs)
161 + # 입력 시퀀스 길이
162 + self.in_seq_len = max_seq_length
163 + # 출력 시퀀스 길이
164 + self.out_seq_len = max_seq_length
165 + # GRU의 hidden 개수
166 + self.n_hidden = n_hidden
167 + # 고유문자개수
168 + self.vocab_size = vocab_size
169 + # max_seq_length
170 + self.max_seq_length = max_seq_length
171 + # 임베딩 차원수
172 + self.embed_dim = embed_dim
173 +
174 + with self.name_scope():
175 + self.embedding = nn.Embedding(input_dim=self.vocab_size,
176 + output_dim=self.embed_dim)
177 +
178 + self.conv_unigram = nn.Conv2D(channels=128,
179 + kernel_size=(1, self.embed_dim))
180 +
181 + self.conv_bigram = nn.Conv2D(channels=256,
182 + kernel_size=(2, self.embed_dim),
183 + padding=(1, 0))
184 +
185 + self.conv_trigram = nn.Conv2D(channels=128,
186 + kernel_size=(3, self.embed_dim),
187 + padding=(1, 0))
188 +
189 + self.conv_forthgram = nn.Conv2D(channels=64,
190 + kernel_size=(4, self.embed_dim),
191 + padding=(2, 0))
192 +
193 + self.conv_fifthgram = nn.Conv2D(channels=32,
194 + kernel_size=(5, self.embed_dim),
195 + padding=(2, 0))
196 +
197 + self.bi_gru = rnn.GRU(hidden_size=self.n_hidden, layout='NTC', bidirectional=True)
198 + self.dense_sh = nn.Dense(100, activation='relu', flatten=False)
199 + self.dense = nn.Dense(1, activation='sigmoid', flatten=False)
200 +
201 + def hybrid_forward(self, F, inputs):
202 + embed = self.embedding(inputs)
203 + embed = F.expand_dims(embed, axis=1)
204 + unigram = self.conv_unigram(embed)
205 + bigram = self.conv_bigram(embed)
206 + trigram = self.conv_trigram(embed)
207 + forthgram = self.conv_forthgram(embed)
208 + fifthgram = self.conv_fifthgram(embed)
209 +
210 + grams = F.concat(unigram,
211 + F.slice_axis(bigram,
212 + axis=2,
213 + begin=0,
214 + end=self.max_seq_length),
215 + trigram,
216 + F.slice_axis(forthgram,
217 + axis=2,
218 + begin=0,
219 + end=self.max_seq_length),
220 + F.slice_axis(fifthgram,
221 + axis=2,
222 + begin=0,
223 + end=self.max_seq_length),
224 + dim=1)
225 +
226 + grams = F.transpose(grams, (0, 2, 3, 1))
227 + grams = F.reshape(grams, (-1, self.max_seq_length, -3))
228 + grams = self.bi_gru(grams)
229 + fc1 = self.dense_sh(grams)
230 + return (self.dense(fc1))
231 +
232 +
233 +# https://raw.githubusercontent.com/haven-jeon/Train_KoSpacing/master/img/kosapcing_img.png
234 +class korean_autospacing2(gluon.HybridBlock):
235 + def __init__(self, n_hidden, vocab_size, embed_dim, max_seq_length,
236 + **kwargs):
237 + super(korean_autospacing2, self).__init__(**kwargs)
238 + # 입력 시퀀스 길이
239 + self.in_seq_len = max_seq_length
240 + # 출력 시퀀스 길이
241 + self.out_seq_len = max_seq_length
242 + # GRU의 hidden 개수
243 + self.n_hidden = n_hidden
244 + # 고유문자개수
245 + self.vocab_size = vocab_size
246 + # max_seq_length
247 + self.max_seq_length = max_seq_length
248 + # 임베딩 차원수
249 + self.embed_dim = embed_dim
250 +
251 + with self.name_scope():
252 + self.embedding = nn.Embedding(input_dim=self.vocab_size,
253 + output_dim=self.embed_dim)
254 +
255 + self.conv_unigram = nn.Conv2D(channels=128,
256 + kernel_size=(1, self.embed_dim))
257 +
258 + self.conv_bigram = nn.Conv2D(channels=128,
259 + kernel_size=(2, self.embed_dim),
260 + padding=(1, 0))
261 +
262 + self.conv_trigram = nn.Conv2D(channels=64,
263 + kernel_size=(3, self.embed_dim),
264 + padding=(2, 0))
265 +
266 + self.conv_forthgram = nn.Conv2D(channels=32,
267 + kernel_size=(4, self.embed_dim),
268 + padding=(3, 0))
269 +
270 + self.conv_fifthgram = nn.Conv2D(channels=16,
271 + kernel_size=(5, self.embed_dim),
272 + padding=(4, 0))
273 + # for reverse convolution
274 + self.conv_rev_bigram = nn.Conv2D(channels=128,
275 + kernel_size=(2, self.embed_dim),
276 + padding=(1, 0))
277 +
278 + self.conv_rev_trigram = nn.Conv2D(channels=64,
279 + kernel_size=(3, self.embed_dim),
280 + padding=(2, 0))
281 +
282 + self.conv_rev_forthgram = nn.Conv2D(channels=32,
283 + kernel_size=(4,
284 + self.embed_dim),
285 + padding=(3, 0))
286 +
287 + self.conv_rev_fifthgram = nn.Conv2D(channels=16,
288 + kernel_size=(5,
289 + self.embed_dim),
290 + padding=(4, 0))
291 + self.bi_gru = rnn.GRU(hidden_size=self.n_hidden, layout='NTC', bidirectional=True)
292 + # self.bi_gru = rnn.BidirectionalCell(
293 + # rnn.GRUCell(hidden_size=self.n_hidden),
294 + # rnn.GRUCell(hidden_size=self.n_hidden))
295 + self.dense_sh = nn.Dense(100, activation='relu', flatten=False)
296 + self.dense = nn.Dense(1, activation='sigmoid', flatten=False)
297 +
298 + def hybrid_forward(self, F, inputs):
299 + embed = self.embedding(inputs)
300 + embed = F.expand_dims(embed, axis=1)
301 + rev_embed = embed.flip(axis=2)
302 +
303 + unigram = self.conv_unigram(embed)
304 + bigram = self.conv_bigram(embed)
305 + trigram = self.conv_trigram(embed)
306 + forthgram = self.conv_forthgram(embed)
307 + fifthgram = self.conv_fifthgram(embed)
308 +
309 + rev_bigram = self.conv_rev_bigram(rev_embed).flip(axis=2)
310 + rev_trigram = self.conv_rev_trigram(rev_embed).flip(axis=2)
311 + rev_forthgram = self.conv_rev_forthgram(rev_embed).flip(axis=2)
312 + rev_fifthgram = self.conv_rev_fifthgram(rev_embed).flip(axis=2)
313 +
314 + grams = F.concat(unigram,
315 + F.slice_axis(bigram,
316 + axis=2,
317 + begin=0,
318 + end=self.max_seq_length),
319 + F.slice_axis(rev_bigram,
320 + axis=2,
321 + begin=0,
322 + end=self.max_seq_length),
323 + F.slice_axis(trigram,
324 + axis=2,
325 + begin=0,
326 + end=self.max_seq_length),
327 + F.slice_axis(rev_trigram,
328 + axis=2,
329 + begin=0,
330 + end=self.max_seq_length),
331 + F.slice_axis(forthgram,
332 + axis=2,
333 + begin=0,
334 + end=self.max_seq_length),
335 + F.slice_axis(rev_forthgram,
336 + axis=2,
337 + begin=0,
338 + end=self.max_seq_length),
339 + F.slice_axis(fifthgram,
340 + axis=2,
341 + begin=0,
342 + end=self.max_seq_length),
343 + F.slice_axis(rev_fifthgram,
344 + axis=2,
345 + begin=0,
346 + end=self.max_seq_length),
347 + dim=1)
348 +
349 + grams = F.transpose(grams, (0, 2, 3, 1))
350 + grams = F.reshape(grams, (-1, self.max_seq_length, -3))
351 + grams = self.bi_gru(grams)
352 + fc1 = self.dense_sh(grams)
353 + return (self.dense(fc1))
354 +
355 +
356 +def y_encoding(n_grams, maxlen=200):
357 + # 입력된 문장으로 정답셋 인코딩함
358 + init_mat = np.zeros(shape=(len(n_grams), maxlen), dtype=np.int8)
359 + for i in range(len(n_grams)):
360 + init_mat[i, np.cumsum([len(j) for j in n_grams[i]]) - 1] = 1
361 + return init_mat
362 +
363 +
364 +def split_train_set(x_train, p=0.98):
365 + """
366 + > split_train_set(pd.DataFrame({'a':[1,2,3,4,None], 'b':[5,6,7,8,9]}))
367 + (array([0, 4, 3]), [1, 2])
368 + """
369 + import numpy as np
370 + train_idx = np.random.choice(range(x_train.shape[0]),
371 + int(x_train.shape[0] * p),
372 + replace=False)
373 + set_tr_idx = set(train_idx)
374 + test_index = [i for i in range(x_train.shape[0]) if i not in set_tr_idx]
375 + return ((train_idx, np.array(test_index)))
376 +
377 +
378 +def get_generator(x, y, batch_size):
379 + tr_set = gluon.data.ArrayDataset(x, y.astype('float32'))
380 + tr_data_iterator = gluon.data.DataLoader(tr_set,
381 + batch_size=batch_size,
382 + shuffle=True,
383 + num_workers=opt.n_workers)
384 + return (tr_data_iterator)
385 +
386 +
387 +def pick_model(model_nm, n_hidden, vocab_size, embed_dim, max_seq_length):
388 + if model_nm.lower() == 'kospacing':
389 + model = korean_autospacing_base(n_hidden=n_hidden,
390 + vocab_size=vocab_size,
391 + embed_dim=embed_dim,
392 + max_seq_length=max_seq_length)
393 + elif model_nm.lower() == 'kospacing2':
394 + model = korean_autospacing2(n_hidden=n_hidden,
395 + vocab_size=vocab_size,
396 + embed_dim=embed_dim,
397 + max_seq_length=max_seq_length)
398 + else:
399 + assert False
400 + return model
401 +
402 +
403 +def model_init(n_hidden, vocab_size, embed_dim, max_seq_length, ctx):
404 + # 모형 인스턴스 생성 및 트래이너, loss 정의
405 + # n_hidden, vocab_size, embed_dim, max_seq_length
406 + model = pick_model(opt.model_type, n_hidden, vocab_size, embed_dim, max_seq_length)
407 + model.collect_params().initialize(mx.init.Xavier(), ctx=ctx)
408 + model.embedding.weight.set_data(weights)
409 + model.hybridize(static_alloc=True)
410 + # 임베딩 영역 가중치 고정
411 + model.embedding.collect_params().setattr('grad_req', 'null')
412 + trainer = gluon.Trainer(model.collect_params(), 'rmsprop')
413 + loss = gluon.loss.SigmoidBinaryCrossEntropyLoss(from_sigmoid=True)
414 + loss.hybridize(static_alloc=True)
415 + return (model, loss, trainer)
416 +
417 +
418 +def evaluate_accuracy(data_iterator, net, pad_idx, ctx, n=5000):
419 + # 각 시퀀스의 길이만큼 순회하며 정확도 측정
420 + # 최적화되지 않음
421 + acc = mx.metric.Accuracy(axis=0)
422 + num_of_test = 0
423 + for i, (data, label) in enumerate(data_iterator):
424 + data = data.as_in_context(ctx)
425 + label = label.as_in_context(ctx)
426 + # get sentence length
427 + data_np = data.asnumpy()
428 + lengths = np.argmax(np.where(data_np == pad_idx, np.ones_like(data_np),
429 + np.zeros_like(data_np)),
430 + axis=1)
431 + output = net(data)
432 + pred_label = output.squeeze(axis=2) > 0.5
433 +
434 + for i in range(data.shape[0]):
435 + num_of_test += data.shape[0]
436 + acc.update(preds=pred_label[i, :lengths[i]],
437 + labels=label[i, :lengths[i]])
438 + if num_of_test > n:
439 + break
440 + return acc.get()[1]
441 +
442 +
443 +def train(epochs,
444 + tr_data_iterator,
445 + te_data_iterator,
446 + va_data_iterator,
447 + model,
448 + loss,
449 + trainer,
450 + pad_idx,
451 + ctx,
452 + mdl_desc="spacing_model",
453 + decay=False):
454 + # 학습 코드
455 + tot_test_acc = []
456 + tot_train_loss = []
457 + for e in range(epochs):
458 + tic = time.time()
459 + # Decay learning rate.
460 + if e > 1 and decay:
461 + trainer.set_learning_rate(trainer.learning_rate * 0.7)
462 + train_loss = []
463 + iter_tqdm = tqdm(tr_data_iterator, 'Batches')
464 + for i, (x_data, y_data) in enumerate(iter_tqdm):
465 + x_data_l = gluon.utils.split_and_load(x_data,
466 + ctx,
467 + even_split=False)
468 + y_data_l = gluon.utils.split_and_load(y_data,
469 + ctx,
470 + even_split=False)
471 +
472 + with autograd.record():
473 + losses = [
474 + loss(model(x), y) for x, y in zip(x_data_l, y_data_l)
475 + ]
476 + for l in losses:
477 + l.backward()
478 + trainer.step(x_data.shape[0])
479 + curr_loss = np.mean([mx.nd.mean(l).asscalar() for l in losses])
480 + train_loss.append(curr_loss)
481 + iter_tqdm.set_description("loss {}".format(curr_loss))
482 + mx.nd.waitall()
483 +
484 + # caculate test loss
485 + test_acc = evaluate_accuracy(
486 + te_data_iterator,
487 + model,
488 + pad_idx,
489 + ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
490 + valid_acc = evaluate_accuracy(
491 + va_data_iterator,
492 + model,
493 + pad_idx,
494 + ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
495 + logger.info('[Epoch %d] time cost: %f' % (e, time.time() - tic))
496 + logger.info("[Epoch %d] Train Loss: %f, Test acc : %f Valid acc : %f" %
497 + (e, np.mean(train_loss), test_acc, valid_acc))
498 + tot_test_acc.append(test_acc)
499 + tot_train_loss.append(np.mean(train_loss))
500 + model.save_parameters(opt.outputs + '/' + "{}_{}.params".format(mdl_desc, e))
501 + return (tot_test_acc, tot_train_loss)
502 +
503 +
504 +def pre_processing(setences):
505 + # 공백은 ^
506 + char_list = [li.strip().replace(' ', '^') for li in setences]
507 + # 문장의 시작 포인트 «
508 + # 문장의 끌 포인트 »
509 + char_list = ["«" + li + "»" for li in char_list]
510 + # 문장 -> 문자열
511 + char_list = [''.join(list(li)) for li in char_list]
512 + return char_list
513 +
514 +
515 +def make_input_data(inputs,
516 + train_ratio,
517 + sampling,
518 + make_lag_set=False,
519 + batch_size=200):
520 + with bz2.open(inputs, 'rt') as f:
521 + line_list = [i.strip() for i in f.readlines() if i.strip() != '']
522 + logger.info('complete loading train file!')
523 +
524 + # 아버지가 방에 들어가신다. -> '«아버지가^방에^들어가신다.»'
525 + processed_seq = pre_processing(line_list)
526 + logger.info(processed_seq[0])
527 + # n percent random sample
528 + logger.info('random sampling on training set!')
529 + samp_idx = np.random.choice(range(len(processed_seq)),
530 + int(len(processed_seq) * sampling),
531 + replace=False)
532 + processed_seq_samp = [processed_seq[i] for i in samp_idx]
533 + sp_sents = [i.split('^') for i in processed_seq_samp]
534 +
535 + sp_sents = list(filter(lambda x: len(x) >= 8, sp_sents))
536 +
537 + # max 8 어절 씩 1어절 shift하여 학습 데이터 생성
538 + if make_lag_set is True:
539 + n_gram = [[k, v, z, a, c, d, e, f]
540 + for sent in sp_sents for k, v, z, a, c, d, e, f in zip(
541 + sent, sent[1:], sent[2:], sent[3:], sent[4:], sent[5:],
542 + sent[6:], sent[7:])]
543 + else:
544 + n_gram = sp_sents
545 + # max 200문자 이하만 사용
546 + n_gram = [i for i in n_gram if len("^".join(i)) <= opt.max_seq_len]
547 + # y 정답 인코딩
548 + n_gram_y = y_encoding(n_gram, opt.max_seq_len)
549 + logger.info(n_gram[0])
550 + logger.info(n_gram_y[0])
551 + # vocab file 로딩
552 + w2idx, _ = load_vocab(opt.vocab_file)
553 +
554 + # 학습셋을 만들기 위해 공백을 제거하고 문자 인덱스로 인코딩함
555 + logger.info('index eocoding!')
556 + ngram_coding_seq = encoding_and_padding(
557 + word2idx_dic=w2idx,
558 + sequences=[''.join(gram) for gram in n_gram],
559 + maxlen=opt.max_seq_len,
560 + padding='post',
561 + truncating='post')
562 + logger.info(ngram_coding_seq[0])
563 + if train_ratio < 1:
564 + # 학습셋 테스트셋 생성
565 + tr_idx, te_idx = split_train_set(ngram_coding_seq, train_ratio)
566 +
567 + y_train = n_gram_y[tr_idx, ]
568 + x_train = ngram_coding_seq[tr_idx, ]
569 +
570 + y_test = n_gram_y[te_idx, ]
571 + x_test = ngram_coding_seq[te_idx, ]
572 +
573 + # train generator
574 + train_generator = get_generator(x_train, y_train, batch_size)
575 + valid_generator = get_generator(x_test, y_test, 500)
576 + return (train_generator, valid_generator)
577 + else:
578 + train_generator = get_generator(ngram_coding_seq, n_gram_y, batch_size)
579 + return (train_generator)
580 +
581 +
582 +if opt.train:
583 + # 사전 파일 로딩
584 + w2idx, idx2w = load_vocab(opt.vocab_file)
585 + # 임베딩 파일 로딩
586 + weights = load_embedding(opt.embedding_file)
587 + vocab_size = weights.shape[0]
588 + embed_dim = weights.shape[1]
589 +
590 + train_generator, valid_generator = make_input_data(
591 + opt.train_data,
592 + train_ratio=0.95,
593 + sampling=opt.train_samp_ratio,
594 + make_lag_set=True,
595 + batch_size=opt.batch_size)
596 +
597 + test_generator = make_input_data(opt.test_data,
598 + sampling=1,
599 + train_ratio=1,
600 + make_lag_set=True,
601 + batch_size=opt.test_batch_size)
602 +
603 + model, loss, trainer = model_init(n_hidden=opt.n_hidden,
604 + vocab_size=vocab_size,
605 + embed_dim=embed_dim,
606 + max_seq_length=opt.max_seq_len,
607 + ctx=ctx)
608 + logger.info('start training!')
609 + train(epochs=opt.num_epoch,
610 + tr_data_iterator=train_generator,
611 + te_data_iterator=test_generator,
612 + va_data_iterator=valid_generator,
613 + model=model,
614 + loss=loss,
615 + trainer=trainer,
616 + pad_idx=w2idx['__PAD__'],
617 + ctx=ctx,
618 + mdl_desc=opt.model_prefix)
619 +
620 +
621 +class pred_spacing:
622 + def __init__(self, model, w2idx):
623 + self.model = model
624 + self.w2idx = w2idx
625 + self.pattern = re.compile(r'\s+')
626 +
627 + @lru_cache(maxsize=None)
628 + def get_spaced_sent(self, raw_sent):
629 + raw_sent_ = "«" + raw_sent + "»"
630 + raw_sent_ = raw_sent_.replace(' ', '^')
631 + sents_in = [
632 + raw_sent_,
633 + ]
634 + mat_in = encoding_and_padding(word2idx_dic=self.w2idx,
635 + sequences=sents_in,
636 + maxlen=opt.max_seq_len,
637 + padding='post',
638 + truncating='post')
639 + mat_in = mx.nd.array(mat_in, ctx=mx.cpu(0))
640 + results = self.model(mat_in)
641 + mat_set = results[0, ]
642 +
643 + r = 255
644 + c = 1 / np.log(1+r)
645 + log_scaled = c * mx.nd.log(1 + r * mat_set[:len(raw_sent_)])
646 + #print(log_scaled)
647 + d_2 = [1]
648 + for i in range(1,len(raw_sent_)):
649 + d_2.append(mat_set[i-1] - (2 * mat_set[i]) + mat_set[i+1])
650 + #print(d_2)
651 + preds = np.array(
652 + ['1' if log_scaled[i] > 0.01 and d_2[i] < 0 else '0' for i in range(len(raw_sent_))])
653 + print(mat_set[:len(raw_sent_)])
654 + # #saveresult
655 +
656 +
657 + # wr.writerow([raw_sent_, temp])
658 + # f.close
659 + return self.make_pred_sents(raw_sent_, preds)
660 +
661 + def make_pred_sents(self, x_sents, y_pred):
662 + res_sent = []
663 + for i, j in zip(x_sents, y_pred):
664 + if j == '1':
665 + res_sent.append(i)
666 + res_sent.append(' ')
667 + else:
668 + res_sent.append(i)
669 + subs = re.sub(self.pattern, ' ', ''.join(res_sent).replace('^', ' '))
670 + subs = subs.replace('«', '')
671 + subs = subs.replace('»', '')
672 + return subs
673 +
674 +if not opt.train and not opt.test:
675 + # 사전 파일 로딩
676 + w2idx, idx2w = load_vocab(opt.vocab_file)
677 + # 임베딩 파일 로딩
678 + weights = load_embedding(opt.embedding_file)
679 + vocab_size = weights.shape[0]
680 + embed_dim = weights.shape[1]
681 + model = pick_model(opt.model_type, opt.n_hidden, vocab_size, embed_dim, opt.max_seq_len)
682 +
683 + # model.collect_params().initialize(mx.init.Xavier(), ctx=mx.cpu(0))
684 + # model.embedding.weight.set_data(weights)
685 + model.load_parameters(opt.model_params, ctx=mx.cpu(0))
686 + predictor = pred_spacing(model, w2idx)
687 +
688 + # datafile = open('./data/removed.txt', 'r', encoding='utf-8')
689 + # lines = datafile.readlines()
690 + # total = len(lines)
691 + # cnt = 1
692 + # for line in lines[:50000]:
693 + # print()
694 + # print('#' * 30)
695 + # print(cnt, ' / ', total)
696 + # print('#' * 30)
697 + # predictor.get_spaced_sent(line)
698 + # cnt += 1
699 +
700 +
701 +
702 + while 1:
703 + sent = input("sent > ")
704 + print(sent)
705 + start = timer()
706 + spaced = predictor.get_spaced_sent(sent)
707 + end = timer()
708 + print("spaced sent[{:03.2f}sec/sent] > {}".format(end - start, spaced))
709 +
710 +if not opt.train and opt.test:
711 + logger.info("calculate accuracy!")
712 + # 사전 파일 로딩
713 + w2idx, idx2w = load_vocab(opt.vocab_file)
714 + # 임베딩 파일 로딩
715 + weights = load_embedding(opt.embedding_file)
716 + vocab_size = weights.shape[0]
717 + embed_dim = weights.shape[1]
718 +
719 + model = pick_model(opt.model_type, opt.n_hidden, vocab_size, embed_dim, opt.max_seq_len)
720 +
721 + # model.initialize(ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
722 + model.load_parameters(opt.model_params,
723 + ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
724 + valid_generator = make_input_data(opt.test_data,
725 + sampling=1,
726 + train_ratio=1,
727 + make_lag_set=True,
728 + batch_size=100)
729 + valid_acc = evaluate_accuracy(
730 + valid_generator,
731 + model,
732 + w2idx['__PAD__'],
733 + ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0),
734 + n=30000)
735 + logger.info('valid accuracy : {}'.format(valid_acc))
1 +__all__ = [
2 + 'create_embeddings', 'load_embedding', 'load_vocab',
3 + 'encoding_and_padding', 'get_embedding_model'
4 +]
5 +
6 +import bz2
7 +import json
8 +import os
9 +
10 +import numpy as np
11 +import pkg_resources
12 +from gensim.models import FastText
13 +
14 +from utils.spacing_utils import sent_to_spacing_chars
15 +from tqdm import tqdm
16 +from utils.jamo_utils import jamo_sentence, jamo_to_word
17 +
18 +def pad_sequences(sequences,
19 + maxlen=None,
20 + dtype='int32',
21 + padding='pre',
22 + truncating='pre',
23 + value=0.):
24 +
25 + if not hasattr(sequences, '__len__'):
26 + raise ValueError('`sequences` must be iterable.')
27 + lengths = []
28 + for x in sequences:
29 + if not hasattr(x, '__len__'):
30 + raise ValueError('`sequences` must be a list of iterables. '
31 + 'Found non-iterable: ' + str(x))
32 + lengths.append(len(x))
33 +
34 + num_samples = len(sequences)
35 + if maxlen is None:
36 + maxlen = np.max(lengths)
37 +
38 + # take the sample shape from the first non empty sequence
39 + # checking for consistency in the main loop below.
40 + sample_shape = tuple()
41 + for s in sequences:
42 + if len(s) > 0:
43 + sample_shape = np.asarray(s).shape[1:]
44 + break
45 +
46 + x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
47 + for idx, s in enumerate(sequences):
48 + if not len(s):
49 + continue # empty list/array was found
50 + if truncating == 'pre':
51 + trunc = s[-maxlen:]
52 + elif truncating == 'post':
53 + trunc = s[:maxlen]
54 + else:
55 + raise ValueError('Truncating type "%s" not understood' %
56 + truncating)
57 +
58 + # check `trunc` has expected shape
59 + trunc = np.asarray(trunc, dtype=dtype)
60 + if trunc.shape[1:] != sample_shape:
61 + raise ValueError(
62 + 'Shape of sample %s of sequence at position %s is different from expected shape %s'
63 + % (trunc.shape[1:], idx, sample_shape))
64 +
65 + if padding == 'post':
66 + x[idx, :len(trunc)] = trunc
67 + elif padding == 'pre':
68 + x[idx, -len(trunc):] = trunc
69 + else:
70 + raise ValueError('Padding type "%s" not understood' % padding)
71 + return x
72 +
73 +
74 +def create_embeddings(data_dir,
75 + model_file,
76 + embeddings_file,
77 + vocab_file,
78 + splitc=' ',
79 + **params):
80 + """
81 + making embedding from files.
82 + :**params additional Word2Vec() parameters
83 + :splitc char for splitting in data_dir files
84 + :model_file output object from Word2Vec()
85 + :data_dir data dir to be process
86 + :embeddings_file numpy object file path from Word2Vec()
87 + :vocab_file item to index json dictionary
88 + """
89 + class SentenceGenerator(object):
90 + def __init__(self, dirname):
91 + self.dirname = dirname
92 +
93 + def __iter__(self):
94 + for fname in os.listdir(self.dirname):
95 + print("processing~ '{}'".format(fname))
96 + for line in bz2.open(os.path.join(self.dirname, fname), "rt"):
97 + yield sent_to_spacing_chars(line.strip()).split(splitc)
98 +
99 + sentences = SentenceGenerator(data_dir)
100 +
101 + model = FastText.load(model_file)
102 + model.save(model_file)
103 + weights = model.wv.syn0
104 + default_vec = np.mean(weights, axis=0, keepdims=True)
105 + padding_vec = np.zeros((1, weights.shape[1]))
106 +
107 + weights_default = np.concatenate([weights, default_vec, padding_vec],
108 + axis=0)
109 +
110 + np.save(open(embeddings_file, 'wb'), weights_default)
111 +
112 + vocab = dict([(k, v.index) for k, v in model.wv.vocab.items()])
113 + vocab['__PAD__'] = weights_default.shape[0] - 1
114 + with open(vocab_file, 'w') as f:
115 + f.write(json.dumps(vocab))
116 +
117 +
118 +def load_embedding(embeddings_file):
119 + return (np.load(embeddings_file))
120 +
121 +
122 +def load_vocab(vocab_path):
123 + with open(vocab_path, 'r') as f:
124 + data = json.loads(f.read())
125 + word2idx = data
126 + idx2word = dict([(v, k) for k, v in data.items()])
127 + return word2idx, idx2word
128 +
129 +def get_similar_char(word2idx_dic, model, jamo_model, text, try_cnt, OOV_CNT, HIT_CNT):
130 + OOV_CNT += 1
131 + jamo_text = jamo_sentence(text)
132 + simialr_list = jamo_model.wv.most_similar(jamo_text)[:try_cnt]
133 + for char in simialr_list:
134 + result = jamo_to_word(char[0])
135 +
136 + if result in word2idx_dic.keys():
137 + # print('#' * 20)
138 + # print('hit')
139 + # print('origin: ', text, 'reuslt: ', result)
140 + HIT_CNT += 1
141 + return OOV_CNT, HIT_CNT,result
142 +
143 + # print('#' * 20)
144 + # print('no hit')
145 + # print('origin: ', text)
146 + return OOV_CNT, HIT_CNT, model.wv.most_similar(text)[0][0]
147 +
148 +
149 +def encoding_and_padding(word2idx_dic, sequences, **params):
150 + """
151 + 1. making item to idx
152 + 2. padding
153 + :word2idx_dic
154 + :sequences: list of lists where each element is a sequence
155 + :maxlen: int, maximum length
156 + :dtype: type to cast the resulting sequence.
157 + :padding: 'pre' or 'post', pad either before or after each sequence.
158 + :truncating: 'pre' or 'post', remove values from sequences larger than
159 + maxlen either in the beginning or in the end of the sequence
160 + :value: float, value to pad the sequences to the desired value.
161 + """
162 + model_file = 'model/fasttext'
163 + jamo_model_path = 'jamo_model/fasttext'
164 + print('seq_idx start')
165 + model = FastText.load(model_file)
166 + jamo_model = FastText.load(jamo_model_path)
167 + seq_idx = []
168 + OOV_CNT = 0
169 + HIT_CNT = 0
170 + TOTAL_CNT = 0
171 +
172 + for word in tqdm(sequences):
173 + temp = []
174 + for char in word:
175 + TOTAL_CNT += 1
176 + if char in word2idx_dic.keys():
177 + temp.append(word2idx_dic[char])
178 + else:
179 + OOV_CNT, HIT_CNT, result = get_similar_char(word2idx_dic, model, jamo_model, char, 3, OOV_CNT, HIT_CNT)
180 + temp.append(word2idx_dic[result])
181 + seq_idx.append(temp)
182 + print('TOTAL CNT: ', TOTAL_CNT, 'OOV CNT: ', OOV_CNT, 'HIT_CNT: ', HIT_CNT)
183 + if OOV_CNT > 0 and HIT_CNT > 0:
184 + print('OOV RATE:', float(OOV_CNT) / TOTAL_CNT * 100, '%' ,'HIT_RATE: ', float(HIT_CNT) / float(OOV_CNT) * 100, '%')
185 +
186 + params['value'] = word2idx_dic['__PAD__']
187 + return (pad_sequences(seq_idx, **params))
188 +
189 +
190 +def get_embedding_model(name='fee_prods', path='data/embedding'):
191 + weights = pkg_resources.resource_filename(
192 + 'dsc', os.path.join(path, name, 'weights.np'))
193 + w2idx = pkg_resources.resource_filename(
194 + 'dsc', os.path.join(path, name, 'idx.json'))
195 + return ((load_embedding(weights), load_vocab(w2idx)[0]))
1 +import re
2 +from soynlp.hangle import compose, decompose, character_is_korean
3 +
4 +
5 +doublespace_pattern = re.compile('\s+')
6 +
7 +def jamo_sentence(sent):
8 + def transform(char):
9 + if char == ' ':
10 + return char
11 +
12 + cjj = decompose(char)
13 + if len(cjj) == 1:
14 + return cjj
15 +
16 + cjj_ = ''.join(c if c != ' ' else '-' for c in cjj)
17 + return cjj_
18 +
19 + sent_ = []
20 + for char in sent:
21 + if character_is_korean(char):
22 + sent_.append(transform(char))
23 + else:
24 + sent_.append(char)
25 + sent_ = doublespace_pattern.sub(' ', ''.join(sent_))
26 + return sent_
27 +
28 +def jamo_to_word(jamo):
29 + jamo_list, idx = [], 0
30 +
31 + while idx < len(jamo):
32 + if not character_is_korean(jamo[idx]):
33 + jamo_list.append(jamo[idx])
34 + idx += 1
35 + else:
36 + jamo_list.append(jamo[idx:idx + 3])
37 + idx += 3
38 +
39 + word = ""
40 + for jamo_char in jamo_list:
41 + if len(jamo_char) == 1:
42 + word += jamo_char
43 + elif jamo_char[2] == "-":
44 + word += compose(jamo_char[0], jamo_char[1], " ")
45 + else: word += compose(jamo_char[0], jamo_char[1], jamo_char[2])
46 +
47 + return word
48 +
49 +def break_char (jamo_sentence):
50 + idx = 0
51 + corpus = []
52 +
53 + while idx < len(jamo_sentence):
54 + if not character_is_korean(jamo_sentence[idx]):
55 + corpus.append(jamo_sentence[idx])
56 + idx += 1
57 + else:
58 + corpus.append(jamo_sentence[idx : idx+3])
59 + idx += 3
60 + return corpus
...\ No newline at end of file ...\ No newline at end of file
1 +# coding=utf-8
2 +# Copyright 2020 Heewon Jeon. All rights reserved.
3 +#
4 +# Licensed under the Apache License, Version 2.0 (the "License");
5 +# you may not use this file except in compliance with the License.
6 +# You may obtain a copy of the License at
7 +#
8 +# http://www.apache.org/licenses/LICENSE-2.0
9 +#
10 +# Unless required by applicable law or agreed to in writing, software
11 +# distributed under the License is distributed on an "AS IS" BASIS,
12 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 +# See the License for the specific language governing permissions and
14 +# limitations under the License.
15 +
16 +def sent_to_spacing_chars(sent):
17 + # 공백은 ^
18 + chars = sent.strip().replace(' ', '^')
19 + # char_list = [li.strip().replace(' ', '^') for li in sents]
20 +
21 + # 문장의 시작 포인트 «
22 + # 문장의 끌 포인트 »
23 + tagged_chars = "«" + chars + "»"
24 + # char_list = [ "«" + li + "»" for li in char_list]
25 +
26 + # 문장 -> 문자열
27 + char_list = ' '.join(list(tagged_chars))
28 + # char_list = [ ' '.join(list(li)) for li in char_list]
29 + return(char_list)