Showing
19 changed files
with
1463 additions
and
0 deletions
1 | +# ML base Spacing Correcter | ||
2 | +This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing "TrainKoSpacing"), using FastText instead of Word2Vec | ||
3 | + | ||
4 | +## Performances | ||
5 | +| Model | Test Accuracy(%) | Encoding Time Cost | | ||
6 | +| :------------: | :------------: | :------------: | | ||
7 | +| TrainKoSpacing | 96.6147 | 02m 23s| | ||
8 | +| 자모분해 FastText | 98.9915 | 08h 20m 11s | ||
9 | +| 2 Stage FastText | 99.0888 | 03m 23s | ||
10 | + | ||
11 | +## Data | ||
12 | +#### Corpus | ||
13 | + | ||
14 | +We mainly focus on the National Institute of Korean Language 모두의 말뭉치 corpus and National Information Society Agency AI-Hub data. However, due to the license issue, we are restricted to distribute this dataset. You should be able to get them throw the link below | ||
15 | +[National Institute of Korean Language 모두의 말뭉치](https://corpus.korean.go.kr/). | ||
16 | +[National Information Society Agency AI-Hub](https://aihub.or.kr/aihub-data/natural-language/about "National Information Society Agency AI-Hub") | ||
17 | + | ||
18 | +#### Data format | ||
19 | +Bziped file consisting of one sentence per line. | ||
20 | + | ||
21 | +``` | ||
22 | +~/KoSpacing/data$ bzcat train.txt.bz2 | head | ||
23 | +엠마누엘 웅가로 / 의상서 실내 장식품으로… 디자인 세계 넓혀 | ||
24 | +프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다. | ||
25 | +웅가로는 침실과 식당, 욕실에서 사용하는 갖가지 직물제품을 디자인해 최근 파리의 갤러리 라파예트백화점에서 '색의 컬렉션'이라는 이름으로 전시회를 열었다. | ||
26 | +``` | ||
27 | + | ||
28 | + | ||
29 | +## Architecture | ||
30 | + | ||
31 | +### Model | ||
32 | + | ||
33 | + | ||
34 | +### Word Embedding | ||
35 | +#### 자모분해 | ||
36 | +To get similar shpae of Korean charector, use 자모분해 FastText word embedding. | ||
37 | +ex) | ||
38 | +자연어처리 | ||
39 | +ㅈ ㅏ – ㅇ ㅕ ㄴ ㅇ ㅓ – ㅊ ㅓ – ㄹ ㅣ – | ||
40 | + | ||
41 | +#### 2 stage FastText | ||
42 | +Becasue of time to handdle 자모분해, use 자모분해 FastText only for Out of Vocabulary charector. | ||
43 | + | ||
44 | + | ||
45 | +### Thresholding | ||
46 | +Because middle part of output distribution are evenly distributed. | ||
47 | + | ||
48 | + | ||
49 | +Use log transform and second derivative | ||
50 | +result: | ||
51 | + | ||
52 | + | ||
53 | + | ||
54 | + | ||
55 | +## How to Run | ||
56 | + | ||
57 | + | ||
58 | +### Installation | ||
59 | + | ||
60 | +- For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow. | ||
61 | +- Support only above Python 3.7. | ||
62 | +### Requirement | ||
63 | + | ||
64 | +- Python (>= 3.7) | ||
65 | +- MXNet (>= 1.6.0) | ||
66 | +- tqdm (>= 4.19.5) | ||
67 | +- Pandas (>= 0.22.0) | ||
68 | +- Gensim (>= 3.8.1) | ||
69 | +- GluonNLP (>= 0.9.1) | ||
70 | +- soynlp (>= 0.0.493) | ||
71 | + | ||
72 | +### Dependencies | ||
73 | + | ||
74 | +```bash | ||
75 | +pip install -r requirements.txt | ||
76 | +``` | ||
77 | + | ||
78 | +### Training | ||
79 | + | ||
80 | +```bash | ||
81 | +python train.py --train --train-samp-ratio 1.0 --num-epoch 50 --train_data data/train.txt.bz2 --test_data data/test.txt.bz2 --outputs train_log_to --model_type kospacing --model-file fasttext | ||
82 | +``` | ||
83 | + | ||
84 | +### Evaluation | ||
85 | + | ||
86 | +```bash | ||
87 | +python train.py --model-params model/kospacing.params --model_type kospacing | ||
88 | +sent > 중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다. | ||
89 | +중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다. | ||
90 | +spaced sent[0.12sec/sent] > 중국은 2018년 평창동계올림픽의 반환점에 이르기까지 아직 노골드 행진이다. | ||
91 | +``` | ||
92 | + | ||
93 | +### Directory | ||
94 | +Directory guide for embedding model files | ||
95 | + bold texts means necessary | ||
96 | + | ||
97 | +- model | ||
98 | + - **fasttext** | ||
99 | + - fasttext_vis | ||
100 | + - **fasttext.trainables.vectors_ngrams_lockf.npy** | ||
101 | + - **fasttext.wv.vectors_ngrams.npy** | ||
102 | + - **kospacing_wv.np** | ||
103 | + - **w2idx.dic** | ||
104 | + | ||
105 | +- jamo_model | ||
106 | + - **fasttext** | ||
107 | + - fasttext_vis | ||
108 | + - **fasttext.trainables.vectors_ngrams_lockf.npy** | ||
109 | + - **fasttext.wv.vectors_ngrams.npy** | ||
110 | + - **kospacing_wv.np** | ||
111 | + - **w2idx.dic** | ||
112 | + | ||
113 | +### Reference | ||
114 | +TrainKoSpacing: https://github.com/haven-jeon/TrainKoSpacing | ||
115 | +딥 러닝을 이용한 자연어 처리 입문: https://wikidocs.net/book/2155 | ||
116 | + | ... | ... |
img/2-stage-FastText.png
0 → 100644

53.5 KB
img/Thresholding_result.png
0 → 100644

365 KB
img/kosapcing_img.png
0 → 100644

209 KB

32.1 KB
train/LICENSE
0 → 100644
1 | + Apache License | ||
2 | + Version 2.0, January 2004 | ||
3 | + http://www.apache.org/licenses/ | ||
4 | + | ||
5 | + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION | ||
6 | + | ||
7 | + 1. Definitions. | ||
8 | + | ||
9 | + "License" shall mean the terms and conditions for use, reproduction, | ||
10 | + and distribution as defined by Sections 1 through 9 of this document. | ||
11 | + | ||
12 | + "Licensor" shall mean the copyright owner or entity authorized by | ||
13 | + the copyright owner that is granting the License. | ||
14 | + | ||
15 | + "Legal Entity" shall mean the union of the acting entity and all | ||
16 | + other entities that control, are controlled by, or are under common | ||
17 | + control with that entity. For the purposes of this definition, | ||
18 | + "control" means (i) the power, direct or indirect, to cause the | ||
19 | + direction or management of such entity, whether by contract or | ||
20 | + otherwise, or (ii) ownership of fifty percent (50%) or more of the | ||
21 | + outstanding shares, or (iii) beneficial ownership of such entity. | ||
22 | + | ||
23 | + "You" (or "Your") shall mean an individual or Legal Entity | ||
24 | + exercising permissions granted by this License. | ||
25 | + | ||
26 | + "Source" form shall mean the preferred form for making modifications, | ||
27 | + including but not limited to software source code, documentation | ||
28 | + source, and configuration files. | ||
29 | + | ||
30 | + "Object" form shall mean any form resulting from mechanical | ||
31 | + transformation or translation of a Source form, including but | ||
32 | + not limited to compiled object code, generated documentation, | ||
33 | + and conversions to other media types. | ||
34 | + | ||
35 | + "Work" shall mean the work of authorship, whether in Source or | ||
36 | + Object form, made available under the License, as indicated by a | ||
37 | + copyright notice that is included in or attached to the work | ||
38 | + (an example is provided in the Appendix below). | ||
39 | + | ||
40 | + "Derivative Works" shall mean any work, whether in Source or Object | ||
41 | + form, that is based on (or derived from) the Work and for which the | ||
42 | + editorial revisions, annotations, elaborations, or other modifications | ||
43 | + represent, as a whole, an original work of authorship. For the purposes | ||
44 | + of this License, Derivative Works shall not include works that remain | ||
45 | + separable from, or merely link (or bind by name) to the interfaces of, | ||
46 | + the Work and Derivative Works thereof. | ||
47 | + | ||
48 | + "Contribution" shall mean any work of authorship, including | ||
49 | + the original version of the Work and any modifications or additions | ||
50 | + to that Work or Derivative Works thereof, that is intentionally | ||
51 | + submitted to Licensor for inclusion in the Work by the copyright owner | ||
52 | + or by an individual or Legal Entity authorized to submit on behalf of | ||
53 | + the copyright owner. For the purposes of this definition, "submitted" | ||
54 | + means any form of electronic, verbal, or written communication sent | ||
55 | + to the Licensor or its representatives, including but not limited to | ||
56 | + communication on electronic mailing lists, source code control systems, | ||
57 | + and issue tracking systems that are managed by, or on behalf of, the | ||
58 | + Licensor for the purpose of discussing and improving the Work, but | ||
59 | + excluding communication that is conspicuously marked or otherwise | ||
60 | + designated in writing by the copyright owner as "Not a Contribution." | ||
61 | + | ||
62 | + "Contributor" shall mean Licensor and any individual or Legal Entity | ||
63 | + on behalf of whom a Contribution has been received by Licensor and | ||
64 | + subsequently incorporated within the Work. | ||
65 | + | ||
66 | + 2. Grant of Copyright License. Subject to the terms and conditions of | ||
67 | + this License, each Contributor hereby grants to You a perpetual, | ||
68 | + worldwide, non-exclusive, no-charge, royalty-free, irrevocable | ||
69 | + copyright license to reproduce, prepare Derivative Works of, | ||
70 | + publicly display, publicly perform, sublicense, and distribute the | ||
71 | + Work and such Derivative Works in Source or Object form. | ||
72 | + | ||
73 | + 3. Grant of Patent License. Subject to the terms and conditions of | ||
74 | + this License, each Contributor hereby grants to You a perpetual, | ||
75 | + worldwide, non-exclusive, no-charge, royalty-free, irrevocable | ||
76 | + (except as stated in this section) patent license to make, have made, | ||
77 | + use, offer to sell, sell, import, and otherwise transfer the Work, | ||
78 | + where such license applies only to those patent claims licensable | ||
79 | + by such Contributor that are necessarily infringed by their | ||
80 | + Contribution(s) alone or by combination of their Contribution(s) | ||
81 | + with the Work to which such Contribution(s) was submitted. If You | ||
82 | + institute patent litigation against any entity (including a | ||
83 | + cross-claim or counterclaim in a lawsuit) alleging that the Work | ||
84 | + or a Contribution incorporated within the Work constitutes direct | ||
85 | + or contributory patent infringement, then any patent licenses | ||
86 | + granted to You under this License for that Work shall terminate | ||
87 | + as of the date such litigation is filed. | ||
88 | + | ||
89 | + 4. Redistribution. You may reproduce and distribute copies of the | ||
90 | + Work or Derivative Works thereof in any medium, with or without | ||
91 | + modifications, and in Source or Object form, provided that You | ||
92 | + meet the following conditions: | ||
93 | + | ||
94 | + (a) You must give any other recipients of the Work or | ||
95 | + Derivative Works a copy of this License; and | ||
96 | + | ||
97 | + (b) You must cause any modified files to carry prominent notices | ||
98 | + stating that You changed the files; and | ||
99 | + | ||
100 | + (c) You must retain, in the Source form of any Derivative Works | ||
101 | + that You distribute, all copyright, patent, trademark, and | ||
102 | + attribution notices from the Source form of the Work, | ||
103 | + excluding those notices that do not pertain to any part of | ||
104 | + the Derivative Works; and | ||
105 | + | ||
106 | + (d) If the Work includes a "NOTICE" text file as part of its | ||
107 | + distribution, then any Derivative Works that You distribute must | ||
108 | + include a readable copy of the attribution notices contained | ||
109 | + within such NOTICE file, excluding those notices that do not | ||
110 | + pertain to any part of the Derivative Works, in at least one | ||
111 | + of the following places: within a NOTICE text file distributed | ||
112 | + as part of the Derivative Works; within the Source form or | ||
113 | + documentation, if provided along with the Derivative Works; or, | ||
114 | + within a display generated by the Derivative Works, if and | ||
115 | + wherever such third-party notices normally appear. The contents | ||
116 | + of the NOTICE file are for informational purposes only and | ||
117 | + do not modify the License. You may add Your own attribution | ||
118 | + notices within Derivative Works that You distribute, alongside | ||
119 | + or as an addendum to the NOTICE text from the Work, provided | ||
120 | + that such additional attribution notices cannot be construed | ||
121 | + as modifying the License. | ||
122 | + | ||
123 | + You may add Your own copyright statement to Your modifications and | ||
124 | + may provide additional or different license terms and conditions | ||
125 | + for use, reproduction, or distribution of Your modifications, or | ||
126 | + for any such Derivative Works as a whole, provided Your use, | ||
127 | + reproduction, and distribution of the Work otherwise complies with | ||
128 | + the conditions stated in this License. | ||
129 | + | ||
130 | + 5. Submission of Contributions. Unless You explicitly state otherwise, | ||
131 | + any Contribution intentionally submitted for inclusion in the Work | ||
132 | + by You to the Licensor shall be under the terms and conditions of | ||
133 | + this License, without any additional terms or conditions. | ||
134 | + Notwithstanding the above, nothing herein shall supersede or modify | ||
135 | + the terms of any separate license agreement you may have executed | ||
136 | + with Licensor regarding such Contributions. | ||
137 | + | ||
138 | + 6. Trademarks. This License does not grant permission to use the trade | ||
139 | + names, trademarks, service marks, or product names of the Licensor, | ||
140 | + except as required for reasonable and customary use in describing the | ||
141 | + origin of the Work and reproducing the content of the NOTICE file. | ||
142 | + | ||
143 | + 7. Disclaimer of Warranty. Unless required by applicable law or | ||
144 | + agreed to in writing, Licensor provides the Work (and each | ||
145 | + Contributor provides its Contributions) on an "AS IS" BASIS, | ||
146 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or | ||
147 | + implied, including, without limitation, any warranties or conditions | ||
148 | + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A | ||
149 | + PARTICULAR PURPOSE. You are solely responsible for determining the | ||
150 | + appropriateness of using or redistributing the Work and assume any | ||
151 | + risks associated with Your exercise of permissions under this License. | ||
152 | + | ||
153 | + 8. Limitation of Liability. In no event and under no legal theory, | ||
154 | + whether in tort (including negligence), contract, or otherwise, | ||
155 | + unless required by applicable law (such as deliberate and grossly | ||
156 | + negligent acts) or agreed to in writing, shall any Contributor be | ||
157 | + liable to You for damages, including any direct, indirect, special, | ||
158 | + incidental, or consequential damages of any character arising as a | ||
159 | + result of this License or out of the use or inability to use the | ||
160 | + Work (including but not limited to damages for loss of goodwill, | ||
161 | + work stoppage, computer failure or malfunction, or any and all | ||
162 | + other commercial damages or losses), even if such Contributor | ||
163 | + has been advised of the possibility of such damages. | ||
164 | + | ||
165 | + 9. Accepting Warranty or Additional Liability. While redistributing | ||
166 | + the Work or Derivative Works thereof, You may choose to offer, | ||
167 | + and charge a fee for, acceptance of support, warranty, indemnity, | ||
168 | + or other liability obligations and/or rights consistent with this | ||
169 | + License. However, in accepting such obligations, You may act only | ||
170 | + on Your own behalf and on Your sole responsibility, not on behalf | ||
171 | + of any other Contributor, and only if You agree to indemnify, | ||
172 | + defend, and hold each Contributor harmless for any liability | ||
173 | + incurred by, or claims asserted against, such Contributor by reason | ||
174 | + of your accepting any such warranty or additional liability. | ||
175 | + | ||
176 | + END OF TERMS AND CONDITIONS | ||
177 | + | ||
178 | + APPENDIX: How to apply the Apache License to your work. | ||
179 | + | ||
180 | + To apply the Apache License to your work, attach the following | ||
181 | + boilerplate notice, with the fields enclosed by brackets "[]" | ||
182 | + replaced with your own identifying information. (Don't include | ||
183 | + the brackets!) The text should be enclosed in the appropriate | ||
184 | + comment syntax for the file format. We also recommend that a | ||
185 | + file or class name and description of purpose be included on the | ||
186 | + same "printed page" as the copyright notice for easier | ||
187 | + identification within third-party archives. | ||
188 | + | ||
189 | + Copyright [yyyy] [name of copyright owner] | ||
190 | + | ||
191 | + Licensed under the Apache License, Version 2.0 (the "License"); | ||
192 | + you may not use this file except in compliance with the License. | ||
193 | + You may obtain a copy of the License at | ||
194 | + | ||
195 | + http://www.apache.org/licenses/LICENSE-2.0 | ||
196 | + | ||
197 | + Unless required by applicable law or agreed to in writing, software | ||
198 | + distributed under the License is distributed on an "AS IS" BASIS, | ||
199 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
200 | + See the License for the specific language governing permissions and | ||
201 | + limitations under the License. |
train/data/example.txt.bz2
0 → 100644
No preview for this file type
train/embedding.py
0 → 100644
1 | +# coding=utf-8 | ||
2 | +# Copyright 2020 Heewon Jeon. All rights reserved. | ||
3 | +# | ||
4 | +# Licensed under the Apache License, Version 2.0 (the "License"); | ||
5 | +# you may not use this file except in compliance with the License. | ||
6 | +# You may obtain a copy of the License at | ||
7 | +# | ||
8 | +# http://www.apache.org/licenses/LICENSE-2.0 | ||
9 | +# | ||
10 | +# Unless required by applicable law or agreed to in writing, software | ||
11 | +# distributed under the License is distributed on an "AS IS" BASIS, | ||
12 | +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
13 | +# See the License for the specific language governing permissions and | ||
14 | +# limitations under the License. | ||
15 | + | ||
16 | +import argparse | ||
17 | +from utils.embedding_maker import create_embeddings | ||
18 | + | ||
19 | + | ||
20 | +parser = argparse.ArgumentParser(description='Korean Autospacing Embedding Maker') | ||
21 | + | ||
22 | +parser.add_argument('--num-iters', type=int, default=5, | ||
23 | + help='number of iterations to train (default: 5)') | ||
24 | + | ||
25 | +parser.add_argument('--min-count', type=int, default=100, | ||
26 | + help='mininum word counts to filter (default: 100)') | ||
27 | + | ||
28 | +parser.add_argument('--embedding-size', type=int, default=100, | ||
29 | + help='embedding dimention size (default: 100)') | ||
30 | + | ||
31 | +parser.add_argument('--num-worker', type=int, default=16, | ||
32 | + help='number of thread (default: 16)') | ||
33 | + | ||
34 | +parser.add_argument('--window-size', type=int, default=8, | ||
35 | + help='skip-gram window size (default: 8)') | ||
36 | + | ||
37 | +parser.add_argument('--corpus_dir', type=str, default='data', | ||
38 | + help='training resource dir') | ||
39 | + | ||
40 | +parser.add_argument('--train', action='store_true', default=True, | ||
41 | + help='do embedding trainig (default: True)') | ||
42 | + | ||
43 | +parser.add_argument('--model-file', type=str, default='kospacing_wv.mdl', | ||
44 | + help='output object from Word2Vec() (default: kospacing_wv.mdl)') | ||
45 | + | ||
46 | +parser.add_argument('--numpy-wv', type=str, default='kospacing_wv.np', | ||
47 | + help='numpy object file path from Word2Vec() (default: kospacing_wv.np)') | ||
48 | + | ||
49 | +parser.add_argument('--w2idx', type=str, default='w2idx.dic', | ||
50 | + help='item to index json dictionary (default: w2idx.dic)') | ||
51 | + | ||
52 | +parser.add_argument('--model-dir', type=str, default='model', | ||
53 | + help='dir to save models (default: model)') | ||
54 | + | ||
55 | +opt = parser.parse_args() | ||
56 | + | ||
57 | +if opt.train: | ||
58 | + create_embeddings(opt.corpus_dir, opt.model_dir + '/' + | ||
59 | + opt.model_file, opt.model_dir + '/' + opt.numpy_wv, | ||
60 | + opt.model_dir + '/' + opt.w2idx, min_count=opt.min_count, | ||
61 | + iter=opt.num_iters, | ||
62 | + size=opt.embedding_size, workers=opt.num_worker, window=opt.window_size) |
train/jamo_model/.gitignore
0 → 100644
File mode changed
train/model/.gitignore
0 → 100644
File mode changed
train/output/.gitignore
0 → 100644
File mode changed
train/requirements.txt
0 → 100644
1 | +absl-py==0.11.0 | ||
2 | +astunparse==1.6.3 | ||
3 | +cachetools==4.2.1 | ||
4 | +certifi==2020.12.5 | ||
5 | +chardet==4.0.0 | ||
6 | +click==7.1.2 | ||
7 | +cmake==3.18.4.post1 | ||
8 | +Cython==0.29.21 | ||
9 | +Flask==1.1.2 | ||
10 | +Flask-Cors==3.0.9 | ||
11 | +flatbuffers==1.12 | ||
12 | +gast==0.3.3 | ||
13 | +gensim==3.8.3 | ||
14 | +gluonnlp==0.10.0 | ||
15 | +google-auth==1.26.1 | ||
16 | +google-auth-oauthlib==0.4.2 | ||
17 | +google-pasta==0.2.0 | ||
18 | +graphviz==0.8.4 | ||
19 | +grpcio==1.32.0 | ||
20 | +h5py==2.10.0 | ||
21 | +idna==2.10 | ||
22 | +importlib-metadata==3.4.0 | ||
23 | +itsdangerous==1.1.0 | ||
24 | +Jinja2==2.11.2 | ||
25 | +joblib==1.0.1 | ||
26 | +Keras==2.4.3 | ||
27 | +Keras-Preprocessing==1.1.2 | ||
28 | +Markdown==3.3.3 | ||
29 | +MarkupSafe==1.1.1 | ||
30 | +mxnet-cu101==1.7.0 | ||
31 | +mxnet-cu101mkl==1.6.0.post0 | ||
32 | +mxnet-mkl==1.6.0 | ||
33 | +numpy==1.19.5 | ||
34 | +oauthlib==3.1.0 | ||
35 | +opt-einsum==3.3.0 | ||
36 | +packaging==20.9 | ||
37 | +pandas==1.2.2 | ||
38 | +protobuf==3.14.0 | ||
39 | +psutil==5.8.0 | ||
40 | +pyasn1==0.4.8 | ||
41 | +pyasn1-modules==0.2.8 | ||
42 | +pyparsing==2.4.7 | ||
43 | +python-dateutil==2.8.1 | ||
44 | +pytz==2020.5 | ||
45 | +PyYAML==5.3.1 | ||
46 | +requests==2.25.1 | ||
47 | +requests-oauthlib==1.3.0 | ||
48 | +rsa==4.6 | ||
49 | +scikit-learn==0.24.1 | ||
50 | +scipy==1.6.0 | ||
51 | +six==1.15.0 | ||
52 | +smart-open==4.0.1 | ||
53 | +soynlp==0.0.493 | ||
54 | +tensorboard==2.4.0 | ||
55 | +tensorboard-plugin-wit==1.7.0 | ||
56 | +tensorflow==2.4.1 | ||
57 | +tensorflow-estimator==2.4.0 | ||
58 | +termcolor==1.1.0 | ||
59 | +threadpoolctl==2.1.0 | ||
60 | +tqdm==4.56.0 | ||
61 | +typing-extensions==3.7.4.3 | ||
62 | +urllib3==1.26.3 | ||
63 | +Werkzeug==1.0.1 | ||
64 | +wrapt==1.12.1 | ||
65 | +zipp==3.4.0 |
train/train.py
0 → 100644
1 | +# coding=utf-8 | ||
2 | +# Copyright 2020 Heewon Jeon. All rights reserved. | ||
3 | +# | ||
4 | +# Licensed under the Apache License, Version 2.0 (the "License"); | ||
5 | +# you may not use this file except in compliance with the License. | ||
6 | +# You may obtain a copy of the License at | ||
7 | +# | ||
8 | +# http://www.apache.org/licenses/LICENSE-2.0 | ||
9 | +# | ||
10 | +# Unless required by applicable law or agreed to in writing, software | ||
11 | +# distributed under the License is distributed on an "AS IS" BASIS, | ||
12 | +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
13 | +# See the License for the specific language governing permissions and | ||
14 | +# limitations under the License. | ||
15 | + | ||
16 | +import argparse | ||
17 | +import bz2 | ||
18 | +import logging | ||
19 | +import re | ||
20 | +import time | ||
21 | +from functools import lru_cache | ||
22 | +from timeit import default_timer as timer | ||
23 | + | ||
24 | +import gluonnlp as nlp | ||
25 | +import mxnet as mx | ||
26 | +import mxnet.autograd as autograd | ||
27 | +import numpy as np | ||
28 | +from mxnet import gluon | ||
29 | +from mxnet.gluon import nn, rnn | ||
30 | +from tqdm import tqdm | ||
31 | +import csv | ||
32 | + | ||
33 | +from utils.embedding_maker import (encoding_and_padding, load_embedding, | ||
34 | + load_vocab) | ||
35 | + | ||
36 | +logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s] %(message)s") | ||
37 | +logger = logging.getLogger() | ||
38 | + | ||
39 | +parser = argparse.ArgumentParser(description='Korean Autospacing Trainer') | ||
40 | +parser.add_argument('--num-epoch', | ||
41 | + type=int, | ||
42 | + default=5, | ||
43 | + help='number of iterations to train (default: 5)') | ||
44 | + | ||
45 | +parser.add_argument('--n-hidden', | ||
46 | + type=int, | ||
47 | + default=200, | ||
48 | + help='GRU hidden size (default: 200)') | ||
49 | + | ||
50 | +parser.add_argument('--max-seq-len', | ||
51 | + type=int, | ||
52 | + default=200, | ||
53 | + help='max sentence length on input (default: 200)') | ||
54 | + | ||
55 | +parser.add_argument('--num-gpus', | ||
56 | + type=int, | ||
57 | + default=1, | ||
58 | + help='number of gpus (default: 1)') | ||
59 | + | ||
60 | +parser.add_argument('--vocab-file', | ||
61 | + type=str, | ||
62 | + default='model/w2idx.dic', | ||
63 | + help='vocabarary file (default: model/w2idx.dic)') | ||
64 | + | ||
65 | +parser.add_argument( | ||
66 | + '--embedding-file', | ||
67 | + type=str, | ||
68 | + default='model/kospacing_wv.np', | ||
69 | + help='embedding matrix file (default: model/kospacing_wv.np)') | ||
70 | + | ||
71 | +parser.add_argument('--train', | ||
72 | + action='store_true', | ||
73 | + default=False, | ||
74 | + help='do trainig (default: False)') | ||
75 | + | ||
76 | +parser.add_argument( | ||
77 | + '--model-file', | ||
78 | + type=str, | ||
79 | + default='kospacing_wv.mdl', | ||
80 | + help='output object from Word2Vec() (default: kospacing_wv.mdl)') | ||
81 | + | ||
82 | +parser.add_argument('--train-samp-ratio', | ||
83 | + type=float, | ||
84 | + default=0.50, | ||
85 | + help='random train sample ration (default: 0.50)') | ||
86 | + | ||
87 | +parser.add_argument('--model-prefix', | ||
88 | + type=str, | ||
89 | + default='kospacing', | ||
90 | + help='prefix of output model file (default: kospacing)') | ||
91 | + | ||
92 | +parser.add_argument('--model-params', | ||
93 | + type=str, | ||
94 | + default='kospacing_0.params', | ||
95 | + help='model params file (default: kospacing_0.params)') | ||
96 | + | ||
97 | +parser.add_argument('--test', | ||
98 | + action='store_true', | ||
99 | + default=False, | ||
100 | + help='eval train set (default: False)') | ||
101 | + | ||
102 | +parser.add_argument('--batch_size', | ||
103 | + type=int, | ||
104 | + default=100, | ||
105 | + help='train batch size') | ||
106 | + | ||
107 | +parser.add_argument('--test_batch_size', | ||
108 | + type=int, | ||
109 | + default=100, | ||
110 | + help='test batch size') | ||
111 | + | ||
112 | +parser.add_argument('--n_workers', | ||
113 | + type=int, | ||
114 | + default=10, | ||
115 | + help='number of dataloader workers') | ||
116 | + | ||
117 | +parser.add_argument('--train_data', | ||
118 | + type=str, | ||
119 | + default='data/UCorpus_spacing_train.txt.bz2', | ||
120 | + help='bziped train data') | ||
121 | + | ||
122 | +parser.add_argument('--test_data', | ||
123 | + type=str, | ||
124 | + default='data/UCorpus_spacing_test.txt.bz2', | ||
125 | + help='bziped test data') | ||
126 | + | ||
127 | +parser.add_argument('--model_type', | ||
128 | + type=str, | ||
129 | + default='kospacing', | ||
130 | + help='kospacing or kospacing2') | ||
131 | + | ||
132 | +parser.add_argument('--outputs', | ||
133 | + type=str, | ||
134 | + default='outputs', | ||
135 | + help='directory to save log and model params') | ||
136 | + | ||
137 | +opt = parser.parse_args() | ||
138 | + | ||
139 | +nlp.utils.mkdir(opt.outputs) | ||
140 | + | ||
141 | +fileHandler = logging.FileHandler(opt.outputs + '/' + 'log.log') | ||
142 | +fileHandler.setFormatter(logFormatter) | ||
143 | +logger.addHandler(fileHandler) | ||
144 | + | ||
145 | +consoleHandler = logging.StreamHandler() | ||
146 | +consoleHandler.setFormatter(logFormatter) | ||
147 | +logger.addHandler(consoleHandler) | ||
148 | + | ||
149 | +logger.setLevel(logging.DEBUG) | ||
150 | +logger.info(opt) | ||
151 | + | ||
152 | +GPU_COUNT = opt.num_gpus | ||
153 | +ctx = [mx.gpu(i) for i in range(GPU_COUNT)] | ||
154 | + | ||
155 | + | ||
156 | +# Model class | ||
157 | +class korean_autospacing_base(gluon.HybridBlock): | ||
158 | + def __init__(self, n_hidden, vocab_size, embed_dim, max_seq_length, | ||
159 | + **kwargs): | ||
160 | + super(korean_autospacing_base, self).__init__(**kwargs) | ||
161 | + # 입력 시퀀스 길이 | ||
162 | + self.in_seq_len = max_seq_length | ||
163 | + # 출력 시퀀스 길이 | ||
164 | + self.out_seq_len = max_seq_length | ||
165 | + # GRU의 hidden 개수 | ||
166 | + self.n_hidden = n_hidden | ||
167 | + # 고유문자개수 | ||
168 | + self.vocab_size = vocab_size | ||
169 | + # max_seq_length | ||
170 | + self.max_seq_length = max_seq_length | ||
171 | + # 임베딩 차원수 | ||
172 | + self.embed_dim = embed_dim | ||
173 | + | ||
174 | + with self.name_scope(): | ||
175 | + self.embedding = nn.Embedding(input_dim=self.vocab_size, | ||
176 | + output_dim=self.embed_dim) | ||
177 | + | ||
178 | + self.conv_unigram = nn.Conv2D(channels=128, | ||
179 | + kernel_size=(1, self.embed_dim)) | ||
180 | + | ||
181 | + self.conv_bigram = nn.Conv2D(channels=256, | ||
182 | + kernel_size=(2, self.embed_dim), | ||
183 | + padding=(1, 0)) | ||
184 | + | ||
185 | + self.conv_trigram = nn.Conv2D(channels=128, | ||
186 | + kernel_size=(3, self.embed_dim), | ||
187 | + padding=(1, 0)) | ||
188 | + | ||
189 | + self.conv_forthgram = nn.Conv2D(channels=64, | ||
190 | + kernel_size=(4, self.embed_dim), | ||
191 | + padding=(2, 0)) | ||
192 | + | ||
193 | + self.conv_fifthgram = nn.Conv2D(channels=32, | ||
194 | + kernel_size=(5, self.embed_dim), | ||
195 | + padding=(2, 0)) | ||
196 | + | ||
197 | + self.bi_gru = rnn.GRU(hidden_size=self.n_hidden, layout='NTC', bidirectional=True) | ||
198 | + self.dense_sh = nn.Dense(100, activation='relu', flatten=False) | ||
199 | + self.dense = nn.Dense(1, activation='sigmoid', flatten=False) | ||
200 | + | ||
201 | + def hybrid_forward(self, F, inputs): | ||
202 | + embed = self.embedding(inputs) | ||
203 | + embed = F.expand_dims(embed, axis=1) | ||
204 | + unigram = self.conv_unigram(embed) | ||
205 | + bigram = self.conv_bigram(embed) | ||
206 | + trigram = self.conv_trigram(embed) | ||
207 | + forthgram = self.conv_forthgram(embed) | ||
208 | + fifthgram = self.conv_fifthgram(embed) | ||
209 | + | ||
210 | + grams = F.concat(unigram, | ||
211 | + F.slice_axis(bigram, | ||
212 | + axis=2, | ||
213 | + begin=0, | ||
214 | + end=self.max_seq_length), | ||
215 | + trigram, | ||
216 | + F.slice_axis(forthgram, | ||
217 | + axis=2, | ||
218 | + begin=0, | ||
219 | + end=self.max_seq_length), | ||
220 | + F.slice_axis(fifthgram, | ||
221 | + axis=2, | ||
222 | + begin=0, | ||
223 | + end=self.max_seq_length), | ||
224 | + dim=1) | ||
225 | + | ||
226 | + grams = F.transpose(grams, (0, 2, 3, 1)) | ||
227 | + grams = F.reshape(grams, (-1, self.max_seq_length, -3)) | ||
228 | + grams = self.bi_gru(grams) | ||
229 | + fc1 = self.dense_sh(grams) | ||
230 | + return (self.dense(fc1)) | ||
231 | + | ||
232 | + | ||
233 | +# https://raw.githubusercontent.com/haven-jeon/Train_KoSpacing/master/img/kosapcing_img.png | ||
234 | +class korean_autospacing2(gluon.HybridBlock): | ||
235 | + def __init__(self, n_hidden, vocab_size, embed_dim, max_seq_length, | ||
236 | + **kwargs): | ||
237 | + super(korean_autospacing2, self).__init__(**kwargs) | ||
238 | + # 입력 시퀀스 길이 | ||
239 | + self.in_seq_len = max_seq_length | ||
240 | + # 출력 시퀀스 길이 | ||
241 | + self.out_seq_len = max_seq_length | ||
242 | + # GRU의 hidden 개수 | ||
243 | + self.n_hidden = n_hidden | ||
244 | + # 고유문자개수 | ||
245 | + self.vocab_size = vocab_size | ||
246 | + # max_seq_length | ||
247 | + self.max_seq_length = max_seq_length | ||
248 | + # 임베딩 차원수 | ||
249 | + self.embed_dim = embed_dim | ||
250 | + | ||
251 | + with self.name_scope(): | ||
252 | + self.embedding = nn.Embedding(input_dim=self.vocab_size, | ||
253 | + output_dim=self.embed_dim) | ||
254 | + | ||
255 | + self.conv_unigram = nn.Conv2D(channels=128, | ||
256 | + kernel_size=(1, self.embed_dim)) | ||
257 | + | ||
258 | + self.conv_bigram = nn.Conv2D(channels=128, | ||
259 | + kernel_size=(2, self.embed_dim), | ||
260 | + padding=(1, 0)) | ||
261 | + | ||
262 | + self.conv_trigram = nn.Conv2D(channels=64, | ||
263 | + kernel_size=(3, self.embed_dim), | ||
264 | + padding=(2, 0)) | ||
265 | + | ||
266 | + self.conv_forthgram = nn.Conv2D(channels=32, | ||
267 | + kernel_size=(4, self.embed_dim), | ||
268 | + padding=(3, 0)) | ||
269 | + | ||
270 | + self.conv_fifthgram = nn.Conv2D(channels=16, | ||
271 | + kernel_size=(5, self.embed_dim), | ||
272 | + padding=(4, 0)) | ||
273 | + # for reverse convolution | ||
274 | + self.conv_rev_bigram = nn.Conv2D(channels=128, | ||
275 | + kernel_size=(2, self.embed_dim), | ||
276 | + padding=(1, 0)) | ||
277 | + | ||
278 | + self.conv_rev_trigram = nn.Conv2D(channels=64, | ||
279 | + kernel_size=(3, self.embed_dim), | ||
280 | + padding=(2, 0)) | ||
281 | + | ||
282 | + self.conv_rev_forthgram = nn.Conv2D(channels=32, | ||
283 | + kernel_size=(4, | ||
284 | + self.embed_dim), | ||
285 | + padding=(3, 0)) | ||
286 | + | ||
287 | + self.conv_rev_fifthgram = nn.Conv2D(channels=16, | ||
288 | + kernel_size=(5, | ||
289 | + self.embed_dim), | ||
290 | + padding=(4, 0)) | ||
291 | + self.bi_gru = rnn.GRU(hidden_size=self.n_hidden, layout='NTC', bidirectional=True) | ||
292 | + # self.bi_gru = rnn.BidirectionalCell( | ||
293 | + # rnn.GRUCell(hidden_size=self.n_hidden), | ||
294 | + # rnn.GRUCell(hidden_size=self.n_hidden)) | ||
295 | + self.dense_sh = nn.Dense(100, activation='relu', flatten=False) | ||
296 | + self.dense = nn.Dense(1, activation='sigmoid', flatten=False) | ||
297 | + | ||
298 | + def hybrid_forward(self, F, inputs): | ||
299 | + embed = self.embedding(inputs) | ||
300 | + embed = F.expand_dims(embed, axis=1) | ||
301 | + rev_embed = embed.flip(axis=2) | ||
302 | + | ||
303 | + unigram = self.conv_unigram(embed) | ||
304 | + bigram = self.conv_bigram(embed) | ||
305 | + trigram = self.conv_trigram(embed) | ||
306 | + forthgram = self.conv_forthgram(embed) | ||
307 | + fifthgram = self.conv_fifthgram(embed) | ||
308 | + | ||
309 | + rev_bigram = self.conv_rev_bigram(rev_embed).flip(axis=2) | ||
310 | + rev_trigram = self.conv_rev_trigram(rev_embed).flip(axis=2) | ||
311 | + rev_forthgram = self.conv_rev_forthgram(rev_embed).flip(axis=2) | ||
312 | + rev_fifthgram = self.conv_rev_fifthgram(rev_embed).flip(axis=2) | ||
313 | + | ||
314 | + grams = F.concat(unigram, | ||
315 | + F.slice_axis(bigram, | ||
316 | + axis=2, | ||
317 | + begin=0, | ||
318 | + end=self.max_seq_length), | ||
319 | + F.slice_axis(rev_bigram, | ||
320 | + axis=2, | ||
321 | + begin=0, | ||
322 | + end=self.max_seq_length), | ||
323 | + F.slice_axis(trigram, | ||
324 | + axis=2, | ||
325 | + begin=0, | ||
326 | + end=self.max_seq_length), | ||
327 | + F.slice_axis(rev_trigram, | ||
328 | + axis=2, | ||
329 | + begin=0, | ||
330 | + end=self.max_seq_length), | ||
331 | + F.slice_axis(forthgram, | ||
332 | + axis=2, | ||
333 | + begin=0, | ||
334 | + end=self.max_seq_length), | ||
335 | + F.slice_axis(rev_forthgram, | ||
336 | + axis=2, | ||
337 | + begin=0, | ||
338 | + end=self.max_seq_length), | ||
339 | + F.slice_axis(fifthgram, | ||
340 | + axis=2, | ||
341 | + begin=0, | ||
342 | + end=self.max_seq_length), | ||
343 | + F.slice_axis(rev_fifthgram, | ||
344 | + axis=2, | ||
345 | + begin=0, | ||
346 | + end=self.max_seq_length), | ||
347 | + dim=1) | ||
348 | + | ||
349 | + grams = F.transpose(grams, (0, 2, 3, 1)) | ||
350 | + grams = F.reshape(grams, (-1, self.max_seq_length, -3)) | ||
351 | + grams = self.bi_gru(grams) | ||
352 | + fc1 = self.dense_sh(grams) | ||
353 | + return (self.dense(fc1)) | ||
354 | + | ||
355 | + | ||
356 | +def y_encoding(n_grams, maxlen=200): | ||
357 | + # 입력된 문장으로 정답셋 인코딩함 | ||
358 | + init_mat = np.zeros(shape=(len(n_grams), maxlen), dtype=np.int8) | ||
359 | + for i in range(len(n_grams)): | ||
360 | + init_mat[i, np.cumsum([len(j) for j in n_grams[i]]) - 1] = 1 | ||
361 | + return init_mat | ||
362 | + | ||
363 | + | ||
364 | +def split_train_set(x_train, p=0.98): | ||
365 | + """ | ||
366 | + > split_train_set(pd.DataFrame({'a':[1,2,3,4,None], 'b':[5,6,7,8,9]})) | ||
367 | + (array([0, 4, 3]), [1, 2]) | ||
368 | + """ | ||
369 | + import numpy as np | ||
370 | + train_idx = np.random.choice(range(x_train.shape[0]), | ||
371 | + int(x_train.shape[0] * p), | ||
372 | + replace=False) | ||
373 | + set_tr_idx = set(train_idx) | ||
374 | + test_index = [i for i in range(x_train.shape[0]) if i not in set_tr_idx] | ||
375 | + return ((train_idx, np.array(test_index))) | ||
376 | + | ||
377 | + | ||
378 | +def get_generator(x, y, batch_size): | ||
379 | + tr_set = gluon.data.ArrayDataset(x, y.astype('float32')) | ||
380 | + tr_data_iterator = gluon.data.DataLoader(tr_set, | ||
381 | + batch_size=batch_size, | ||
382 | + shuffle=True, | ||
383 | + num_workers=opt.n_workers) | ||
384 | + return (tr_data_iterator) | ||
385 | + | ||
386 | + | ||
387 | +def pick_model(model_nm, n_hidden, vocab_size, embed_dim, max_seq_length): | ||
388 | + if model_nm.lower() == 'kospacing': | ||
389 | + model = korean_autospacing_base(n_hidden=n_hidden, | ||
390 | + vocab_size=vocab_size, | ||
391 | + embed_dim=embed_dim, | ||
392 | + max_seq_length=max_seq_length) | ||
393 | + elif model_nm.lower() == 'kospacing2': | ||
394 | + model = korean_autospacing2(n_hidden=n_hidden, | ||
395 | + vocab_size=vocab_size, | ||
396 | + embed_dim=embed_dim, | ||
397 | + max_seq_length=max_seq_length) | ||
398 | + else: | ||
399 | + assert False | ||
400 | + return model | ||
401 | + | ||
402 | + | ||
403 | +def model_init(n_hidden, vocab_size, embed_dim, max_seq_length, ctx): | ||
404 | + # 모형 인스턴스 생성 및 트래이너, loss 정의 | ||
405 | + # n_hidden, vocab_size, embed_dim, max_seq_length | ||
406 | + model = pick_model(opt.model_type, n_hidden, vocab_size, embed_dim, max_seq_length) | ||
407 | + model.collect_params().initialize(mx.init.Xavier(), ctx=ctx) | ||
408 | + model.embedding.weight.set_data(weights) | ||
409 | + model.hybridize(static_alloc=True) | ||
410 | + # 임베딩 영역 가중치 고정 | ||
411 | + model.embedding.collect_params().setattr('grad_req', 'null') | ||
412 | + trainer = gluon.Trainer(model.collect_params(), 'rmsprop') | ||
413 | + loss = gluon.loss.SigmoidBinaryCrossEntropyLoss(from_sigmoid=True) | ||
414 | + loss.hybridize(static_alloc=True) | ||
415 | + return (model, loss, trainer) | ||
416 | + | ||
417 | + | ||
418 | +def evaluate_accuracy(data_iterator, net, pad_idx, ctx, n=5000): | ||
419 | + # 각 시퀀스의 길이만큼 순회하며 정확도 측정 | ||
420 | + # 최적화되지 않음 | ||
421 | + acc = mx.metric.Accuracy(axis=0) | ||
422 | + num_of_test = 0 | ||
423 | + for i, (data, label) in enumerate(data_iterator): | ||
424 | + data = data.as_in_context(ctx) | ||
425 | + label = label.as_in_context(ctx) | ||
426 | + # get sentence length | ||
427 | + data_np = data.asnumpy() | ||
428 | + lengths = np.argmax(np.where(data_np == pad_idx, np.ones_like(data_np), | ||
429 | + np.zeros_like(data_np)), | ||
430 | + axis=1) | ||
431 | + output = net(data) | ||
432 | + pred_label = output.squeeze(axis=2) > 0.5 | ||
433 | + | ||
434 | + for i in range(data.shape[0]): | ||
435 | + num_of_test += data.shape[0] | ||
436 | + acc.update(preds=pred_label[i, :lengths[i]], | ||
437 | + labels=label[i, :lengths[i]]) | ||
438 | + if num_of_test > n: | ||
439 | + break | ||
440 | + return acc.get()[1] | ||
441 | + | ||
442 | + | ||
443 | +def train(epochs, | ||
444 | + tr_data_iterator, | ||
445 | + te_data_iterator, | ||
446 | + va_data_iterator, | ||
447 | + model, | ||
448 | + loss, | ||
449 | + trainer, | ||
450 | + pad_idx, | ||
451 | + ctx, | ||
452 | + mdl_desc="spacing_model", | ||
453 | + decay=False): | ||
454 | + # 학습 코드 | ||
455 | + tot_test_acc = [] | ||
456 | + tot_train_loss = [] | ||
457 | + for e in range(epochs): | ||
458 | + tic = time.time() | ||
459 | + # Decay learning rate. | ||
460 | + if e > 1 and decay: | ||
461 | + trainer.set_learning_rate(trainer.learning_rate * 0.7) | ||
462 | + train_loss = [] | ||
463 | + iter_tqdm = tqdm(tr_data_iterator, 'Batches') | ||
464 | + for i, (x_data, y_data) in enumerate(iter_tqdm): | ||
465 | + x_data_l = gluon.utils.split_and_load(x_data, | ||
466 | + ctx, | ||
467 | + even_split=False) | ||
468 | + y_data_l = gluon.utils.split_and_load(y_data, | ||
469 | + ctx, | ||
470 | + even_split=False) | ||
471 | + | ||
472 | + with autograd.record(): | ||
473 | + losses = [ | ||
474 | + loss(model(x), y) for x, y in zip(x_data_l, y_data_l) | ||
475 | + ] | ||
476 | + for l in losses: | ||
477 | + l.backward() | ||
478 | + trainer.step(x_data.shape[0]) | ||
479 | + curr_loss = np.mean([mx.nd.mean(l).asscalar() for l in losses]) | ||
480 | + train_loss.append(curr_loss) | ||
481 | + iter_tqdm.set_description("loss {}".format(curr_loss)) | ||
482 | + mx.nd.waitall() | ||
483 | + | ||
484 | + # caculate test loss | ||
485 | + test_acc = evaluate_accuracy( | ||
486 | + te_data_iterator, | ||
487 | + model, | ||
488 | + pad_idx, | ||
489 | + ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0)) | ||
490 | + valid_acc = evaluate_accuracy( | ||
491 | + va_data_iterator, | ||
492 | + model, | ||
493 | + pad_idx, | ||
494 | + ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0)) | ||
495 | + logger.info('[Epoch %d] time cost: %f' % (e, time.time() - tic)) | ||
496 | + logger.info("[Epoch %d] Train Loss: %f, Test acc : %f Valid acc : %f" % | ||
497 | + (e, np.mean(train_loss), test_acc, valid_acc)) | ||
498 | + tot_test_acc.append(test_acc) | ||
499 | + tot_train_loss.append(np.mean(train_loss)) | ||
500 | + model.save_parameters(opt.outputs + '/' + "{}_{}.params".format(mdl_desc, e)) | ||
501 | + return (tot_test_acc, tot_train_loss) | ||
502 | + | ||
503 | + | ||
504 | +def pre_processing(setences): | ||
505 | + # 공백은 ^ | ||
506 | + char_list = [li.strip().replace(' ', '^') for li in setences] | ||
507 | + # 문장의 시작 포인트 « | ||
508 | + # 문장의 끌 포인트 » | ||
509 | + char_list = ["«" + li + "»" for li in char_list] | ||
510 | + # 문장 -> 문자열 | ||
511 | + char_list = [''.join(list(li)) for li in char_list] | ||
512 | + return char_list | ||
513 | + | ||
514 | + | ||
515 | +def make_input_data(inputs, | ||
516 | + train_ratio, | ||
517 | + sampling, | ||
518 | + make_lag_set=False, | ||
519 | + batch_size=200): | ||
520 | + with bz2.open(inputs, 'rt') as f: | ||
521 | + line_list = [i.strip() for i in f.readlines() if i.strip() != ''] | ||
522 | + logger.info('complete loading train file!') | ||
523 | + | ||
524 | + # 아버지가 방에 들어가신다. -> '«아버지가^방에^들어가신다.»' | ||
525 | + processed_seq = pre_processing(line_list) | ||
526 | + logger.info(processed_seq[0]) | ||
527 | + # n percent random sample | ||
528 | + logger.info('random sampling on training set!') | ||
529 | + samp_idx = np.random.choice(range(len(processed_seq)), | ||
530 | + int(len(processed_seq) * sampling), | ||
531 | + replace=False) | ||
532 | + processed_seq_samp = [processed_seq[i] for i in samp_idx] | ||
533 | + sp_sents = [i.split('^') for i in processed_seq_samp] | ||
534 | + | ||
535 | + sp_sents = list(filter(lambda x: len(x) >= 8, sp_sents)) | ||
536 | + | ||
537 | + # max 8 어절 씩 1어절 shift하여 학습 데이터 생성 | ||
538 | + if make_lag_set is True: | ||
539 | + n_gram = [[k, v, z, a, c, d, e, f] | ||
540 | + for sent in sp_sents for k, v, z, a, c, d, e, f in zip( | ||
541 | + sent, sent[1:], sent[2:], sent[3:], sent[4:], sent[5:], | ||
542 | + sent[6:], sent[7:])] | ||
543 | + else: | ||
544 | + n_gram = sp_sents | ||
545 | + # max 200문자 이하만 사용 | ||
546 | + n_gram = [i for i in n_gram if len("^".join(i)) <= opt.max_seq_len] | ||
547 | + # y 정답 인코딩 | ||
548 | + n_gram_y = y_encoding(n_gram, opt.max_seq_len) | ||
549 | + logger.info(n_gram[0]) | ||
550 | + logger.info(n_gram_y[0]) | ||
551 | + # vocab file 로딩 | ||
552 | + w2idx, _ = load_vocab(opt.vocab_file) | ||
553 | + | ||
554 | + # 학습셋을 만들기 위해 공백을 제거하고 문자 인덱스로 인코딩함 | ||
555 | + logger.info('index eocoding!') | ||
556 | + ngram_coding_seq = encoding_and_padding( | ||
557 | + word2idx_dic=w2idx, | ||
558 | + sequences=[''.join(gram) for gram in n_gram], | ||
559 | + maxlen=opt.max_seq_len, | ||
560 | + padding='post', | ||
561 | + truncating='post') | ||
562 | + logger.info(ngram_coding_seq[0]) | ||
563 | + if train_ratio < 1: | ||
564 | + # 학습셋 테스트셋 생성 | ||
565 | + tr_idx, te_idx = split_train_set(ngram_coding_seq, train_ratio) | ||
566 | + | ||
567 | + y_train = n_gram_y[tr_idx, ] | ||
568 | + x_train = ngram_coding_seq[tr_idx, ] | ||
569 | + | ||
570 | + y_test = n_gram_y[te_idx, ] | ||
571 | + x_test = ngram_coding_seq[te_idx, ] | ||
572 | + | ||
573 | + # train generator | ||
574 | + train_generator = get_generator(x_train, y_train, batch_size) | ||
575 | + valid_generator = get_generator(x_test, y_test, 500) | ||
576 | + return (train_generator, valid_generator) | ||
577 | + else: | ||
578 | + train_generator = get_generator(ngram_coding_seq, n_gram_y, batch_size) | ||
579 | + return (train_generator) | ||
580 | + | ||
581 | + | ||
582 | +if opt.train: | ||
583 | + # 사전 파일 로딩 | ||
584 | + w2idx, idx2w = load_vocab(opt.vocab_file) | ||
585 | + # 임베딩 파일 로딩 | ||
586 | + weights = load_embedding(opt.embedding_file) | ||
587 | + vocab_size = weights.shape[0] | ||
588 | + embed_dim = weights.shape[1] | ||
589 | + | ||
590 | + train_generator, valid_generator = make_input_data( | ||
591 | + opt.train_data, | ||
592 | + train_ratio=0.95, | ||
593 | + sampling=opt.train_samp_ratio, | ||
594 | + make_lag_set=True, | ||
595 | + batch_size=opt.batch_size) | ||
596 | + | ||
597 | + test_generator = make_input_data(opt.test_data, | ||
598 | + sampling=1, | ||
599 | + train_ratio=1, | ||
600 | + make_lag_set=True, | ||
601 | + batch_size=opt.test_batch_size) | ||
602 | + | ||
603 | + model, loss, trainer = model_init(n_hidden=opt.n_hidden, | ||
604 | + vocab_size=vocab_size, | ||
605 | + embed_dim=embed_dim, | ||
606 | + max_seq_length=opt.max_seq_len, | ||
607 | + ctx=ctx) | ||
608 | + logger.info('start training!') | ||
609 | + train(epochs=opt.num_epoch, | ||
610 | + tr_data_iterator=train_generator, | ||
611 | + te_data_iterator=test_generator, | ||
612 | + va_data_iterator=valid_generator, | ||
613 | + model=model, | ||
614 | + loss=loss, | ||
615 | + trainer=trainer, | ||
616 | + pad_idx=w2idx['__PAD__'], | ||
617 | + ctx=ctx, | ||
618 | + mdl_desc=opt.model_prefix) | ||
619 | + | ||
620 | + | ||
621 | +class pred_spacing: | ||
622 | + def __init__(self, model, w2idx): | ||
623 | + self.model = model | ||
624 | + self.w2idx = w2idx | ||
625 | + self.pattern = re.compile(r'\s+') | ||
626 | + | ||
627 | + @lru_cache(maxsize=None) | ||
628 | + def get_spaced_sent(self, raw_sent): | ||
629 | + raw_sent_ = "«" + raw_sent + "»" | ||
630 | + raw_sent_ = raw_sent_.replace(' ', '^') | ||
631 | + sents_in = [ | ||
632 | + raw_sent_, | ||
633 | + ] | ||
634 | + mat_in = encoding_and_padding(word2idx_dic=self.w2idx, | ||
635 | + sequences=sents_in, | ||
636 | + maxlen=opt.max_seq_len, | ||
637 | + padding='post', | ||
638 | + truncating='post') | ||
639 | + mat_in = mx.nd.array(mat_in, ctx=mx.cpu(0)) | ||
640 | + results = self.model(mat_in) | ||
641 | + mat_set = results[0, ] | ||
642 | + | ||
643 | + r = 255 | ||
644 | + c = 1 / np.log(1+r) | ||
645 | + log_scaled = c * mx.nd.log(1 + r * mat_set[:len(raw_sent_)]) | ||
646 | + #print(log_scaled) | ||
647 | + d_2 = [1] | ||
648 | + for i in range(1,len(raw_sent_)): | ||
649 | + d_2.append(mat_set[i-1] - (2 * mat_set[i]) + mat_set[i+1]) | ||
650 | + #print(d_2) | ||
651 | + preds = np.array( | ||
652 | + ['1' if log_scaled[i] > 0.01 and d_2[i] < 0 else '0' for i in range(len(raw_sent_))]) | ||
653 | + print(mat_set[:len(raw_sent_)]) | ||
654 | + # #saveresult | ||
655 | + | ||
656 | + | ||
657 | + # wr.writerow([raw_sent_, temp]) | ||
658 | + # f.close | ||
659 | + return self.make_pred_sents(raw_sent_, preds) | ||
660 | + | ||
661 | + def make_pred_sents(self, x_sents, y_pred): | ||
662 | + res_sent = [] | ||
663 | + for i, j in zip(x_sents, y_pred): | ||
664 | + if j == '1': | ||
665 | + res_sent.append(i) | ||
666 | + res_sent.append(' ') | ||
667 | + else: | ||
668 | + res_sent.append(i) | ||
669 | + subs = re.sub(self.pattern, ' ', ''.join(res_sent).replace('^', ' ')) | ||
670 | + subs = subs.replace('«', '') | ||
671 | + subs = subs.replace('»', '') | ||
672 | + return subs | ||
673 | + | ||
674 | +if not opt.train and not opt.test: | ||
675 | + # 사전 파일 로딩 | ||
676 | + w2idx, idx2w = load_vocab(opt.vocab_file) | ||
677 | + # 임베딩 파일 로딩 | ||
678 | + weights = load_embedding(opt.embedding_file) | ||
679 | + vocab_size = weights.shape[0] | ||
680 | + embed_dim = weights.shape[1] | ||
681 | + model = pick_model(opt.model_type, opt.n_hidden, vocab_size, embed_dim, opt.max_seq_len) | ||
682 | + | ||
683 | + # model.collect_params().initialize(mx.init.Xavier(), ctx=mx.cpu(0)) | ||
684 | + # model.embedding.weight.set_data(weights) | ||
685 | + model.load_parameters(opt.model_params, ctx=mx.cpu(0)) | ||
686 | + predictor = pred_spacing(model, w2idx) | ||
687 | + | ||
688 | + # datafile = open('./data/removed.txt', 'r', encoding='utf-8') | ||
689 | + # lines = datafile.readlines() | ||
690 | + # total = len(lines) | ||
691 | + # cnt = 1 | ||
692 | + # for line in lines[:50000]: | ||
693 | + # print() | ||
694 | + # print('#' * 30) | ||
695 | + # print(cnt, ' / ', total) | ||
696 | + # print('#' * 30) | ||
697 | + # predictor.get_spaced_sent(line) | ||
698 | + # cnt += 1 | ||
699 | + | ||
700 | + | ||
701 | + | ||
702 | + while 1: | ||
703 | + sent = input("sent > ") | ||
704 | + print(sent) | ||
705 | + start = timer() | ||
706 | + spaced = predictor.get_spaced_sent(sent) | ||
707 | + end = timer() | ||
708 | + print("spaced sent[{:03.2f}sec/sent] > {}".format(end - start, spaced)) | ||
709 | + | ||
710 | +if not opt.train and opt.test: | ||
711 | + logger.info("calculate accuracy!") | ||
712 | + # 사전 파일 로딩 | ||
713 | + w2idx, idx2w = load_vocab(opt.vocab_file) | ||
714 | + # 임베딩 파일 로딩 | ||
715 | + weights = load_embedding(opt.embedding_file) | ||
716 | + vocab_size = weights.shape[0] | ||
717 | + embed_dim = weights.shape[1] | ||
718 | + | ||
719 | + model = pick_model(opt.model_type, opt.n_hidden, vocab_size, embed_dim, opt.max_seq_len) | ||
720 | + | ||
721 | + # model.initialize(ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0)) | ||
722 | + model.load_parameters(opt.model_params, | ||
723 | + ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0)) | ||
724 | + valid_generator = make_input_data(opt.test_data, | ||
725 | + sampling=1, | ||
726 | + train_ratio=1, | ||
727 | + make_lag_set=True, | ||
728 | + batch_size=100) | ||
729 | + valid_acc = evaluate_accuracy( | ||
730 | + valid_generator, | ||
731 | + model, | ||
732 | + w2idx['__PAD__'], | ||
733 | + ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0), | ||
734 | + n=30000) | ||
735 | + logger.info('valid accuracy : {}'.format(valid_acc)) |
No preview for this file type
No preview for this file type
No preview for this file type
train/utils/embedding_maker.py
0 → 100644
1 | +__all__ = [ | ||
2 | + 'create_embeddings', 'load_embedding', 'load_vocab', | ||
3 | + 'encoding_and_padding', 'get_embedding_model' | ||
4 | +] | ||
5 | + | ||
6 | +import bz2 | ||
7 | +import json | ||
8 | +import os | ||
9 | + | ||
10 | +import numpy as np | ||
11 | +import pkg_resources | ||
12 | +from gensim.models import FastText | ||
13 | + | ||
14 | +from utils.spacing_utils import sent_to_spacing_chars | ||
15 | +from tqdm import tqdm | ||
16 | +from utils.jamo_utils import jamo_sentence, jamo_to_word | ||
17 | + | ||
18 | +def pad_sequences(sequences, | ||
19 | + maxlen=None, | ||
20 | + dtype='int32', | ||
21 | + padding='pre', | ||
22 | + truncating='pre', | ||
23 | + value=0.): | ||
24 | + | ||
25 | + if not hasattr(sequences, '__len__'): | ||
26 | + raise ValueError('`sequences` must be iterable.') | ||
27 | + lengths = [] | ||
28 | + for x in sequences: | ||
29 | + if not hasattr(x, '__len__'): | ||
30 | + raise ValueError('`sequences` must be a list of iterables. ' | ||
31 | + 'Found non-iterable: ' + str(x)) | ||
32 | + lengths.append(len(x)) | ||
33 | + | ||
34 | + num_samples = len(sequences) | ||
35 | + if maxlen is None: | ||
36 | + maxlen = np.max(lengths) | ||
37 | + | ||
38 | + # take the sample shape from the first non empty sequence | ||
39 | + # checking for consistency in the main loop below. | ||
40 | + sample_shape = tuple() | ||
41 | + for s in sequences: | ||
42 | + if len(s) > 0: | ||
43 | + sample_shape = np.asarray(s).shape[1:] | ||
44 | + break | ||
45 | + | ||
46 | + x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype) | ||
47 | + for idx, s in enumerate(sequences): | ||
48 | + if not len(s): | ||
49 | + continue # empty list/array was found | ||
50 | + if truncating == 'pre': | ||
51 | + trunc = s[-maxlen:] | ||
52 | + elif truncating == 'post': | ||
53 | + trunc = s[:maxlen] | ||
54 | + else: | ||
55 | + raise ValueError('Truncating type "%s" not understood' % | ||
56 | + truncating) | ||
57 | + | ||
58 | + # check `trunc` has expected shape | ||
59 | + trunc = np.asarray(trunc, dtype=dtype) | ||
60 | + if trunc.shape[1:] != sample_shape: | ||
61 | + raise ValueError( | ||
62 | + 'Shape of sample %s of sequence at position %s is different from expected shape %s' | ||
63 | + % (trunc.shape[1:], idx, sample_shape)) | ||
64 | + | ||
65 | + if padding == 'post': | ||
66 | + x[idx, :len(trunc)] = trunc | ||
67 | + elif padding == 'pre': | ||
68 | + x[idx, -len(trunc):] = trunc | ||
69 | + else: | ||
70 | + raise ValueError('Padding type "%s" not understood' % padding) | ||
71 | + return x | ||
72 | + | ||
73 | + | ||
74 | +def create_embeddings(data_dir, | ||
75 | + model_file, | ||
76 | + embeddings_file, | ||
77 | + vocab_file, | ||
78 | + splitc=' ', | ||
79 | + **params): | ||
80 | + """ | ||
81 | + making embedding from files. | ||
82 | + :**params additional Word2Vec() parameters | ||
83 | + :splitc char for splitting in data_dir files | ||
84 | + :model_file output object from Word2Vec() | ||
85 | + :data_dir data dir to be process | ||
86 | + :embeddings_file numpy object file path from Word2Vec() | ||
87 | + :vocab_file item to index json dictionary | ||
88 | + """ | ||
89 | + class SentenceGenerator(object): | ||
90 | + def __init__(self, dirname): | ||
91 | + self.dirname = dirname | ||
92 | + | ||
93 | + def __iter__(self): | ||
94 | + for fname in os.listdir(self.dirname): | ||
95 | + print("processing~ '{}'".format(fname)) | ||
96 | + for line in bz2.open(os.path.join(self.dirname, fname), "rt"): | ||
97 | + yield sent_to_spacing_chars(line.strip()).split(splitc) | ||
98 | + | ||
99 | + sentences = SentenceGenerator(data_dir) | ||
100 | + | ||
101 | + model = FastText.load(model_file) | ||
102 | + model.save(model_file) | ||
103 | + weights = model.wv.syn0 | ||
104 | + default_vec = np.mean(weights, axis=0, keepdims=True) | ||
105 | + padding_vec = np.zeros((1, weights.shape[1])) | ||
106 | + | ||
107 | + weights_default = np.concatenate([weights, default_vec, padding_vec], | ||
108 | + axis=0) | ||
109 | + | ||
110 | + np.save(open(embeddings_file, 'wb'), weights_default) | ||
111 | + | ||
112 | + vocab = dict([(k, v.index) for k, v in model.wv.vocab.items()]) | ||
113 | + vocab['__PAD__'] = weights_default.shape[0] - 1 | ||
114 | + with open(vocab_file, 'w') as f: | ||
115 | + f.write(json.dumps(vocab)) | ||
116 | + | ||
117 | + | ||
118 | +def load_embedding(embeddings_file): | ||
119 | + return (np.load(embeddings_file)) | ||
120 | + | ||
121 | + | ||
122 | +def load_vocab(vocab_path): | ||
123 | + with open(vocab_path, 'r') as f: | ||
124 | + data = json.loads(f.read()) | ||
125 | + word2idx = data | ||
126 | + idx2word = dict([(v, k) for k, v in data.items()]) | ||
127 | + return word2idx, idx2word | ||
128 | + | ||
129 | +def get_similar_char(word2idx_dic, model, jamo_model, text, try_cnt, OOV_CNT, HIT_CNT): | ||
130 | + OOV_CNT += 1 | ||
131 | + jamo_text = jamo_sentence(text) | ||
132 | + simialr_list = jamo_model.wv.most_similar(jamo_text)[:try_cnt] | ||
133 | + for char in simialr_list: | ||
134 | + result = jamo_to_word(char[0]) | ||
135 | + | ||
136 | + if result in word2idx_dic.keys(): | ||
137 | + # print('#' * 20) | ||
138 | + # print('hit') | ||
139 | + # print('origin: ', text, 'reuslt: ', result) | ||
140 | + HIT_CNT += 1 | ||
141 | + return OOV_CNT, HIT_CNT,result | ||
142 | + | ||
143 | + # print('#' * 20) | ||
144 | + # print('no hit') | ||
145 | + # print('origin: ', text) | ||
146 | + return OOV_CNT, HIT_CNT, model.wv.most_similar(text)[0][0] | ||
147 | + | ||
148 | + | ||
149 | +def encoding_and_padding(word2idx_dic, sequences, **params): | ||
150 | + """ | ||
151 | + 1. making item to idx | ||
152 | + 2. padding | ||
153 | + :word2idx_dic | ||
154 | + :sequences: list of lists where each element is a sequence | ||
155 | + :maxlen: int, maximum length | ||
156 | + :dtype: type to cast the resulting sequence. | ||
157 | + :padding: 'pre' or 'post', pad either before or after each sequence. | ||
158 | + :truncating: 'pre' or 'post', remove values from sequences larger than | ||
159 | + maxlen either in the beginning or in the end of the sequence | ||
160 | + :value: float, value to pad the sequences to the desired value. | ||
161 | + """ | ||
162 | + model_file = 'model/fasttext' | ||
163 | + jamo_model_path = 'jamo_model/fasttext' | ||
164 | + print('seq_idx start') | ||
165 | + model = FastText.load(model_file) | ||
166 | + jamo_model = FastText.load(jamo_model_path) | ||
167 | + seq_idx = [] | ||
168 | + OOV_CNT = 0 | ||
169 | + HIT_CNT = 0 | ||
170 | + TOTAL_CNT = 0 | ||
171 | + | ||
172 | + for word in tqdm(sequences): | ||
173 | + temp = [] | ||
174 | + for char in word: | ||
175 | + TOTAL_CNT += 1 | ||
176 | + if char in word2idx_dic.keys(): | ||
177 | + temp.append(word2idx_dic[char]) | ||
178 | + else: | ||
179 | + OOV_CNT, HIT_CNT, result = get_similar_char(word2idx_dic, model, jamo_model, char, 3, OOV_CNT, HIT_CNT) | ||
180 | + temp.append(word2idx_dic[result]) | ||
181 | + seq_idx.append(temp) | ||
182 | + print('TOTAL CNT: ', TOTAL_CNT, 'OOV CNT: ', OOV_CNT, 'HIT_CNT: ', HIT_CNT) | ||
183 | + if OOV_CNT > 0 and HIT_CNT > 0: | ||
184 | + print('OOV RATE:', float(OOV_CNT) / TOTAL_CNT * 100, '%' ,'HIT_RATE: ', float(HIT_CNT) / float(OOV_CNT) * 100, '%') | ||
185 | + | ||
186 | + params['value'] = word2idx_dic['__PAD__'] | ||
187 | + return (pad_sequences(seq_idx, **params)) | ||
188 | + | ||
189 | + | ||
190 | +def get_embedding_model(name='fee_prods', path='data/embedding'): | ||
191 | + weights = pkg_resources.resource_filename( | ||
192 | + 'dsc', os.path.join(path, name, 'weights.np')) | ||
193 | + w2idx = pkg_resources.resource_filename( | ||
194 | + 'dsc', os.path.join(path, name, 'idx.json')) | ||
195 | + return ((load_embedding(weights), load_vocab(w2idx)[0])) |
train/utils/jamo_utils.py
0 → 100644
1 | +import re | ||
2 | +from soynlp.hangle import compose, decompose, character_is_korean | ||
3 | + | ||
4 | + | ||
5 | +doublespace_pattern = re.compile('\s+') | ||
6 | + | ||
7 | +def jamo_sentence(sent): | ||
8 | + def transform(char): | ||
9 | + if char == ' ': | ||
10 | + return char | ||
11 | + | ||
12 | + cjj = decompose(char) | ||
13 | + if len(cjj) == 1: | ||
14 | + return cjj | ||
15 | + | ||
16 | + cjj_ = ''.join(c if c != ' ' else '-' for c in cjj) | ||
17 | + return cjj_ | ||
18 | + | ||
19 | + sent_ = [] | ||
20 | + for char in sent: | ||
21 | + if character_is_korean(char): | ||
22 | + sent_.append(transform(char)) | ||
23 | + else: | ||
24 | + sent_.append(char) | ||
25 | + sent_ = doublespace_pattern.sub(' ', ''.join(sent_)) | ||
26 | + return sent_ | ||
27 | + | ||
28 | +def jamo_to_word(jamo): | ||
29 | + jamo_list, idx = [], 0 | ||
30 | + | ||
31 | + while idx < len(jamo): | ||
32 | + if not character_is_korean(jamo[idx]): | ||
33 | + jamo_list.append(jamo[idx]) | ||
34 | + idx += 1 | ||
35 | + else: | ||
36 | + jamo_list.append(jamo[idx:idx + 3]) | ||
37 | + idx += 3 | ||
38 | + | ||
39 | + word = "" | ||
40 | + for jamo_char in jamo_list: | ||
41 | + if len(jamo_char) == 1: | ||
42 | + word += jamo_char | ||
43 | + elif jamo_char[2] == "-": | ||
44 | + word += compose(jamo_char[0], jamo_char[1], " ") | ||
45 | + else: word += compose(jamo_char[0], jamo_char[1], jamo_char[2]) | ||
46 | + | ||
47 | + return word | ||
48 | + | ||
49 | +def break_char (jamo_sentence): | ||
50 | + idx = 0 | ||
51 | + corpus = [] | ||
52 | + | ||
53 | + while idx < len(jamo_sentence): | ||
54 | + if not character_is_korean(jamo_sentence[idx]): | ||
55 | + corpus.append(jamo_sentence[idx]) | ||
56 | + idx += 1 | ||
57 | + else: | ||
58 | + corpus.append(jamo_sentence[idx : idx+3]) | ||
59 | + idx += 3 | ||
60 | + return corpus | ||
... | \ No newline at end of file | ... | \ No newline at end of file |
train/utils/spacing_utils.py
0 → 100644
1 | +# coding=utf-8 | ||
2 | +# Copyright 2020 Heewon Jeon. All rights reserved. | ||
3 | +# | ||
4 | +# Licensed under the Apache License, Version 2.0 (the "License"); | ||
5 | +# you may not use this file except in compliance with the License. | ||
6 | +# You may obtain a copy of the License at | ||
7 | +# | ||
8 | +# http://www.apache.org/licenses/LICENSE-2.0 | ||
9 | +# | ||
10 | +# Unless required by applicable law or agreed to in writing, software | ||
11 | +# distributed under the License is distributed on an "AS IS" BASIS, | ||
12 | +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
13 | +# See the License for the specific language governing permissions and | ||
14 | +# limitations under the License. | ||
15 | + | ||
16 | +def sent_to_spacing_chars(sent): | ||
17 | + # 공백은 ^ | ||
18 | + chars = sent.strip().replace(' ', '^') | ||
19 | + # char_list = [li.strip().replace(' ', '^') for li in sents] | ||
20 | + | ||
21 | + # 문장의 시작 포인트 « | ||
22 | + # 문장의 끌 포인트 » | ||
23 | + tagged_chars = "«" + chars + "»" | ||
24 | + # char_list = [ "«" + li + "»" for li in char_list] | ||
25 | + | ||
26 | + # 문장 -> 문자열 | ||
27 | + char_list = ' '.join(list(tagged_chars)) | ||
28 | + # char_list = [ ' '.join(list(li)) for li in char_list] | ||
29 | + return(char_list) |
-
Please register or login to post a comment