submit train init

yomapi
Commit 48da6e7d8642425195a2c89705e430501dfd469c 48da6e7d 1 parent a0839ebc
Showing 19 changed files with 1463 additions and 0 deletions
README.md
img/2-stage-FastText.png
img/Thresholding_result.png
img/kosapcing_img.png
img/probability_distribution_of_output_vector.png
train/LICENSE
train/data/example.txt.bz2
train/embedding.py
train/jamo_model/.gitignore
train/model/.gitignore
train/output/.gitignore
train/requirements.txt
train/train.py
train/utils/__pycache__/embedding_maker.cpython-37.pyc
train/utils/__pycache__/jamo_utils.cpython-37.pyc
train/utils/__pycache__/spacing_utils.cpython-37.pyc
train/utils/embedding_maker.py
train/utils/jamo_utils.py
train/utils/spacing_utils.py
--- a/README.md
View file @48da6e7
+++ b/README.md
View file @48da6e7
+ # ML base Spacing Correcter
+ This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing "TrainKoSpacing"), using FastText instead of Word2Vec
+ 
+ ## Performances
+ | Model  | Test Accuracy(%)   | Encoding Time Cost |
+ | :------------: | :------------: | :------------: |
+ | TrainKoSpacing | 96.6147 | 02m 23s|
+ | 자모분해 FastText  | 98.9915  | 08h 20m 11s
+ | 2 Stage FastText  | 99.0888  | 03m 23s
+ 
+ ## Data
+ #### Corpus
+ 
+ We mainly focus on the National Institute of Korean Language 모두의 말뭉치 corpus and National Information Society Agency AI-Hub data. However, due to the license issue, we are restricted to distribute this dataset. You should be able to get them throw the link below
+ [National Institute of Korean Language 모두의 말뭉치](https://corpus.korean.go.kr/).
+ [National Information Society Agency AI-Hub](https://aihub.or.kr/aihub-data/natural-language/about "National Information Society Agency AI-Hub")
+ 
+ #### Data format
+ Bziped file consisting of one sentence per line.
+ 
+ ```
+ ~/KoSpacing/data$ bzcat train.txt.bz2 | head
+ 엠마누엘 웅가로 / 의상서 실내 장식품으로… 디자인 세계 넓혀
+ 프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다.
+ 웅가로는 침실과 식당, 욕실에서 사용하는 갖가지 직물제품을 디자인해 최근 파리의 갤러리 라파예트백화점에서 '색의 컬렉션'이라는 이름으로 전시회를 열었다.
+ ```
+ 
+ 
+ ## Architecture
+ 
+ ### Model
+ ![kosapcing_img](img/kosapcing_img.png)
+ 
+ ### Word Embedding
+ #### 자모분해
+ To get similar shpae of Korean charector, use 자모분해 FastText word embedding.
+ ex)
+ 자연어처리
+ ㅈ ㅏ – ㅇ ㅕ ㄴ ㅇ ㅓ – ㅊ ㅓ – ㄹ ㅣ –
+ 
+ #### 2 stage FastText
+ Becasue of time to handdle 자모분해, use 자모분해 FastText only for Out of Vocabulary charector.
+ ![2-stage-FastText_img](img/2-stage-FastText.png)
+ 
+ ### Thresholding
+ Because middle part of output distribution are evenly distributed.
+ ![probability_distribution_of_output_vector](img/probability_distribution_of_output_vector.png)
+ 
+ Use log transform and second derivative
+ result:
+ ![Thresholding_result](img/Thresholding_result.png)
+ 
+ 
+ 
+ ## How to Run
+ 
+ 
+ ### Installation
+ 
+ - For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow.
+ - Support only above Python 3.7.
+ ### Requirement
+ 
+ - Python (>= 3.7)
+ - MXNet (>= 1.6.0)
+ - tqdm (>= 4.19.5)
+ - Pandas (>= 0.22.0)
+ - Gensim (>= 3.8.1)
+ - GluonNLP (>= 0.9.1)
+ - soynlp (>= 0.0.493)
+ 
+ ### Dependencies
+ 
+ ```bash
+ pip install -r requirements.txt
+ ```
+ 
+ ### Training
+ 
+ ```bash
+ python train.py --train --train-samp-ratio 1.0 --num-epoch 50 --train_data data/train.txt.bz2 --test_data data/test.txt.bz2 --outputs train_log_to --model_type kospacing --model-file fasttext
+ ```
+ 
+ ### Evaluation
+ 
+ ```bash
+ python train.py --model-params model/kospacing.params --model_type kospacing
+ sent > 중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
+ 중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
+ spaced sent[0.12sec/sent]  > 중국은 2018년 평창동계올림픽의 반환점에 이르기까지 아직 노골드 행진이다.  
+ ```
+ 
+ ### Directory
+ Directory guide for embedding model files
+  bold texts means necessary
+ 
+ - model
+ 	- **fasttext**
+ 	- fasttext_vis
+ 	- **fasttext.trainables.vectors_ngrams_lockf.npy**
+ 	- **fasttext.wv.vectors_ngrams.npy**
+ 	- **kospacing_wv.np**
+ 	- **w2idx.dic**
+ 
+ - jamo_model
+ 	- **fasttext**
+ 	- fasttext_vis
+ 	- **fasttext.trainables.vectors_ngrams_lockf.npy**
+ 	- **fasttext.wv.vectors_ngrams.npy**
+ 	- **kospacing_wv.np**
+ 	- **w2idx.dic**
+ 
+ ### Reference
+ TrainKoSpacing: https://github.com/haven-jeon/TrainKoSpacing
+ 딥 러닝을 이용한 자연어 처리 입문: https://wikidocs.net/book/2155
+ 
--- a/img/2-stage-FastText.png 0 → 100644
View file @48da6e7
+++ b/img/2-stage-FastText.png 0 → 100644
View file @48da6e7
--- a/img/Thresholding_result.png 0 → 100644
View file @48da6e7
+++ b/img/Thresholding_result.png 0 → 100644
View file @48da6e7
--- a/img/kosapcing_img.png 0 → 100644
View file @48da6e7
+++ b/img/kosapcing_img.png 0 → 100644
View file @48da6e7
--- a/img/probability_distribution_of_output_vector.png 0 → 100644
View file @48da6e7
+++ b/img/probability_distribution_of_output_vector.png 0 → 100644
View file @48da6e7
--- a/train/LICENSE 0 → 100644
View file @48da6e7
+++ b/train/LICENSE 0 → 100644
View file @48da6e7
+                                  Apache License
+                            Version 2.0, January 2004
+                         http://www.apache.org/licenses/
+ 
+    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+ 
+    1. Definitions.
+ 
+       "License" shall mean the terms and conditions for use, reproduction,
+       and distribution as defined by Sections 1 through 9 of this document.
+ 
+       "Licensor" shall mean the copyright owner or entity authorized by
+       the copyright owner that is granting the License.
+ 
+       "Legal Entity" shall mean the union of the acting entity and all
+       other entities that control, are controlled by, or are under common
+       control with that entity. For the purposes of this definition,
+       "control" means (i) the power, direct or indirect, to cause the
+       direction or management of such entity, whether by contract or
+       otherwise, or (ii) ownership of fifty percent (50%) or more of the
+       outstanding shares, or (iii) beneficial ownership of such entity.
+ 
+       "You" (or "Your") shall mean an individual or Legal Entity
+       exercising permissions granted by this License.
+ 
+       "Source" form shall mean the preferred form for making modifications,
+       including but not limited to software source code, documentation
+       source, and configuration files.
+ 
+       "Object" form shall mean any form resulting from mechanical
+       transformation or translation of a Source form, including but
+       not limited to compiled object code, generated documentation,
+       and conversions to other media types.
+ 
+       "Work" shall mean the work of authorship, whether in Source or
+       Object form, made available under the License, as indicated by a
+       copyright notice that is included in or attached to the work
+       (an example is provided in the Appendix below).
+ 
+       "Derivative Works" shall mean any work, whether in Source or Object
+       form, that is based on (or derived from) the Work and for which the
+       editorial revisions, annotations, elaborations, or other modifications
+       represent, as a whole, an original work of authorship. For the purposes
+       of this License, Derivative Works shall not include works that remain
+       separable from, or merely link (or bind by name) to the interfaces of,
+       the Work and Derivative Works thereof.
+ 
+       "Contribution" shall mean any work of authorship, including
+       the original version of the Work and any modifications or additions
+       to that Work or Derivative Works thereof, that is intentionally
+       submitted to Licensor for inclusion in the Work by the copyright owner
+       or by an individual or Legal Entity authorized to submit on behalf of
+       the copyright owner. For the purposes of this definition, "submitted"
+       means any form of electronic, verbal, or written communication sent
+       to the Licensor or its representatives, including but not limited to
+       communication on electronic mailing lists, source code control systems,
+       and issue tracking systems that are managed by, or on behalf of, the
+       Licensor for the purpose of discussing and improving the Work, but
+       excluding communication that is conspicuously marked or otherwise
+       designated in writing by the copyright owner as "Not a Contribution."
+ 
+       "Contributor" shall mean Licensor and any individual or Legal Entity
+       on behalf of whom a Contribution has been received by Licensor and
+       subsequently incorporated within the Work.
+ 
+    2. Grant of Copyright License. Subject to the terms and conditions of
+       this License, each Contributor hereby grants to You a perpetual,
+       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+       copyright license to reproduce, prepare Derivative Works of,
+       publicly display, publicly perform, sublicense, and distribute the
+       Work and such Derivative Works in Source or Object form.
+ 
+    3. Grant of Patent License. Subject to the terms and conditions of
+       this License, each Contributor hereby grants to You a perpetual,
+       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+       (except as stated in this section) patent license to make, have made,
+       use, offer to sell, sell, import, and otherwise transfer the Work,
+       where such license applies only to those patent claims licensable
+       by such Contributor that are necessarily infringed by their
+       Contribution(s) alone or by combination of their Contribution(s)
+       with the Work to which such Contribution(s) was submitted. If You
+       institute patent litigation against any entity (including a
+       cross-claim or counterclaim in a lawsuit) alleging that the Work
+       or a Contribution incorporated within the Work constitutes direct
+       or contributory patent infringement, then any patent licenses
+       granted to You under this License for that Work shall terminate
+       as of the date such litigation is filed.
+ 
+    4. Redistribution. You may reproduce and distribute copies of the
+       Work or Derivative Works thereof in any medium, with or without
+       modifications, and in Source or Object form, provided that You
+       meet the following conditions:
+ 
+       (a) You must give any other recipients of the Work or
+           Derivative Works a copy of this License; and
+ 
+       (b) You must cause any modified files to carry prominent notices
+           stating that You changed the files; and
+ 
+       (c) You must retain, in the Source form of any Derivative Works
+           that You distribute, all copyright, patent, trademark, and
+           attribution notices from the Source form of the Work,
+           excluding those notices that do not pertain to any part of
+           the Derivative Works; and
+ 
+       (d) If the Work includes a "NOTICE" text file as part of its
+           distribution, then any Derivative Works that You distribute must
+           include a readable copy of the attribution notices contained
+           within such NOTICE file, excluding those notices that do not
+           pertain to any part of the Derivative Works, in at least one
+           of the following places: within a NOTICE text file distributed
+           as part of the Derivative Works; within the Source form or
+           documentation, if provided along with the Derivative Works; or,
+           within a display generated by the Derivative Works, if and
+           wherever such third-party notices normally appear. The contents
+           of the NOTICE file are for informational purposes only and
+           do not modify the License. You may add Your own attribution
+           notices within Derivative Works that You distribute, alongside
+           or as an addendum to the NOTICE text from the Work, provided
+           that such additional attribution notices cannot be construed
+           as modifying the License.
+ 
+       You may add Your own copyright statement to Your modifications and
+       may provide additional or different license terms and conditions
+       for use, reproduction, or distribution of Your modifications, or
+       for any such Derivative Works as a whole, provided Your use,
+       reproduction, and distribution of the Work otherwise complies with
+       the conditions stated in this License.
+ 
+    5. Submission of Contributions. Unless You explicitly state otherwise,
+       any Contribution intentionally submitted for inclusion in the Work
+       by You to the Licensor shall be under the terms and conditions of
+       this License, without any additional terms or conditions.
+       Notwithstanding the above, nothing herein shall supersede or modify
+       the terms of any separate license agreement you may have executed
+       with Licensor regarding such Contributions.
+ 
+    6. Trademarks. This License does not grant permission to use the trade
+       names, trademarks, service marks, or product names of the Licensor,
+       except as required for reasonable and customary use in describing the
+       origin of the Work and reproducing the content of the NOTICE file.
+ 
+    7. Disclaimer of Warranty. Unless required by applicable law or
+       agreed to in writing, Licensor provides the Work (and each
+       Contributor provides its Contributions) on an "AS IS" BASIS,
+       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+       implied, including, without limitation, any warranties or conditions
+       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+       PARTICULAR PURPOSE. You are solely responsible for determining the
+       appropriateness of using or redistributing the Work and assume any
+       risks associated with Your exercise of permissions under this License.
+ 
+    8. Limitation of Liability. In no event and under no legal theory,
+       whether in tort (including negligence), contract, or otherwise,
+       unless required by applicable law (such as deliberate and grossly
+       negligent acts) or agreed to in writing, shall any Contributor be
+       liable to You for damages, including any direct, indirect, special,
+       incidental, or consequential damages of any character arising as a
+       result of this License or out of the use or inability to use the
+       Work (including but not limited to damages for loss of goodwill,
+       work stoppage, computer failure or malfunction, or any and all
+       other commercial damages or losses), even if such Contributor
+       has been advised of the possibility of such damages.
+ 
+    9. Accepting Warranty or Additional Liability. While redistributing
+       the Work or Derivative Works thereof, You may choose to offer,
+       and charge a fee for, acceptance of support, warranty, indemnity,
+       or other liability obligations and/or rights consistent with this
+       License. However, in accepting such obligations, You may act only
+       on Your own behalf and on Your sole responsibility, not on behalf
+       of any other Contributor, and only if You agree to indemnify,
+       defend, and hold each Contributor harmless for any liability
+       incurred by, or claims asserted against, such Contributor by reason
+       of your accepting any such warranty or additional liability.
+ 
+    END OF TERMS AND CONDITIONS
+ 
+    APPENDIX: How to apply the Apache License to your work.
+ 
+       To apply the Apache License to your work, attach the following
+       boilerplate notice, with the fields enclosed by brackets "[]"
+       replaced with your own identifying information. (Don't include
+       the brackets!)  The text should be enclosed in the appropriate
+       comment syntax for the file format. We also recommend that a
+       file or class name and description of purpose be included on the
+       same "printed page" as the copyright notice for easier
+       identification within third-party archives.
+ 
+    Copyright [yyyy] [name of copyright owner]
+ 
+    Licensed under the Apache License, Version 2.0 (the "License");
+    you may not use this file except in compliance with the License.
+    You may obtain a copy of the License at
+ 
+        http://www.apache.org/licenses/LICENSE-2.0
+ 
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
--- a/train/data/example.txt.bz2 0 → 100644
View file @48da6e7
+++ b/train/data/example.txt.bz2 0 → 100644
View file @48da6e7
--- a/train/embedding.py 0 → 100644
View file @48da6e7
+++ b/train/embedding.py 0 → 100644
View file @48da6e7
+ # coding=utf-8
+ # Copyright 2020 Heewon Jeon. All rights reserved.
+ #
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #
+ #     http://www.apache.org/licenses/LICENSE-2.0
+ #
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
+ 
+ import argparse
+ from utils.embedding_maker import create_embeddings
+ 
+ 
+ parser = argparse.ArgumentParser(description='Korean Autospacing Embedding Maker')
+ 
+ parser.add_argument('--num-iters', type=int, default=5,
+                     help='number of iterations to train (default: 5)')
+ 
+ parser.add_argument('--min-count', type=int, default=100,
+                     help='mininum word counts to filter (default: 100)')
+ 
+ parser.add_argument('--embedding-size', type=int, default=100,
+                     help='embedding dimention size (default: 100)')
+ 
+ parser.add_argument('--num-worker', type=int, default=16,
+                     help='number of thread (default: 16)')
+ 
+ parser.add_argument('--window-size', type=int, default=8,
+                     help='skip-gram window size (default: 8)')
+ 
+ parser.add_argument('--corpus_dir', type=str, default='data',
+                     help='training resource dir')
+ 
+ parser.add_argument('--train', action='store_true', default=True,
+                     help='do embedding trainig (default: True)')
+ 
+ parser.add_argument('--model-file', type=str, default='kospacing_wv.mdl',
+                     help='output object from Word2Vec() (default: kospacing_wv.mdl)')
+ 
+ parser.add_argument('--numpy-wv', type=str, default='kospacing_wv.np',
+                     help='numpy object file path from Word2Vec() (default: kospacing_wv.np)')
+ 
+ parser.add_argument('--w2idx', type=str, default='w2idx.dic',
+                     help='item to index json dictionary (default: w2idx.dic)')
+ 
+ parser.add_argument('--model-dir', type=str, default='model',
+                     help='dir to save models (default: model)')
+ 
+ opt = parser.parse_args()
+ 
+ if opt.train:
+     create_embeddings(opt.corpus_dir, opt.model_dir + '/' +
+                       opt.model_file, opt.model_dir + '/' + opt.numpy_wv,
+                       opt.model_dir + '/' + opt.w2idx, min_count=opt.min_count,
+                       iter=opt.num_iters,
+                       size=opt.embedding_size, workers=opt.num_worker, window=opt.window_size)
--- a/train/jamo_model/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/jamo_model/.gitignore 0 → 100644
View file @48da6e7
--- a/train/model/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/model/.gitignore 0 → 100644
View file @48da6e7
--- a/train/output/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/output/.gitignore 0 → 100644
View file @48da6e7
--- a/train/requirements.txt 0 → 100644
View file @48da6e7
+++ b/train/requirements.txt 0 → 100644
View file @48da6e7
+ absl-py==0.11.0
+ astunparse==1.6.3
+ cachetools==4.2.1
+ certifi==2020.12.5
+ chardet==4.0.0
+ click==7.1.2
+ cmake==3.18.4.post1
+ Cython==0.29.21
+ Flask==1.1.2
+ Flask-Cors==3.0.9
+ flatbuffers==1.12
+ gast==0.3.3
+ gensim==3.8.3
+ gluonnlp==0.10.0
+ google-auth==1.26.1
+ google-auth-oauthlib==0.4.2
+ google-pasta==0.2.0
+ graphviz==0.8.4
+ grpcio==1.32.0
+ h5py==2.10.0
+ idna==2.10
+ importlib-metadata==3.4.0
+ itsdangerous==1.1.0
+ Jinja2==2.11.2
+ joblib==1.0.1
+ Keras==2.4.3
+ Keras-Preprocessing==1.1.2
+ Markdown==3.3.3
+ MarkupSafe==1.1.1
+ mxnet-cu101==1.7.0
+ mxnet-cu101mkl==1.6.0.post0
+ mxnet-mkl==1.6.0
+ numpy==1.19.5
+ oauthlib==3.1.0
+ opt-einsum==3.3.0
+ packaging==20.9
+ pandas==1.2.2
+ protobuf==3.14.0
+ psutil==5.8.0
+ pyasn1==0.4.8
+ pyasn1-modules==0.2.8
+ pyparsing==2.4.7
+ python-dateutil==2.8.1
+ pytz==2020.5
+ PyYAML==5.3.1
+ requests==2.25.1
+ requests-oauthlib==1.3.0
+ rsa==4.6
+ scikit-learn==0.24.1
+ scipy==1.6.0
+ six==1.15.0
+ smart-open==4.0.1
+ soynlp==0.0.493
+ tensorboard==2.4.0
+ tensorboard-plugin-wit==1.7.0
+ tensorflow==2.4.1
+ tensorflow-estimator==2.4.0
+ termcolor==1.1.0
+ threadpoolctl==2.1.0
+ tqdm==4.56.0
+ typing-extensions==3.7.4.3
+ urllib3==1.26.3
+ Werkzeug==1.0.1
+ wrapt==1.12.1
+ zipp==3.4.0
--- a/train/train.py 0 → 100644
View file @48da6e7
+++ b/train/train.py 0 → 100644
View file @48da6e7
+ # coding=utf-8
+ # Copyright 2020 Heewon Jeon. All rights reserved.
+ #
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #
+ #     http://www.apache.org/licenses/LICENSE-2.0
+ #
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
+ 
+ import argparse
+ import bz2
+ import logging
+ import re
+ import time
+ from functools import lru_cache
+ from timeit import default_timer as timer
+ 
+ import gluonnlp as nlp
+ import mxnet as mx
+ import mxnet.autograd as autograd
+ import numpy as np
+ from mxnet import gluon
+ from mxnet.gluon import nn, rnn
+ from tqdm import tqdm
+ import csv
+ 
+ from utils.embedding_maker import (encoding_and_padding, load_embedding,
+                                    load_vocab)
+ 
+ logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s]  %(message)s")
+ logger = logging.getLogger()
+ 
+ parser = argparse.ArgumentParser(description='Korean Autospacing Trainer')
+ parser.add_argument('--num-epoch',
+                     type=int,
+                     default=5,
+                     help='number of iterations to train (default: 5)')
+ 
+ parser.add_argument('--n-hidden',
+                     type=int,
+                     default=200,
+                     help='GRU hidden size (default: 200)')
+ 
+ parser.add_argument('--max-seq-len',
+                     type=int,
+                     default=200,
+                     help='max sentence length on input (default: 200)')
+ 
+ parser.add_argument('--num-gpus',
+                     type=int,
+                     default=1,
+                     help='number of gpus (default: 1)')
+ 
+ parser.add_argument('--vocab-file',
+                     type=str,
+                     default='model/w2idx.dic',
+                     help='vocabarary file (default: model/w2idx.dic)')
+ 
+ parser.add_argument(
+     '--embedding-file',
+     type=str,
+     default='model/kospacing_wv.np',
+     help='embedding matrix file (default: model/kospacing_wv.np)')
+ 
+ parser.add_argument('--train',
+                     action='store_true',
+                     default=False,
+                     help='do trainig (default: False)')
+ 
+ parser.add_argument(
+     '--model-file',
+     type=str,
+     default='kospacing_wv.mdl',
+     help='output object from Word2Vec() (default: kospacing_wv.mdl)')
+ 
+ parser.add_argument('--train-samp-ratio',
+                     type=float,
+                     default=0.50,
+                     help='random train sample ration (default: 0.50)')
+ 
+ parser.add_argument('--model-prefix',
+                     type=str,
+                     default='kospacing',
+                     help='prefix of output model file (default: kospacing)')
+ 
+ parser.add_argument('--model-params',
+                     type=str,
+                     default='kospacing_0.params',
+                     help='model params file (default: kospacing_0.params)')
+ 
+ parser.add_argument('--test',
+                     action='store_true',
+                     default=False,
+                     help='eval train set (default: False)')
+ 
+ parser.add_argument('--batch_size',
+                     type=int,
+                     default=100,
+                     help='train batch size')
+ 
+ parser.add_argument('--test_batch_size',
+                     type=int,
+                     default=100,
+                     help='test batch size')
+ 
+ parser.add_argument('--n_workers',
+                     type=int,
+                     default=10,
+                     help='number of dataloader workers')
+ 
+ parser.add_argument('--train_data',
+                     type=str,
+                     default='data/UCorpus_spacing_train.txt.bz2',
+                     help='bziped train data')
+ 
+ parser.add_argument('--test_data',
+                     type=str,
+                     default='data/UCorpus_spacing_test.txt.bz2',
+                     help='bziped test data')
+ 
+ parser.add_argument('--model_type',
+                     type=str,
+                     default='kospacing',
+                     help='kospacing or kospacing2')
+ 
+ parser.add_argument('--outputs',
+                     type=str,
+                     default='outputs',
+                     help='directory to save log and model params')
+ 
+ opt = parser.parse_args()
+ 
+ nlp.utils.mkdir(opt.outputs)
+ 
+ fileHandler = logging.FileHandler(opt.outputs + '/' + 'log.log')
+ fileHandler.setFormatter(logFormatter)
+ logger.addHandler(fileHandler)
+ 
+ consoleHandler = logging.StreamHandler()
+ consoleHandler.setFormatter(logFormatter)
+ logger.addHandler(consoleHandler)
+ 
+ logger.setLevel(logging.DEBUG)
+ logger.info(opt)
+ 
+ GPU_COUNT = opt.num_gpus
+ ctx = [mx.gpu(i) for i in range(GPU_COUNT)]
+ 
+ 
+ # Model class
+ class korean_autospacing_base(gluon.HybridBlock):
+     def __init__(self, n_hidden, vocab_size, embed_dim, max_seq_length,
+                  **kwargs):
+         super(korean_autospacing_base, self).__init__(**kwargs)
+         # 입력 시퀀스 길이
+         self.in_seq_len = max_seq_length
+         # 출력 시퀀스 길이
+         self.out_seq_len = max_seq_length
+         # GRU의 hidden 개수
+         self.n_hidden = n_hidden
+         # 고유문자개수
+         self.vocab_size = vocab_size
+         # max_seq_length
+         self.max_seq_length = max_seq_length
+         # 임베딩 차원수
+         self.embed_dim = embed_dim
+ 
+         with self.name_scope():
+             self.embedding = nn.Embedding(input_dim=self.vocab_size,
+                                           output_dim=self.embed_dim)
+ 
+             self.conv_unigram = nn.Conv2D(channels=128,
+                                           kernel_size=(1, self.embed_dim))
+ 
+             self.conv_bigram = nn.Conv2D(channels=256,
+                                          kernel_size=(2, self.embed_dim),
+                                          padding=(1, 0))
+ 
+             self.conv_trigram = nn.Conv2D(channels=128,
+                                           kernel_size=(3, self.embed_dim),
+                                           padding=(1, 0))
+ 
+             self.conv_forthgram = nn.Conv2D(channels=64,
+                                             kernel_size=(4, self.embed_dim),
+                                             padding=(2, 0))
+ 
+             self.conv_fifthgram = nn.Conv2D(channels=32,
+                                             kernel_size=(5, self.embed_dim),
+                                             padding=(2, 0))
+ 
+             self.bi_gru = rnn.GRU(hidden_size=self.n_hidden, layout='NTC', bidirectional=True)
+             self.dense_sh = nn.Dense(100, activation='relu', flatten=False)
+             self.dense = nn.Dense(1, activation='sigmoid', flatten=False)
+ 
+     def hybrid_forward(self, F, inputs):
+         embed = self.embedding(inputs)
+         embed = F.expand_dims(embed, axis=1)
+         unigram = self.conv_unigram(embed)
+         bigram = self.conv_bigram(embed)
+         trigram = self.conv_trigram(embed)
+         forthgram = self.conv_forthgram(embed)
+         fifthgram = self.conv_fifthgram(embed)
+ 
+         grams = F.concat(unigram,
+                          F.slice_axis(bigram,
+                                       axis=2,
+                                       begin=0,
+                                       end=self.max_seq_length),
+                          trigram,
+                          F.slice_axis(forthgram,
+                                       axis=2,
+                                       begin=0,
+                                       end=self.max_seq_length),
+                          F.slice_axis(fifthgram,
+                                       axis=2,
+                                       begin=0,
+                                       end=self.max_seq_length),
+                          dim=1)
+ 
+         grams = F.transpose(grams, (0, 2, 3, 1))
+         grams = F.reshape(grams, (-1, self.max_seq_length, -3))
+         grams = self.bi_gru(grams)
+         fc1 = self.dense_sh(grams)
+         return (self.dense(fc1))
+ 
+ 
+ # https://raw.githubusercontent.com/haven-jeon/Train_KoSpacing/master/img/kosapcing_img.png
+ class korean_autospacing2(gluon.HybridBlock):
+     def __init__(self, n_hidden, vocab_size, embed_dim, max_seq_length,
+                  **kwargs):
+         super(korean_autospacing2, self).__init__(**kwargs)
+         # 입력 시퀀스 길이
+         self.in_seq_len = max_seq_length
+         # 출력 시퀀스 길이
+         self.out_seq_len = max_seq_length
+         # GRU의 hidden 개수
+         self.n_hidden = n_hidden
+         # 고유문자개수
+         self.vocab_size = vocab_size
+         # max_seq_length
+         self.max_seq_length = max_seq_length
+         # 임베딩 차원수
+         self.embed_dim = embed_dim
+ 
+         with self.name_scope():
+             self.embedding = nn.Embedding(input_dim=self.vocab_size,
+                                           output_dim=self.embed_dim)
+ 
+             self.conv_unigram = nn.Conv2D(channels=128,
+                                           kernel_size=(1, self.embed_dim))
+ 
+             self.conv_bigram = nn.Conv2D(channels=128,
+                                          kernel_size=(2, self.embed_dim),
+                                          padding=(1, 0))
+ 
+             self.conv_trigram = nn.Conv2D(channels=64,
+                                           kernel_size=(3, self.embed_dim),
+                                           padding=(2, 0))
+ 
+             self.conv_forthgram = nn.Conv2D(channels=32,
+                                             kernel_size=(4, self.embed_dim),
+                                             padding=(3, 0))
+ 
+             self.conv_fifthgram = nn.Conv2D(channels=16,
+                                             kernel_size=(5, self.embed_dim),
+                                             padding=(4, 0))
+             # for reverse convolution
+             self.conv_rev_bigram = nn.Conv2D(channels=128,
+                                              kernel_size=(2, self.embed_dim),
+                                              padding=(1, 0))
+ 
+             self.conv_rev_trigram = nn.Conv2D(channels=64,
+                                               kernel_size=(3, self.embed_dim),
+                                               padding=(2, 0))
+ 
+             self.conv_rev_forthgram = nn.Conv2D(channels=32,
+                                                 kernel_size=(4,
+                                                              self.embed_dim),
+                                                 padding=(3, 0))
+ 
+             self.conv_rev_fifthgram = nn.Conv2D(channels=16,
+                                                 kernel_size=(5,
+                                                              self.embed_dim),
+                                                 padding=(4, 0))
+             self.bi_gru = rnn.GRU(hidden_size=self.n_hidden, layout='NTC', bidirectional=True)
+             # self.bi_gru = rnn.BidirectionalCell(
+             #     rnn.GRUCell(hidden_size=self.n_hidden),
+             #     rnn.GRUCell(hidden_size=self.n_hidden))
+             self.dense_sh = nn.Dense(100, activation='relu', flatten=False)
+             self.dense = nn.Dense(1, activation='sigmoid', flatten=False)
+ 
+     def hybrid_forward(self, F, inputs):
+         embed = self.embedding(inputs)
+         embed = F.expand_dims(embed, axis=1)
+         rev_embed = embed.flip(axis=2)
+ 
+         unigram = self.conv_unigram(embed)
+         bigram = self.conv_bigram(embed)
+         trigram = self.conv_trigram(embed)
+         forthgram = self.conv_forthgram(embed)
+         fifthgram = self.conv_fifthgram(embed)
+ 
+         rev_bigram = self.conv_rev_bigram(rev_embed).flip(axis=2)
+         rev_trigram = self.conv_rev_trigram(rev_embed).flip(axis=2)
+         rev_forthgram = self.conv_rev_forthgram(rev_embed).flip(axis=2)
+         rev_fifthgram = self.conv_rev_fifthgram(rev_embed).flip(axis=2)
+ 
+         grams = F.concat(unigram,
+                          F.slice_axis(bigram,
+                                       axis=2,
+                                       begin=0,
+                                       end=self.max_seq_length),
+                          F.slice_axis(rev_bigram,
+                                       axis=2,
+                                       begin=0,
+                                       end=self.max_seq_length),
+                          F.slice_axis(trigram,
+                                       axis=2,
+                                       begin=0,
+                                       end=self.max_seq_length),
+                          F.slice_axis(rev_trigram,
+                                       axis=2,
+                                       begin=0,
+                                       end=self.max_seq_length),
+                          F.slice_axis(forthgram,
+                                       axis=2,
+                                       begin=0,
+                                       end=self.max_seq_length),
+                          F.slice_axis(rev_forthgram,
+                                       axis=2,
+                                       begin=0,
+                                       end=self.max_seq_length),
+                          F.slice_axis(fifthgram,
+                                       axis=2,
+                                       begin=0,
+                                       end=self.max_seq_length),
+                          F.slice_axis(rev_fifthgram,
+                                       axis=2,
+                                       begin=0,
+                                       end=self.max_seq_length),
+                          dim=1)
+ 
+         grams = F.transpose(grams, (0, 2, 3, 1))
+         grams = F.reshape(grams, (-1, self.max_seq_length, -3))
+         grams = self.bi_gru(grams)
+         fc1 = self.dense_sh(grams)
+         return (self.dense(fc1))
+ 
+ 
+ def y_encoding(n_grams, maxlen=200):
+     # 입력된 문장으로 정답셋 인코딩함
+     init_mat = np.zeros(shape=(len(n_grams), maxlen), dtype=np.int8)
+     for i in range(len(n_grams)):
+         init_mat[i, np.cumsum([len(j) for j in n_grams[i]]) - 1] = 1
+     return init_mat
+ 
+ 
+ def split_train_set(x_train, p=0.98):
+     """
+     > split_train_set(pd.DataFrame({'a':[1,2,3,4,None], 'b':[5,6,7,8,9]}))
+     (array([0, 4, 3]), [1, 2])
+     """
+     import numpy as np
+     train_idx = np.random.choice(range(x_train.shape[0]),
+                                  int(x_train.shape[0] * p),
+                                  replace=False)
+     set_tr_idx = set(train_idx)
+     test_index = [i for i in range(x_train.shape[0]) if i not in set_tr_idx]
+     return ((train_idx, np.array(test_index)))
+ 
+ 
+ def get_generator(x, y, batch_size):
+     tr_set = gluon.data.ArrayDataset(x, y.astype('float32'))
+     tr_data_iterator = gluon.data.DataLoader(tr_set,
+                                              batch_size=batch_size,
+                                              shuffle=True,
+                                              num_workers=opt.n_workers)
+     return (tr_data_iterator)
+ 
+ 
+ def pick_model(model_nm, n_hidden, vocab_size, embed_dim, max_seq_length):
+     if model_nm.lower() == 'kospacing':
+         model = korean_autospacing_base(n_hidden=n_hidden,
+                                         vocab_size=vocab_size,
+                                         embed_dim=embed_dim,
+                                         max_seq_length=max_seq_length)
+     elif model_nm.lower() == 'kospacing2':
+         model = korean_autospacing2(n_hidden=n_hidden,
+                                     vocab_size=vocab_size,
+                                     embed_dim=embed_dim,
+                                     max_seq_length=max_seq_length)
+     else:
+         assert False
+     return model
+ 
+ 
+ def model_init(n_hidden, vocab_size, embed_dim, max_seq_length, ctx):
+     # 모형 인스턴스 생성 및 트래이너, loss 정의
+     # n_hidden, vocab_size, embed_dim, max_seq_length
+     model = pick_model(opt.model_type, n_hidden, vocab_size, embed_dim, max_seq_length)
+     model.collect_params().initialize(mx.init.Xavier(), ctx=ctx)
+     model.embedding.weight.set_data(weights)
+     model.hybridize(static_alloc=True)
+     # 임베딩 영역 가중치 고정
+     model.embedding.collect_params().setattr('grad_req', 'null')
+     trainer = gluon.Trainer(model.collect_params(), 'rmsprop')
+     loss = gluon.loss.SigmoidBinaryCrossEntropyLoss(from_sigmoid=True)
+     loss.hybridize(static_alloc=True)
+     return (model, loss, trainer)
+ 
+ 
+ def evaluate_accuracy(data_iterator, net, pad_idx, ctx, n=5000):
+     # 각 시퀀스의 길이만큼 순회하며 정확도 측정
+     # 최적화되지 않음
+     acc = mx.metric.Accuracy(axis=0)
+     num_of_test = 0
+     for i, (data, label) in enumerate(data_iterator):
+         data = data.as_in_context(ctx)
+         label = label.as_in_context(ctx)
+         # get sentence length
+         data_np = data.asnumpy()
+         lengths = np.argmax(np.where(data_np == pad_idx, np.ones_like(data_np),
+                                      np.zeros_like(data_np)),
+                             axis=1)
+         output = net(data)
+         pred_label = output.squeeze(axis=2) > 0.5
+ 
+         for i in range(data.shape[0]):
+             num_of_test += data.shape[0]
+             acc.update(preds=pred_label[i, :lengths[i]],
+                        labels=label[i, :lengths[i]])
+         if num_of_test > n:
+             break
+     return acc.get()[1]
+ 
+ 
+ def train(epochs,
+           tr_data_iterator,
+           te_data_iterator,
+           va_data_iterator,
+           model,
+           loss,
+           trainer,
+           pad_idx,
+           ctx,
+           mdl_desc="spacing_model",
+           decay=False):
+     # 학습 코드
+     tot_test_acc = []
+     tot_train_loss = []
+     for e in range(epochs):
+         tic = time.time()
+         # Decay learning rate.
+         if e > 1 and decay:
+             trainer.set_learning_rate(trainer.learning_rate * 0.7)
+         train_loss = []
+         iter_tqdm = tqdm(tr_data_iterator, 'Batches')
+         for i, (x_data, y_data) in enumerate(iter_tqdm):
+             x_data_l = gluon.utils.split_and_load(x_data,
+                                                   ctx,
+                                                   even_split=False)
+             y_data_l = gluon.utils.split_and_load(y_data,
+                                                   ctx,
+                                                   even_split=False)
+ 
+             with autograd.record():
+                 losses = [
+                     loss(model(x), y) for x, y in zip(x_data_l, y_data_l)
+                 ]
+             for l in losses:
+                 l.backward()
+             trainer.step(x_data.shape[0])
+             curr_loss = np.mean([mx.nd.mean(l).asscalar() for l in losses])
+             train_loss.append(curr_loss)
+             iter_tqdm.set_description("loss {}".format(curr_loss))
+             mx.nd.waitall()
+ 
+         # caculate test loss
+         test_acc = evaluate_accuracy(
+             te_data_iterator,
+             model,
+             pad_idx,
+             ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
+         valid_acc = evaluate_accuracy(
+             va_data_iterator,
+             model,
+             pad_idx,
+             ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
+         logger.info('[Epoch %d] time cost: %f' % (e, time.time() - tic))
+         logger.info("[Epoch %d] Train Loss: %f, Test acc : %f Valid acc : %f" %
+                     (e, np.mean(train_loss), test_acc, valid_acc))
+         tot_test_acc.append(test_acc)
+         tot_train_loss.append(np.mean(train_loss))
+         model.save_parameters(opt.outputs + '/' + "{}_{}.params".format(mdl_desc, e))
+     return (tot_test_acc, tot_train_loss)
+ 
+ 
+ def pre_processing(setences):
+     # 공백은 ^
+     char_list = [li.strip().replace(' ', '^') for li in setences]
+     # 문장의 시작 포인트 «
+     # 문장의 끌 포인트  »
+     char_list = ["«" + li + "»" for li in char_list]
+     # 문장 -> 문자열
+     char_list = [''.join(list(li)) for li in char_list]
+     return char_list
+ 
+ 
+ def make_input_data(inputs,
+                     train_ratio,
+                     sampling,
+                     make_lag_set=False,
+                     batch_size=200):
+     with bz2.open(inputs, 'rt') as f:
+         line_list = [i.strip() for i in f.readlines() if i.strip() != '']
+     logger.info('complete loading train file!')
+ 
+     # 아버지가 방에 들어가신다. -> '«아버지가^방에^들어가신다.»'
+     processed_seq = pre_processing(line_list)
+     logger.info(processed_seq[0])
+     # n percent random sample
+     logger.info('random sampling on training set!')
+     samp_idx = np.random.choice(range(len(processed_seq)),
+                                 int(len(processed_seq) * sampling),
+                                 replace=False)
+     processed_seq_samp = [processed_seq[i] for i in samp_idx]
+     sp_sents = [i.split('^') for i in processed_seq_samp]
+ 
+     sp_sents = list(filter(lambda x: len(x) >= 8, sp_sents))
+ 
+     # max 8 어절 씩 1어절 shift하여 학습 데이터 생성
+     if make_lag_set is True:
+         n_gram = [[k, v, z, a, c, d, e, f]
+                   for sent in sp_sents for k, v, z, a, c, d, e, f in zip(
+                       sent, sent[1:], sent[2:], sent[3:], sent[4:], sent[5:],
+                       sent[6:], sent[7:])]
+     else:
+         n_gram = sp_sents
+     # max 200문자 이하만 사용
+     n_gram = [i for i in n_gram if len("^".join(i)) <= opt.max_seq_len]
+     # y 정답 인코딩
+     n_gram_y = y_encoding(n_gram, opt.max_seq_len)
+     logger.info(n_gram[0])
+     logger.info(n_gram_y[0])
+     # vocab file 로딩
+     w2idx, _ = load_vocab(opt.vocab_file)
+ 
+     # 학습셋을 만들기 위해 공백을 제거하고 문자 인덱스로 인코딩함
+     logger.info('index eocoding!')
+     ngram_coding_seq = encoding_and_padding(
+         word2idx_dic=w2idx,
+         sequences=[''.join(gram) for gram in n_gram],
+         maxlen=opt.max_seq_len,
+         padding='post',
+         truncating='post')
+     logger.info(ngram_coding_seq[0])
+     if train_ratio < 1:
+         # 학습셋 테스트셋 생성
+         tr_idx, te_idx = split_train_set(ngram_coding_seq, train_ratio)
+ 
+         y_train = n_gram_y[tr_idx, ]
+         x_train = ngram_coding_seq[tr_idx, ]
+ 
+         y_test = n_gram_y[te_idx, ]
+         x_test = ngram_coding_seq[te_idx, ]
+ 
+         # train generator
+         train_generator = get_generator(x_train, y_train, batch_size)
+         valid_generator = get_generator(x_test, y_test, 500)
+         return (train_generator, valid_generator)
+     else:
+         train_generator = get_generator(ngram_coding_seq, n_gram_y, batch_size)
+         return (train_generator)
+ 
+ 
+ if opt.train:
+     # 사전 파일 로딩
+     w2idx, idx2w = load_vocab(opt.vocab_file)
+     # 임베딩 파일 로딩
+     weights = load_embedding(opt.embedding_file)
+     vocab_size = weights.shape[0]
+     embed_dim = weights.shape[1]
+ 
+     train_generator, valid_generator = make_input_data(
+         opt.train_data,
+         train_ratio=0.95,
+         sampling=opt.train_samp_ratio,
+         make_lag_set=True,
+         batch_size=opt.batch_size)
+ 
+     test_generator = make_input_data(opt.test_data,
+                                      sampling=1,
+                                      train_ratio=1,
+                                      make_lag_set=True,
+                                      batch_size=opt.test_batch_size)
+ 
+     model, loss, trainer = model_init(n_hidden=opt.n_hidden,
+                                       vocab_size=vocab_size,
+                                       embed_dim=embed_dim,
+                                       max_seq_length=opt.max_seq_len,
+                                       ctx=ctx)
+     logger.info('start training!')
+     train(epochs=opt.num_epoch,
+           tr_data_iterator=train_generator,
+           te_data_iterator=test_generator,
+           va_data_iterator=valid_generator,
+           model=model,
+           loss=loss,
+           trainer=trainer,
+           pad_idx=w2idx['__PAD__'],
+           ctx=ctx,
+           mdl_desc=opt.model_prefix)
+ 
+ 
+ class pred_spacing:
+     def __init__(self, model, w2idx):
+         self.model = model
+         self.w2idx = w2idx
+         self.pattern = re.compile(r'\s+')
+ 
+     @lru_cache(maxsize=None)
+     def get_spaced_sent(self, raw_sent):
+         raw_sent_ = "«" + raw_sent + "»"
+         raw_sent_ = raw_sent_.replace(' ', '^')
+         sents_in = [
+             raw_sent_,
+         ]
+         mat_in = encoding_and_padding(word2idx_dic=self.w2idx,
+                                       sequences=sents_in,
+                                       maxlen=opt.max_seq_len,
+                                       padding='post',
+                                       truncating='post')
+         mat_in = mx.nd.array(mat_in, ctx=mx.cpu(0))
+         results = self.model(mat_in)
+         mat_set = results[0, ]
+ 
+         r = 255
+         c = 1 / np.log(1+r)
+         log_scaled = c * mx.nd.log(1 + r * mat_set[:len(raw_sent_)])
+         #print(log_scaled)
+         d_2 = [1]
+         for i in range(1,len(raw_sent_)):
+             d_2.append(mat_set[i-1] - (2 * mat_set[i]) + mat_set[i+1])
+         #print(d_2)
+         preds = np.array(
+             ['1' if log_scaled[i] > 0.01 and d_2[i] < 0 else '0' for i in range(len(raw_sent_))])
+         print(mat_set[:len(raw_sent_)])
+         # #saveresult
+         
+         
+         # wr.writerow([raw_sent_, temp])
+         # f.close
+         return self.make_pred_sents(raw_sent_, preds)
+ 
+     def make_pred_sents(self, x_sents, y_pred):
+         res_sent = []
+         for i, j in zip(x_sents, y_pred):
+             if j == '1':
+                 res_sent.append(i)
+                 res_sent.append(' ')
+             else:
+                 res_sent.append(i)
+         subs = re.sub(self.pattern, ' ', ''.join(res_sent).replace('^', ' '))
+         subs = subs.replace('«', '')
+         subs = subs.replace('»', '')
+         return subs
+ 
+ if not opt.train and not opt.test:
+     # 사전 파일 로딩
+     w2idx, idx2w = load_vocab(opt.vocab_file)
+     # 임베딩 파일 로딩
+     weights = load_embedding(opt.embedding_file)
+     vocab_size = weights.shape[0]
+     embed_dim = weights.shape[1]
+     model = pick_model(opt.model_type, opt.n_hidden, vocab_size, embed_dim, opt.max_seq_len)
+ 
+     # model.collect_params().initialize(mx.init.Xavier(), ctx=mx.cpu(0))
+     # model.embedding.weight.set_data(weights)
+     model.load_parameters(opt.model_params, ctx=mx.cpu(0))
+     predictor = pred_spacing(model, w2idx)
+     
+     # datafile = open('./data/removed.txt', 'r', encoding='utf-8')
+     # lines = datafile.readlines()
+     # total = len(lines)
+     # cnt = 1
+     # for line in lines[:50000]:
+     #     print()
+     #     print('#' * 30)
+     #     print(cnt, ' / ', total)
+     #     print('#' * 30)
+     #     predictor.get_spaced_sent(line)
+     #     cnt += 1
+ 
+ 
+ 
+     while 1:
+         sent = input("sent > ")
+         print(sent)
+         start = timer()
+         spaced = predictor.get_spaced_sent(sent)
+         end = timer()
+         print("spaced sent[{:03.2f}sec/sent]  > {}".format(end - start, spaced))
+ 
+ if not opt.train and opt.test:
+     logger.info("calculate accuracy!")
+     # 사전 파일 로딩
+     w2idx, idx2w = load_vocab(opt.vocab_file)
+     # 임베딩 파일 로딩
+     weights = load_embedding(opt.embedding_file)
+     vocab_size = weights.shape[0]
+     embed_dim = weights.shape[1]
+ 
+     model = pick_model(opt.model_type, opt.n_hidden, vocab_size, embed_dim, opt.max_seq_len)
+ 
+     # model.initialize(ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
+     model.load_parameters(opt.model_params,
+                           ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
+     valid_generator = make_input_data(opt.test_data,
+                                       sampling=1,
+                                       train_ratio=1,
+                                       make_lag_set=True,
+                                       batch_size=100)
+     valid_acc = evaluate_accuracy(
+         valid_generator,
+         model,
+         w2idx['__PAD__'],
+         ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0),
+         n=30000)
+     logger.info('valid accuracy : {}'.format(valid_acc))
--- a/train/utils/__pycache__/embedding_maker.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/embedding_maker.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/__pycache__/jamo_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/jamo_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/__pycache__/spacing_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/spacing_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/embedding_maker.py 0 → 100644
View file @48da6e7
+++ b/train/utils/embedding_maker.py 0 → 100644
View file @48da6e7
+ __all__ = [
+     'create_embeddings', 'load_embedding', 'load_vocab',
+     'encoding_and_padding', 'get_embedding_model'
+ ]
+ 
+ import bz2
+ import json
+ import os
+ 
+ import numpy as np
+ import pkg_resources
+ from gensim.models import FastText
+ 
+ from utils.spacing_utils import sent_to_spacing_chars
+ from tqdm import tqdm
+ from utils.jamo_utils import jamo_sentence, jamo_to_word
+ 
+ def pad_sequences(sequences,
+                   maxlen=None,
+                   dtype='int32',
+                   padding='pre',
+                   truncating='pre',
+                   value=0.):
+ 
+     if not hasattr(sequences, '__len__'):
+         raise ValueError('`sequences` must be iterable.')
+     lengths = []
+     for x in sequences:
+         if not hasattr(x, '__len__'):
+             raise ValueError('`sequences` must be a list of iterables. '
+                              'Found non-iterable: ' + str(x))
+         lengths.append(len(x))
+ 
+     num_samples = len(sequences)
+     if maxlen is None:
+         maxlen = np.max(lengths)
+ 
+     # take the sample shape from the first non empty sequence
+     # checking for consistency in the main loop below.
+     sample_shape = tuple()
+     for s in sequences:
+         if len(s) > 0:
+             sample_shape = np.asarray(s).shape[1:]
+             break
+ 
+     x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
+     for idx, s in enumerate(sequences):
+         if not len(s):
+             continue  # empty list/array was found
+         if truncating == 'pre':
+             trunc = s[-maxlen:]
+         elif truncating == 'post':
+             trunc = s[:maxlen]
+         else:
+             raise ValueError('Truncating type "%s" not understood' %
+                              truncating)
+ 
+         # check `trunc` has expected shape
+         trunc = np.asarray(trunc, dtype=dtype)
+         if trunc.shape[1:] != sample_shape:
+             raise ValueError(
+                 'Shape of sample %s of sequence at position %s is different from expected shape %s'
+                 % (trunc.shape[1:], idx, sample_shape))
+ 
+         if padding == 'post':
+             x[idx, :len(trunc)] = trunc
+         elif padding == 'pre':
+             x[idx, -len(trunc):] = trunc
+         else:
+             raise ValueError('Padding type "%s" not understood' % padding)
+     return x
+ 
+ 
+ def create_embeddings(data_dir,
+                       model_file,
+                       embeddings_file,
+                       vocab_file,
+                       splitc=' ',
+                       **params):
+     """
+     making embedding from files.
+     :**params additional Word2Vec() parameters
+     :splitc   char for splitting in  data_dir files
+     :model_file output object from Word2Vec()
+     :data_dir data dir to be process
+     :embeddings_file numpy object file path from Word2Vec()
+     :vocab_file item to index json dictionary
+     """
+     class SentenceGenerator(object):
+         def __init__(self, dirname):
+             self.dirname = dirname
+ 
+         def __iter__(self):
+             for fname in os.listdir(self.dirname):
+                 print("processing~  '{}'".format(fname))
+                 for line in bz2.open(os.path.join(self.dirname, fname), "rt"):
+                     yield sent_to_spacing_chars(line.strip()).split(splitc)
+ 
+     sentences = SentenceGenerator(data_dir)
+ 
+     model = FastText.load(model_file)
+     model.save(model_file)
+     weights = model.wv.syn0
+     default_vec = np.mean(weights, axis=0, keepdims=True)
+     padding_vec = np.zeros((1, weights.shape[1]))
+ 
+     weights_default = np.concatenate([weights, default_vec, padding_vec],
+                                      axis=0)
+ 
+     np.save(open(embeddings_file, 'wb'), weights_default)
+ 
+     vocab = dict([(k, v.index) for k, v in model.wv.vocab.items()])
+     vocab['__PAD__'] = weights_default.shape[0] - 1
+     with open(vocab_file, 'w') as f:
+         f.write(json.dumps(vocab))
+ 
+ 
+ def load_embedding(embeddings_file):
+     return (np.load(embeddings_file))
+ 
+ 
+ def load_vocab(vocab_path):
+     with open(vocab_path, 'r') as f:
+         data = json.loads(f.read())
+     word2idx = data
+     idx2word = dict([(v, k) for k, v in data.items()])
+     return word2idx, idx2word
+ 
+ def get_similar_char(word2idx_dic, model, jamo_model, text, try_cnt, OOV_CNT, HIT_CNT):
+     OOV_CNT += 1
+     jamo_text = jamo_sentence(text)
+     simialr_list = jamo_model.wv.most_similar(jamo_text)[:try_cnt]
+     for char in simialr_list:
+         result = jamo_to_word(char[0])
+         
+         if result in word2idx_dic.keys(): 
+             # print('#' * 20)
+             # print('hit')
+             # print('origin: ', text, 'reuslt: ', result)
+             HIT_CNT += 1
+             return OOV_CNT, HIT_CNT,result
+ 
+     # print('#' * 20)
+     # print('no hit')
+     # print('origin: ', text)
+     return OOV_CNT, HIT_CNT, model.wv.most_similar(text)[0][0]
+ 
+ 
+ def encoding_and_padding(word2idx_dic, sequences, **params):
+     """
+     1. making item to idx
+     2. padding
+     :word2idx_dic
+     :sequences: list of lists where each element is a sequence
+     :maxlen: int, maximum length
+     :dtype: type to cast the resulting sequence.
+     :padding: 'pre' or 'post', pad either before or after each sequence.
+     :truncating: 'pre' or 'post', remove values from sequences larger than
+         maxlen either in the beginning or in the end of the sequence
+     :value: float, value to pad the sequences to the desired value.
+     """
+     model_file = 'model/fasttext'
+     jamo_model_path = 'jamo_model/fasttext'
+     print('seq_idx start')
+     model = FastText.load(model_file)
+     jamo_model = FastText.load(jamo_model_path)
+     seq_idx = []
+     OOV_CNT = 0
+     HIT_CNT = 0
+     TOTAL_CNT = 0
+     
+     for word in tqdm(sequences):
+         temp = []
+         for char in word:
+             TOTAL_CNT += 1
+             if char in word2idx_dic.keys():
+                 temp.append(word2idx_dic[char])
+             else:
+                 OOV_CNT, HIT_CNT, result = get_similar_char(word2idx_dic, model, jamo_model, char, 3, OOV_CNT, HIT_CNT)
+                 temp.append(word2idx_dic[result])
+         seq_idx.append(temp)
+     print('TOTAL CNT: ', TOTAL_CNT, 'OOV CNT: ', OOV_CNT, 'HIT_CNT: ', HIT_CNT)
+     if OOV_CNT > 0 and HIT_CNT > 0:
+         print('OOV RATE:', float(OOV_CNT) / TOTAL_CNT * 100, '%' ,'HIT_RATE: ', float(HIT_CNT) / float(OOV_CNT) * 100, '%')
+     
+     params['value'] = word2idx_dic['__PAD__']
+     return (pad_sequences(seq_idx, **params))
+ 
+ 
+ def get_embedding_model(name='fee_prods', path='data/embedding'):
+     weights = pkg_resources.resource_filename(
+         'dsc', os.path.join(path, name, 'weights.np'))
+     w2idx = pkg_resources.resource_filename(
+         'dsc', os.path.join(path, name, 'idx.json'))
+     return ((load_embedding(weights), load_vocab(w2idx)[0]))
--- a/train/utils/jamo_utils.py 0 → 100644
View file @48da6e7
+++ b/train/utils/jamo_utils.py 0 → 100644
View file @48da6e7
+ import re 
+ from soynlp.hangle import compose, decompose, character_is_korean 
+ 
+ 
+ doublespace_pattern = re.compile('\s+') 
+ 
+ def jamo_sentence(sent): 
+     def transform(char): 
+         if char == ' ': 
+             return char 
+             
+         cjj = decompose(char) 
+         if len(cjj) == 1: 
+             return cjj 
+         
+         cjj_ = ''.join(c if c != ' ' else '-' for c in cjj) 
+         return cjj_ 
+         
+     sent_ = [] 
+     for char in sent: 
+         if character_is_korean(char): 
+             sent_.append(transform(char)) 
+         else: 
+             sent_.append(char) 
+     sent_ = doublespace_pattern.sub(' ', ''.join(sent_)) 
+     return sent_ 
+         
+ def jamo_to_word(jamo): 
+     jamo_list, idx = [], 0 
+     
+     while idx < len(jamo): 
+         if not character_is_korean(jamo[idx]): 
+             jamo_list.append(jamo[idx]) 
+             idx += 1 
+         else: 
+             jamo_list.append(jamo[idx:idx + 3]) 
+             idx += 3 
+         
+     word = "" 
+     for jamo_char in jamo_list: 
+         if len(jamo_char) == 1: 
+             word += jamo_char 
+         elif jamo_char[2] == "-":
+             word += compose(jamo_char[0], jamo_char[1], " ")
+         else: word += compose(jamo_char[0], jamo_char[1], jamo_char[2]) 
+             
+     return word
+ 
+ def break_char (jamo_sentence):
+   idx = 0
+   corpus = []
+ 
+   while idx < len(jamo_sentence):
+     if not character_is_korean(jamo_sentence[idx]): 
+       corpus.append(jamo_sentence[idx]) 
+       idx += 1
+     else:
+       corpus.append(jamo_sentence[idx : idx+3])
+       idx += 3
+   return corpus
\ No newline at end of file
--- a/train/utils/spacing_utils.py 0 → 100644
View file @48da6e7
+++ b/train/utils/spacing_utils.py 0 → 100644
View file @48da6e7
+ # coding=utf-8
+ # Copyright 2020 Heewon Jeon. All rights reserved.
+ #
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #
+ #     http://www.apache.org/licenses/LICENSE-2.0
+ #
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
+ 
+ def sent_to_spacing_chars(sent):
+     # 공백은 ^
+     chars = sent.strip().replace(' ', '^')
+     # char_list = [li.strip().replace(' ', '^') for li in sents]
+ 
+     # 문장의 시작 포인트 «
+     # 문장의 끌 포인트  »
+     tagged_chars = "«" + chars + "»"
+     # char_list = [ "«" + li + "»" for li in char_list]
+ 
+     # 문장 -> 문자열
+     char_list = ' '.join(list(tagged_chars))
+     # char_list = [ ' '.join(list(li))  for li in char_list]
+     return(char_list)