submit train init

yomapi
Commit 48da6e7d8642425195a2c89705e430501dfd469c 48da6e7d 1 parent a0839ebc
Showing 19 changed files with 1463 additions and 0 deletions
README.md
img/2-stage-FastText.png
img/Thresholding_result.png
img/kosapcing_img.png
img/probability_distribution_of_output_vector.png
train/LICENSE
train/data/example.txt.bz2
train/embedding.py
train/jamo_model/.gitignore
train/model/.gitignore
train/output/.gitignore
train/requirements.txt
train/train.py
train/utils/__pycache__/embedding_maker.cpython-37.pyc
train/utils/__pycache__/jamo_utils.cpython-37.pyc
train/utils/__pycache__/spacing_utils.cpython-37.pyc
train/utils/embedding_maker.py
train/utils/jamo_utils.py
train/utils/spacing_utils.py
--- a/README.md
View file @48da6e7
+++ b/README.md
View file @48da6e7
+# ML base Spacing Correcter
+This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing "TrainKoSpacing"), using FastText instead of Word2Vec
+
+## Performances
+| Model  | Test Accuracy(%)   | Encoding Time Cost |
+| :------------: | :------------: | :------------: |
+| TrainKoSpacing | 96.6147 | 02m 23s|
+| 자모분해 FastText  | 98.9915  | 08h 20m 11s
+| 2 Stage FastText  | 99.0888  | 03m 23s
+
+## Data
+#### Corpus
+
+We mainly focus on the National Institute of Korean Language 모두의 말뭉치 corpus and National Information Society Agency AI-Hub data. However, due to the license issue, we are restricted to distribute this dataset. You should be able to get them throw the link below
+[National Institute of Korean Language 모두의 말뭉치](https://corpus.korean.go.kr/).
+[National Information Society Agency AI-Hub](https://aihub.or.kr/aihub-data/natural-language/about "National Information Society Agency AI-Hub")
+
+#### Data format
+Bziped file consisting of one sentence per line.
+
+```
+~/KoSpacing/data$ bzcat train.txt.bz2 | head
+엠마누엘 웅가로 / 의상서 실내 장식품으로… 디자인 세계 넓혀
+프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다.
+웅가로는 침실과 식당, 욕실에서 사용하는 갖가지 직물제품을 디자인해 최근 파리의 갤러리 라파예트백화점에서 '색의 컬렉션'이라는 이름으로 전시회를 열었다.
+```
+
+
+## Architecture
+
+### Model
+![kosapcing_img](img/kosapcing_img.png)
+
+### Word Embedding
+#### 자모분해
+To get similar shpae of Korean charector, use 자모분해 FastText word embedding.
+ex)
+자연어처리
+ㅈ ㅏ – ㅇ ㅕ ㄴ ㅇ ㅓ – ㅊ ㅓ – ㄹ ㅣ –
+
+#### 2 stage FastText
+Becasue of time to handdle 자모분해, use 자모분해 FastText only for Out of Vocabulary charector.
+![2-stage-FastText_img](img/2-stage-FastText.png)
+
+### Thresholding
+Because middle part of output distribution are evenly distributed.
+![probability_distribution_of_output_vector](img/probability_distribution_of_output_vector.png)
+
+Use log transform and second derivative
+result:
+![Thresholding_result](img/Thresholding_result.png)
+
+
+
+## How to Run
+
+
+### Installation
+
+- For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow.
+- Support only above Python 3.7.
+### Requirement
+
+- Python (>= 3.7)
+- MXNet (>= 1.6.0)
+- tqdm (>= 4.19.5)
+- Pandas (>= 0.22.0)
+- Gensim (>= 3.8.1)
+- GluonNLP (>= 0.9.1)
+- soynlp (>= 0.0.493)
+
+### Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### Training
+
+```bash
+python train.py --train --train-samp-ratio 1.0 --num-epoch 50 --train_data data/train.txt.bz2 --test_data data/test.txt.bz2 --outputs train_log_to --model_type kospacing --model-file fasttext
+```
+
+### Evaluation
+
+```bash
+python train.py --model-params model/kospacing.params --model_type kospacing
+sent > 중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
+중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
+spaced sent[0.12sec/sent]  > 중국은 2018년 평창동계올림픽의 반환점에 이르기까지 아직 노골드 행진이다.  
+```
+
+### Directory
+Directory guide for embedding model files
+ bold texts means necessary
+
+- model
+	- **fasttext**
+	- fasttext_vis
+	- **fasttext.trainables.vectors_ngrams_lockf.npy**
+	- **fasttext.wv.vectors_ngrams.npy**
+	- **kospacing_wv.np**
+	- **w2idx.dic**
+
+- jamo_model
+	- **fasttext**
+	- fasttext_vis
+	- **fasttext.trainables.vectors_ngrams_lockf.npy**
+	- **fasttext.wv.vectors_ngrams.npy**
+	- **kospacing_wv.np**
+	- **w2idx.dic**
+
+### Reference
+TrainKoSpacing: https://github.com/haven-jeon/TrainKoSpacing
+딥 러닝을 이용한 자연어 처리 입문: https://wikidocs.net/book/2155
+
--- a/img/2-stage-FastText.png 0 → 100644
View file @48da6e7
+++ b/img/2-stage-FastText.png 0 → 100644
View file @48da6e7
--- a/img/Thresholding_result.png 0 → 100644
View file @48da6e7
+++ b/img/Thresholding_result.png 0 → 100644
View file @48da6e7
--- a/img/kosapcing_img.png 0 → 100644
View file @48da6e7
+++ b/img/kosapcing_img.png 0 → 100644
View file @48da6e7
--- a/img/probability_distribution_of_output_vector.png 0 → 100644
View file @48da6e7
+++ b/img/probability_distribution_of_output_vector.png 0 → 100644
View file @48da6e7
--- a/train/LICENSE 0 → 100644
View file @48da6e7
+++ b/train/LICENSE 0 → 100644
View file @48da6e7
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/train/data/example.txt.bz2 0 → 100644
View file @48da6e7
+++ b/train/data/example.txt.bz2 0 → 100644
View file @48da6e7
--- a/train/embedding.py 0 → 100644
View file @48da6e7
+++ b/train/embedding.py 0 → 100644
View file @48da6e7
+# coding=utf-8
+# Copyright 2020 Heewon Jeon. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from utils.embedding_maker import create_embeddings
+
+
+parser = argparse.ArgumentParser(description='Korean Autospacing Embedding Maker')
+
+parser.add_argument('--num-iters', type=int, default=5,
+                    help='number of iterations to train (default: 5)')
+
+parser.add_argument('--min-count', type=int, default=100,
+                    help='mininum word counts to filter (default: 100)')
+
+parser.add_argument('--embedding-size', type=int, default=100,
+                    help='embedding dimention size (default: 100)')
+
+parser.add_argument('--num-worker', type=int, default=16,
+                    help='number of thread (default: 16)')
+
+parser.add_argument('--window-size', type=int, default=8,
+                    help='skip-gram window size (default: 8)')
+
+parser.add_argument('--corpus_dir', type=str, default='data',
+                    help='training resource dir')
+
+parser.add_argument('--train', action='store_true', default=True,
+                    help='do embedding trainig (default: True)')
+
+parser.add_argument('--model-file', type=str, default='kospacing_wv.mdl',
+                    help='output object from Word2Vec() (default: kospacing_wv.mdl)')
+
+parser.add_argument('--numpy-wv', type=str, default='kospacing_wv.np',
+                    help='numpy object file path from Word2Vec() (default: kospacing_wv.np)')
+
+parser.add_argument('--w2idx', type=str, default='w2idx.dic',
+                    help='item to index json dictionary (default: w2idx.dic)')
+
+parser.add_argument('--model-dir', type=str, default='model',
+                    help='dir to save models (default: model)')
+
+opt = parser.parse_args()
+
+if opt.train:
+    create_embeddings(opt.corpus_dir, opt.model_dir + '/' +
+                      opt.model_file, opt.model_dir + '/' + opt.numpy_wv,
+                      opt.model_dir + '/' + opt.w2idx, min_count=opt.min_count,
+                      iter=opt.num_iters,
+                      size=opt.embedding_size, workers=opt.num_worker, window=opt.window_size)
--- a/train/jamo_model/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/jamo_model/.gitignore 0 → 100644
View file @48da6e7
--- a/train/model/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/model/.gitignore 0 → 100644
View file @48da6e7
--- a/train/output/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/output/.gitignore 0 → 100644
View file @48da6e7
--- a/train/requirements.txt 0 → 100644
View file @48da6e7
+++ b/train/requirements.txt 0 → 100644
View file @48da6e7
+absl-py==0.11.0
+astunparse==1.6.3
+cachetools==4.2.1
+certifi==2020.12.5
+chardet==4.0.0
+click==7.1.2
+cmake==3.18.4.post1
+Cython==0.29.21
+Flask==1.1.2
+Flask-Cors==3.0.9
+flatbuffers==1.12
+gast==0.3.3
+gensim==3.8.3
+gluonnlp==0.10.0
+google-auth==1.26.1
+google-auth-oauthlib==0.4.2
+google-pasta==0.2.0
+graphviz==0.8.4
+grpcio==1.32.0
+h5py==2.10.0
+idna==2.10
+importlib-metadata==3.4.0
+itsdangerous==1.1.0
+Jinja2==2.11.2
+joblib==1.0.1
+Keras==2.4.3
+Keras-Preprocessing==1.1.2
+Markdown==3.3.3
+MarkupSafe==1.1.1
+mxnet-cu101==1.7.0
+mxnet-cu101mkl==1.6.0.post0
+mxnet-mkl==1.6.0
+numpy==1.19.5
+oauthlib==3.1.0
+opt-einsum==3.3.0
+packaging==20.9
+pandas==1.2.2
+protobuf==3.14.0
+psutil==5.8.0
+pyasn1==0.4.8
+pyasn1-modules==0.2.8
+pyparsing==2.4.7
+python-dateutil==2.8.1
+pytz==2020.5
+PyYAML==5.3.1
+requests==2.25.1
+requests-oauthlib==1.3.0
+rsa==4.6
+scikit-learn==0.24.1
+scipy==1.6.0
+six==1.15.0
+smart-open==4.0.1
+soynlp==0.0.493
+tensorboard==2.4.0
+tensorboard-plugin-wit==1.7.0
+tensorflow==2.4.1
+tensorflow-estimator==2.4.0
+termcolor==1.1.0
+threadpoolctl==2.1.0
+tqdm==4.56.0
+typing-extensions==3.7.4.3
+urllib3==1.26.3
+Werkzeug==1.0.1
+wrapt==1.12.1
+zipp==3.4.0
--- a/train/train.py 0 → 100644
View file @48da6e7
+++ b/train/train.py 0 → 100644
View file @48da6e7
+# coding=utf-8
+# Copyright 2020 Heewon Jeon. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import bz2
+import logging
+import re
+import time
+from functools import lru_cache
+from timeit import default_timer as timer
+
+import gluonnlp as nlp
+import mxnet as mx
+import mxnet.autograd as autograd
+import numpy as np
+from mxnet import gluon
+from mxnet.gluon import nn, rnn
+from tqdm import tqdm
+import csv
+
+from utils.embedding_maker import (encoding_and_padding, load_embedding,
+                                   load_vocab)
+
+logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s]  %(message)s")
+logger = logging.getLogger()
+
+parser = argparse.ArgumentParser(description='Korean Autospacing Trainer')
+parser.add_argument('--num-epoch',
+                    type=int,
+                    default=5,
+                    help='number of iterations to train (default: 5)')
+
+parser.add_argument('--n-hidden',
+                    type=int,
+                    default=200,
+                    help='GRU hidden size (default: 200)')
+
+parser.add_argument('--max-seq-len',
+                    type=int,
+                    default=200,
+                    help='max sentence length on input (default: 200)')
+
+parser.add_argument('--num-gpus',
+                    type=int,
+                    default=1,
+                    help='number of gpus (default: 1)')
+
+parser.add_argument('--vocab-file',
+                    type=str,
+                    default='model/w2idx.dic',
+                    help='vocabarary file (default: model/w2idx.dic)')
+
+parser.add_argument(
+    '--embedding-file',
+    type=str,
+    default='model/kospacing_wv.np',
+    help='embedding matrix file (default: model/kospacing_wv.np)')
+
+parser.add_argument('--train',
+                    action='store_true',
+                    default=False,
+                    help='do trainig (default: False)')
+
+parser.add_argument(
+    '--model-file',
+    type=str,
+    default='kospacing_wv.mdl',
+    help='output object from Word2Vec() (default: kospacing_wv.mdl)')
+
+parser.add_argument('--train-samp-ratio',
+                    type=float,
+                    default=0.50,
+                    help='random train sample ration (default: 0.50)')
+
+parser.add_argument('--model-prefix',
+                    type=str,
+                    default='kospacing',
+                    help='prefix of output model file (default: kospacing)')
+
+parser.add_argument('--model-params',
+                    type=str,
+                    default='kospacing_0.params',
+                    help='model params file (default: kospacing_0.params)')
+
+parser.add_argument('--test',
+                    action='store_true',
+                    default=False,
+                    help='eval train set (default: False)')
+
+parser.add_argument('--batch_size',
+                    type=int,
+                    default=100,
+                    help='train batch size')
+
+parser.add_argument('--test_batch_size',
+                    type=int,
+                    default=100,
+                    help='test batch size')
+
+parser.add_argument('--n_workers',
+                    type=int,
+                    default=10,
+                    help='number of dataloader workers')
+
+parser.add_argument('--train_data',
+                    type=str,
+                    default='data/UCorpus_spacing_train.txt.bz2',
+                    help='bziped train data')
+
+parser.add_argument('--test_data',
+                    type=str,
+                    default='data/UCorpus_spacing_test.txt.bz2',
+                    help='bziped test data')
+
+parser.add_argument('--model_type',
+                    type=str,
+                    default='kospacing',
+                    help='kospacing or kospacing2')
+
+parser.add_argument('--outputs',
+                    type=str,
+                    default='outputs',
+                    help='directory to save log and model params')
+
+opt = parser.parse_args()
+
+nlp.utils.mkdir(opt.outputs)
+
+fileHandler = logging.FileHandler(opt.outputs + '/' + 'log.log')
+fileHandler.setFormatter(logFormatter)
+logger.addHandler(fileHandler)
+
+consoleHandler = logging.StreamHandler()
+consoleHandler.setFormatter(logFormatter)
+logger.addHandler(consoleHandler)
+
+logger.setLevel(logging.DEBUG)
+logger.info(opt)
+
+GPU_COUNT = opt.num_gpus
+ctx = [mx.gpu(i) for i in range(GPU_COUNT)]
+
+
+# Model class
+class korean_autospacing_base(gluon.HybridBlock):
+    def __init__(self, n_hidden, vocab_size, embed_dim, max_seq_length,
+                 **kwargs):
+        super(korean_autospacing_base, self).__init__(**kwargs)
+        # 입력 시퀀스 길이
+        self.in_seq_len = max_seq_length
+        # 출력 시퀀스 길이
+        self.out_seq_len = max_seq_length
+        # GRU의 hidden 개수
+        self.n_hidden = n_hidden
+        # 고유문자개수
+        self.vocab_size = vocab_size
+        # max_seq_length
+        self.max_seq_length = max_seq_length
+        # 임베딩 차원수
+        self.embed_dim = embed_dim
+
+        with self.name_scope():
+            self.embedding = nn.Embedding(input_dim=self.vocab_size,
+                                          output_dim=self.embed_dim)
+
+            self.conv_unigram = nn.Conv2D(channels=128,
+                                          kernel_size=(1, self.embed_dim))
+
+            self.conv_bigram = nn.Conv2D(channels=256,
+                                         kernel_size=(2, self.embed_dim),
+                                         padding=(1, 0))
+
+            self.conv_trigram = nn.Conv2D(channels=128,
+                                          kernel_size=(3, self.embed_dim),
+                                          padding=(1, 0))
+
+            self.conv_forthgram = nn.Conv2D(channels=64,
+                                            kernel_size=(4, self.embed_dim),
+                                            padding=(2, 0))
+
+            self.conv_fifthgram = nn.Conv2D(channels=32,
+                                            kernel_size=(5, self.embed_dim),
+                                            padding=(2, 0))
+
+            self.bi_gru = rnn.GRU(hidden_size=self.n_hidden, layout='NTC', bidirectional=True)
+            self.dense_sh = nn.Dense(100, activation='relu', flatten=False)
+            self.dense = nn.Dense(1, activation='sigmoid', flatten=False)
+
+    def hybrid_forward(self, F, inputs):
+        embed = self.embedding(inputs)
+        embed = F.expand_dims(embed, axis=1)
+        unigram = self.conv_unigram(embed)
+        bigram = self.conv_bigram(embed)
+        trigram = self.conv_trigram(embed)
+        forthgram = self.conv_forthgram(embed)
+        fifthgram = self.conv_fifthgram(embed)
+
+        grams = F.concat(unigram,
+                         F.slice_axis(bigram,
+                                      axis=2,
+                                      begin=0,
+                                      end=self.max_seq_length),
+                         trigram,
+                         F.slice_axis(forthgram,
+                                      axis=2,
+                                      begin=0,
+                                      end=self.max_seq_length),
+                         F.slice_axis(fifthgram,
+                                      axis=2,
+                                      begin=0,
+                                      end=self.max_seq_length),
+                         dim=1)
+
+        grams = F.transpose(grams, (0, 2, 3, 1))
+        grams = F.reshape(grams, (-1, self.max_seq_length, -3))
+        grams = self.bi_gru(grams)
+        fc1 = self.dense_sh(grams)
+        return (self.dense(fc1))
+
+
+# https://raw.githubusercontent.com/haven-jeon/Train_KoSpacing/master/img/kosapcing_img.png
+class korean_autospacing2(gluon.HybridBlock):
+    def __init__(self, n_hidden, vocab_size, embed_dim, max_seq_length,
+                 **kwargs):
+        super(korean_autospacing2, self).__init__(**kwargs)
+        # 입력 시퀀스 길이
+        self.in_seq_len = max_seq_length
+        # 출력 시퀀스 길이
+        self.out_seq_len = max_seq_length
+        # GRU의 hidden 개수
+        self.n_hidden = n_hidden
+        # 고유문자개수
+        self.vocab_size = vocab_size
+        # max_seq_length
+        self.max_seq_length = max_seq_length
+        # 임베딩 차원수
+        self.embed_dim = embed_dim
+
+        with self.name_scope():
+            self.embedding = nn.Embedding(input_dim=self.vocab_size,
+                                          output_dim=self.embed_dim)
+
+            self.conv_unigram = nn.Conv2D(channels=128,
+                                          kernel_size=(1, self.embed_dim))
+
+            self.conv_bigram = nn.Conv2D(channels=128,
+                                         kernel_size=(2, self.embed_dim),
+                                         padding=(1, 0))
+
+            self.conv_trigram = nn.Conv2D(channels=64,
+                                          kernel_size=(3, self.embed_dim),
+                                          padding=(2, 0))
+
+            self.conv_forthgram = nn.Conv2D(channels=32,
+                                            kernel_size=(4, self.embed_dim),
+                                            padding=(3, 0))
+
+            self.conv_fifthgram = nn.Conv2D(channels=16,
+                                            kernel_size=(5, self.embed_dim),
+                                            padding=(4, 0))
+            # for reverse convolution
+            self.conv_rev_bigram = nn.Conv2D(channels=128,
+                                             kernel_size=(2, self.embed_dim),
+                                             padding=(1, 0))
+
+            self.conv_rev_trigram = nn.Conv2D(channels=64,
+                                              kernel_size=(3, self.embed_dim),
+                                              padding=(2, 0))
+
+            self.conv_rev_forthgram = nn.Conv2D(channels=32,
+                                                kernel_size=(4,
+                                                             self.embed_dim),
+                                                padding=(3, 0))
+
+            self.conv_rev_fifthgram = nn.Conv2D(channels=16,
+                                                kernel_size=(5,
+                                                             self.embed_dim),
+                                                padding=(4, 0))
+            self.bi_gru = rnn.GRU(hidden_size=self.n_hidden, layout='NTC', bidirectional=True)
+            # self.bi_gru = rnn.BidirectionalCell(
+            #     rnn.GRUCell(hidden_size=self.n_hidden),
+            #     rnn.GRUCell(hidden_size=self.n_hidden))
+            self.dense_sh = nn.Dense(100, activation='relu', flatten=False)
+            self.dense = nn.Dense(1, activation='sigmoid', flatten=False)
+
+    def hybrid_forward(self, F, inputs):
+        embed = self.embedding(inputs)
+        embed = F.expand_dims(embed, axis=1)
+        rev_embed = embed.flip(axis=2)
+
+        unigram = self.conv_unigram(embed)
+        bigram = self.conv_bigram(embed)
+        trigram = self.conv_trigram(embed)
+        forthgram = self.conv_forthgram(embed)
+        fifthgram = self.conv_fifthgram(embed)
+
+        rev_bigram = self.conv_rev_bigram(rev_embed).flip(axis=2)
+        rev_trigram = self.conv_rev_trigram(rev_embed).flip(axis=2)
+        rev_forthgram = self.conv_rev_forthgram(rev_embed).flip(axis=2)
+        rev_fifthgram = self.conv_rev_fifthgram(rev_embed).flip(axis=2)
+
+        grams = F.concat(unigram,
+                         F.slice_axis(bigram,
+                                      axis=2,
+                                      begin=0,
+                                      end=self.max_seq_length),
+                         F.slice_axis(rev_bigram,
+                                      axis=2,
+                                      begin=0,
+                                      end=self.max_seq_length),
+                         F.slice_axis(trigram,
+                                      axis=2,
+                                      begin=0,
+                                      end=self.max_seq_length),
+                         F.slice_axis(rev_trigram,
+                                      axis=2,
+                                      begin=0,
+                                      end=self.max_seq_length),
+                         F.slice_axis(forthgram,
+                                      axis=2,
+                                      begin=0,
+                                      end=self.max_seq_length),
+                         F.slice_axis(rev_forthgram,
+                                      axis=2,
+                                      begin=0,
+                                      end=self.max_seq_length),
+                         F.slice_axis(fifthgram,
+                                      axis=2,
+                                      begin=0,
+                                      end=self.max_seq_length),
+                         F.slice_axis(rev_fifthgram,
+                                      axis=2,
+                                      begin=0,
+                                      end=self.max_seq_length),
+                         dim=1)
+
+        grams = F.transpose(grams, (0, 2, 3, 1))
+        grams = F.reshape(grams, (-1, self.max_seq_length, -3))
+        grams = self.bi_gru(grams)
+        fc1 = self.dense_sh(grams)
+        return (self.dense(fc1))
+
+
+def y_encoding(n_grams, maxlen=200):
+    # 입력된 문장으로 정답셋 인코딩함
+    init_mat = np.zeros(shape=(len(n_grams), maxlen), dtype=np.int8)
+    for i in range(len(n_grams)):
+        init_mat[i, np.cumsum([len(j) for j in n_grams[i]]) - 1] = 1
+    return init_mat
+
+
+def split_train_set(x_train, p=0.98):
+    """
+    > split_train_set(pd.DataFrame({'a':[1,2,3,4,None], 'b':[5,6,7,8,9]}))
+    (array([0, 4, 3]), [1, 2])
+    """
+    import numpy as np
+    train_idx = np.random.choice(range(x_train.shape[0]),
+                                 int(x_train.shape[0] * p),
+                                 replace=False)
+    set_tr_idx = set(train_idx)
+    test_index = [i for i in range(x_train.shape[0]) if i not in set_tr_idx]
+    return ((train_idx, np.array(test_index)))
+
+
+def get_generator(x, y, batch_size):
+    tr_set = gluon.data.ArrayDataset(x, y.astype('float32'))
+    tr_data_iterator = gluon.data.DataLoader(tr_set,
+                                             batch_size=batch_size,
+                                             shuffle=True,
+                                             num_workers=opt.n_workers)
+    return (tr_data_iterator)
+
+
+def pick_model(model_nm, n_hidden, vocab_size, embed_dim, max_seq_length):
+    if model_nm.lower() == 'kospacing':
+        model = korean_autospacing_base(n_hidden=n_hidden,
+                                        vocab_size=vocab_size,
+                                        embed_dim=embed_dim,
+                                        max_seq_length=max_seq_length)
+    elif model_nm.lower() == 'kospacing2':
+        model = korean_autospacing2(n_hidden=n_hidden,
+                                    vocab_size=vocab_size,
+                                    embed_dim=embed_dim,
+                                    max_seq_length=max_seq_length)
+    else:
+        assert False
+    return model
+
+
+def model_init(n_hidden, vocab_size, embed_dim, max_seq_length, ctx):
+    # 모형 인스턴스 생성 및 트래이너, loss 정의
+    # n_hidden, vocab_size, embed_dim, max_seq_length
+    model = pick_model(opt.model_type, n_hidden, vocab_size, embed_dim, max_seq_length)
+    model.collect_params().initialize(mx.init.Xavier(), ctx=ctx)
+    model.embedding.weight.set_data(weights)
+    model.hybridize(static_alloc=True)
+    # 임베딩 영역 가중치 고정
+    model.embedding.collect_params().setattr('grad_req', 'null')
+    trainer = gluon.Trainer(model.collect_params(), 'rmsprop')
+    loss = gluon.loss.SigmoidBinaryCrossEntropyLoss(from_sigmoid=True)
+    loss.hybridize(static_alloc=True)
+    return (model, loss, trainer)
+
+
+def evaluate_accuracy(data_iterator, net, pad_idx, ctx, n=5000):
+    # 각 시퀀스의 길이만큼 순회하며 정확도 측정
+    # 최적화되지 않음
+    acc = mx.metric.Accuracy(axis=0)
+    num_of_test = 0
+    for i, (data, label) in enumerate(data_iterator):
+        data = data.as_in_context(ctx)
+        label = label.as_in_context(ctx)
+        # get sentence length
+        data_np = data.asnumpy()
+        lengths = np.argmax(np.where(data_np == pad_idx, np.ones_like(data_np),
+                                     np.zeros_like(data_np)),
+                            axis=1)
+        output = net(data)
+        pred_label = output.squeeze(axis=2) > 0.5
+
+        for i in range(data.shape[0]):
+            num_of_test += data.shape[0]
+            acc.update(preds=pred_label[i, :lengths[i]],
+                       labels=label[i, :lengths[i]])
+        if num_of_test > n:
+            break
+    return acc.get()[1]
+
+
+def train(epochs,
+          tr_data_iterator,
+          te_data_iterator,
+          va_data_iterator,
+          model,
+          loss,
+          trainer,
+          pad_idx,
+          ctx,
+          mdl_desc="spacing_model",
+          decay=False):
+    # 학습 코드
+    tot_test_acc = []
+    tot_train_loss = []
+    for e in range(epochs):
+        tic = time.time()
+        # Decay learning rate.
+        if e > 1 and decay:
+            trainer.set_learning_rate(trainer.learning_rate * 0.7)
+        train_loss = []
+        iter_tqdm = tqdm(tr_data_iterator, 'Batches')
+        for i, (x_data, y_data) in enumerate(iter_tqdm):
+            x_data_l = gluon.utils.split_and_load(x_data,
+                                                  ctx,
+                                                  even_split=False)
+            y_data_l = gluon.utils.split_and_load(y_data,
+                                                  ctx,
+                                                  even_split=False)
+
+            with autograd.record():
+                losses = [
+                    loss(model(x), y) for x, y in zip(x_data_l, y_data_l)
+                ]
+            for l in losses:
+                l.backward()
+            trainer.step(x_data.shape[0])
+            curr_loss = np.mean([mx.nd.mean(l).asscalar() for l in losses])
+            train_loss.append(curr_loss)
+            iter_tqdm.set_description("loss {}".format(curr_loss))
+            mx.nd.waitall()
+
+        # caculate test loss
+        test_acc = evaluate_accuracy(
+            te_data_iterator,
+            model,
+            pad_idx,
+            ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
+        valid_acc = evaluate_accuracy(
+            va_data_iterator,
+            model,
+            pad_idx,
+            ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
+        logger.info('[Epoch %d] time cost: %f' % (e, time.time() - tic))
+        logger.info("[Epoch %d] Train Loss: %f, Test acc : %f Valid acc : %f" %
+                    (e, np.mean(train_loss), test_acc, valid_acc))
+        tot_test_acc.append(test_acc)
+        tot_train_loss.append(np.mean(train_loss))
+        model.save_parameters(opt.outputs + '/' + "{}_{}.params".format(mdl_desc, e))
+    return (tot_test_acc, tot_train_loss)
+
+
+def pre_processing(setences):
+    # 공백은 ^
+    char_list = [li.strip().replace(' ', '^') for li in setences]
+    # 문장의 시작 포인트 «
+    # 문장의 끌 포인트  »
+    char_list = ["«" + li + "»" for li in char_list]
+    # 문장 -> 문자열
+    char_list = [''.join(list(li)) for li in char_list]
+    return char_list
+
+
+def make_input_data(inputs,
+                    train_ratio,
+                    sampling,
+                    make_lag_set=False,
+                    batch_size=200):
+    with bz2.open(inputs, 'rt') as f:
+        line_list = [i.strip() for i in f.readlines() if i.strip() != '']
+    logger.info('complete loading train file!')
+
+    # 아버지가 방에 들어가신다. -> '«아버지가^방에^들어가신다.»'
+    processed_seq = pre_processing(line_list)
+    logger.info(processed_seq[0])
+    # n percent random sample
+    logger.info('random sampling on training set!')
+    samp_idx = np.random.choice(range(len(processed_seq)),
+                                int(len(processed_seq) * sampling),
+                                replace=False)
+    processed_seq_samp = [processed_seq[i] for i in samp_idx]
+    sp_sents = [i.split('^') for i in processed_seq_samp]
+
+    sp_sents = list(filter(lambda x: len(x) >= 8, sp_sents))
+
+    # max 8 어절 씩 1어절 shift하여 학습 데이터 생성
+    if make_lag_set is True:
+        n_gram = [[k, v, z, a, c, d, e, f]
+                  for sent in sp_sents for k, v, z, a, c, d, e, f in zip(
+                      sent, sent[1:], sent[2:], sent[3:], sent[4:], sent[5:],
+                      sent[6:], sent[7:])]
+    else:
+        n_gram = sp_sents
+    # max 200문자 이하만 사용
+    n_gram = [i for i in n_gram if len("^".join(i)) <= opt.max_seq_len]
+    # y 정답 인코딩
+    n_gram_y = y_encoding(n_gram, opt.max_seq_len)
+    logger.info(n_gram[0])
+    logger.info(n_gram_y[0])
+    # vocab file 로딩
+    w2idx, _ = load_vocab(opt.vocab_file)
+
+    # 학습셋을 만들기 위해 공백을 제거하고 문자 인덱스로 인코딩함
+    logger.info('index eocoding!')
+    ngram_coding_seq = encoding_and_padding(
+        word2idx_dic=w2idx,
+        sequences=[''.join(gram) for gram in n_gram],
+        maxlen=opt.max_seq_len,
+        padding='post',
+        truncating='post')
+    logger.info(ngram_coding_seq[0])
+    if train_ratio < 1:
+        # 학습셋 테스트셋 생성
+        tr_idx, te_idx = split_train_set(ngram_coding_seq, train_ratio)
+
+        y_train = n_gram_y[tr_idx, ]
+        x_train = ngram_coding_seq[tr_idx, ]
+
+        y_test = n_gram_y[te_idx, ]
+        x_test = ngram_coding_seq[te_idx, ]
+
+        # train generator
+        train_generator = get_generator(x_train, y_train, batch_size)
+        valid_generator = get_generator(x_test, y_test, 500)
+        return (train_generator, valid_generator)
+    else:
+        train_generator = get_generator(ngram_coding_seq, n_gram_y, batch_size)
+        return (train_generator)
+
+
+if opt.train:
+    # 사전 파일 로딩
+    w2idx, idx2w = load_vocab(opt.vocab_file)
+    # 임베딩 파일 로딩
+    weights = load_embedding(opt.embedding_file)
+    vocab_size = weights.shape[0]
+    embed_dim = weights.shape[1]
+
+    train_generator, valid_generator = make_input_data(
+        opt.train_data,
+        train_ratio=0.95,
+        sampling=opt.train_samp_ratio,
+        make_lag_set=True,
+        batch_size=opt.batch_size)
+
+    test_generator = make_input_data(opt.test_data,
+                                     sampling=1,
+                                     train_ratio=1,
+                                     make_lag_set=True,
+                                     batch_size=opt.test_batch_size)
+
+    model, loss, trainer = model_init(n_hidden=opt.n_hidden,
+                                      vocab_size=vocab_size,
+                                      embed_dim=embed_dim,
+                                      max_seq_length=opt.max_seq_len,
+                                      ctx=ctx)
+    logger.info('start training!')
+    train(epochs=opt.num_epoch,
+          tr_data_iterator=train_generator,
+          te_data_iterator=test_generator,
+          va_data_iterator=valid_generator,
+          model=model,
+          loss=loss,
+          trainer=trainer,
+          pad_idx=w2idx['__PAD__'],
+          ctx=ctx,
+          mdl_desc=opt.model_prefix)
+
+
+class pred_spacing:
+    def __init__(self, model, w2idx):
+        self.model = model
+        self.w2idx = w2idx
+        self.pattern = re.compile(r'\s+')
+
+    @lru_cache(maxsize=None)
+    def get_spaced_sent(self, raw_sent):
+        raw_sent_ = "«" + raw_sent + "»"
+        raw_sent_ = raw_sent_.replace(' ', '^')
+        sents_in = [
+            raw_sent_,
+        ]
+        mat_in = encoding_and_padding(word2idx_dic=self.w2idx,
+                                      sequences=sents_in,
+                                      maxlen=opt.max_seq_len,
+                                      padding='post',
+                                      truncating='post')
+        mat_in = mx.nd.array(mat_in, ctx=mx.cpu(0))
+        results = self.model(mat_in)
+        mat_set = results[0, ]
+
+        r = 255
+        c = 1 / np.log(1+r)
+        log_scaled = c * mx.nd.log(1 + r * mat_set[:len(raw_sent_)])
+        #print(log_scaled)
+        d_2 = [1]
+        for i in range(1,len(raw_sent_)):
+            d_2.append(mat_set[i-1] - (2 * mat_set[i]) + mat_set[i+1])
+        #print(d_2)
+        preds = np.array(
+            ['1' if log_scaled[i] > 0.01 and d_2[i] < 0 else '0' for i in range(len(raw_sent_))])
+        print(mat_set[:len(raw_sent_)])
+        # #saveresult
+        
+        
+        # wr.writerow([raw_sent_, temp])
+        # f.close
+        return self.make_pred_sents(raw_sent_, preds)
+
+    def make_pred_sents(self, x_sents, y_pred):
+        res_sent = []
+        for i, j in zip(x_sents, y_pred):
+            if j == '1':
+                res_sent.append(i)
+                res_sent.append(' ')
+            else:
+                res_sent.append(i)
+        subs = re.sub(self.pattern, ' ', ''.join(res_sent).replace('^', ' '))
+        subs = subs.replace('«', '')
+        subs = subs.replace('»', '')
+        return subs
+
+if not opt.train and not opt.test:
+    # 사전 파일 로딩
+    w2idx, idx2w = load_vocab(opt.vocab_file)
+    # 임베딩 파일 로딩
+    weights = load_embedding(opt.embedding_file)
+    vocab_size = weights.shape[0]
+    embed_dim = weights.shape[1]
+    model = pick_model(opt.model_type, opt.n_hidden, vocab_size, embed_dim, opt.max_seq_len)
+
+    # model.collect_params().initialize(mx.init.Xavier(), ctx=mx.cpu(0))
+    # model.embedding.weight.set_data(weights)
+    model.load_parameters(opt.model_params, ctx=mx.cpu(0))
+    predictor = pred_spacing(model, w2idx)
+    
+    # datafile = open('./data/removed.txt', 'r', encoding='utf-8')
+    # lines = datafile.readlines()
+    # total = len(lines)
+    # cnt = 1
+    # for line in lines[:50000]:
+    #     print()
+    #     print('#' * 30)
+    #     print(cnt, ' / ', total)
+    #     print('#' * 30)
+    #     predictor.get_spaced_sent(line)
+    #     cnt += 1
+
+
+
+    while 1:
+        sent = input("sent > ")
+        print(sent)
+        start = timer()
+        spaced = predictor.get_spaced_sent(sent)
+        end = timer()
+        print("spaced sent[{:03.2f}sec/sent]  > {}".format(end - start, spaced))
+
+if not opt.train and opt.test:
+    logger.info("calculate accuracy!")
+    # 사전 파일 로딩
+    w2idx, idx2w = load_vocab(opt.vocab_file)
+    # 임베딩 파일 로딩
+    weights = load_embedding(opt.embedding_file)
+    vocab_size = weights.shape[0]
+    embed_dim = weights.shape[1]
+
+    model = pick_model(opt.model_type, opt.n_hidden, vocab_size, embed_dim, opt.max_seq_len)
+
+    # model.initialize(ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
+    model.load_parameters(opt.model_params,
+                          ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0))
+    valid_generator = make_input_data(opt.test_data,
+                                      sampling=1,
+                                      train_ratio=1,
+                                      make_lag_set=True,
+                                      batch_size=100)
+    valid_acc = evaluate_accuracy(
+        valid_generator,
+        model,
+        w2idx['__PAD__'],
+        ctx=ctx[0] if isinstance(ctx, list) else mx.gpu(0),
+        n=30000)
+    logger.info('valid accuracy : {}'.format(valid_acc))
--- a/train/utils/__pycache__/embedding_maker.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/embedding_maker.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/__pycache__/jamo_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/jamo_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/__pycache__/spacing_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/spacing_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/embedding_maker.py 0 → 100644
View file @48da6e7
+++ b/train/utils/embedding_maker.py 0 → 100644
View file @48da6e7
+__all__ = [
+    'create_embeddings', 'load_embedding', 'load_vocab',
+    'encoding_and_padding', 'get_embedding_model'
+]
+
+import bz2
+import json
+import os
+
+import numpy as np
+import pkg_resources
+from gensim.models import FastText
+
+from utils.spacing_utils import sent_to_spacing_chars
+from tqdm import tqdm
+from utils.jamo_utils import jamo_sentence, jamo_to_word
+
+def pad_sequences(sequences,
+                  maxlen=None,
+                  dtype='int32',
+                  padding='pre',
+                  truncating='pre',
+                  value=0.):
+
+    if not hasattr(sequences, '__len__'):
+        raise ValueError('`sequences` must be iterable.')
+    lengths = []
+    for x in sequences:
+        if not hasattr(x, '__len__'):
+            raise ValueError('`sequences` must be a list of iterables. '
+                             'Found non-iterable: ' + str(x))
+        lengths.append(len(x))
+
+    num_samples = len(sequences)
+    if maxlen is None:
+        maxlen = np.max(lengths)
+
+    # take the sample shape from the first non empty sequence
+    # checking for consistency in the main loop below.
+    sample_shape = tuple()
+    for s in sequences:
+        if len(s) > 0:
+            sample_shape = np.asarray(s).shape[1:]
+            break
+
+    x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
+    for idx, s in enumerate(sequences):
+        if not len(s):
+            continue  # empty list/array was found
+        if truncating == 'pre':
+            trunc = s[-maxlen:]
+        elif truncating == 'post':
+            trunc = s[:maxlen]
+        else:
+            raise ValueError('Truncating type "%s" not understood' %
+                             truncating)
+
+        # check `trunc` has expected shape
+        trunc = np.asarray(trunc, dtype=dtype)
+        if trunc.shape[1:] != sample_shape:
+            raise ValueError(
+                'Shape of sample %s of sequence at position %s is different from expected shape %s'
+                % (trunc.shape[1:], idx, sample_shape))
+
+        if padding == 'post':
+            x[idx, :len(trunc)] = trunc
+        elif padding == 'pre':
+            x[idx, -len(trunc):] = trunc
+        else:
+            raise ValueError('Padding type "%s" not understood' % padding)
+    return x
+
+
+def create_embeddings(data_dir,
+                      model_file,
+                      embeddings_file,
+                      vocab_file,
+                      splitc=' ',
+                      **params):
+    """
+    making embedding from files.
+    :**params additional Word2Vec() parameters
+    :splitc   char for splitting in  data_dir files
+    :model_file output object from Word2Vec()
+    :data_dir data dir to be process
+    :embeddings_file numpy object file path from Word2Vec()
+    :vocab_file item to index json dictionary
+    """
+    class SentenceGenerator(object):
+        def __init__(self, dirname):
+            self.dirname = dirname
+
+        def __iter__(self):
+            for fname in os.listdir(self.dirname):
+                print("processing~  '{}'".format(fname))
+                for line in bz2.open(os.path.join(self.dirname, fname), "rt"):
+                    yield sent_to_spacing_chars(line.strip()).split(splitc)
+
+    sentences = SentenceGenerator(data_dir)
+
+    model = FastText.load(model_file)
+    model.save(model_file)
+    weights = model.wv.syn0
+    default_vec = np.mean(weights, axis=0, keepdims=True)
+    padding_vec = np.zeros((1, weights.shape[1]))
+
+    weights_default = np.concatenate([weights, default_vec, padding_vec],
+                                     axis=0)
+
+    np.save(open(embeddings_file, 'wb'), weights_default)
+
+    vocab = dict([(k, v.index) for k, v in model.wv.vocab.items()])
+    vocab['__PAD__'] = weights_default.shape[0] - 1
+    with open(vocab_file, 'w') as f:
+        f.write(json.dumps(vocab))
+
+
+def load_embedding(embeddings_file):
+    return (np.load(embeddings_file))
+
+
+def load_vocab(vocab_path):
+    with open(vocab_path, 'r') as f:
+        data = json.loads(f.read())
+    word2idx = data
+    idx2word = dict([(v, k) for k, v in data.items()])
+    return word2idx, idx2word
+
+def get_similar_char(word2idx_dic, model, jamo_model, text, try_cnt, OOV_CNT, HIT_CNT):
+    OOV_CNT += 1
+    jamo_text = jamo_sentence(text)
+    simialr_list = jamo_model.wv.most_similar(jamo_text)[:try_cnt]
+    for char in simialr_list:
+        result = jamo_to_word(char[0])
+        
+        if result in word2idx_dic.keys(): 
+            # print('#' * 20)
+            # print('hit')
+            # print('origin: ', text, 'reuslt: ', result)
+            HIT_CNT += 1
+            return OOV_CNT, HIT_CNT,result
+
+    # print('#' * 20)
+    # print('no hit')
+    # print('origin: ', text)
+    return OOV_CNT, HIT_CNT, model.wv.most_similar(text)[0][0]
+
+
+def encoding_and_padding(word2idx_dic, sequences, **params):
+    """
+    1. making item to idx
+    2. padding
+    :word2idx_dic
+    :sequences: list of lists where each element is a sequence
+    :maxlen: int, maximum length
+    :dtype: type to cast the resulting sequence.
+    :padding: 'pre' or 'post', pad either before or after each sequence.
+    :truncating: 'pre' or 'post', remove values from sequences larger than
+        maxlen either in the beginning or in the end of the sequence
+    :value: float, value to pad the sequences to the desired value.
+    """
+    model_file = 'model/fasttext'
+    jamo_model_path = 'jamo_model/fasttext'
+    print('seq_idx start')
+    model = FastText.load(model_file)
+    jamo_model = FastText.load(jamo_model_path)
+    seq_idx = []
+    OOV_CNT = 0
+    HIT_CNT = 0
+    TOTAL_CNT = 0
+    
+    for word in tqdm(sequences):
+        temp = []
+        for char in word:
+            TOTAL_CNT += 1
+            if char in word2idx_dic.keys():
+                temp.append(word2idx_dic[char])
+            else:
+                OOV_CNT, HIT_CNT, result = get_similar_char(word2idx_dic, model, jamo_model, char, 3, OOV_CNT, HIT_CNT)
+                temp.append(word2idx_dic[result])
+        seq_idx.append(temp)
+    print('TOTAL CNT: ', TOTAL_CNT, 'OOV CNT: ', OOV_CNT, 'HIT_CNT: ', HIT_CNT)
+    if OOV_CNT > 0 and HIT_CNT > 0:
+        print('OOV RATE:', float(OOV_CNT) / TOTAL_CNT * 100, '%' ,'HIT_RATE: ', float(HIT_CNT) / float(OOV_CNT) * 100, '%')
+    
+    params['value'] = word2idx_dic['__PAD__']
+    return (pad_sequences(seq_idx, **params))
+
+
+def get_embedding_model(name='fee_prods', path='data/embedding'):
+    weights = pkg_resources.resource_filename(
+        'dsc', os.path.join(path, name, 'weights.np'))
+    w2idx = pkg_resources.resource_filename(
+        'dsc', os.path.join(path, name, 'idx.json'))
+    return ((load_embedding(weights), load_vocab(w2idx)[0]))
--- a/train/utils/jamo_utils.py 0 → 100644
View file @48da6e7
+++ b/train/utils/jamo_utils.py 0 → 100644
View file @48da6e7
+import re 
+from soynlp.hangle import compose, decompose, character_is_korean 
+
+
+doublespace_pattern = re.compile('\s+') 
+
+def jamo_sentence(sent): 
+    def transform(char): 
+        if char == ' ': 
+            return char 
+            
+        cjj = decompose(char) 
+        if len(cjj) == 1: 
+            return cjj 
+        
+        cjj_ = ''.join(c if c != ' ' else '-' for c in cjj) 
+        return cjj_ 
+        
+    sent_ = [] 
+    for char in sent: 
+        if character_is_korean(char): 
+            sent_.append(transform(char)) 
+        else: 
+            sent_.append(char) 
+    sent_ = doublespace_pattern.sub(' ', ''.join(sent_)) 
+    return sent_ 
+        
+def jamo_to_word(jamo): 
+    jamo_list, idx = [], 0 
+    
+    while idx < len(jamo): 
+        if not character_is_korean(jamo[idx]): 
+            jamo_list.append(jamo[idx]) 
+            idx += 1 
+        else: 
+            jamo_list.append(jamo[idx:idx + 3]) 
+            idx += 3 
+        
+    word = "" 
+    for jamo_char in jamo_list: 
+        if len(jamo_char) == 1: 
+            word += jamo_char 
+        elif jamo_char[2] == "-":
+            word += compose(jamo_char[0], jamo_char[1], " ")
+        else: word += compose(jamo_char[0], jamo_char[1], jamo_char[2]) 
+            
+    return word
+
+def break_char (jamo_sentence):
+  idx = 0
+  corpus = []
+
+  while idx < len(jamo_sentence):
+    if not character_is_korean(jamo_sentence[idx]): 
+      corpus.append(jamo_sentence[idx]) 
+      idx += 1
+    else:
+      corpus.append(jamo_sentence[idx : idx+3])
+      idx += 3
+  return corpus
\ No newline at end of file
--- a/train/utils/spacing_utils.py 0 → 100644
View file @48da6e7
+++ b/train/utils/spacing_utils.py 0 → 100644
View file @48da6e7
+# coding=utf-8
+# Copyright 2020 Heewon Jeon. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+def sent_to_spacing_chars(sent):
+    # 공백은 ^
+    chars = sent.strip().replace(' ', '^')
+    # char_list = [li.strip().replace(' ', '^') for li in sents]
+
+    # 문장의 시작 포인트 «
+    # 문장의 끌 포인트  »
+    tagged_chars = "«" + chars + "»"
+    # char_list = [ "«" + li + "»" for li in char_list]
+
+    # 문장 -> 문자열
+    char_list = ' '.join(list(tagged_chars))
+    # char_list = [ ' '.join(list(li))  for li in char_list]
+    return(char_list)