submit train init

yomapi
Commit 48da6e7d8642425195a2c89705e430501dfd469c 48da6e7d 1 parent a0839ebc
Showing 19 changed files with 527 additions and 0 deletions
README.md
img/2-stage-FastText.png
img/Thresholding_result.png
img/kosapcing_img.png
img/probability_distribution_of_output_vector.png
train/LICENSE
train/data/example.txt.bz2
train/embedding.py
train/jamo_model/.gitignore
train/model/.gitignore
train/output/.gitignore
train/requirements.txt
train/train.py
train/utils/__pycache__/embedding_maker.cpython-37.pyc
train/utils/__pycache__/jamo_utils.cpython-37.pyc
train/utils/__pycache__/spacing_utils.cpython-37.pyc
train/utils/embedding_maker.py
train/utils/jamo_utils.py
train/utils/spacing_utils.py
--- a/README.md
View file @48da6e7
+++ b/README.md
View file @48da6e7
+ # ML base Spacing Correcter
+ This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing "TrainKoSpacing"), using FastText instead of Word2Vec
+ 
+ ## Performances
+ | Model  | Test Accuracy(%)   | Encoding Time Cost |
+ | :------------: | :------------: | :------------: |
+ | TrainKoSpacing | 96.6147 | 02m 23s|
+ | 자모분해 FastText  | 98.9915  | 08h 20m 11s
+ | 2 Stage FastText  | 99.0888  | 03m 23s
+ 
+ ## Data
+ #### Corpus
+ 
+ We mainly focus on the National Institute of Korean Language 모두의 말뭉치 corpus and National Information Society Agency AI-Hub data. However, due to the license issue, we are restricted to distribute this dataset. You should be able to get them throw the link below
+ [National Institute of Korean Language 모두의 말뭉치](https://corpus.korean.go.kr/).
+ [National Information Society Agency AI-Hub](https://aihub.or.kr/aihub-data/natural-language/about "National Information Society Agency AI-Hub")
+ 
+ #### Data format
+ Bziped file consisting of one sentence per line.
+ 
+ ```
+ ~/KoSpacing/data$ bzcat train.txt.bz2 | head
+ 엠마누엘 웅가로 / 의상서 실내 장식품으로… 디자인 세계 넓혀
+ 프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다.
+ 웅가로는 침실과 식당, 욕실에서 사용하는 갖가지 직물제품을 디자인해 최근 파리의 갤러리 라파예트백화점에서 '색의 컬렉션'이라는 이름으로 전시회를 열었다.
+ ```
+ 
+ 
+ ## Architecture
+ 
+ ### Model
+ ![kosapcing_img](img/kosapcing_img.png)
+ 
+ ### Word Embedding
+ #### 자모분해
+ To get similar shpae of Korean charector, use 자모분해 FastText word embedding.
+ ex)
+ 자연어처리
+ ㅈ ㅏ – ㅇ ㅕ ㄴ ㅇ ㅓ – ㅊ ㅓ – ㄹ ㅣ –
+ 
+ #### 2 stage FastText
+ Becasue of time to handdle 자모분해, use 자모분해 FastText only for Out of Vocabulary charector.
+ ![2-stage-FastText_img](img/2-stage-FastText.png)
+ 
+ ### Thresholding
+ Because middle part of output distribution are evenly distributed.
+ ![probability_distribution_of_output_vector](img/probability_distribution_of_output_vector.png)
+ 
+ Use log transform and second derivative
+ result:
+ ![Thresholding_result](img/Thresholding_result.png)
+ 
+ 
+ 
+ ## How to Run
+ 
+ 
+ ### Installation
+ 
+ - For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow.
+ - Support only above Python 3.7.
+ ### Requirement
+ 
+ - Python (>= 3.7)
+ - MXNet (>= 1.6.0)
+ - tqdm (>= 4.19.5)
+ - Pandas (>= 0.22.0)
+ - Gensim (>= 3.8.1)
+ - GluonNLP (>= 0.9.1)
+ - soynlp (>= 0.0.493)
+ 
+ ### Dependencies
+ 
+ ```bash
+ pip install -r requirements.txt
+ ```
+ 
+ ### Training
+ 
+ ```bash
+ python train.py --train --train-samp-ratio 1.0 --num-epoch 50 --train_data data/train.txt.bz2 --test_data data/test.txt.bz2 --outputs train_log_to --model_type kospacing --model-file fasttext
+ ```
+ 
+ ### Evaluation
+ 
+ ```bash
+ python train.py --model-params model/kospacing.params --model_type kospacing
+ sent > 중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
+ 중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
+ spaced sent[0.12sec/sent]  > 중국은 2018년 평창동계올림픽의 반환점에 이르기까지 아직 노골드 행진이다.  
+ ```
+ 
+ ### Directory
+ Directory guide for embedding model files
+  bold texts means necessary
+ 
+ - model
+ 	- **fasttext**
+ 	- fasttext_vis
+ 	- **fasttext.trainables.vectors_ngrams_lockf.npy**
+ 	- **fasttext.wv.vectors_ngrams.npy**
+ 	- **kospacing_wv.np**
+ 	- **w2idx.dic**
+ 
+ - jamo_model
+ 	- **fasttext**
+ 	- fasttext_vis
+ 	- **fasttext.trainables.vectors_ngrams_lockf.npy**
+ 	- **fasttext.wv.vectors_ngrams.npy**
+ 	- **kospacing_wv.np**
+ 	- **w2idx.dic**
+ 
+ ### Reference
+ TrainKoSpacing: https://github.com/haven-jeon/TrainKoSpacing
+ 딥 러닝을 이용한 자연어 처리 입문: https://wikidocs.net/book/2155
+ 
--- a/img/2-stage-FastText.png 0 → 100644
View file @48da6e7
+++ b/img/2-stage-FastText.png 0 → 100644
View file @48da6e7
--- a/img/Thresholding_result.png 0 → 100644
View file @48da6e7
+++ b/img/Thresholding_result.png 0 → 100644
View file @48da6e7
--- a/img/kosapcing_img.png 0 → 100644
View file @48da6e7
+++ b/img/kosapcing_img.png 0 → 100644
View file @48da6e7
--- a/img/probability_distribution_of_output_vector.png 0 → 100644
View file @48da6e7
+++ b/img/probability_distribution_of_output_vector.png 0 → 100644
View file @48da6e7
--- a/train/LICENSE 0 → 100644
View file @48da6e7
+++ b/train/LICENSE 0 → 100644
View file @48da6e7
--- a/train/data/example.txt.bz2 0 → 100644
View file @48da6e7
+++ b/train/data/example.txt.bz2 0 → 100644
View file @48da6e7
--- a/train/embedding.py 0 → 100644
View file @48da6e7
+++ b/train/embedding.py 0 → 100644
View file @48da6e7
+ # coding=utf-8
+ # Copyright 2020 Heewon Jeon. All rights reserved.
+ #
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #
+ #     http://www.apache.org/licenses/LICENSE-2.0
+ #
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
+ 
+ import argparse
+ from utils.embedding_maker import create_embeddings
+ 
+ 
+ parser = argparse.ArgumentParser(description='Korean Autospacing Embedding Maker')
+ 
+ parser.add_argument('--num-iters', type=int, default=5,
+                     help='number of iterations to train (default: 5)')
+ 
+ parser.add_argument('--min-count', type=int, default=100,
+                     help='mininum word counts to filter (default: 100)')
+ 
+ parser.add_argument('--embedding-size', type=int, default=100,
+                     help='embedding dimention size (default: 100)')
+ 
+ parser.add_argument('--num-worker', type=int, default=16,
+                     help='number of thread (default: 16)')
+ 
+ parser.add_argument('--window-size', type=int, default=8,
+                     help='skip-gram window size (default: 8)')
+ 
+ parser.add_argument('--corpus_dir', type=str, default='data',
+                     help='training resource dir')
+ 
+ parser.add_argument('--train', action='store_true', default=True,
+                     help='do embedding trainig (default: True)')
+ 
+ parser.add_argument('--model-file', type=str, default='kospacing_wv.mdl',
+                     help='output object from Word2Vec() (default: kospacing_wv.mdl)')
+ 
+ parser.add_argument('--numpy-wv', type=str, default='kospacing_wv.np',
+                     help='numpy object file path from Word2Vec() (default: kospacing_wv.np)')
+ 
+ parser.add_argument('--w2idx', type=str, default='w2idx.dic',
+                     help='item to index json dictionary (default: w2idx.dic)')
+ 
+ parser.add_argument('--model-dir', type=str, default='model',
+                     help='dir to save models (default: model)')
+ 
+ opt = parser.parse_args()
+ 
+ if opt.train:
+     create_embeddings(opt.corpus_dir, opt.model_dir + '/' +
+                       opt.model_file, opt.model_dir + '/' + opt.numpy_wv,
+                       opt.model_dir + '/' + opt.w2idx, min_count=opt.min_count,
+                       iter=opt.num_iters,
+                       size=opt.embedding_size, workers=opt.num_worker, window=opt.window_size)
--- a/train/jamo_model/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/jamo_model/.gitignore 0 → 100644
View file @48da6e7
--- a/train/model/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/model/.gitignore 0 → 100644
View file @48da6e7
--- a/train/output/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/output/.gitignore 0 → 100644
View file @48da6e7
--- a/train/requirements.txt 0 → 100644
View file @48da6e7
+++ b/train/requirements.txt 0 → 100644
View file @48da6e7
+ absl-py==0.11.0
+ astunparse==1.6.3
+ cachetools==4.2.1
+ certifi==2020.12.5
+ chardet==4.0.0
+ click==7.1.2
+ cmake==3.18.4.post1
+ Cython==0.29.21
+ Flask==1.1.2
+ Flask-Cors==3.0.9
+ flatbuffers==1.12
+ gast==0.3.3
+ gensim==3.8.3
+ gluonnlp==0.10.0
+ google-auth==1.26.1
+ google-auth-oauthlib==0.4.2
+ google-pasta==0.2.0
+ graphviz==0.8.4
+ grpcio==1.32.0
+ h5py==2.10.0
+ idna==2.10
+ importlib-metadata==3.4.0
+ itsdangerous==1.1.0
+ Jinja2==2.11.2
+ joblib==1.0.1
+ Keras==2.4.3
+ Keras-Preprocessing==1.1.2
+ Markdown==3.3.3
+ MarkupSafe==1.1.1
+ mxnet-cu101==1.7.0
+ mxnet-cu101mkl==1.6.0.post0
+ mxnet-mkl==1.6.0
+ numpy==1.19.5
+ oauthlib==3.1.0
+ opt-einsum==3.3.0
+ packaging==20.9
+ pandas==1.2.2
+ protobuf==3.14.0
+ psutil==5.8.0
+ pyasn1==0.4.8
+ pyasn1-modules==0.2.8
+ pyparsing==2.4.7
+ python-dateutil==2.8.1
+ pytz==2020.5
+ PyYAML==5.3.1
+ requests==2.25.1
+ requests-oauthlib==1.3.0
+ rsa==4.6
+ scikit-learn==0.24.1
+ scipy==1.6.0
+ six==1.15.0
+ smart-open==4.0.1
+ soynlp==0.0.493
+ tensorboard==2.4.0
+ tensorboard-plugin-wit==1.7.0
+ tensorflow==2.4.1
+ tensorflow-estimator==2.4.0
+ termcolor==1.1.0
+ threadpoolctl==2.1.0
+ tqdm==4.56.0
+ typing-extensions==3.7.4.3
+ urllib3==1.26.3
+ Werkzeug==1.0.1
+ wrapt==1.12.1
+ zipp==3.4.0
--- a/train/train.py 0 → 100644
View file @48da6e7
+++ b/train/train.py 0 → 100644
View file @48da6e7
--- a/train/utils/__pycache__/embedding_maker.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/embedding_maker.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/__pycache__/jamo_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/jamo_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/__pycache__/spacing_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/spacing_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/embedding_maker.py 0 → 100644
View file @48da6e7
+++ b/train/utils/embedding_maker.py 0 → 100644
View file @48da6e7
+ __all__ = [
+     'create_embeddings', 'load_embedding', 'load_vocab',
+     'encoding_and_padding', 'get_embedding_model'
+ ]
+ 
+ import bz2
+ import json
+ import os
+ 
+ import numpy as np
+ import pkg_resources
+ from gensim.models import FastText
+ 
+ from utils.spacing_utils import sent_to_spacing_chars
+ from tqdm import tqdm
+ from utils.jamo_utils import jamo_sentence, jamo_to_word
+ 
+ def pad_sequences(sequences,
+                   maxlen=None,
+                   dtype='int32',
+                   padding='pre',
+                   truncating='pre',
+                   value=0.):
+ 
+     if not hasattr(sequences, '__len__'):
+         raise ValueError('`sequences` must be iterable.')
+     lengths = []
+     for x in sequences:
+         if not hasattr(x, '__len__'):
+             raise ValueError('`sequences` must be a list of iterables. '
+                              'Found non-iterable: ' + str(x))
+         lengths.append(len(x))
+ 
+     num_samples = len(sequences)
+     if maxlen is None:
+         maxlen = np.max(lengths)
+ 
+     # take the sample shape from the first non empty sequence
+     # checking for consistency in the main loop below.
+     sample_shape = tuple()
+     for s in sequences:
+         if len(s) > 0:
+             sample_shape = np.asarray(s).shape[1:]
+             break
+ 
+     x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
+     for idx, s in enumerate(sequences):
+         if not len(s):
+             continue  # empty list/array was found
+         if truncating == 'pre':
+             trunc = s[-maxlen:]
+         elif truncating == 'post':
+             trunc = s[:maxlen]
+         else:
+             raise ValueError('Truncating type "%s" not understood' %
+                              truncating)
+ 
+         # check `trunc` has expected shape
+         trunc = np.asarray(trunc, dtype=dtype)
+         if trunc.shape[1:] != sample_shape:
+             raise ValueError(
+                 'Shape of sample %s of sequence at position %s is different from expected shape %s'
+                 % (trunc.shape[1:], idx, sample_shape))
+ 
+         if padding == 'post':
+             x[idx, :len(trunc)] = trunc
+         elif padding == 'pre':
+             x[idx, -len(trunc):] = trunc
+         else:
+             raise ValueError('Padding type "%s" not understood' % padding)
+     return x
+ 
+ 
+ def create_embeddings(data_dir,
+                       model_file,
+                       embeddings_file,
+                       vocab_file,
+                       splitc=' ',
+                       **params):
+     """
+     making embedding from files.
+     :**params additional Word2Vec() parameters
+     :splitc   char for splitting in  data_dir files
+     :model_file output object from Word2Vec()
+     :data_dir data dir to be process
+     :embeddings_file numpy object file path from Word2Vec()
+     :vocab_file item to index json dictionary
+     """
+     class SentenceGenerator(object):
+         def __init__(self, dirname):
+             self.dirname = dirname
+ 
+         def __iter__(self):
+             for fname in os.listdir(self.dirname):
+                 print("processing~  '{}'".format(fname))
+                 for line in bz2.open(os.path.join(self.dirname, fname), "rt"):
+                     yield sent_to_spacing_chars(line.strip()).split(splitc)
+ 
+     sentences = SentenceGenerator(data_dir)
+ 
+     model = FastText.load(model_file)
+     model.save(model_file)
+     weights = model.wv.syn0
+     default_vec = np.mean(weights, axis=0, keepdims=True)
+     padding_vec = np.zeros((1, weights.shape[1]))
+ 
+     weights_default = np.concatenate([weights, default_vec, padding_vec],
+                                      axis=0)
+ 
+     np.save(open(embeddings_file, 'wb'), weights_default)
+ 
+     vocab = dict([(k, v.index) for k, v in model.wv.vocab.items()])
+     vocab['__PAD__'] = weights_default.shape[0] - 1
+     with open(vocab_file, 'w') as f:
+         f.write(json.dumps(vocab))
+ 
+ 
+ def load_embedding(embeddings_file):
+     return (np.load(embeddings_file))
+ 
+ 
+ def load_vocab(vocab_path):
+     with open(vocab_path, 'r') as f:
+         data = json.loads(f.read())
+     word2idx = data
+     idx2word = dict([(v, k) for k, v in data.items()])
+     return word2idx, idx2word
+ 
+ def get_similar_char(word2idx_dic, model, jamo_model, text, try_cnt, OOV_CNT, HIT_CNT):
+     OOV_CNT += 1
+     jamo_text = jamo_sentence(text)
+     simialr_list = jamo_model.wv.most_similar(jamo_text)[:try_cnt]
+     for char in simialr_list:
+         result = jamo_to_word(char[0])
+         
+         if result in word2idx_dic.keys(): 
+             # print('#' * 20)
+             # print('hit')
+             # print('origin: ', text, 'reuslt: ', result)
+             HIT_CNT += 1
+             return OOV_CNT, HIT_CNT,result
+ 
+     # print('#' * 20)
+     # print('no hit')
+     # print('origin: ', text)
+     return OOV_CNT, HIT_CNT, model.wv.most_similar(text)[0][0]
+ 
+ 
+ def encoding_and_padding(word2idx_dic, sequences, **params):
+     """
+     1. making item to idx
+     2. padding
+     :word2idx_dic
+     :sequences: list of lists where each element is a sequence
+     :maxlen: int, maximum length
+     :dtype: type to cast the resulting sequence.
+     :padding: 'pre' or 'post', pad either before or after each sequence.
+     :truncating: 'pre' or 'post', remove values from sequences larger than
+         maxlen either in the beginning or in the end of the sequence
+     :value: float, value to pad the sequences to the desired value.
+     """
+     model_file = 'model/fasttext'
+     jamo_model_path = 'jamo_model/fasttext'
+     print('seq_idx start')
+     model = FastText.load(model_file)
+     jamo_model = FastText.load(jamo_model_path)
+     seq_idx = []
+     OOV_CNT = 0
+     HIT_CNT = 0
+     TOTAL_CNT = 0
+     
+     for word in tqdm(sequences):
+         temp = []
+         for char in word:
+             TOTAL_CNT += 1
+             if char in word2idx_dic.keys():
+                 temp.append(word2idx_dic[char])
+             else:
+                 OOV_CNT, HIT_CNT, result = get_similar_char(word2idx_dic, model, jamo_model, char, 3, OOV_CNT, HIT_CNT)
+                 temp.append(word2idx_dic[result])
+         seq_idx.append(temp)
+     print('TOTAL CNT: ', TOTAL_CNT, 'OOV CNT: ', OOV_CNT, 'HIT_CNT: ', HIT_CNT)
+     if OOV_CNT > 0 and HIT_CNT > 0:
+         print('OOV RATE:', float(OOV_CNT) / TOTAL_CNT * 100, '%' ,'HIT_RATE: ', float(HIT_CNT) / float(OOV_CNT) * 100, '%')
+     
+     params['value'] = word2idx_dic['__PAD__']
+     return (pad_sequences(seq_idx, **params))
+ 
+ 
+ def get_embedding_model(name='fee_prods', path='data/embedding'):
+     weights = pkg_resources.resource_filename(
+         'dsc', os.path.join(path, name, 'weights.np'))
+     w2idx = pkg_resources.resource_filename(
+         'dsc', os.path.join(path, name, 'idx.json'))
+     return ((load_embedding(weights), load_vocab(w2idx)[0]))
--- a/train/utils/jamo_utils.py 0 → 100644
View file @48da6e7
+++ b/train/utils/jamo_utils.py 0 → 100644
View file @48da6e7
+ import re 
+ from soynlp.hangle import compose, decompose, character_is_korean 
+ 
+ 
+ doublespace_pattern = re.compile('\s+') 
+ 
+ def jamo_sentence(sent): 
+     def transform(char): 
+         if char == ' ': 
+             return char 
+             
+         cjj = decompose(char) 
+         if len(cjj) == 1: 
+             return cjj 
+         
+         cjj_ = ''.join(c if c != ' ' else '-' for c in cjj) 
+         return cjj_ 
+         
+     sent_ = [] 
+     for char in sent: 
+         if character_is_korean(char): 
+             sent_.append(transform(char)) 
+         else: 
+             sent_.append(char) 
+     sent_ = doublespace_pattern.sub(' ', ''.join(sent_)) 
+     return sent_ 
+         
+ def jamo_to_word(jamo): 
+     jamo_list, idx = [], 0 
+     
+     while idx < len(jamo): 
+         if not character_is_korean(jamo[idx]): 
+             jamo_list.append(jamo[idx]) 
+             idx += 1 
+         else: 
+             jamo_list.append(jamo[idx:idx + 3]) 
+             idx += 3 
+         
+     word = "" 
+     for jamo_char in jamo_list: 
+         if len(jamo_char) == 1: 
+             word += jamo_char 
+         elif jamo_char[2] == "-":
+             word += compose(jamo_char[0], jamo_char[1], " ")
+         else: word += compose(jamo_char[0], jamo_char[1], jamo_char[2]) 
+             
+     return word
+ 
+ def break_char (jamo_sentence):
+   idx = 0
+   corpus = []
+ 
+   while idx < len(jamo_sentence):
+     if not character_is_korean(jamo_sentence[idx]): 
+       corpus.append(jamo_sentence[idx]) 
+       idx += 1
+     else:
+       corpus.append(jamo_sentence[idx : idx+3])
+       idx += 3
+   return corpus
\ No newline at end of file
--- a/train/utils/spacing_utils.py 0 → 100644
View file @48da6e7
+++ b/train/utils/spacing_utils.py 0 → 100644
View file @48da6e7
+ # coding=utf-8
+ # Copyright 2020 Heewon Jeon. All rights reserved.
+ #
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #
+ #     http://www.apache.org/licenses/LICENSE-2.0
+ #
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
+ 
+ def sent_to_spacing_chars(sent):
+     # 공백은 ^
+     chars = sent.strip().replace(' ', '^')
+     # char_list = [li.strip().replace(' ', '^') for li in sents]
+ 
+     # 문장의 시작 포인트 «
+     # 문장의 끌 포인트  »
+     tagged_chars = "«" + chars + "»"
+     # char_list = [ "«" + li + "»" for li in char_list]
+ 
+     # 문장 -> 문자열
+     char_list = ' '.join(list(tagged_chars))
+     # char_list = [ ' '.join(list(li))  for li in char_list]
+     return(char_list)