submit train init

yomapi
Commit 48da6e7d8642425195a2c89705e430501dfd469c 48da6e7d 1 parent a0839ebc
Showing 19 changed files with 527 additions and 0 deletions
README.md
img/2-stage-FastText.png
img/Thresholding_result.png
img/kosapcing_img.png
img/probability_distribution_of_output_vector.png
train/LICENSE
train/data/example.txt.bz2
train/embedding.py
train/jamo_model/.gitignore
train/model/.gitignore
train/output/.gitignore
train/requirements.txt
train/train.py
train/utils/__pycache__/embedding_maker.cpython-37.pyc
train/utils/__pycache__/jamo_utils.cpython-37.pyc
train/utils/__pycache__/spacing_utils.cpython-37.pyc
train/utils/embedding_maker.py
train/utils/jamo_utils.py
train/utils/spacing_utils.py
--- a/README.md
View file @48da6e7
+++ b/README.md
View file @48da6e7
+# ML base Spacing Correcter
+This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing "TrainKoSpacing"), using FastText instead of Word2Vec
+
+## Performances
+| Model  | Test Accuracy(%)   | Encoding Time Cost |
+| :------------: | :------------: | :------------: |
+| TrainKoSpacing | 96.6147 | 02m 23s|
+| 자모분해 FastText  | 98.9915  | 08h 20m 11s
+| 2 Stage FastText  | 99.0888  | 03m 23s
+
+## Data
+#### Corpus
+
+We mainly focus on the National Institute of Korean Language 모두의 말뭉치 corpus and National Information Society Agency AI-Hub data. However, due to the license issue, we are restricted to distribute this dataset. You should be able to get them throw the link below
+[National Institute of Korean Language 모두의 말뭉치](https://corpus.korean.go.kr/).
+[National Information Society Agency AI-Hub](https://aihub.or.kr/aihub-data/natural-language/about "National Information Society Agency AI-Hub")
+
+#### Data format
+Bziped file consisting of one sentence per line.
+
+```
+~/KoSpacing/data$ bzcat train.txt.bz2 | head
+엠마누엘 웅가로 / 의상서 실내 장식품으로… 디자인 세계 넓혀
+프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다.
+웅가로는 침실과 식당, 욕실에서 사용하는 갖가지 직물제품을 디자인해 최근 파리의 갤러리 라파예트백화점에서 '색의 컬렉션'이라는 이름으로 전시회를 열었다.
+```
+
+
+## Architecture
+
+### Model
+![kosapcing_img](img/kosapcing_img.png)
+
+### Word Embedding
+#### 자모분해
+To get similar shpae of Korean charector, use 자모분해 FastText word embedding.
+ex)
+자연어처리
+ㅈ ㅏ – ㅇ ㅕ ㄴ ㅇ ㅓ – ㅊ ㅓ – ㄹ ㅣ –
+
+#### 2 stage FastText
+Becasue of time to handdle 자모분해, use 자모분해 FastText only for Out of Vocabulary charector.
+![2-stage-FastText_img](img/2-stage-FastText.png)
+
+### Thresholding
+Because middle part of output distribution are evenly distributed.
+![probability_distribution_of_output_vector](img/probability_distribution_of_output_vector.png)
+
+Use log transform and second derivative
+result:
+![Thresholding_result](img/Thresholding_result.png)
+
+
+
+## How to Run
+
+
+### Installation
+
+- For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow.
+- Support only above Python 3.7.
+### Requirement
+
+- Python (>= 3.7)
+- MXNet (>= 1.6.0)
+- tqdm (>= 4.19.5)
+- Pandas (>= 0.22.0)
+- Gensim (>= 3.8.1)
+- GluonNLP (>= 0.9.1)
+- soynlp (>= 0.0.493)
+
+### Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### Training
+
+```bash
+python train.py --train --train-samp-ratio 1.0 --num-epoch 50 --train_data data/train.txt.bz2 --test_data data/test.txt.bz2 --outputs train_log_to --model_type kospacing --model-file fasttext
+```
+
+### Evaluation
+
+```bash
+python train.py --model-params model/kospacing.params --model_type kospacing
+sent > 중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
+중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
+spaced sent[0.12sec/sent]  > 중국은 2018년 평창동계올림픽의 반환점에 이르기까지 아직 노골드 행진이다.  
+```
+
+### Directory
+Directory guide for embedding model files
+ bold texts means necessary
+
+- model
+	- **fasttext**
+	- fasttext_vis
+	- **fasttext.trainables.vectors_ngrams_lockf.npy**
+	- **fasttext.wv.vectors_ngrams.npy**
+	- **kospacing_wv.np**
+	- **w2idx.dic**
+
+- jamo_model
+	- **fasttext**
+	- fasttext_vis
+	- **fasttext.trainables.vectors_ngrams_lockf.npy**
+	- **fasttext.wv.vectors_ngrams.npy**
+	- **kospacing_wv.np**
+	- **w2idx.dic**
+
+### Reference
+TrainKoSpacing: https://github.com/haven-jeon/TrainKoSpacing
+딥 러닝을 이용한 자연어 처리 입문: https://wikidocs.net/book/2155
+
--- a/img/2-stage-FastText.png 0 → 100644
View file @48da6e7
+++ b/img/2-stage-FastText.png 0 → 100644
View file @48da6e7
--- a/img/Thresholding_result.png 0 → 100644
View file @48da6e7
+++ b/img/Thresholding_result.png 0 → 100644
View file @48da6e7
--- a/img/kosapcing_img.png 0 → 100644
View file @48da6e7
+++ b/img/kosapcing_img.png 0 → 100644
View file @48da6e7
--- a/img/probability_distribution_of_output_vector.png 0 → 100644
View file @48da6e7
+++ b/img/probability_distribution_of_output_vector.png 0 → 100644
View file @48da6e7
--- a/train/LICENSE 0 → 100644
View file @48da6e7
+++ b/train/LICENSE 0 → 100644
View file @48da6e7
--- a/train/data/example.txt.bz2 0 → 100644
View file @48da6e7
+++ b/train/data/example.txt.bz2 0 → 100644
View file @48da6e7
--- a/train/embedding.py 0 → 100644
View file @48da6e7
+++ b/train/embedding.py 0 → 100644
View file @48da6e7
+# coding=utf-8
+# Copyright 2020 Heewon Jeon. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from utils.embedding_maker import create_embeddings
+
+
+parser = argparse.ArgumentParser(description='Korean Autospacing Embedding Maker')
+
+parser.add_argument('--num-iters', type=int, default=5,
+                    help='number of iterations to train (default: 5)')
+
+parser.add_argument('--min-count', type=int, default=100,
+                    help='mininum word counts to filter (default: 100)')
+
+parser.add_argument('--embedding-size', type=int, default=100,
+                    help='embedding dimention size (default: 100)')
+
+parser.add_argument('--num-worker', type=int, default=16,
+                    help='number of thread (default: 16)')
+
+parser.add_argument('--window-size', type=int, default=8,
+                    help='skip-gram window size (default: 8)')
+
+parser.add_argument('--corpus_dir', type=str, default='data',
+                    help='training resource dir')
+
+parser.add_argument('--train', action='store_true', default=True,
+                    help='do embedding trainig (default: True)')
+
+parser.add_argument('--model-file', type=str, default='kospacing_wv.mdl',
+                    help='output object from Word2Vec() (default: kospacing_wv.mdl)')
+
+parser.add_argument('--numpy-wv', type=str, default='kospacing_wv.np',
+                    help='numpy object file path from Word2Vec() (default: kospacing_wv.np)')
+
+parser.add_argument('--w2idx', type=str, default='w2idx.dic',
+                    help='item to index json dictionary (default: w2idx.dic)')
+
+parser.add_argument('--model-dir', type=str, default='model',
+                    help='dir to save models (default: model)')
+
+opt = parser.parse_args()
+
+if opt.train:
+    create_embeddings(opt.corpus_dir, opt.model_dir + '/' +
+                      opt.model_file, opt.model_dir + '/' + opt.numpy_wv,
+                      opt.model_dir + '/' + opt.w2idx, min_count=opt.min_count,
+                      iter=opt.num_iters,
+                      size=opt.embedding_size, workers=opt.num_worker, window=opt.window_size)
--- a/train/jamo_model/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/jamo_model/.gitignore 0 → 100644
View file @48da6e7
--- a/train/model/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/model/.gitignore 0 → 100644
View file @48da6e7
--- a/train/output/.gitignore 0 → 100644
View file @48da6e7
+++ b/train/output/.gitignore 0 → 100644
View file @48da6e7
--- a/train/requirements.txt 0 → 100644
View file @48da6e7
+++ b/train/requirements.txt 0 → 100644
View file @48da6e7
+absl-py==0.11.0
+astunparse==1.6.3
+cachetools==4.2.1
+certifi==2020.12.5
+chardet==4.0.0
+click==7.1.2
+cmake==3.18.4.post1
+Cython==0.29.21
+Flask==1.1.2
+Flask-Cors==3.0.9
+flatbuffers==1.12
+gast==0.3.3
+gensim==3.8.3
+gluonnlp==0.10.0
+google-auth==1.26.1
+google-auth-oauthlib==0.4.2
+google-pasta==0.2.0
+graphviz==0.8.4
+grpcio==1.32.0
+h5py==2.10.0
+idna==2.10
+importlib-metadata==3.4.0
+itsdangerous==1.1.0
+Jinja2==2.11.2
+joblib==1.0.1
+Keras==2.4.3
+Keras-Preprocessing==1.1.2
+Markdown==3.3.3
+MarkupSafe==1.1.1
+mxnet-cu101==1.7.0
+mxnet-cu101mkl==1.6.0.post0
+mxnet-mkl==1.6.0
+numpy==1.19.5
+oauthlib==3.1.0
+opt-einsum==3.3.0
+packaging==20.9
+pandas==1.2.2
+protobuf==3.14.0
+psutil==5.8.0
+pyasn1==0.4.8
+pyasn1-modules==0.2.8
+pyparsing==2.4.7
+python-dateutil==2.8.1
+pytz==2020.5
+PyYAML==5.3.1
+requests==2.25.1
+requests-oauthlib==1.3.0
+rsa==4.6
+scikit-learn==0.24.1
+scipy==1.6.0
+six==1.15.0
+smart-open==4.0.1
+soynlp==0.0.493
+tensorboard==2.4.0
+tensorboard-plugin-wit==1.7.0
+tensorflow==2.4.1
+tensorflow-estimator==2.4.0
+termcolor==1.1.0
+threadpoolctl==2.1.0
+tqdm==4.56.0
+typing-extensions==3.7.4.3
+urllib3==1.26.3
+Werkzeug==1.0.1
+wrapt==1.12.1
+zipp==3.4.0
--- a/train/train.py 0 → 100644
View file @48da6e7
+++ b/train/train.py 0 → 100644
View file @48da6e7
--- a/train/utils/__pycache__/embedding_maker.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/embedding_maker.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/__pycache__/jamo_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/jamo_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/__pycache__/spacing_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
+++ b/train/utils/__pycache__/spacing_utils.cpython-37.pyc 0 → 100644
View file @48da6e7
--- a/train/utils/embedding_maker.py 0 → 100644
View file @48da6e7
+++ b/train/utils/embedding_maker.py 0 → 100644
View file @48da6e7
+__all__ = [
+    'create_embeddings', 'load_embedding', 'load_vocab',
+    'encoding_and_padding', 'get_embedding_model'
+]
+
+import bz2
+import json
+import os
+
+import numpy as np
+import pkg_resources
+from gensim.models import FastText
+
+from utils.spacing_utils import sent_to_spacing_chars
+from tqdm import tqdm
+from utils.jamo_utils import jamo_sentence, jamo_to_word
+
+def pad_sequences(sequences,
+                  maxlen=None,
+                  dtype='int32',
+                  padding='pre',
+                  truncating='pre',
+                  value=0.):
+
+    if not hasattr(sequences, '__len__'):
+        raise ValueError('`sequences` must be iterable.')
+    lengths = []
+    for x in sequences:
+        if not hasattr(x, '__len__'):
+            raise ValueError('`sequences` must be a list of iterables. '
+                             'Found non-iterable: ' + str(x))
+        lengths.append(len(x))
+
+    num_samples = len(sequences)
+    if maxlen is None:
+        maxlen = np.max(lengths)
+
+    # take the sample shape from the first non empty sequence
+    # checking for consistency in the main loop below.
+    sample_shape = tuple()
+    for s in sequences:
+        if len(s) > 0:
+            sample_shape = np.asarray(s).shape[1:]
+            break
+
+    x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
+    for idx, s in enumerate(sequences):
+        if not len(s):
+            continue  # empty list/array was found
+        if truncating == 'pre':
+            trunc = s[-maxlen:]
+        elif truncating == 'post':
+            trunc = s[:maxlen]
+        else:
+            raise ValueError('Truncating type "%s" not understood' %
+                             truncating)
+
+        # check `trunc` has expected shape
+        trunc = np.asarray(trunc, dtype=dtype)
+        if trunc.shape[1:] != sample_shape:
+            raise ValueError(
+                'Shape of sample %s of sequence at position %s is different from expected shape %s'
+                % (trunc.shape[1:], idx, sample_shape))
+
+        if padding == 'post':
+            x[idx, :len(trunc)] = trunc
+        elif padding == 'pre':
+            x[idx, -len(trunc):] = trunc
+        else:
+            raise ValueError('Padding type "%s" not understood' % padding)
+    return x
+
+
+def create_embeddings(data_dir,
+                      model_file,
+                      embeddings_file,
+                      vocab_file,
+                      splitc=' ',
+                      **params):
+    """
+    making embedding from files.
+    :**params additional Word2Vec() parameters
+    :splitc   char for splitting in  data_dir files
+    :model_file output object from Word2Vec()
+    :data_dir data dir to be process
+    :embeddings_file numpy object file path from Word2Vec()
+    :vocab_file item to index json dictionary
+    """
+    class SentenceGenerator(object):
+        def __init__(self, dirname):
+            self.dirname = dirname
+
+        def __iter__(self):
+            for fname in os.listdir(self.dirname):
+                print("processing~  '{}'".format(fname))
+                for line in bz2.open(os.path.join(self.dirname, fname), "rt"):
+                    yield sent_to_spacing_chars(line.strip()).split(splitc)
+
+    sentences = SentenceGenerator(data_dir)
+
+    model = FastText.load(model_file)
+    model.save(model_file)
+    weights = model.wv.syn0
+    default_vec = np.mean(weights, axis=0, keepdims=True)
+    padding_vec = np.zeros((1, weights.shape[1]))
+
+    weights_default = np.concatenate([weights, default_vec, padding_vec],
+                                     axis=0)
+
+    np.save(open(embeddings_file, 'wb'), weights_default)
+
+    vocab = dict([(k, v.index) for k, v in model.wv.vocab.items()])
+    vocab['__PAD__'] = weights_default.shape[0] - 1
+    with open(vocab_file, 'w') as f:
+        f.write(json.dumps(vocab))
+
+
+def load_embedding(embeddings_file):
+    return (np.load(embeddings_file))
+
+
+def load_vocab(vocab_path):
+    with open(vocab_path, 'r') as f:
+        data = json.loads(f.read())
+    word2idx = data
+    idx2word = dict([(v, k) for k, v in data.items()])
+    return word2idx, idx2word
+
+def get_similar_char(word2idx_dic, model, jamo_model, text, try_cnt, OOV_CNT, HIT_CNT):
+    OOV_CNT += 1
+    jamo_text = jamo_sentence(text)
+    simialr_list = jamo_model.wv.most_similar(jamo_text)[:try_cnt]
+    for char in simialr_list:
+        result = jamo_to_word(char[0])
+        
+        if result in word2idx_dic.keys(): 
+            # print('#' * 20)
+            # print('hit')
+            # print('origin: ', text, 'reuslt: ', result)
+            HIT_CNT += 1
+            return OOV_CNT, HIT_CNT,result
+
+    # print('#' * 20)
+    # print('no hit')
+    # print('origin: ', text)
+    return OOV_CNT, HIT_CNT, model.wv.most_similar(text)[0][0]
+
+
+def encoding_and_padding(word2idx_dic, sequences, **params):
+    """
+    1. making item to idx
+    2. padding
+    :word2idx_dic
+    :sequences: list of lists where each element is a sequence
+    :maxlen: int, maximum length
+    :dtype: type to cast the resulting sequence.
+    :padding: 'pre' or 'post', pad either before or after each sequence.
+    :truncating: 'pre' or 'post', remove values from sequences larger than
+        maxlen either in the beginning or in the end of the sequence
+    :value: float, value to pad the sequences to the desired value.
+    """
+    model_file = 'model/fasttext'
+    jamo_model_path = 'jamo_model/fasttext'
+    print('seq_idx start')
+    model = FastText.load(model_file)
+    jamo_model = FastText.load(jamo_model_path)
+    seq_idx = []
+    OOV_CNT = 0
+    HIT_CNT = 0
+    TOTAL_CNT = 0
+    
+    for word in tqdm(sequences):
+        temp = []
+        for char in word:
+            TOTAL_CNT += 1
+            if char in word2idx_dic.keys():
+                temp.append(word2idx_dic[char])
+            else:
+                OOV_CNT, HIT_CNT, result = get_similar_char(word2idx_dic, model, jamo_model, char, 3, OOV_CNT, HIT_CNT)
+                temp.append(word2idx_dic[result])
+        seq_idx.append(temp)
+    print('TOTAL CNT: ', TOTAL_CNT, 'OOV CNT: ', OOV_CNT, 'HIT_CNT: ', HIT_CNT)
+    if OOV_CNT > 0 and HIT_CNT > 0:
+        print('OOV RATE:', float(OOV_CNT) / TOTAL_CNT * 100, '%' ,'HIT_RATE: ', float(HIT_CNT) / float(OOV_CNT) * 100, '%')
+    
+    params['value'] = word2idx_dic['__PAD__']
+    return (pad_sequences(seq_idx, **params))
+
+
+def get_embedding_model(name='fee_prods', path='data/embedding'):
+    weights = pkg_resources.resource_filename(
+        'dsc', os.path.join(path, name, 'weights.np'))
+    w2idx = pkg_resources.resource_filename(
+        'dsc', os.path.join(path, name, 'idx.json'))
+    return ((load_embedding(weights), load_vocab(w2idx)[0]))
--- a/train/utils/jamo_utils.py 0 → 100644
View file @48da6e7
+++ b/train/utils/jamo_utils.py 0 → 100644
View file @48da6e7
+import re 
+from soynlp.hangle import compose, decompose, character_is_korean 
+
+
+doublespace_pattern = re.compile('\s+') 
+
+def jamo_sentence(sent): 
+    def transform(char): 
+        if char == ' ': 
+            return char 
+            
+        cjj = decompose(char) 
+        if len(cjj) == 1: 
+            return cjj 
+        
+        cjj_ = ''.join(c if c != ' ' else '-' for c in cjj) 
+        return cjj_ 
+        
+    sent_ = [] 
+    for char in sent: 
+        if character_is_korean(char): 
+            sent_.append(transform(char)) 
+        else: 
+            sent_.append(char) 
+    sent_ = doublespace_pattern.sub(' ', ''.join(sent_)) 
+    return sent_ 
+        
+def jamo_to_word(jamo): 
+    jamo_list, idx = [], 0 
+    
+    while idx < len(jamo): 
+        if not character_is_korean(jamo[idx]): 
+            jamo_list.append(jamo[idx]) 
+            idx += 1 
+        else: 
+            jamo_list.append(jamo[idx:idx + 3]) 
+            idx += 3 
+        
+    word = "" 
+    for jamo_char in jamo_list: 
+        if len(jamo_char) == 1: 
+            word += jamo_char 
+        elif jamo_char[2] == "-":
+            word += compose(jamo_char[0], jamo_char[1], " ")
+        else: word += compose(jamo_char[0], jamo_char[1], jamo_char[2]) 
+            
+    return word
+
+def break_char (jamo_sentence):
+  idx = 0
+  corpus = []
+
+  while idx < len(jamo_sentence):
+    if not character_is_korean(jamo_sentence[idx]): 
+      corpus.append(jamo_sentence[idx]) 
+      idx += 1
+    else:
+      corpus.append(jamo_sentence[idx : idx+3])
+      idx += 3
+  return corpus
\ No newline at end of file
--- a/train/utils/spacing_utils.py 0 → 100644
View file @48da6e7
+++ b/train/utils/spacing_utils.py 0 → 100644
View file @48da6e7
+# coding=utf-8
+# Copyright 2020 Heewon Jeon. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+def sent_to_spacing_chars(sent):
+    # 공백은 ^
+    chars = sent.strip().replace(' ', '^')
+    # char_list = [li.strip().replace(' ', '^') for li in sents]
+
+    # 문장의 시작 포인트 «
+    # 문장의 끌 포인트  »
+    tagged_chars = "«" + chars + "»"
+    # char_list = [ "«" + li + "»" for li in char_list]
+
+    # 문장 -> 문자열
+    char_list = ' '.join(list(tagged_chars))
+    # char_list = [ ' '.join(list(li))  for li in char_list]
+    return(char_list)