yomapi

submit train init

1 +# ML base Spacing Correcter
2 +This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing "TrainKoSpacing"), using FastText instead of Word2Vec
3 +
4 +## Performances
5 +| Model | Test Accuracy(%) | Encoding Time Cost |
6 +| :------------: | :------------: | :------------: |
7 +| TrainKoSpacing | 96.6147 | 02m 23s|
8 +| 자모분해 FastText | 98.9915 | 08h 20m 11s
9 +| 2 Stage FastText | 99.0888 | 03m 23s
10 +
11 +## Data
12 +#### Corpus
13 +
14 +We mainly focus on the National Institute of Korean Language 모두의 말뭉치 corpus and National Information Society Agency AI-Hub data. However, due to the license issue, we are restricted to distribute this dataset. You should be able to get them throw the link below
15 +[National Institute of Korean Language 모두의 말뭉치](https://corpus.korean.go.kr/).
16 +[National Information Society Agency AI-Hub](https://aihub.or.kr/aihub-data/natural-language/about "National Information Society Agency AI-Hub")
17 +
18 +#### Data format
19 +Bziped file consisting of one sentence per line.
20 +
21 +```
22 +~/KoSpacing/data$ bzcat train.txt.bz2 | head
23 +엠마누엘 웅가로 / 의상서 실내 장식품으로… 디자인 세계 넓혀
24 +프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다.
25 +웅가로는 침실과 식당, 욕실에서 사용하는 갖가지 직물제품을 디자인해 최근 파리의 갤러리 라파예트백화점에서 '색의 컬렉션'이라는 이름으로 전시회를 열었다.
26 +```
27 +
28 +
29 +## Architecture
30 +
31 +### Model
32 +![kosapcing_img](img/kosapcing_img.png)
33 +
34 +### Word Embedding
35 +#### 자모분해
36 +To get similar shpae of Korean charector, use 자모분해 FastText word embedding.
37 +ex)
38 +자연어처리
39 +ㅈ ㅏ – ㅇ ㅕ ㄴ ㅇ ㅓ – ㅊ ㅓ – ㄹ ㅣ –
40 +
41 +#### 2 stage FastText
42 +Becasue of time to handdle 자모분해, use 자모분해 FastText only for Out of Vocabulary charector.
43 +![2-stage-FastText_img](img/2-stage-FastText.png)
44 +
45 +### Thresholding
46 +Because middle part of output distribution are evenly distributed.
47 +![probability_distribution_of_output_vector](img/probability_distribution_of_output_vector.png)
48 +
49 +Use log transform and second derivative
50 +result:
51 +![Thresholding_result](img/Thresholding_result.png)
52 +
53 +
54 +
55 +## How to Run
56 +
57 +
58 +### Installation
59 +
60 +- For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow.
61 +- Support only above Python 3.7.
62 +### Requirement
63 +
64 +- Python (>= 3.7)
65 +- MXNet (>= 1.6.0)
66 +- tqdm (>= 4.19.5)
67 +- Pandas (>= 0.22.0)
68 +- Gensim (>= 3.8.1)
69 +- GluonNLP (>= 0.9.1)
70 +- soynlp (>= 0.0.493)
71 +
72 +### Dependencies
73 +
74 +```bash
75 +pip install -r requirements.txt
76 +```
77 +
78 +### Training
79 +
80 +```bash
81 +python train.py --train --train-samp-ratio 1.0 --num-epoch 50 --train_data data/train.txt.bz2 --test_data data/test.txt.bz2 --outputs train_log_to --model_type kospacing --model-file fasttext
82 +```
83 +
84 +### Evaluation
85 +
86 +```bash
87 +python train.py --model-params model/kospacing.params --model_type kospacing
88 +sent > 중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
89 +중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다.
90 +spaced sent[0.12sec/sent] > 중국은 2018년 평창동계올림픽의 반환점에 이르기까지 아직 노골드 행진이다.
91 +```
92 +
93 +### Directory
94 +Directory guide for embedding model files
95 + bold texts means necessary
96 +
97 +- model
98 + - **fasttext**
99 + - fasttext_vis
100 + - **fasttext.trainables.vectors_ngrams_lockf.npy**
101 + - **fasttext.wv.vectors_ngrams.npy**
102 + - **kospacing_wv.np**
103 + - **w2idx.dic**
104 +
105 +- jamo_model
106 + - **fasttext**
107 + - fasttext_vis
108 + - **fasttext.trainables.vectors_ngrams_lockf.npy**
109 + - **fasttext.wv.vectors_ngrams.npy**
110 + - **kospacing_wv.np**
111 + - **w2idx.dic**
112 +
113 +### Reference
114 +TrainKoSpacing: https://github.com/haven-jeon/TrainKoSpacing
115 +딥 러닝을 이용한 자연어 처리 입문: https://wikidocs.net/book/2155
116 +
......
This diff is collapsed. Click to expand it.
No preview for this file type
1 +# coding=utf-8
2 +# Copyright 2020 Heewon Jeon. All rights reserved.
3 +#
4 +# Licensed under the Apache License, Version 2.0 (the "License");
5 +# you may not use this file except in compliance with the License.
6 +# You may obtain a copy of the License at
7 +#
8 +# http://www.apache.org/licenses/LICENSE-2.0
9 +#
10 +# Unless required by applicable law or agreed to in writing, software
11 +# distributed under the License is distributed on an "AS IS" BASIS,
12 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 +# See the License for the specific language governing permissions and
14 +# limitations under the License.
15 +
16 +import argparse
17 +from utils.embedding_maker import create_embeddings
18 +
19 +
20 +parser = argparse.ArgumentParser(description='Korean Autospacing Embedding Maker')
21 +
22 +parser.add_argument('--num-iters', type=int, default=5,
23 + help='number of iterations to train (default: 5)')
24 +
25 +parser.add_argument('--min-count', type=int, default=100,
26 + help='mininum word counts to filter (default: 100)')
27 +
28 +parser.add_argument('--embedding-size', type=int, default=100,
29 + help='embedding dimention size (default: 100)')
30 +
31 +parser.add_argument('--num-worker', type=int, default=16,
32 + help='number of thread (default: 16)')
33 +
34 +parser.add_argument('--window-size', type=int, default=8,
35 + help='skip-gram window size (default: 8)')
36 +
37 +parser.add_argument('--corpus_dir', type=str, default='data',
38 + help='training resource dir')
39 +
40 +parser.add_argument('--train', action='store_true', default=True,
41 + help='do embedding trainig (default: True)')
42 +
43 +parser.add_argument('--model-file', type=str, default='kospacing_wv.mdl',
44 + help='output object from Word2Vec() (default: kospacing_wv.mdl)')
45 +
46 +parser.add_argument('--numpy-wv', type=str, default='kospacing_wv.np',
47 + help='numpy object file path from Word2Vec() (default: kospacing_wv.np)')
48 +
49 +parser.add_argument('--w2idx', type=str, default='w2idx.dic',
50 + help='item to index json dictionary (default: w2idx.dic)')
51 +
52 +parser.add_argument('--model-dir', type=str, default='model',
53 + help='dir to save models (default: model)')
54 +
55 +opt = parser.parse_args()
56 +
57 +if opt.train:
58 + create_embeddings(opt.corpus_dir, opt.model_dir + '/' +
59 + opt.model_file, opt.model_dir + '/' + opt.numpy_wv,
60 + opt.model_dir + '/' + opt.w2idx, min_count=opt.min_count,
61 + iter=opt.num_iters,
62 + size=opt.embedding_size, workers=opt.num_worker, window=opt.window_size)
File mode changed
File mode changed
1 +absl-py==0.11.0
2 +astunparse==1.6.3
3 +cachetools==4.2.1
4 +certifi==2020.12.5
5 +chardet==4.0.0
6 +click==7.1.2
7 +cmake==3.18.4.post1
8 +Cython==0.29.21
9 +Flask==1.1.2
10 +Flask-Cors==3.0.9
11 +flatbuffers==1.12
12 +gast==0.3.3
13 +gensim==3.8.3
14 +gluonnlp==0.10.0
15 +google-auth==1.26.1
16 +google-auth-oauthlib==0.4.2
17 +google-pasta==0.2.0
18 +graphviz==0.8.4
19 +grpcio==1.32.0
20 +h5py==2.10.0
21 +idna==2.10
22 +importlib-metadata==3.4.0
23 +itsdangerous==1.1.0
24 +Jinja2==2.11.2
25 +joblib==1.0.1
26 +Keras==2.4.3
27 +Keras-Preprocessing==1.1.2
28 +Markdown==3.3.3
29 +MarkupSafe==1.1.1
30 +mxnet-cu101==1.7.0
31 +mxnet-cu101mkl==1.6.0.post0
32 +mxnet-mkl==1.6.0
33 +numpy==1.19.5
34 +oauthlib==3.1.0
35 +opt-einsum==3.3.0
36 +packaging==20.9
37 +pandas==1.2.2
38 +protobuf==3.14.0
39 +psutil==5.8.0
40 +pyasn1==0.4.8
41 +pyasn1-modules==0.2.8
42 +pyparsing==2.4.7
43 +python-dateutil==2.8.1
44 +pytz==2020.5
45 +PyYAML==5.3.1
46 +requests==2.25.1
47 +requests-oauthlib==1.3.0
48 +rsa==4.6
49 +scikit-learn==0.24.1
50 +scipy==1.6.0
51 +six==1.15.0
52 +smart-open==4.0.1
53 +soynlp==0.0.493
54 +tensorboard==2.4.0
55 +tensorboard-plugin-wit==1.7.0
56 +tensorflow==2.4.1
57 +tensorflow-estimator==2.4.0
58 +termcolor==1.1.0
59 +threadpoolctl==2.1.0
60 +tqdm==4.56.0
61 +typing-extensions==3.7.4.3
62 +urllib3==1.26.3
63 +Werkzeug==1.0.1
64 +wrapt==1.12.1
65 +zipp==3.4.0
This diff is collapsed. Click to expand it.
1 +__all__ = [
2 + 'create_embeddings', 'load_embedding', 'load_vocab',
3 + 'encoding_and_padding', 'get_embedding_model'
4 +]
5 +
6 +import bz2
7 +import json
8 +import os
9 +
10 +import numpy as np
11 +import pkg_resources
12 +from gensim.models import FastText
13 +
14 +from utils.spacing_utils import sent_to_spacing_chars
15 +from tqdm import tqdm
16 +from utils.jamo_utils import jamo_sentence, jamo_to_word
17 +
18 +def pad_sequences(sequences,
19 + maxlen=None,
20 + dtype='int32',
21 + padding='pre',
22 + truncating='pre',
23 + value=0.):
24 +
25 + if not hasattr(sequences, '__len__'):
26 + raise ValueError('`sequences` must be iterable.')
27 + lengths = []
28 + for x in sequences:
29 + if not hasattr(x, '__len__'):
30 + raise ValueError('`sequences` must be a list of iterables. '
31 + 'Found non-iterable: ' + str(x))
32 + lengths.append(len(x))
33 +
34 + num_samples = len(sequences)
35 + if maxlen is None:
36 + maxlen = np.max(lengths)
37 +
38 + # take the sample shape from the first non empty sequence
39 + # checking for consistency in the main loop below.
40 + sample_shape = tuple()
41 + for s in sequences:
42 + if len(s) > 0:
43 + sample_shape = np.asarray(s).shape[1:]
44 + break
45 +
46 + x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
47 + for idx, s in enumerate(sequences):
48 + if not len(s):
49 + continue # empty list/array was found
50 + if truncating == 'pre':
51 + trunc = s[-maxlen:]
52 + elif truncating == 'post':
53 + trunc = s[:maxlen]
54 + else:
55 + raise ValueError('Truncating type "%s" not understood' %
56 + truncating)
57 +
58 + # check `trunc` has expected shape
59 + trunc = np.asarray(trunc, dtype=dtype)
60 + if trunc.shape[1:] != sample_shape:
61 + raise ValueError(
62 + 'Shape of sample %s of sequence at position %s is different from expected shape %s'
63 + % (trunc.shape[1:], idx, sample_shape))
64 +
65 + if padding == 'post':
66 + x[idx, :len(trunc)] = trunc
67 + elif padding == 'pre':
68 + x[idx, -len(trunc):] = trunc
69 + else:
70 + raise ValueError('Padding type "%s" not understood' % padding)
71 + return x
72 +
73 +
74 +def create_embeddings(data_dir,
75 + model_file,
76 + embeddings_file,
77 + vocab_file,
78 + splitc=' ',
79 + **params):
80 + """
81 + making embedding from files.
82 + :**params additional Word2Vec() parameters
83 + :splitc char for splitting in data_dir files
84 + :model_file output object from Word2Vec()
85 + :data_dir data dir to be process
86 + :embeddings_file numpy object file path from Word2Vec()
87 + :vocab_file item to index json dictionary
88 + """
89 + class SentenceGenerator(object):
90 + def __init__(self, dirname):
91 + self.dirname = dirname
92 +
93 + def __iter__(self):
94 + for fname in os.listdir(self.dirname):
95 + print("processing~ '{}'".format(fname))
96 + for line in bz2.open(os.path.join(self.dirname, fname), "rt"):
97 + yield sent_to_spacing_chars(line.strip()).split(splitc)
98 +
99 + sentences = SentenceGenerator(data_dir)
100 +
101 + model = FastText.load(model_file)
102 + model.save(model_file)
103 + weights = model.wv.syn0
104 + default_vec = np.mean(weights, axis=0, keepdims=True)
105 + padding_vec = np.zeros((1, weights.shape[1]))
106 +
107 + weights_default = np.concatenate([weights, default_vec, padding_vec],
108 + axis=0)
109 +
110 + np.save(open(embeddings_file, 'wb'), weights_default)
111 +
112 + vocab = dict([(k, v.index) for k, v in model.wv.vocab.items()])
113 + vocab['__PAD__'] = weights_default.shape[0] - 1
114 + with open(vocab_file, 'w') as f:
115 + f.write(json.dumps(vocab))
116 +
117 +
118 +def load_embedding(embeddings_file):
119 + return (np.load(embeddings_file))
120 +
121 +
122 +def load_vocab(vocab_path):
123 + with open(vocab_path, 'r') as f:
124 + data = json.loads(f.read())
125 + word2idx = data
126 + idx2word = dict([(v, k) for k, v in data.items()])
127 + return word2idx, idx2word
128 +
129 +def get_similar_char(word2idx_dic, model, jamo_model, text, try_cnt, OOV_CNT, HIT_CNT):
130 + OOV_CNT += 1
131 + jamo_text = jamo_sentence(text)
132 + simialr_list = jamo_model.wv.most_similar(jamo_text)[:try_cnt]
133 + for char in simialr_list:
134 + result = jamo_to_word(char[0])
135 +
136 + if result in word2idx_dic.keys():
137 + # print('#' * 20)
138 + # print('hit')
139 + # print('origin: ', text, 'reuslt: ', result)
140 + HIT_CNT += 1
141 + return OOV_CNT, HIT_CNT,result
142 +
143 + # print('#' * 20)
144 + # print('no hit')
145 + # print('origin: ', text)
146 + return OOV_CNT, HIT_CNT, model.wv.most_similar(text)[0][0]
147 +
148 +
149 +def encoding_and_padding(word2idx_dic, sequences, **params):
150 + """
151 + 1. making item to idx
152 + 2. padding
153 + :word2idx_dic
154 + :sequences: list of lists where each element is a sequence
155 + :maxlen: int, maximum length
156 + :dtype: type to cast the resulting sequence.
157 + :padding: 'pre' or 'post', pad either before or after each sequence.
158 + :truncating: 'pre' or 'post', remove values from sequences larger than
159 + maxlen either in the beginning or in the end of the sequence
160 + :value: float, value to pad the sequences to the desired value.
161 + """
162 + model_file = 'model/fasttext'
163 + jamo_model_path = 'jamo_model/fasttext'
164 + print('seq_idx start')
165 + model = FastText.load(model_file)
166 + jamo_model = FastText.load(jamo_model_path)
167 + seq_idx = []
168 + OOV_CNT = 0
169 + HIT_CNT = 0
170 + TOTAL_CNT = 0
171 +
172 + for word in tqdm(sequences):
173 + temp = []
174 + for char in word:
175 + TOTAL_CNT += 1
176 + if char in word2idx_dic.keys():
177 + temp.append(word2idx_dic[char])
178 + else:
179 + OOV_CNT, HIT_CNT, result = get_similar_char(word2idx_dic, model, jamo_model, char, 3, OOV_CNT, HIT_CNT)
180 + temp.append(word2idx_dic[result])
181 + seq_idx.append(temp)
182 + print('TOTAL CNT: ', TOTAL_CNT, 'OOV CNT: ', OOV_CNT, 'HIT_CNT: ', HIT_CNT)
183 + if OOV_CNT > 0 and HIT_CNT > 0:
184 + print('OOV RATE:', float(OOV_CNT) / TOTAL_CNT * 100, '%' ,'HIT_RATE: ', float(HIT_CNT) / float(OOV_CNT) * 100, '%')
185 +
186 + params['value'] = word2idx_dic['__PAD__']
187 + return (pad_sequences(seq_idx, **params))
188 +
189 +
190 +def get_embedding_model(name='fee_prods', path='data/embedding'):
191 + weights = pkg_resources.resource_filename(
192 + 'dsc', os.path.join(path, name, 'weights.np'))
193 + w2idx = pkg_resources.resource_filename(
194 + 'dsc', os.path.join(path, name, 'idx.json'))
195 + return ((load_embedding(weights), load_vocab(w2idx)[0]))
1 +import re
2 +from soynlp.hangle import compose, decompose, character_is_korean
3 +
4 +
5 +doublespace_pattern = re.compile('\s+')
6 +
7 +def jamo_sentence(sent):
8 + def transform(char):
9 + if char == ' ':
10 + return char
11 +
12 + cjj = decompose(char)
13 + if len(cjj) == 1:
14 + return cjj
15 +
16 + cjj_ = ''.join(c if c != ' ' else '-' for c in cjj)
17 + return cjj_
18 +
19 + sent_ = []
20 + for char in sent:
21 + if character_is_korean(char):
22 + sent_.append(transform(char))
23 + else:
24 + sent_.append(char)
25 + sent_ = doublespace_pattern.sub(' ', ''.join(sent_))
26 + return sent_
27 +
28 +def jamo_to_word(jamo):
29 + jamo_list, idx = [], 0
30 +
31 + while idx < len(jamo):
32 + if not character_is_korean(jamo[idx]):
33 + jamo_list.append(jamo[idx])
34 + idx += 1
35 + else:
36 + jamo_list.append(jamo[idx:idx + 3])
37 + idx += 3
38 +
39 + word = ""
40 + for jamo_char in jamo_list:
41 + if len(jamo_char) == 1:
42 + word += jamo_char
43 + elif jamo_char[2] == "-":
44 + word += compose(jamo_char[0], jamo_char[1], " ")
45 + else: word += compose(jamo_char[0], jamo_char[1], jamo_char[2])
46 +
47 + return word
48 +
49 +def break_char (jamo_sentence):
50 + idx = 0
51 + corpus = []
52 +
53 + while idx < len(jamo_sentence):
54 + if not character_is_korean(jamo_sentence[idx]):
55 + corpus.append(jamo_sentence[idx])
56 + idx += 1
57 + else:
58 + corpus.append(jamo_sentence[idx : idx+3])
59 + idx += 3
60 + return corpus
...\ No newline at end of file ...\ No newline at end of file
1 +# coding=utf-8
2 +# Copyright 2020 Heewon Jeon. All rights reserved.
3 +#
4 +# Licensed under the Apache License, Version 2.0 (the "License");
5 +# you may not use this file except in compliance with the License.
6 +# You may obtain a copy of the License at
7 +#
8 +# http://www.apache.org/licenses/LICENSE-2.0
9 +#
10 +# Unless required by applicable law or agreed to in writing, software
11 +# distributed under the License is distributed on an "AS IS" BASIS,
12 +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 +# See the License for the specific language governing permissions and
14 +# limitations under the License.
15 +
16 +def sent_to_spacing_chars(sent):
17 + # 공백은 ^
18 + chars = sent.strip().replace(' ', '^')
19 + # char_list = [li.strip().replace(' ', '^') for li in sents]
20 +
21 + # 문장의 시작 포인트 «
22 + # 문장의 끌 포인트 »
23 + tagged_chars = "«" + chars + "»"
24 + # char_list = [ "«" + li + "»" for li in char_list]
25 +
26 + # 문장 -> 문자열
27 + char_list = ' '.join(list(tagged_chars))
28 + # char_list = [ ' '.join(list(li)) for li in char_list]
29 + return(char_list)