Showing
19 changed files
with
527 additions
and
0 deletions
1 | +# ML base Spacing Correcter | ||
2 | +This model is improved version of [TrainKoSpacing](https://github.com/haven-jeon/TrainKoSpacing "TrainKoSpacing"), using FastText instead of Word2Vec | ||
3 | + | ||
4 | +## Performances | ||
5 | +| Model | Test Accuracy(%) | Encoding Time Cost | | ||
6 | +| :------------: | :------------: | :------------: | | ||
7 | +| TrainKoSpacing | 96.6147 | 02m 23s| | ||
8 | +| 자모분해 FastText | 98.9915 | 08h 20m 11s | ||
9 | +| 2 Stage FastText | 99.0888 | 03m 23s | ||
10 | + | ||
11 | +## Data | ||
12 | +#### Corpus | ||
13 | + | ||
14 | +We mainly focus on the National Institute of Korean Language 모두의 말뭉치 corpus and National Information Society Agency AI-Hub data. However, due to the license issue, we are restricted to distribute this dataset. You should be able to get them throw the link below | ||
15 | +[National Institute of Korean Language 모두의 말뭉치](https://corpus.korean.go.kr/). | ||
16 | +[National Information Society Agency AI-Hub](https://aihub.or.kr/aihub-data/natural-language/about "National Information Society Agency AI-Hub") | ||
17 | + | ||
18 | +#### Data format | ||
19 | +Bziped file consisting of one sentence per line. | ||
20 | + | ||
21 | +``` | ||
22 | +~/KoSpacing/data$ bzcat train.txt.bz2 | head | ||
23 | +엠마누엘 웅가로 / 의상서 실내 장식품으로… 디자인 세계 넓혀 | ||
24 | +프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다. | ||
25 | +웅가로는 침실과 식당, 욕실에서 사용하는 갖가지 직물제품을 디자인해 최근 파리의 갤러리 라파예트백화점에서 '색의 컬렉션'이라는 이름으로 전시회를 열었다. | ||
26 | +``` | ||
27 | + | ||
28 | + | ||
29 | +## Architecture | ||
30 | + | ||
31 | +### Model | ||
32 | +![kosapcing_img](img/kosapcing_img.png) | ||
33 | + | ||
34 | +### Word Embedding | ||
35 | +#### 자모분해 | ||
36 | +To get similar shpae of Korean charector, use 자모분해 FastText word embedding. | ||
37 | +ex) | ||
38 | +자연어처리 | ||
39 | +ㅈ ㅏ – ㅇ ㅕ ㄴ ㅇ ㅓ – ㅊ ㅓ – ㄹ ㅣ – | ||
40 | + | ||
41 | +#### 2 stage FastText | ||
42 | +Becasue of time to handdle 자모분해, use 자모분해 FastText only for Out of Vocabulary charector. | ||
43 | +![2-stage-FastText_img](img/2-stage-FastText.png) | ||
44 | + | ||
45 | +### Thresholding | ||
46 | +Because middle part of output distribution are evenly distributed. | ||
47 | +![probability_distribution_of_output_vector](img/probability_distribution_of_output_vector.png) | ||
48 | + | ||
49 | +Use log transform and second derivative | ||
50 | +result: | ||
51 | +![Thresholding_result](img/Thresholding_result.png) | ||
52 | + | ||
53 | + | ||
54 | + | ||
55 | +## How to Run | ||
56 | + | ||
57 | + | ||
58 | +### Installation | ||
59 | + | ||
60 | +- For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow. | ||
61 | +- Support only above Python 3.7. | ||
62 | +### Requirement | ||
63 | + | ||
64 | +- Python (>= 3.7) | ||
65 | +- MXNet (>= 1.6.0) | ||
66 | +- tqdm (>= 4.19.5) | ||
67 | +- Pandas (>= 0.22.0) | ||
68 | +- Gensim (>= 3.8.1) | ||
69 | +- GluonNLP (>= 0.9.1) | ||
70 | +- soynlp (>= 0.0.493) | ||
71 | + | ||
72 | +### Dependencies | ||
73 | + | ||
74 | +```bash | ||
75 | +pip install -r requirements.txt | ||
76 | +``` | ||
77 | + | ||
78 | +### Training | ||
79 | + | ||
80 | +```bash | ||
81 | +python train.py --train --train-samp-ratio 1.0 --num-epoch 50 --train_data data/train.txt.bz2 --test_data data/test.txt.bz2 --outputs train_log_to --model_type kospacing --model-file fasttext | ||
82 | +``` | ||
83 | + | ||
84 | +### Evaluation | ||
85 | + | ||
86 | +```bash | ||
87 | +python train.py --model-params model/kospacing.params --model_type kospacing | ||
88 | +sent > 중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다. | ||
89 | +중국은2018년평창동계올림픽의반환점에이르기까지아직노골드행진이다. | ||
90 | +spaced sent[0.12sec/sent] > 중국은 2018년 평창동계올림픽의 반환점에 이르기까지 아직 노골드 행진이다. | ||
91 | +``` | ||
92 | + | ||
93 | +### Directory | ||
94 | +Directory guide for embedding model files | ||
95 | + bold texts means necessary | ||
96 | + | ||
97 | +- model | ||
98 | + - **fasttext** | ||
99 | + - fasttext_vis | ||
100 | + - **fasttext.trainables.vectors_ngrams_lockf.npy** | ||
101 | + - **fasttext.wv.vectors_ngrams.npy** | ||
102 | + - **kospacing_wv.np** | ||
103 | + - **w2idx.dic** | ||
104 | + | ||
105 | +- jamo_model | ||
106 | + - **fasttext** | ||
107 | + - fasttext_vis | ||
108 | + - **fasttext.trainables.vectors_ngrams_lockf.npy** | ||
109 | + - **fasttext.wv.vectors_ngrams.npy** | ||
110 | + - **kospacing_wv.np** | ||
111 | + - **w2idx.dic** | ||
112 | + | ||
113 | +### Reference | ||
114 | +TrainKoSpacing: https://github.com/haven-jeon/TrainKoSpacing | ||
115 | +딥 러닝을 이용한 자연어 처리 입문: https://wikidocs.net/book/2155 | ||
116 | + | ... | ... |
img/2-stage-FastText.png
0 → 100644
53.5 KB
img/Thresholding_result.png
0 → 100644
365 KB
img/kosapcing_img.png
0 → 100644
209 KB
32.1 KB
train/LICENSE
0 → 100644
This diff is collapsed. Click to expand it.
train/data/example.txt.bz2
0 → 100644
No preview for this file type
train/embedding.py
0 → 100644
1 | +# coding=utf-8 | ||
2 | +# Copyright 2020 Heewon Jeon. All rights reserved. | ||
3 | +# | ||
4 | +# Licensed under the Apache License, Version 2.0 (the "License"); | ||
5 | +# you may not use this file except in compliance with the License. | ||
6 | +# You may obtain a copy of the License at | ||
7 | +# | ||
8 | +# http://www.apache.org/licenses/LICENSE-2.0 | ||
9 | +# | ||
10 | +# Unless required by applicable law or agreed to in writing, software | ||
11 | +# distributed under the License is distributed on an "AS IS" BASIS, | ||
12 | +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
13 | +# See the License for the specific language governing permissions and | ||
14 | +# limitations under the License. | ||
15 | + | ||
16 | +import argparse | ||
17 | +from utils.embedding_maker import create_embeddings | ||
18 | + | ||
19 | + | ||
20 | +parser = argparse.ArgumentParser(description='Korean Autospacing Embedding Maker') | ||
21 | + | ||
22 | +parser.add_argument('--num-iters', type=int, default=5, | ||
23 | + help='number of iterations to train (default: 5)') | ||
24 | + | ||
25 | +parser.add_argument('--min-count', type=int, default=100, | ||
26 | + help='mininum word counts to filter (default: 100)') | ||
27 | + | ||
28 | +parser.add_argument('--embedding-size', type=int, default=100, | ||
29 | + help='embedding dimention size (default: 100)') | ||
30 | + | ||
31 | +parser.add_argument('--num-worker', type=int, default=16, | ||
32 | + help='number of thread (default: 16)') | ||
33 | + | ||
34 | +parser.add_argument('--window-size', type=int, default=8, | ||
35 | + help='skip-gram window size (default: 8)') | ||
36 | + | ||
37 | +parser.add_argument('--corpus_dir', type=str, default='data', | ||
38 | + help='training resource dir') | ||
39 | + | ||
40 | +parser.add_argument('--train', action='store_true', default=True, | ||
41 | + help='do embedding trainig (default: True)') | ||
42 | + | ||
43 | +parser.add_argument('--model-file', type=str, default='kospacing_wv.mdl', | ||
44 | + help='output object from Word2Vec() (default: kospacing_wv.mdl)') | ||
45 | + | ||
46 | +parser.add_argument('--numpy-wv', type=str, default='kospacing_wv.np', | ||
47 | + help='numpy object file path from Word2Vec() (default: kospacing_wv.np)') | ||
48 | + | ||
49 | +parser.add_argument('--w2idx', type=str, default='w2idx.dic', | ||
50 | + help='item to index json dictionary (default: w2idx.dic)') | ||
51 | + | ||
52 | +parser.add_argument('--model-dir', type=str, default='model', | ||
53 | + help='dir to save models (default: model)') | ||
54 | + | ||
55 | +opt = parser.parse_args() | ||
56 | + | ||
57 | +if opt.train: | ||
58 | + create_embeddings(opt.corpus_dir, opt.model_dir + '/' + | ||
59 | + opt.model_file, opt.model_dir + '/' + opt.numpy_wv, | ||
60 | + opt.model_dir + '/' + opt.w2idx, min_count=opt.min_count, | ||
61 | + iter=opt.num_iters, | ||
62 | + size=opt.embedding_size, workers=opt.num_worker, window=opt.window_size) |
train/jamo_model/.gitignore
0 → 100644
File mode changed
train/model/.gitignore
0 → 100644
File mode changed
train/output/.gitignore
0 → 100644
File mode changed
train/requirements.txt
0 → 100644
1 | +absl-py==0.11.0 | ||
2 | +astunparse==1.6.3 | ||
3 | +cachetools==4.2.1 | ||
4 | +certifi==2020.12.5 | ||
5 | +chardet==4.0.0 | ||
6 | +click==7.1.2 | ||
7 | +cmake==3.18.4.post1 | ||
8 | +Cython==0.29.21 | ||
9 | +Flask==1.1.2 | ||
10 | +Flask-Cors==3.0.9 | ||
11 | +flatbuffers==1.12 | ||
12 | +gast==0.3.3 | ||
13 | +gensim==3.8.3 | ||
14 | +gluonnlp==0.10.0 | ||
15 | +google-auth==1.26.1 | ||
16 | +google-auth-oauthlib==0.4.2 | ||
17 | +google-pasta==0.2.0 | ||
18 | +graphviz==0.8.4 | ||
19 | +grpcio==1.32.0 | ||
20 | +h5py==2.10.0 | ||
21 | +idna==2.10 | ||
22 | +importlib-metadata==3.4.0 | ||
23 | +itsdangerous==1.1.0 | ||
24 | +Jinja2==2.11.2 | ||
25 | +joblib==1.0.1 | ||
26 | +Keras==2.4.3 | ||
27 | +Keras-Preprocessing==1.1.2 | ||
28 | +Markdown==3.3.3 | ||
29 | +MarkupSafe==1.1.1 | ||
30 | +mxnet-cu101==1.7.0 | ||
31 | +mxnet-cu101mkl==1.6.0.post0 | ||
32 | +mxnet-mkl==1.6.0 | ||
33 | +numpy==1.19.5 | ||
34 | +oauthlib==3.1.0 | ||
35 | +opt-einsum==3.3.0 | ||
36 | +packaging==20.9 | ||
37 | +pandas==1.2.2 | ||
38 | +protobuf==3.14.0 | ||
39 | +psutil==5.8.0 | ||
40 | +pyasn1==0.4.8 | ||
41 | +pyasn1-modules==0.2.8 | ||
42 | +pyparsing==2.4.7 | ||
43 | +python-dateutil==2.8.1 | ||
44 | +pytz==2020.5 | ||
45 | +PyYAML==5.3.1 | ||
46 | +requests==2.25.1 | ||
47 | +requests-oauthlib==1.3.0 | ||
48 | +rsa==4.6 | ||
49 | +scikit-learn==0.24.1 | ||
50 | +scipy==1.6.0 | ||
51 | +six==1.15.0 | ||
52 | +smart-open==4.0.1 | ||
53 | +soynlp==0.0.493 | ||
54 | +tensorboard==2.4.0 | ||
55 | +tensorboard-plugin-wit==1.7.0 | ||
56 | +tensorflow==2.4.1 | ||
57 | +tensorflow-estimator==2.4.0 | ||
58 | +termcolor==1.1.0 | ||
59 | +threadpoolctl==2.1.0 | ||
60 | +tqdm==4.56.0 | ||
61 | +typing-extensions==3.7.4.3 | ||
62 | +urllib3==1.26.3 | ||
63 | +Werkzeug==1.0.1 | ||
64 | +wrapt==1.12.1 | ||
65 | +zipp==3.4.0 |
train/train.py
0 → 100644
This diff is collapsed. Click to expand it.
No preview for this file type
No preview for this file type
No preview for this file type
train/utils/embedding_maker.py
0 → 100644
1 | +__all__ = [ | ||
2 | + 'create_embeddings', 'load_embedding', 'load_vocab', | ||
3 | + 'encoding_and_padding', 'get_embedding_model' | ||
4 | +] | ||
5 | + | ||
6 | +import bz2 | ||
7 | +import json | ||
8 | +import os | ||
9 | + | ||
10 | +import numpy as np | ||
11 | +import pkg_resources | ||
12 | +from gensim.models import FastText | ||
13 | + | ||
14 | +from utils.spacing_utils import sent_to_spacing_chars | ||
15 | +from tqdm import tqdm | ||
16 | +from utils.jamo_utils import jamo_sentence, jamo_to_word | ||
17 | + | ||
18 | +def pad_sequences(sequences, | ||
19 | + maxlen=None, | ||
20 | + dtype='int32', | ||
21 | + padding='pre', | ||
22 | + truncating='pre', | ||
23 | + value=0.): | ||
24 | + | ||
25 | + if not hasattr(sequences, '__len__'): | ||
26 | + raise ValueError('`sequences` must be iterable.') | ||
27 | + lengths = [] | ||
28 | + for x in sequences: | ||
29 | + if not hasattr(x, '__len__'): | ||
30 | + raise ValueError('`sequences` must be a list of iterables. ' | ||
31 | + 'Found non-iterable: ' + str(x)) | ||
32 | + lengths.append(len(x)) | ||
33 | + | ||
34 | + num_samples = len(sequences) | ||
35 | + if maxlen is None: | ||
36 | + maxlen = np.max(lengths) | ||
37 | + | ||
38 | + # take the sample shape from the first non empty sequence | ||
39 | + # checking for consistency in the main loop below. | ||
40 | + sample_shape = tuple() | ||
41 | + for s in sequences: | ||
42 | + if len(s) > 0: | ||
43 | + sample_shape = np.asarray(s).shape[1:] | ||
44 | + break | ||
45 | + | ||
46 | + x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype) | ||
47 | + for idx, s in enumerate(sequences): | ||
48 | + if not len(s): | ||
49 | + continue # empty list/array was found | ||
50 | + if truncating == 'pre': | ||
51 | + trunc = s[-maxlen:] | ||
52 | + elif truncating == 'post': | ||
53 | + trunc = s[:maxlen] | ||
54 | + else: | ||
55 | + raise ValueError('Truncating type "%s" not understood' % | ||
56 | + truncating) | ||
57 | + | ||
58 | + # check `trunc` has expected shape | ||
59 | + trunc = np.asarray(trunc, dtype=dtype) | ||
60 | + if trunc.shape[1:] != sample_shape: | ||
61 | + raise ValueError( | ||
62 | + 'Shape of sample %s of sequence at position %s is different from expected shape %s' | ||
63 | + % (trunc.shape[1:], idx, sample_shape)) | ||
64 | + | ||
65 | + if padding == 'post': | ||
66 | + x[idx, :len(trunc)] = trunc | ||
67 | + elif padding == 'pre': | ||
68 | + x[idx, -len(trunc):] = trunc | ||
69 | + else: | ||
70 | + raise ValueError('Padding type "%s" not understood' % padding) | ||
71 | + return x | ||
72 | + | ||
73 | + | ||
74 | +def create_embeddings(data_dir, | ||
75 | + model_file, | ||
76 | + embeddings_file, | ||
77 | + vocab_file, | ||
78 | + splitc=' ', | ||
79 | + **params): | ||
80 | + """ | ||
81 | + making embedding from files. | ||
82 | + :**params additional Word2Vec() parameters | ||
83 | + :splitc char for splitting in data_dir files | ||
84 | + :model_file output object from Word2Vec() | ||
85 | + :data_dir data dir to be process | ||
86 | + :embeddings_file numpy object file path from Word2Vec() | ||
87 | + :vocab_file item to index json dictionary | ||
88 | + """ | ||
89 | + class SentenceGenerator(object): | ||
90 | + def __init__(self, dirname): | ||
91 | + self.dirname = dirname | ||
92 | + | ||
93 | + def __iter__(self): | ||
94 | + for fname in os.listdir(self.dirname): | ||
95 | + print("processing~ '{}'".format(fname)) | ||
96 | + for line in bz2.open(os.path.join(self.dirname, fname), "rt"): | ||
97 | + yield sent_to_spacing_chars(line.strip()).split(splitc) | ||
98 | + | ||
99 | + sentences = SentenceGenerator(data_dir) | ||
100 | + | ||
101 | + model = FastText.load(model_file) | ||
102 | + model.save(model_file) | ||
103 | + weights = model.wv.syn0 | ||
104 | + default_vec = np.mean(weights, axis=0, keepdims=True) | ||
105 | + padding_vec = np.zeros((1, weights.shape[1])) | ||
106 | + | ||
107 | + weights_default = np.concatenate([weights, default_vec, padding_vec], | ||
108 | + axis=0) | ||
109 | + | ||
110 | + np.save(open(embeddings_file, 'wb'), weights_default) | ||
111 | + | ||
112 | + vocab = dict([(k, v.index) for k, v in model.wv.vocab.items()]) | ||
113 | + vocab['__PAD__'] = weights_default.shape[0] - 1 | ||
114 | + with open(vocab_file, 'w') as f: | ||
115 | + f.write(json.dumps(vocab)) | ||
116 | + | ||
117 | + | ||
118 | +def load_embedding(embeddings_file): | ||
119 | + return (np.load(embeddings_file)) | ||
120 | + | ||
121 | + | ||
122 | +def load_vocab(vocab_path): | ||
123 | + with open(vocab_path, 'r') as f: | ||
124 | + data = json.loads(f.read()) | ||
125 | + word2idx = data | ||
126 | + idx2word = dict([(v, k) for k, v in data.items()]) | ||
127 | + return word2idx, idx2word | ||
128 | + | ||
129 | +def get_similar_char(word2idx_dic, model, jamo_model, text, try_cnt, OOV_CNT, HIT_CNT): | ||
130 | + OOV_CNT += 1 | ||
131 | + jamo_text = jamo_sentence(text) | ||
132 | + simialr_list = jamo_model.wv.most_similar(jamo_text)[:try_cnt] | ||
133 | + for char in simialr_list: | ||
134 | + result = jamo_to_word(char[0]) | ||
135 | + | ||
136 | + if result in word2idx_dic.keys(): | ||
137 | + # print('#' * 20) | ||
138 | + # print('hit') | ||
139 | + # print('origin: ', text, 'reuslt: ', result) | ||
140 | + HIT_CNT += 1 | ||
141 | + return OOV_CNT, HIT_CNT,result | ||
142 | + | ||
143 | + # print('#' * 20) | ||
144 | + # print('no hit') | ||
145 | + # print('origin: ', text) | ||
146 | + return OOV_CNT, HIT_CNT, model.wv.most_similar(text)[0][0] | ||
147 | + | ||
148 | + | ||
149 | +def encoding_and_padding(word2idx_dic, sequences, **params): | ||
150 | + """ | ||
151 | + 1. making item to idx | ||
152 | + 2. padding | ||
153 | + :word2idx_dic | ||
154 | + :sequences: list of lists where each element is a sequence | ||
155 | + :maxlen: int, maximum length | ||
156 | + :dtype: type to cast the resulting sequence. | ||
157 | + :padding: 'pre' or 'post', pad either before or after each sequence. | ||
158 | + :truncating: 'pre' or 'post', remove values from sequences larger than | ||
159 | + maxlen either in the beginning or in the end of the sequence | ||
160 | + :value: float, value to pad the sequences to the desired value. | ||
161 | + """ | ||
162 | + model_file = 'model/fasttext' | ||
163 | + jamo_model_path = 'jamo_model/fasttext' | ||
164 | + print('seq_idx start') | ||
165 | + model = FastText.load(model_file) | ||
166 | + jamo_model = FastText.load(jamo_model_path) | ||
167 | + seq_idx = [] | ||
168 | + OOV_CNT = 0 | ||
169 | + HIT_CNT = 0 | ||
170 | + TOTAL_CNT = 0 | ||
171 | + | ||
172 | + for word in tqdm(sequences): | ||
173 | + temp = [] | ||
174 | + for char in word: | ||
175 | + TOTAL_CNT += 1 | ||
176 | + if char in word2idx_dic.keys(): | ||
177 | + temp.append(word2idx_dic[char]) | ||
178 | + else: | ||
179 | + OOV_CNT, HIT_CNT, result = get_similar_char(word2idx_dic, model, jamo_model, char, 3, OOV_CNT, HIT_CNT) | ||
180 | + temp.append(word2idx_dic[result]) | ||
181 | + seq_idx.append(temp) | ||
182 | + print('TOTAL CNT: ', TOTAL_CNT, 'OOV CNT: ', OOV_CNT, 'HIT_CNT: ', HIT_CNT) | ||
183 | + if OOV_CNT > 0 and HIT_CNT > 0: | ||
184 | + print('OOV RATE:', float(OOV_CNT) / TOTAL_CNT * 100, '%' ,'HIT_RATE: ', float(HIT_CNT) / float(OOV_CNT) * 100, '%') | ||
185 | + | ||
186 | + params['value'] = word2idx_dic['__PAD__'] | ||
187 | + return (pad_sequences(seq_idx, **params)) | ||
188 | + | ||
189 | + | ||
190 | +def get_embedding_model(name='fee_prods', path='data/embedding'): | ||
191 | + weights = pkg_resources.resource_filename( | ||
192 | + 'dsc', os.path.join(path, name, 'weights.np')) | ||
193 | + w2idx = pkg_resources.resource_filename( | ||
194 | + 'dsc', os.path.join(path, name, 'idx.json')) | ||
195 | + return ((load_embedding(weights), load_vocab(w2idx)[0])) |
train/utils/jamo_utils.py
0 → 100644
1 | +import re | ||
2 | +from soynlp.hangle import compose, decompose, character_is_korean | ||
3 | + | ||
4 | + | ||
5 | +doublespace_pattern = re.compile('\s+') | ||
6 | + | ||
7 | +def jamo_sentence(sent): | ||
8 | + def transform(char): | ||
9 | + if char == ' ': | ||
10 | + return char | ||
11 | + | ||
12 | + cjj = decompose(char) | ||
13 | + if len(cjj) == 1: | ||
14 | + return cjj | ||
15 | + | ||
16 | + cjj_ = ''.join(c if c != ' ' else '-' for c in cjj) | ||
17 | + return cjj_ | ||
18 | + | ||
19 | + sent_ = [] | ||
20 | + for char in sent: | ||
21 | + if character_is_korean(char): | ||
22 | + sent_.append(transform(char)) | ||
23 | + else: | ||
24 | + sent_.append(char) | ||
25 | + sent_ = doublespace_pattern.sub(' ', ''.join(sent_)) | ||
26 | + return sent_ | ||
27 | + | ||
28 | +def jamo_to_word(jamo): | ||
29 | + jamo_list, idx = [], 0 | ||
30 | + | ||
31 | + while idx < len(jamo): | ||
32 | + if not character_is_korean(jamo[idx]): | ||
33 | + jamo_list.append(jamo[idx]) | ||
34 | + idx += 1 | ||
35 | + else: | ||
36 | + jamo_list.append(jamo[idx:idx + 3]) | ||
37 | + idx += 3 | ||
38 | + | ||
39 | + word = "" | ||
40 | + for jamo_char in jamo_list: | ||
41 | + if len(jamo_char) == 1: | ||
42 | + word += jamo_char | ||
43 | + elif jamo_char[2] == "-": | ||
44 | + word += compose(jamo_char[0], jamo_char[1], " ") | ||
45 | + else: word += compose(jamo_char[0], jamo_char[1], jamo_char[2]) | ||
46 | + | ||
47 | + return word | ||
48 | + | ||
49 | +def break_char (jamo_sentence): | ||
50 | + idx = 0 | ||
51 | + corpus = [] | ||
52 | + | ||
53 | + while idx < len(jamo_sentence): | ||
54 | + if not character_is_korean(jamo_sentence[idx]): | ||
55 | + corpus.append(jamo_sentence[idx]) | ||
56 | + idx += 1 | ||
57 | + else: | ||
58 | + corpus.append(jamo_sentence[idx : idx+3]) | ||
59 | + idx += 3 | ||
60 | + return corpus | ||
... | \ No newline at end of file | ... | \ No newline at end of file |
train/utils/spacing_utils.py
0 → 100644
1 | +# coding=utf-8 | ||
2 | +# Copyright 2020 Heewon Jeon. All rights reserved. | ||
3 | +# | ||
4 | +# Licensed under the Apache License, Version 2.0 (the "License"); | ||
5 | +# you may not use this file except in compliance with the License. | ||
6 | +# You may obtain a copy of the License at | ||
7 | +# | ||
8 | +# http://www.apache.org/licenses/LICENSE-2.0 | ||
9 | +# | ||
10 | +# Unless required by applicable law or agreed to in writing, software | ||
11 | +# distributed under the License is distributed on an "AS IS" BASIS, | ||
12 | +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
13 | +# See the License for the specific language governing permissions and | ||
14 | +# limitations under the License. | ||
15 | + | ||
16 | +def sent_to_spacing_chars(sent): | ||
17 | + # 공백은 ^ | ||
18 | + chars = sent.strip().replace(' ', '^') | ||
19 | + # char_list = [li.strip().replace(' ', '^') for li in sents] | ||
20 | + | ||
21 | + # 문장의 시작 포인트 « | ||
22 | + # 문장의 끌 포인트 » | ||
23 | + tagged_chars = "«" + chars + "»" | ||
24 | + # char_list = [ "«" + li + "»" for li in char_list] | ||
25 | + | ||
26 | + # 문장 -> 문자열 | ||
27 | + char_list = ' '.join(list(tagged_chars)) | ||
28 | + # char_list = [ ' '.join(list(li)) for li in char_list] | ||
29 | + return(char_list) |
-
Please register or login to post a comment