We used the pre-trained weight provided by CodeBERT(Feng at al, 2020) as the initial weight.
#### Added model
To train the added model, you can train it using [CodeBERT's official repository](https://github.com/microsoft/CodeBERT). For training data, the cleaned CodeSearchNet was used. See [this document](https://github.com/microsoft/CodeBERT#fine-tune-1) for details. I took about 23 hours with 256 batch size.
```shell script
cd code2nl
lang=python #programming language
lr=5e-5
batch_size=64
beam_size=10
source_length=256
target_length=128
data_dir=../data/code2nl/CodeSearchNet
output_dir=model/$lang
train_file=$data_dir/$lang/train.jsonl
dev_file=$data_dir/$lang/valid.jsonl
eval_steps=1000 #400 for ruby, 600 for javascript, 1000 for others
train_steps=50000 #20000 for ruby, 30000 for javascript, 50000 for others
To train the Diff model we have to use [our code](https://github.com/graykode/commit-autosuggestions/blob/master/train.py). We need an implementation to differentiate between added and diff.
As for the training data, only the top 100 repositories of the Python language in [the document](https://github.com/kaxap/arl/blob/master/README-Python.md) were cloned ([gitcloner.py](https://github.com/graykode/commit-autosuggestions/blob/master/gitparser.py)), and the commit message, added and deleted were preprocessed in jsonl format ([gitparser](https://github.com/graykode/commit-autosuggestions/blob/master/gitparser.py)). The data we used was put on a [google drive](https://drive.google.com/drive/folders/1_8lQmzTH95Nc-4MKd1RP3x4BVc8tBA6W?usp=sharing).
Like the added model, it took about 20 hours at 256 batch size for training.
Note that the weight of the added model was used as the initial weight. Be sure to set this with the `load_model_path` argument.
```shell script
lr=5e-5
batch_size=64
beam_size=10
source_length=256
target_length=128
output_dir=model/python
train_file=train.jsonl
dev_file=valid.jsonl
eval_steps=1000
train_steps=50000
saved_model=pytorch_model.bin # this is added model weight
python train.py --do_train --do_eval --model_type roberta \
--model_name_or_path microsoft/codebert-base \
--load_model_path $saved_model \
--train_filename $train_file \
--dev_filename $dev_file \
--output_dir $output_dir \
--max_source_length $source_length \
--max_target_length $target_length \
--beam_size $beam_size \
--train_batch_size $batch_size \
--eval_batch_size $batch_size \
--learning_rate $lr \
--train_steps $train_steps \
--eval_steps $eval_steps
```
## How to train for your lint style?
See the [Diff model](#Diff model) section above for the role of the code.
#### 1. cloning repositories from github
This code clones all repositories in [repositories.txt](https://github.com/graykode/commit-autosuggestions/blob/master/repositories.txt).
directory that all repositories had been downloaded.
--output_dir OUTPUT_DIR
The output directory where the preprocessed data will be written.
--tokenizer_name TOKENIZER_NAME
The name of tokenizer
--num_workers NUM_WORKERS
number of process
--max_source_length MAX_SOURCE_LENGTH
The maximum total source sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
--max_target_length MAX_TARGET_LENGTH
The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
```
#### 3. Training Added model(Optional for Python Language).
Python has learned the Added model. So, if you only want to make a Diff model for the Python language, step 3 can be ignored. However, for other languages (JavaScript, GO, Ruby, PHP and JAVA), [Code2NL training](https://github.com/microsoft/CodeBERT#fine-tune-1) is required to use as the initial weight of the model to be used in step 4.
#### 4. Training Diff model.
Train the Diff model as the initial weight of the added model for each languages.