training.md 9.98 KB

Raw Blame History Permalink



How the model was trained.

We used the pre-trained weight provided by CodeBERT(Feng at al, 2020) as the initial weight.


Added model

To train the added model, you can train it using CodeBERT's official repository. For training data, the cleaned CodeSearchNet was used. See this document for details. I took about 23 hours with 256 batch size.

```shell script
cd code2nl

lang=python #programming language
lr=5e-5
batch_size=64
beam_size=10
source_length=256
target_length=128
data_dir=../data/code2nl/CodeSearchNet
output_dir=model/$lang
train_file=$data_dir/$lang/train.jsonl
dev_file=$data_dir/$lang/valid.jsonl
eval_steps=1000 #400 for ruby, 600 for javascript, 1000 for others
train_steps=50000 #20000 for ruby, 30000 for javascript, 50000 for others
pretrained_model=microsoft/codebert-base #Roberta: roberta-base

python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --train_steps $train_steps --eval_steps $eval_steps 


#### Diff model
To train the Diff model we have to use [our code](https://github.com/graykode/commit-autosuggestions/blob/master/train.py). We need an implementation to differentiate between added and diff.
As for the training data, only the top 100 repositories of the Python language in [the document](https://github.com/kaxap/arl/blob/master/README-Python.md) were cloned ([gitcloner.py](https://github.com/graykode/commit-autosuggestions/blob/master/gitparser.py)), and the commit message, added and deleted were preprocessed in jsonl format ([gitparser](https://github.com/graykode/commit-autosuggestions/blob/master/gitparser.py)). The data we used was put on a [google drive](https://drive.google.com/drive/folders/1_8lQmzTH95Nc-4MKd1RP3x4BVc8tBA6W?usp=sharing).
Like the added model, it took about 20 hours at 256 batch size for training.
Note that the weight of the added model was used as the initial weight. Be sure to set this with the `load_model_path` argument.

```shell script
lr=5e-5
batch_size=64
beam_size=10
source_length=256
target_length=128
output_dir=model/python
train_file=train.jsonl
dev_file=valid.jsonl

eval_steps=1000
train_steps=50000
saved_model=pytorch_model.bin # this is added model weight

python train.py --do_train --do_eval --model_type roberta \
    --model_name_or_path microsoft/codebert-base \
    --load_model_path $saved_model \
    --train_filename $train_file \
    --dev_filename $dev_file \
    --output_dir $output_dir \
    --max_source_length $source_length \
    --max_target_length $target_length \
    --beam_size $beam_size \
    --train_batch_size $batch_size \
    --eval_batch_size $batch_size \
    --learning_rate $lr \
    --train_steps $train_steps \
    --eval_steps $eval_steps


How to train for your lint style?

See the Diff model section above for the role of the code.


1. cloning repositories from github

This code clones all repositories in repositories.txt.
```shell script
usage: gitcloner.py [-h] --repositories REPOSITORIES --repos_dir REPOS_DIR [--num_worker_threads NUM_WORKER_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  --repositories REPOSITORIES
                        repositories file path.
  --repos_dir REPOS_DIR
                        directory that all repositories will be downloaded.
  --num_worker_threads NUM_WORKER_THREADS
                        number of threads in a worker


#### 2. parsing added code, deleted code and commit message from cloned repositories.
This code preprocesses cloned repositories and divides them into train, valid, and test data.

```shell script
usage: gitparser.py [-h] --repositories REPOSITORIES --repos_dir REPOS_DIR --output_dir OUTPUT_DIR [--tokenizer_name TOKENIZER_NAME] [--num_workers NUM_WORKERS]
                    [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH]

optional arguments:
  -h, --help            show this help message and exit
  --repositories REPOSITORIES
                        repositories file path.
  --repos_dir REPOS_DIR
                        directory that all repositories had been downloaded.
  --output_dir OUTPUT_DIR
                        The output directory where the preprocessed data will be written.
  --tokenizer_name TOKENIZER_NAME
                        The name of tokenizer
  --num_workers NUM_WORKERS
                        number of process
  --max_source_length MAX_SOURCE_LENGTH
                        The maximum total source sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
  --max_target_length MAX_TARGET_LENGTH
                        The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.


If UnicodeDecodeError occurs while using gitparser.py, you must use the GitPython package at least this commit.


3. Training Added model(Optional for Python Language).

Python has learned the Added model. So, if you only want to make a Diff model for the Python language, step 3 can be ignored. However, for other languages (JavaScript, GO, Ruby, PHP and JAVA), Code2NL training is required to use as the initial weight of the model to be used in step 4.


4. Training Diff model.

Train the Diff model as the initial weight of the added model for each languages.

```shell script
usage: train.py [-h] --model_type MODEL_TYPE --model_name_or_path MODEL_NAME_OR_PATH --output_dir OUTPUT_DIR [--load_model_path LOAD_MODEL_PATH]
                [--train_filename TRAIN_FILENAME] [--dev_filename DEV_FILENAME] [--test_filename TEST_FILENAME] [--config_name CONFIG_NAME] [--tokenizer_name TOKENIZER_NAME]
                [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH] [--do_train] [--do_eval] [--do_test] [--do_lower_case] [--no_cuda]
                [--train_batch_size TRAIN_BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
                [--learning_rate LEARNING_RATE] [--beam_size BEAM_SIZE] [--weight_decay WEIGHT_DECAY] [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM]
                [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS] [--eval_steps EVAL_STEPS] [--train_steps TRAIN_STEPS] [--warmup_steps WARMUP_STEPS]
                [--local_rank LOCAL_RANK] [--seed SEED]

optional arguments:
  -h, --help            show this help message and exit
  --model_type MODEL_TYPE
                        Model type: e.g. roberta
  --model_name_or_path MODEL_NAME_OR_PATH
                        Path to pre-trained model: e.g. roberta-base
  --output_dir OUTPUT_DIR
                        The output directory where the model predictions and checkpoints will be written.
  --load_model_path LOAD_MODEL_PATH
                        Path to trained model: Should contain the .bin files
  --train_filename TRAIN_FILENAME
                        The train filename. Should contain the .jsonl files for this task.
  --dev_filename DEV_FILENAME
                        The dev filename. Should contain the .jsonl files for this task.
  --test_filename TEST_FILENAME
                        The test filename. Should contain the .jsonl files for this task.
  --config_name CONFIG_NAME
                        Pretrained config name or path if not the same as model_name
  --tokenizer_name TOKENIZER_NAME
                        Pretrained tokenizer name or path if not the same as model_name
  --max_source_length MAX_SOURCE_LENGTH
                        The maximum total source sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
  --max_target_length MAX_TARGET_LENGTH
                        The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
  --do_train            Whether to run training.
  --do_eval             Whether to run eval on the dev set.
  --do_test             Whether to run eval on the dev set.
  --do_lower_case       Set this flag if you are using an uncased model.
  --no_cuda             Avoid using CUDA when available
  --train_batch_size TRAIN_BATCH_SIZE
                        Batch size per GPU/CPU for training.
  --eval_batch_size EVAL_BATCH_SIZE
                        Batch size per GPU/CPU for evaluation.
  --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                        Number of updates steps to accumulate before performing a backward/update pass.
  --learning_rate LEARNING_RATE
                        The initial learning rate for Adam.
  --beam_size BEAM_SIZE
                        beam size for beam search
  --weight_decay WEIGHT_DECAY
                        Weight deay if we apply some.
  --adam_epsilon ADAM_EPSILON
                        Epsilon for Adam optimizer.
  --max_grad_norm MAX_GRAD_NORM
                        Max gradient norm.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Total number of training epochs to perform.
  --max_steps MAX_STEPS
                        If > 0: set total number of training steps to perform. Override num_train_epochs.
  --eval_steps EVAL_STEPS
  --train_steps TRAIN_STEPS
  --warmup_steps WARMUP_STEPS
                        Linear warmup over warmup_steps.
  --local_rank LOCAL_RANK
                        For distributed training: local_rank
  --seed SEED           random seed for initialization