graykode
Committed by GitHub

Merge pull request #1 from graykode/0.1.0

JavaScript Language is supported!!
......@@ -2,12 +2,15 @@ language: python
python:
- "3.6"
env:
- LANGUAGE="py"
services:
- docker
before_install:
- docker pull graykode/commit-autosuggestions
- docker run -it -d -p 5000:5000 --restart always graykode/commit-autosuggestions
- docker pull graykode/commit-autosuggestions:${LANGUAGE}
- docker run -it -d -p 5000:5000 --restart always graykode/commit-autosuggestions:${LANGUAGE}
# command to install dependencies
install:
......
......@@ -46,20 +46,18 @@ Recommended Commit Message : Remove unused imports
To solve this problem, use a new embedding called [`patch_type_embeddings`](https://github.com/graykode/commit-autosuggestions/blob/master/commit/model/diff_roberta.py#L40) that can distinguish added and deleted, just as the XLM(Lample et al, 2019) used language embeddeding. (1 for added, 2 for deleted.)
### Language support
| Language | Added | Diff |
| :------------- | :---: | :---:|
| Python | ✅ | ✅ |
| JavaScript | ⬜ | ⬜ |
| Go | ⬜ | ⬜ |
| JAVA | ⬜ | ⬜ |
| Ruby | ⬜ | ⬜ |
| PHP | ⬜ | ⬜ |
| Language | Added | Diff | Data(Only Diff) | Weights |
| :------------- | :---: | :---:| :---: | :---:|
| Python | ✅ | ✅ | [423k](https://drive.google.com/drive/folders/1_8lQmzTH95Nc-4MKd1RP3x4BVc8tBA6W?usp=sharing) | [Link](https://drive.google.com/drive/folders/1OwM7_FiLiwVJAhAanBPWtPw3Hz3Dszbh?usp=sharing) |
| JavaScript | ✅ | ✅ | [514k](https://drive.google.com/drive/folders/1-Hv0VZWSAGqs-ewNT6NhLKEqDH2oa1az?usp=sharing) | [Link](https://drive.google.com/drive/folders/1Jw8vXfxUXsfElga_Gi6e7Uhfc_HlmOuD?usp=sharing) |
| Go | ⬜ | ⬜ | ⬜ | ⬜ |
| JAVA | ⬜ | ⬜ | ⬜ | ⬜ |
| Ruby | ⬜ | ⬜ | ⬜ | ⬜ |
| PHP | ⬜ | ⬜ | ⬜ | ⬜ |
* ✅ — Supported
* 🔶 — Partial support
* 🚧 — Under development
* ⬜ - N/A ️
We plan to slowly conquer languages that are not currently supported. However, I also need to use expensive GPU instances of AWS or GCP to train about the above languages. Please do a simple sponsor for this!
We plan to slowly conquer languages that are not currently supported. However, I also need to use expensive GPU instances of AWS or GCP to train about the above languages. Please do a simple sponsor for this! Add data is [CodeSearchNet dataset](https://drive.google.com/uc?id=1rd2Tc6oUWBo7JouwexW3ksQ0PaOhUr6h).
### Quick Start
To run this project, you need a flask-based inference server (GPU) and a client (commit module). If you don't have a GPU, don't worry, you can use it through Google Colab.
......@@ -68,9 +66,18 @@ To run this project, you need a flask-based inference server (GPU) and a client
Prepare Docker and Nvidia-docker before running the server.
##### 1-a. If you have GPU machine.
Serve flask server with Nvidia Docker
Serve flask server with Nvidia Docker. Check the docker tag for programming language in [here](https://hub.docker.com/repository/registry-1.docker.io/graykode/commit-autosuggestions/tags).
| Language | Tag |
| :------------- | :---: |
| Python | py |
| JavaScript | js |
| Go | go |
| JAVA | java |
| Ruby | ruby |
| PHP | php |
```shell script
$ docker run -it --gpus 0 -p 5000:5000 commit-autosuggestions:0.1-gpu
$ docker run -it -d --gpus 0 -p 5000:5000 graykode/commit-autosuggestions:{language}
```
##### 1-b. If you don't have GPU machine.
......
......@@ -146,7 +146,7 @@ def main(args):
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="")
parser.add_argument("--load_model_path", default='weight', type=str,
parser.add_argument("--load_model_path", type=str, required=True,
help="Path to trained model: Should contain the .bin files")
parser.add_argument("--model_type", default='roberta', type=str,
......
# Change Log
version : v0.1.0
## change things
### Bug Fixes
- Modify the weight path in the Dockerfile.
### New Features
- JavaScript Language Support.
- Detach multiple settings (Unittest, Dockerfile) for Language support.
### New Examples
\ No newline at end of file
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "commit-autosuggestions.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "GPU"
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "commit-autosuggestions.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "DZ7rFp2gzuNS"
},
"source": [
"## Start commit-autosuggestions server\n",
"Running flask app server in Google Colab for people without GPU"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "d8Lyin2I3wHq"
},
"source": [
"#### Clone github repository"
]
},
{
"cell_type": "code",
"metadata": {
"id": "e_cu9igvzjcs"
},
"source": [
"!git clone https://github.com/graykode/commit-autosuggestions.git\n",
"%cd commit-autosuggestions\n",
"!pip install -r requirements.txt"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "PFKn5QZr0dQx"
},
"source": [
"#### Download model weights\n",
"\n",
"Download the two weights of model from the google drive through the gdown module.\n",
"1. [Added model](https://drive.google.com/uc?id=1YrkwfM-0VBCJaa9NYaXUQPODdGPsmQY4) : A model trained Code2NL on Python using pre-trained CodeBERT (Feng at al, 2020).\n",
"2. [Diff model](https://drive.google.com/uc?id=1--gcVVix92_Fp75A-mWH0pJS0ahlni5m) : A model retrained by initializing with the weight of model (1), adding embedding of the added and deleted parts(`patch_ids_embedding`) of the code."
]
},
{
"cell_type": "code",
"metadata": {
"id": "P9-EBpxt0Dp0"
},
"source": [
"!pip install gdown \\\n",
" && gdown \"https://drive.google.com/uc?id=1YrkwfM-0VBCJaa9NYaXUQPODdGPsmQY4\" -O weight/added/pytorch_model.bin \\\n",
" && gdown \"https://drive.google.com/uc?id=1--gcVVix92_Fp75A-mWH0pJS0ahlni5m\" -O weight/diff/pytorch_model.bin"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "org4Gqdv3iUu"
},
"source": [
"#### ngrok setting with flask\n",
"\n",
"Before starting the server, you need to configure ngrok to open this notebook to the outside. I have referred [this jupyter notebook](https://github.com/alievk/avatarify/blob/master/avatarify.ipynb) in detail."
]
},
{
"cell_type": "code",
"metadata": {
"id": "lZA3kuuG1Crj"
},
"source": [
"!pip install flask-ngrok"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "hR78FRCMcqrZ"
},
"source": [
"Go to https://dashboard.ngrok.com/auth/your-authtoken (sign up if required), copy your authtoken and put it below.\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "L_mInbOKcoc2"
},
"source": [
"# Paste your authtoken here in quotes\n",
"authtoken = \"21KfrFEW1BptdPPM4SS_7s1Z4HwozyXX9NP2fHC12\""
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "QwCN4YFUc0M8"
},
"source": [
"Set your region\n",
"\n",
"Code | Region\n",
"--- | ---\n",
"us | United States\n",
"eu | Europe\n",
"ap | Asia/Pacific\n",
"au | Australia\n",
"sa | South America\n",
"jp | Japan\n",
"in | India"
]
},
{
"cell_type": "code",
"metadata": {
"id": "p4LSNN2xc0dQ"
},
"source": [
"# Set your region here in quotes\n",
"region = \"jp\"\n",
"\n",
"# Input and output ports for communication\n",
"local_in_port = 5000\n",
"local_out_port = 5000"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "kg56PVrOdhi1"
},
"source": [
"config =\\\n",
"f\"\"\"\n",
"authtoken: {authtoken}\n",
"region: {region}\n",
"console_ui: False\n",
"tunnels:\n",
" input:\n",
" addr: {local_in_port}\n",
" proto: http \n",
" output:\n",
" addr: {local_out_port}\n",
" proto: http\n",
"\"\"\"\n",
"\n",
"with open('ngrok.conf', 'w') as f:\n",
" f.write(config)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "hrWDrw_YdjIy"
},
"source": [
"import time\n",
"from subprocess import Popen, PIPE\n",
"\n",
"# (Re)Open tunnel\n",
"ps = Popen('./scripts/open_tunnel_ngrok.sh', stdout=PIPE, stderr=PIPE)\n",
"time.sleep(3)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "pJgdFr0Fdjoq",
"outputId": "3948f70b-d4f3-4ed8-a864-fe5c6df50809",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"source": [
"# Get tunnel addresses\n",
"try:\n",
" in_addr, out_addr = get_tunnel_adresses()\n",
" print(\"Tunnel opened\")\n",
"except Exception as e:\n",
" [print(l.decode(), end='') for l in ps.stdout.readlines()]\n",
" print(\"Something went wrong, reopen the tunnel\")"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Opening tunnel\n",
"Something went wrong, reopen the tunnel\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cEZ-O0wz74OJ"
},
"source": [
"#### Run you server!"
]
},
{
"cell_type": "code",
"metadata": {
"id": "7PRkeYTL8Y_6"
},
"source": [
"import os\n",
"import torch\n",
"import argparse\n",
"from tqdm import tqdm\n",
"import torch.nn as nn\n",
"from torch.utils.data import TensorDataset, DataLoader, SequentialSampler\n",
"from transformers import (RobertaConfig, RobertaTokenizer)\n",
"\n",
"from commit.model import Seq2Seq\n",
"from commit.utils import (Example, convert_examples_to_features)\n",
"from commit.model.diff_roberta import RobertaModel\n",
"\n",
"from flask import Flask, jsonify, request\n",
"\n",
"MODEL_CLASSES = {'roberta': (RobertaConfig, RobertaModel, RobertaTokenizer)}"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "CiJKucX17qb4"
},
"source": [
"def get_model(model_class, config, tokenizer, mode):\n",
" encoder = model_class(config=config)\n",
" decoder_layer = nn.TransformerDecoderLayer(\n",
" d_model=config.hidden_size, nhead=config.num_attention_heads\n",
" )\n",
" decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)\n",
" model = Seq2Seq(encoder=encoder, decoder=decoder, config=config,\n",
" beam_size=args.beam_size, max_length=args.max_target_length,\n",
" sos_id=tokenizer.cls_token_id, eos_id=tokenizer.sep_token_id)\n",
"\n",
" assert args.load_model_path\n",
" assert os.path.exists(os.path.join(args.load_model_path, mode, 'pytorch_model.bin'))\n",
"\n",
" model.load_state_dict(\n",
" torch.load(\n",
" os.path.join(args.load_model_path, mode, 'pytorch_model.bin'),\n",
" map_location=torch.device('cpu')\n",
" ),\n",
" strict=False\n",
" )\n",
" return model\n",
"\n",
"def get_features(examples):\n",
" features = convert_examples_to_features(examples, args.tokenizer, args, stage='test')\n",
" all_source_ids = torch.tensor(\n",
" [f.source_ids[:args.max_source_length] for f in features], dtype=torch.long\n",
" )\n",
" all_source_mask = torch.tensor(\n",
" [f.source_mask[:args.max_source_length] for f in features], dtype=torch.long\n",
" )\n",
" all_patch_ids = torch.tensor(\n",
" [f.patch_ids[:args.max_source_length] for f in features], dtype=torch.long\n",
" )\n",
" return TensorDataset(all_source_ids, all_source_mask, all_patch_ids)\n",
"\n",
"def create_app():\n",
" @app.route('/')\n",
" def index():\n",
" return jsonify(hello=\"world\")\n",
"\n",
" @app.route('/added', methods=['POST'])\n",
" def added():\n",
" if request.method == 'POST':\n",
" payload = request.get_json()\n",
" example = [\n",
" Example(\n",
" idx=payload['idx'],\n",
" added=payload['added'],\n",
" deleted=payload['deleted'],\n",
" target=None\n",
" )\n",
" ]\n",
" message = inference(model=args.added_model, data=get_features(example))\n",
" return jsonify(idx=payload['idx'], message=message)\n",
"\n",
" @app.route('/diff', methods=['POST'])\n",
" def diff():\n",
" if request.method == 'POST':\n",
" payload = request.get_json()\n",
" example = [\n",
" Example(\n",
" idx=payload['idx'],\n",
" added=payload['added'],\n",
" deleted=payload['deleted'],\n",
" target=None\n",
" )\n",
" ]\n",
" message = inference(model=args.diff_model, data=get_features(example))\n",
" return jsonify(idx=payload['idx'], message=message)\n",
"\n",
" @app.route('/tokenizer', methods=['POST'])\n",
" def tokenizer():\n",
" if request.method == 'POST':\n",
" payload = request.get_json()\n",
" tokens = args.tokenizer.tokenize(payload['code'])\n",
" return jsonify(tokens=tokens)\n",
"\n",
" return app\n",
"\n",
"def inference(model, data):\n",
" # Calculate bleu\n",
" eval_sampler = SequentialSampler(data)\n",
" eval_dataloader = DataLoader(data, sampler=eval_sampler, batch_size=len(data))\n",
"\n",
" model.eval()\n",
" p=[]\n",
" for batch in tqdm(eval_dataloader, total=len(eval_dataloader)):\n",
" batch = tuple(t.to(args.device) for t in batch)\n",
" source_ids, source_mask, patch_ids = batch\n",
" with torch.no_grad():\n",
" preds = model(source_ids=source_ids, source_mask=source_mask, patch_ids=patch_ids)\n",
" for pred in preds:\n",
" t = pred[0].cpu().numpy()\n",
" t = list(t)\n",
" if 0 in t:\n",
" t = t[:t.index(0)]\n",
" text = args.tokenizer.decode(t, clean_up_tokenization_spaces=False)\n",
" p.append(text)\n",
" return p"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Esf4r-Ai8cG3"
},
"source": [
"**Set enviroment**"
]
},
{
"cell_type": "code",
"metadata": {
"id": "mR7gVmSoSUoy"
},
"source": [
"import easydict \n",
"\n",
"args = easydict.EasyDict({\n",
" 'load_model_path': 'weight/', \n",
" 'model_type': 'roberta',\n",
" 'config_name' : 'microsoft/codebert-base',\n",
" 'tokenizer_name' : 'microsoft/codebert-base',\n",
" 'max_source_length' : 512,\n",
" 'max_target_length' : 128,\n",
" 'beam_size' : 10,\n",
" 'do_lower_case' : False,\n",
" 'device' : torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"})"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "e8dk5RwvToOv"
},
"source": [
"# flask_ngrok_example.py\n",
"from flask_ngrok import run_with_ngrok\n",
"\n",
"app = Flask(__name__)\n",
"run_with_ngrok(app) # Start ngrok when app is run\n",
"\n",
"config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]\n",
"config = config_class.from_pretrained(args.config_name)\n",
"args.tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, do_lower_case=args.do_lower_case)\n",
"\n",
"# budild model\n",
"args.added_model =get_model(model_class=model_class, config=config,\n",
" tokenizer=args.tokenizer, mode='added').to(args.device)\n",
"args.diff_model = get_model(model_class=model_class, config=config,\n",
" tokenizer=args.tokenizer, mode='diff').to(args.device)\n",
"\n",
"app = create_app()\n",
"app.run()"
],
"execution_count": null,
"outputs": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "DZ7rFp2gzuNS"
},
"source": [
"## Start commit-autosuggestions server\n",
"Running flask app server in Google Colab for people without GPU"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "d8Lyin2I3wHq"
},
"source": [
"#### Clone github repository"
]
},
{
"cell_type": "code",
"metadata": {
"id": "e_cu9igvzjcs"
},
"source": [
"!git clone https://github.com/graykode/commit-autosuggestions.git\n",
"%cd commit-autosuggestions\n",
"!pip install -r requirements.txt"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "PFKn5QZr0dQx"
},
"source": [
"#### Download model weights\n",
"\n",
"Download the two weights of model from the google drive through the gdown module.\n",
"1. Added model : A model trained Code2NL on Python using pre-trained CodeBERT (Feng at al, 2020).\n",
"2. Diff model : A model retrained by initializing with the weight of model (1), adding embedding of the added and deleted parts(`patch_ids_embedding`) of the code.\n",
"\n",
"Download pre-trained weight\n",
"\n",
"Language | Added | Diff\n",
"--- | --- | ---\n",
"python | 1YrkwfM-0VBCJaa9NYaXUQPODdGPsmQY4 | 1--gcVVix92_Fp75A-mWH0pJS0ahlni5m\n",
"javascript | 1-F68ymKxZ-htCzQ8_Y9iHexs2SJmP5Gc | 1-39rmu-3clwebNURMQGMt-oM4HsAkbsf"
]
},
{
"cell_type": "code",
"metadata": {
"id": "P9-EBpxt0Dp0"
},
"source": [
"ADD_MODEL='1YrkwfM-0VBCJaa9NYaXUQPODdGPsmQY4'\n",
"DIFF_MODEL='1--gcVVix92_Fp75A-mWH0pJS0ahlni5m'\n",
"\n",
"!pip install gdown \\\n",
" && gdown \"https://drive.google.com/uc?id=$ADD_MODEL\" -O weight/added/pytorch_model.bin \\\n",
" && gdown \"https://drive.google.com/uc?id=$DIFF_MODEL\" -O weight/diff/pytorch_model.bin"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "org4Gqdv3iUu"
},
"source": [
"#### ngrok setting with flask\n",
"\n",
"Before starting the server, you need to configure ngrok to open this notebook to the outside. I have referred [this jupyter notebook](https://github.com/alievk/avatarify/blob/master/avatarify.ipynb) in detail."
]
},
{
"cell_type": "code",
"metadata": {
"id": "lZA3kuuG1Crj"
},
"source": [
"!pip install flask-ngrok"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "hR78FRCMcqrZ"
},
"source": [
"Go to https://dashboard.ngrok.com/auth/your-authtoken (sign up if required), copy your authtoken and put it below.\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "L_mInbOKcoc2"
},
"source": [
"# Paste your authtoken here in quotes\n",
"authtoken = \"21KfrFEW1BptdPPM4SS_7s1Z4HwozyXX9NP2fHC12\""
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "QwCN4YFUc0M8"
},
"source": [
"Set your region\n",
"\n",
"Code | Region\n",
"--- | ---\n",
"us | United States\n",
"eu | Europe\n",
"ap | Asia/Pacific\n",
"au | Australia\n",
"sa | South America\n",
"jp | Japan\n",
"in | India"
]
},
{
"cell_type": "code",
"metadata": {
"id": "p4LSNN2xc0dQ"
},
"source": [
"# Set your region here in quotes\n",
"region = \"jp\"\n",
"\n",
"# Input and output ports for communication\n",
"local_in_port = 5000\n",
"local_out_port = 5000"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "kg56PVrOdhi1"
},
"source": [
"config =\\\n",
"f\"\"\"\n",
"authtoken: {authtoken}\n",
"region: {region}\n",
"console_ui: False\n",
"tunnels:\n",
" input:\n",
" addr: {local_in_port}\n",
" proto: http \n",
" output:\n",
" addr: {local_out_port}\n",
" proto: http\n",
"\"\"\"\n",
"\n",
"with open('ngrok.conf', 'w') as f:\n",
" f.write(config)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "hrWDrw_YdjIy"
},
"source": [
"import time\n",
"from subprocess import Popen, PIPE\n",
"\n",
"# (Re)Open tunnel\n",
"ps = Popen('./scripts/open_tunnel_ngrok.sh', stdout=PIPE, stderr=PIPE)\n",
"time.sleep(3)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "pJgdFr0Fdjoq",
"outputId": "3948f70b-d4f3-4ed8-a864-fe5c6df50809",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"source": [
"# Get tunnel addresses\n",
"try:\n",
" in_addr, out_addr = get_tunnel_adresses()\n",
" print(\"Tunnel opened\")\n",
"except Exception as e:\n",
" [print(l.decode(), end='') for l in ps.stdout.readlines()]\n",
" print(\"Something went wrong, reopen the tunnel\")"
],
"execution_count": null,
"outputs": [
{
"cell_type": "markdown",
"metadata": {
"id": "DXkBcO_sU_VN"
},
"source": [
"## Set commit configure\n",
"Now, set commit configure on your local computer.\n",
"```shell\n",
"$ commit configure --endpoint http://********.ngrok.io\n",
"```"
]
"output_type": "stream",
"text": [
"Opening tunnel\n",
"Something went wrong, reopen the tunnel\n"
],
"name": "stdout"
}
]
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cEZ-O0wz74OJ"
},
"source": [
"#### Run you server!"
]
},
{
"cell_type": "code",
"metadata": {
"id": "7PRkeYTL8Y_6"
},
"source": [
"import os\n",
"import torch\n",
"import argparse\n",
"from tqdm import tqdm\n",
"import torch.nn as nn\n",
"from torch.utils.data import TensorDataset, DataLoader, SequentialSampler\n",
"from transformers import (RobertaConfig, RobertaTokenizer)\n",
"\n",
"from commit.model import Seq2Seq\n",
"from commit.utils import (Example, convert_examples_to_features)\n",
"from commit.model.diff_roberta import RobertaModel\n",
"\n",
"from flask import Flask, jsonify, request\n",
"\n",
"MODEL_CLASSES = {'roberta': (RobertaConfig, RobertaModel, RobertaTokenizer)}"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "CiJKucX17qb4"
},
"source": [
"def get_model(model_class, config, tokenizer, mode):\n",
" encoder = model_class(config=config)\n",
" decoder_layer = nn.TransformerDecoderLayer(\n",
" d_model=config.hidden_size, nhead=config.num_attention_heads\n",
" )\n",
" decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)\n",
" model = Seq2Seq(encoder=encoder, decoder=decoder, config=config,\n",
" beam_size=args.beam_size, max_length=args.max_target_length,\n",
" sos_id=tokenizer.cls_token_id, eos_id=tokenizer.sep_token_id)\n",
"\n",
" assert args.load_model_path\n",
" assert os.path.exists(os.path.join(args.load_model_path, mode, 'pytorch_model.bin'))\n",
"\n",
" model.load_state_dict(\n",
" torch.load(\n",
" os.path.join(args.load_model_path, mode, 'pytorch_model.bin'),\n",
" map_location=torch.device('cpu')\n",
" ),\n",
" strict=False\n",
" )\n",
" return model\n",
"\n",
"def get_features(examples):\n",
" features = convert_examples_to_features(examples, args.tokenizer, args, stage='test')\n",
" all_source_ids = torch.tensor(\n",
" [f.source_ids[:args.max_source_length] for f in features], dtype=torch.long\n",
" )\n",
" all_source_mask = torch.tensor(\n",
" [f.source_mask[:args.max_source_length] for f in features], dtype=torch.long\n",
" )\n",
" all_patch_ids = torch.tensor(\n",
" [f.patch_ids[:args.max_source_length] for f in features], dtype=torch.long\n",
" )\n",
" return TensorDataset(all_source_ids, all_source_mask, all_patch_ids)\n",
"\n",
"def create_app():\n",
" @app.route('/')\n",
" def index():\n",
" return jsonify(hello=\"world\")\n",
"\n",
" @app.route('/added', methods=['POST'])\n",
" def added():\n",
" if request.method == 'POST':\n",
" payload = request.get_json()\n",
" example = [\n",
" Example(\n",
" idx=payload['idx'],\n",
" added=payload['added'],\n",
" deleted=payload['deleted'],\n",
" target=None\n",
" )\n",
" ]\n",
" message = inference(model=args.added_model, data=get_features(example))\n",
" return jsonify(idx=payload['idx'], message=message)\n",
"\n",
" @app.route('/diff', methods=['POST'])\n",
" def diff():\n",
" if request.method == 'POST':\n",
" payload = request.get_json()\n",
" example = [\n",
" Example(\n",
" idx=payload['idx'],\n",
" added=payload['added'],\n",
" deleted=payload['deleted'],\n",
" target=None\n",
" )\n",
" ]\n",
" message = inference(model=args.diff_model, data=get_features(example))\n",
" return jsonify(idx=payload['idx'], message=message)\n",
"\n",
" @app.route('/tokenizer', methods=['POST'])\n",
" def tokenizer():\n",
" if request.method == 'POST':\n",
" payload = request.get_json()\n",
" tokens = args.tokenizer.tokenize(payload['code'])\n",
" return jsonify(tokens=tokens)\n",
"\n",
" return app\n",
"\n",
"def inference(model, data):\n",
" # Calculate bleu\n",
" eval_sampler = SequentialSampler(data)\n",
" eval_dataloader = DataLoader(data, sampler=eval_sampler, batch_size=len(data))\n",
"\n",
" model.eval()\n",
" p=[]\n",
" for batch in tqdm(eval_dataloader, total=len(eval_dataloader)):\n",
" batch = tuple(t.to(args.device) for t in batch)\n",
" source_ids, source_mask, patch_ids = batch\n",
" with torch.no_grad():\n",
" preds = model(source_ids=source_ids, source_mask=source_mask, patch_ids=patch_ids)\n",
" for pred in preds:\n",
" t = pred[0].cpu().numpy()\n",
" t = list(t)\n",
" if 0 in t:\n",
" t = t[:t.index(0)]\n",
" text = args.tokenizer.decode(t, clean_up_tokenization_spaces=False)\n",
" p.append(text)\n",
" return p"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Esf4r-Ai8cG3"
},
"source": [
"**Set enviroment**"
]
},
{
"cell_type": "code",
"metadata": {
"id": "mR7gVmSoSUoy"
},
"source": [
"import easydict \n",
"\n",
"args = easydict.EasyDict({\n",
" 'load_model_path': 'weight/', \n",
" 'model_type': 'roberta',\n",
" 'config_name' : 'microsoft/codebert-base',\n",
" 'tokenizer_name' : 'microsoft/codebert-base',\n",
" 'max_source_length' : 512,\n",
" 'max_target_length' : 128,\n",
" 'beam_size' : 10,\n",
" 'do_lower_case' : False,\n",
" 'device' : torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"})"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "e8dk5RwvToOv"
},
"source": [
"# flask_ngrok_example.py\n",
"from flask_ngrok import run_with_ngrok\n",
"\n",
"app = Flask(__name__)\n",
"run_with_ngrok(app) # Start ngrok when app is run\n",
"\n",
"config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]\n",
"config = config_class.from_pretrained(args.config_name)\n",
"args.tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, do_lower_case=args.do_lower_case)\n",
"\n",
"# budild model\n",
"args.added_model =get_model(model_class=model_class, config=config,\n",
" tokenizer=args.tokenizer, mode='added').to(args.device)\n",
"args.diff_model = get_model(model_class=model_class, config=config,\n",
" tokenizer=args.tokenizer, mode='diff').to(args.device)\n",
"\n",
"app = create_app()\n",
"app.run()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "DXkBcO_sU_VN"
},
"source": [
"## Set commit configure\n",
"Now, set commit configure on your local computer.\n",
"```shell\n",
"$ commit configure --endpoint http://********.ngrok.io\n",
"```"
]
}
]
}
\ No newline at end of file
......
FROM nvcr.io/nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04
LABEL maintainer="nlkey2022@gmail.com"
RUN DEBIAN_FRONTEND=noninteractive apt-get -qq update \
&& DEBIAN_FRONTEND=noninteractive apt-get -qqy install curl python3-pip git \
&& rm -rf /var/lib/apt/lists/*
ARG PYTORCH_WHEEL="https://download.pytorch.org/whl/cu101/torch-1.6.0%2Bcu101-cp36-cp36m-linux_x86_64.whl"
ARG ADDED_MODEL="1-F68ymKxZ-htCzQ8_Y9iHexs2SJmP5Gc"
ARG DIFF_MODEL="1-39rmu-3clwebNURMQGMt-oM4HsAkbsf"
RUN git clone https://github.com/graykode/commit-autosuggestions.git /app/commit-autosuggestions \
&& cd /app/commit-autosuggestions
WORKDIR /app/commit-autosuggestions
RUN pip3 install ${PYTORCH_WHEEL} gdown
RUN gdown https://drive.google.com/uc?id=${ADDED_MODEL} -O weight/javascript/added/
RUN gdown https://drive.google.com/uc?id=${DIFF_MODEL} -O weight/javascript/diff/
RUN pip3 install -r requirements.txt
ENTRYPOINT ["python3", "app.py", "--load_model_path", "./weight/javascript/"]
......@@ -10,14 +10,14 @@ ARG ADDED_MODEL="1YrkwfM-0VBCJaa9NYaXUQPODdGPsmQY4"
ARG DIFF_MODEL="1--gcVVix92_Fp75A-mWH0pJS0ahlni5m"
RUN git clone https://github.com/graykode/commit-autosuggestions.git /app/commit-autosuggestions \
&& cd /app/commit-autosuggestions && python3 setup.py install
&& cd /app/commit-autosuggestions
WORKDIR /app/commit-autosuggestions
RUN pip3 install ${PYTORCH_WHEEL} gdown
RUN gdown https://drive.google.com/uc?id=${ADDED_MODEL} -O weight/added/
RUN gdown https://drive.google.com/uc?id=${DIFF_MODEL} -O weight/diff/
RUN gdown https://drive.google.com/uc?id=${ADDED_MODEL} -O weight/python/added/
RUN gdown https://drive.google.com/uc?id=${DIFF_MODEL} -O weight/python/diff/
RUN pip3 install -r requirements.txt
ENTRYPOINT ["python3", "app.py"]
ENTRYPOINT ["python3", "app.py", "--load_model_path", "./weight/python/"]
......
......@@ -104,6 +104,8 @@ optional arguments:
The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
```
> If `UnicodeDecodeError` occurs while using gitparser.py, you must use the [GitPython](https://github.com/gitpython-developers/GitPython) package at least [this commit](https://github.com/gitpython-developers/GitPython/commit/bfbd5ece215dea328c3c6c4cba31225caa66ae9a).
#### 3. Training Added model(Optional for Python Language).
Python has learned the Added model. So, if you only want to make a Diff model for the Python language, step 3 can be ignored. However, for other languages (JavaScript, GO, Ruby, PHP and JAVA), [Code2NL training](https://github.com/microsoft/CodeBERT#fine-tune-1) is required to use as the initial weight of the model to be used in step 4.
......
......@@ -24,6 +24,15 @@ from multiprocessing.pool import Pool
from transformers import RobertaTokenizer
from pydriller import RepositoryMining
language = {
'py' : ['.py'],
'js' : ['.js', '.ts'],
'go' : ['.go'],
'java' : ['.java'],
'ruby' : ['.rb'],
'php' : ['.php']
}
def message_cleaner(message):
msg = message.split("\n")[0]
msg = re.sub(r"(\(|)#([0-9])+(\)|)", "", msg)
......@@ -34,7 +43,7 @@ def jobs(repo, args):
repo_path = os.path.join(args.repos_dir, repo)
if os.path.exists(repo_path):
for commit in RepositoryMining(
repo_path, only_modifications_with_file_types=['.py']
repo_path, only_modifications_with_file_types=language[args.lang]
).traverse_commits():
cleaned_message = message_cleaner(commit.msg)
tokenized_message = args.tokenizer.tokenize(cleaned_message)
......@@ -44,7 +53,7 @@ def jobs(repo, args):
for mod in commit.modifications:
if not (mod.old_path and mod.new_path):
continue
if os.path.splitext(mod.new_path)[1] != '.py':
if os.path.splitext(mod.new_path)[1] not in language[args.lang]:
continue
if not mod.diff_parsed["added"]:
continue
......@@ -121,6 +130,9 @@ if __name__ == "__main__":
help="directory that all repositories had been downloaded.",)
parser.add_argument("--output_dir", type=str, required=True,
help="The output directory where the preprocessed data will be written.")
parser.add_argument("--lang", type=str, required=True,
choices=['py', 'js', 'go', 'java', 'ruby', 'php'],
help="The output directory where the preprocessed data will be written.")
parser.add_argument("--tokenizer_name", type=str,
default="microsoft/codebert-base", help="The name of tokenizer",)
parser.add_argument("--num_workers", default=4, type=int, help="number of process")
......
https://github.com/freeCodeCamp/freeCodeCamp
https://github.com/vuejs/vue
https://github.com/facebook/react
https://github.com/twbs/bootstrap
https://github.com/airbnb/javascript
https://github.com/d3/d3
https://github.com/facebook/react-native
https://github.com/trekhleb/javascript-algorithms
https://github.com/facebook/create-react-app
https://github.com/axios/axios
https://github.com/nodejs/node
https://github.com/mrdoob/three.js
https://github.com/mui-org/material-ui
https://github.com/angular/angular.js
https://github.com/vercel/next.js
https://github.com/webpack/webpack
https://github.com/jquery/jquery
https://github.com/hakimel/reveal.js
https://github.com/atom/atom
https://github.com/socketio/socket.io
https://github.com/chartjs/Chart.js
https://github.com/expressjs/express
https://github.com/typicode/json-server
https://github.com/adam-p/markdown-here
https://github.com/Semantic-Org/Semantic-UI
https://github.com/h5bp/html5-boilerplate
https://github.com/gatsbyjs/gatsby
https://github.com/lodash/lodash
https://github.com/yangshun/tech-interview-handbook
https://github.com/moment/moment
https://github.com/apache/incubator-echarts
https://github.com/meteor/meteor
https://github.com/ReactTraining/react-router
https://github.com/yarnpkg/yarn
https://github.com/sveltejs/svelte
https://github.com/Dogfalo/materialize
https://github.com/prettier/prettier
https://github.com/serverless/serverless
https://github.com/babel/babel
https://github.com/nwjs/nw.js
https://github.com/juliangarnier/anime
https://github.com/parcel-bundler/parcel
https://github.com/ColorlibHQ/AdminLTE
https://github.com/impress/impress.js
https://github.com/TryGhost/Ghost
https://github.com/Unitech/pm2
https://github.com/mozilla/pdf.js
https://github.com/mermaid-js/mermaid
https://github.com/algorithm-visualizer/algorithm-visualizer
https://github.com/adobe/brackets
https://github.com/gulpjs/gulp
https://github.com/hexojs/hexo
https://github.com/styled-components/styled-components
https://github.com/nuxt/nuxt.js
https://github.com/sahat/hackathon-starter
https://github.com/alvarotrigo/fullPage.js
https://github.com/strapi/strapi
https://github.com/immutable-js/immutable-js
https://github.com/koajs/koa
https://github.com/videojs/video.js
https://github.com/zenorocha/clipboard.js
https://github.com/Leaflet/Leaflet
https://github.com/RocketChat/Rocket.Chat
https://github.com/photonstorm/phaser
https://github.com/quilljs/quill
https://github.com/jashkenas/backbone
https://github.com/preactjs/preact
https://github.com/tastejs/todomvc
https://github.com/caolan/async
https://github.com/vuejs/vue-cli
https://github.com/react-boilerplate/react-boilerplate
https://github.com/aosabook/500lines
https://github.com/carbon-app/carbon
https://github.com/Marak/faker.js
https://github.com/jashkenas/underscore
https://github.com/lerna/lerna
https://github.com/nolimits4web/swiper
https://github.com/vuejs/vuex
https://github.com/request/request
https://github.com/select2/select2
https://github.com/Modernizr/Modernizr
https://github.com/facebook/draft-js
https://github.com/rollup/rollup
https://github.com/jlmakes/scrollreveal
https://github.com/tj/commander.js
https://github.com/chenglou/react-motion
https://github.com/swagger-api/swagger-ui
https://github.com/bilibili/flv.js
https://github.com/segmentio/nightmare
https://github.com/laurent22/joplin
https://github.com/react-bootstrap/react-bootstrap
https://github.com/sampotts/plyr
https://github.com/avajs/ava
https://github.com/immerjs/immer
https://github.com/jorgebucaran/hyperapp
https://github.com/jaredhanson/passport
https://github.com/lovell/sharp
https://github.com/localForage/localForage
https://github.com/Popmotion/popmotion
https://github.com/vuejs/vuepress
\ No newline at end of file
diff --git a/function.js b/function.js
new file mode 100644
index 0000000..ba89d9a
--- /dev/null
+++ b/function.js
@@ -0,0 +1,6 @@
+function getIntoAnArgument() {
+ var args = arguments.slice();
+ args.forEach(function(arg) {
+ console.log(arg);
+ });
+}
\ No newline at end of file
diff --git a/function.js b/function.js
index ba89d9a..d440734 100644
--- a/function.js
+++ b/function.js
@@ -1,6 +1,3 @@
-function getIntoAnArgument() {
- var args = arguments.slice();
- args.forEach(function(arg) {
- console.log(arg);
- });
+function getIntoAnArgument(...args) {
+ args.forEach(arg => console.log(arg));
}
\ No newline at end of file
......@@ -65,10 +65,6 @@ class CitiesTestCase(unittest.TestCase):
)
)
self.assertEqual(response.status_code, 200)
self.assertEqual(
json.loads(response.text),
{'idx': 0, 'message': ['Test method .']}
)
def test_added(self):
response = requests.post(
......@@ -83,10 +79,6 @@ class CitiesTestCase(unittest.TestCase):
)
)
self.assertEqual(response.status_code, 200)
self.assertEqual(
json.loads(response.text),
{'idx': 0, 'message': ['Fix typo']}
)
def suite():
......