graykode

(add) support other language in dockerfile and gitparser

......@@ -46,17 +46,15 @@ Recommended Commit Message : Remove unused imports
To solve this problem, use a new embedding called [`patch_type_embeddings`](https://github.com/graykode/commit-autosuggestions/blob/master/commit/model/diff_roberta.py#L40) that can distinguish added and deleted, just as the sample et al, 2019 (XLM) used language embeddeding. (1 for added, 2 for deleted.)
### Language support
| Language | Added | Diff |
| :------------- | :---: | :---:|
| Python | ✅ | ✅ |
| JavaScript | ⬜ | ⬜ |
| Go | ⬜ | ⬜ |
| JAVA | ⬜ | ⬜ |
| Ruby | ⬜ | ⬜ |
| PHP | ⬜ | ⬜ |
| Language | Added | Diff | Data(Diff) | Weights |
| :------------- | :---: | :---:| :---: | :---:|
| Python | ✅ | ✅ | [link](https://drive.google.com/drive/folders/1_8lQmzTH95Nc-4MKd1RP3x4BVc8tBA6W?usp=sharing) | [link](https://drive.google.com/drive/folders/1OwM7_FiLiwVJAhAanBPWtPw3Hz3Dszbh?usp=sharing) |
| JavaScript | ⬜ | ⬜ | ⬜ | ⬜ |
| Go | ⬜ | ⬜ | ⬜ | ⬜ |
| JAVA | ⬜ | ⬜ | ⬜ | ⬜ |
| Ruby | ⬜ | ⬜ | ⬜ | ⬜ |
| PHP | ⬜ | ⬜ | ⬜ | ⬜ |
* ✅ — Supported
* 🔶 — Partial support
* 🚧 — Under development
* ⬜ - N/A ️
We plan to slowly conquer languages that are not currently supported. However, I also need to use expensive GPU instances of AWS or GCP to train about the above languages. Please do a simple sponsor for this!
......@@ -68,9 +66,18 @@ To run this project, you need a flask-based inference server (GPU) and a client
Prepare Docker and Nvidia-docker before running the server.
##### 1-a. If you have GPU machine.
Serve flask server with Nvidia Docker
Serve flask server with Nvidia Docker. Check the docker tag for programming language in [here](https://hub.docker.com/repository/registry-1.docker.io/graykode/commit-autosuggestions/tags).
| Language | Tag |
| :------------- | :---: |
| Python | py |
| JavaScript | js |
| Go | go |
| JAVA | java |
| Ruby | ruby |
| PHP | php |
```shell script
$ docker run -it --gpus 0 -p 5000:5000 commit-autosuggestions:0.1-gpu
$ docker run -it -d --gpus 0 -p 5000:5000 graykode/commit-autosuggestions:{language}
```
##### 1-b. If you don't have GPU machine.
......
......@@ -10,14 +10,14 @@ ARG ADDED_MODEL="1YrkwfM-0VBCJaa9NYaXUQPODdGPsmQY4"
ARG DIFF_MODEL="1--gcVVix92_Fp75A-mWH0pJS0ahlni5m"
RUN git clone https://github.com/graykode/commit-autosuggestions.git /app/commit-autosuggestions \
&& cd /app/commit-autosuggestions && python3 setup.py install
&& cd /app/commit-autosuggestions
WORKDIR /app/commit-autosuggestions
RUN pip3 install ${PYTORCH_WHEEL} gdown
RUN gdown https://drive.google.com/uc?id=${ADDED_MODEL} -O weight/added/
RUN gdown https://drive.google.com/uc?id=${DIFF_MODEL} -O weight/diff/
RUN gdown https://drive.google.com/uc?id=${ADDED_MODEL} -O weight/python/added/
RUN gdown https://drive.google.com/uc?id=${DIFF_MODEL} -O weight/python/diff/
RUN pip3 install -r requirements.txt
ENTRYPOINT ["python3", "app.py"]
ENTRYPOINT ["python3", "app.py", "--load_model_path", "./weights/python/"]
......
......@@ -24,6 +24,15 @@ from multiprocessing.pool import Pool
from transformers import RobertaTokenizer
from pydriller import RepositoryMining
language = {
'py' : ['.py'],
'js' : ['.js', '.ts'],
'go' : ['.go'],
'java' : ['.java'],
'ruby' : ['.rb'],
'php' : ['.php']
}
def message_cleaner(message):
msg = message.split("\n")[0]
msg = re.sub(r"(\(|)#([0-9])+(\)|)", "", msg)
......@@ -34,7 +43,7 @@ def jobs(repo, args):
repo_path = os.path.join(args.repos_dir, repo)
if os.path.exists(repo_path):
for commit in RepositoryMining(
repo_path, only_modifications_with_file_types=['.py']
repo_path, only_modifications_with_file_types=language[args.lang]
).traverse_commits():
cleaned_message = message_cleaner(commit.msg)
tokenized_message = args.tokenizer.tokenize(cleaned_message)
......@@ -44,7 +53,7 @@ def jobs(repo, args):
for mod in commit.modifications:
if not (mod.old_path and mod.new_path):
continue
if os.path.splitext(mod.new_path)[1] != '.py':
if os.path.splitext(mod.new_path)[1] not in language[args.lang]:
continue
if not mod.diff_parsed["added"]:
continue
......@@ -121,6 +130,9 @@ if __name__ == "__main__":
help="directory that all repositories had been downloaded.",)
parser.add_argument("--output_dir", type=str, required=True,
help="The output directory where the preprocessed data will be written.")
parser.add_argument("--lang", type=str, required=True,
choices=['py', 'js', 'go', 'java', 'ruby', 'php'],
help="The output directory where the preprocessed data will be written.")
parser.add_argument("--tokenizer_name", type=str,
default="microsoft/codebert-base", help="The name of tokenizer",)
parser.add_argument("--num_workers", default=4, type=int, help="number of process")
......