(add) support other language in dockerfile and gitparser

graykode
Commit f7ef6cd002b31512a2236a0f8ba22f5e210f344e f7ef6cd0 1 parent 705ee9a5
Showing 3 changed files with 37 additions and 18 deletions
README.md
docker/Dockerfile → docker/python/Dockerfile
gitparser.py
--- a/README.md
View file @f7ef6cd
+++ b/README.md
View file @f7ef6cd
@@ -46,17 +46,15 @@ Recommended Commit Message : Remove unused imports
 To solve this problem, use a new embedding called [`patch_type_embeddings`](https://github.com/graykode/commit-autosuggestions/blob/master/commit/model/diff_roberta.py#L40) that can distinguish added and deleted, just as the sample et al, 2019 (XLM) used language embeddeding. (1 for added, 2 for deleted.)
 
 ### Language support
- | Language       | Added | Diff |
- | :------------- | :---: | :---:|
- | Python         | ✅    | ✅    |
- | JavaScript     | ⬜    | ⬜    |
- | Go             | ⬜    | ⬜    |
- | JAVA           | ⬜    | ⬜    |
- | Ruby           | ⬜    | ⬜    |
- | PHP            | ⬜    | ⬜    |
+ | Language       | Added | Diff |  Data(Diff) | Weights |
+ | :------------- | :---: | :---:| :---: | :---:|
+ | Python         | ✅    | ✅   | [link](https://drive.google.com/drive/folders/1_8lQmzTH95Nc-4MKd1RP3x4BVc8tBA6W?usp=sharing) |  [link](https://drive.google.com/drive/folders/1OwM7_FiLiwVJAhAanBPWtPw3Hz3Dszbh?usp=sharing)  |
+ | JavaScript     | ⬜    | ⬜   | ⬜ |  ⬜  |
+ | Go             | ⬜    | ⬜   | ⬜ |  ⬜  |
+ | JAVA           | ⬜    | ⬜   | ⬜ |  ⬜  |
+ | Ruby           | ⬜    | ⬜   | ⬜ |  ⬜  |
+ | PHP            | ⬜    | ⬜   | ⬜ |  ⬜  |
 * ✅ — Supported
- * 🔶 — Partial support
- * 🚧 — Under development
 * ⬜ - N/A ️
 
 We plan to slowly conquer languages that are not currently supported. However, I also need to use expensive GPU instances of AWS or GCP to train about the above languages. Please do a simple sponsor for this!
@@ -68,9 +66,18 @@ To run this project, you need a flask-based inference server (GPU) and a client 
 Prepare Docker and Nvidia-docker before running the server.
 
 ##### 1-a. If you have GPU machine.
- Serve flask server with Nvidia Docker
+ Serve flask server with Nvidia Docker. Check the docker tag for programming language in [here](https://hub.docker.com/repository/registry-1.docker.io/graykode/commit-autosuggestions/tags).
+ | Language       | Tag   |
+ | :------------- | :---: |
+ | Python         | py    |
+ | JavaScript     | js    |
+ | Go             | go    |
+ | JAVA           | java  |
+ | Ruby           | ruby  |
+ | PHP            | php   |
+ 
 ```shell script
- $ docker run -it --gpus 0 -p 5000:5000 commit-autosuggestions:0.1-gpu
+ $ docker run -it -d --gpus 0 -p 5000:5000 graykode/commit-autosuggestions:{language}
 ```
 
 ##### 1-b. If you don't have GPU machine.
--- a/docker/Dockerfile → docker/python/Dockerfile
View file @f7ef6cd
+++ b/docker/Dockerfile → docker/python/Dockerfile
View file @f7ef6cd
@@ -10,14 +10,14 @@ ARG ADDED_MODEL="1YrkwfM-0VBCJaa9NYaXUQPODdGPsmQY4"
 ARG DIFF_MODEL="1--gcVVix92_Fp75A-mWH0pJS0ahlni5m"
 
 RUN git clone https://github.com/graykode/commit-autosuggestions.git /app/commit-autosuggestions \
-     && cd /app/commit-autosuggestions && python3 setup.py install
+     && cd /app/commit-autosuggestions
 
 WORKDIR /app/commit-autosuggestions
 
 RUN pip3 install ${PYTORCH_WHEEL} gdown
- RUN gdown https://drive.google.com/uc?id=${ADDED_MODEL} -O weight/added/
- RUN gdown https://drive.google.com/uc?id=${DIFF_MODEL} -O weight/diff/
+ RUN gdown https://drive.google.com/uc?id=${ADDED_MODEL} -O weight/python/added/
+ RUN gdown https://drive.google.com/uc?id=${DIFF_MODEL} -O weight/python/diff/
 
 RUN pip3 install -r requirements.txt
 
- ENTRYPOINT ["python3", "app.py"]
+ ENTRYPOINT ["python3", "app.py", "--load_model_path", "./weights/python/"]
--- a/gitparser.py
View file @f7ef6cd
+++ b/gitparser.py
View file @f7ef6cd
@@ -24,6 +24,15 @@ from multiprocessing.pool import Pool
 from transformers import RobertaTokenizer
 from pydriller import RepositoryMining
 
+ language = {
+     'py' : ['.py'],
+     'js' : ['.js', '.ts'],
+     'go' : ['.go'],
+     'java' : ['.java'],
+     'ruby' : ['.rb'],
+     'php' : ['.php']
+ }
+ 
 def message_cleaner(message):
     msg = message.split("\n")[0]
     msg = re.sub(r"(\(|)#([0-9])+(\)|)", "", msg)
@@ -34,7 +43,7 @@ def jobs(repo, args):
     repo_path = os.path.join(args.repos_dir, repo)
     if os.path.exists(repo_path):
         for commit in RepositoryMining(
-             repo_path, only_modifications_with_file_types=['.py']
+             repo_path, only_modifications_with_file_types=language[args.lang]
         ).traverse_commits():
             cleaned_message = message_cleaner(commit.msg)
             tokenized_message = args.tokenizer.tokenize(cleaned_message)
@@ -44,7 +53,7 @@ def jobs(repo, args):
             for mod in commit.modifications:
                 if not (mod.old_path and mod.new_path):
                     continue
-                 if os.path.splitext(mod.new_path)[1] != '.py':
+                 if os.path.splitext(mod.new_path)[1] not in language[args.lang]:
                     continue
                 if not mod.diff_parsed["added"]:
                     continue
@@ -121,6 +130,9 @@ if __name__ == "__main__":
                         help="directory that all repositories had been downloaded.",)
     parser.add_argument("--output_dir", type=str, required=True,
                         help="The output directory where the preprocessed data will be written.")
+     parser.add_argument("--lang", type=str, required=True,
+                         choices=['py', 'js', 'go', 'java', 'ruby', 'php'],
+                         help="The output directory where the preprocessed data will be written.")
     parser.add_argument("--tokenizer_name", type=str,
                         default="microsoft/codebert-base", help="The name of tokenizer",)
     parser.add_argument("--num_workers", default=4, type=int, help="number of process")