(add) support other language in dockerfile and gitparser

graykode
Commit f7ef6cd002b31512a2236a0f8ba22f5e210f344e f7ef6cd0 1 parent 705ee9a5
Showing 3 changed files with 37 additions and 18 deletions
README.md
docker/Dockerfile → docker/python/Dockerfile
gitparser.py
--- a/README.md
View file @f7ef6cd
+++ b/README.md
View file @f7ef6cd
@@ -46,17 +46,15 @@ Recommended Commit Message : Remove unused imports
 To solve this problem, use a new embedding called [`patch_type_embeddings`](https://github.com/graykode/commit-autosuggestions/blob/master/commit/model/diff_roberta.py#L40) that can distinguish added and deleted, just as the sample et al, 2019 (XLM) used language embeddeding. (1 for added, 2 for deleted.)
 ### Language support
-| Language       | Added | Diff |
+| Language       | Added | Diff |  Data(Diff) | Weights |
-| :------------- | :---: | :---:|
+| :------------- | :---: | :---:| :---: | :---:|
-| Python         | ✅    | ✅    |
+| Python         | ✅    | ✅   | [link](https://drive.google.com/drive/folders/1_8lQmzTH95Nc-4MKd1RP3x4BVc8tBA6W?usp=sharing) |  [link](https://drive.google.com/drive/folders/1OwM7_FiLiwVJAhAanBPWtPw3Hz3Dszbh?usp=sharing)  |
-| JavaScript     | ⬜    | ⬜    |
+| JavaScript     | ⬜    | ⬜   | ⬜ |  ⬜  |
-| Go             | ⬜    | ⬜    |
+| Go             | ⬜    | ⬜   | ⬜ |  ⬜  |
-| JAVA           | ⬜    | ⬜    |
+| JAVA           | ⬜    | ⬜   | ⬜ |  ⬜  |
-| Ruby           | ⬜    | ⬜    |
+| Ruby           | ⬜    | ⬜   | ⬜ |  ⬜  |
-| PHP            | ⬜    | ⬜    |
+| PHP            | ⬜    | ⬜   | ⬜ |  ⬜  |
 * ✅ — Supported
-* 🔶 — Partial support
-* 🚧 — Under development
 * ⬜ - N/A ️
 We plan to slowly conquer languages that are not currently supported. However, I also need to use expensive GPU instances of AWS or GCP to train about the above languages. Please do a simple sponsor for this!
@@ -68,9 +66,18 @@ To run this project, you need a flask-based inference server (GPU) and a client 
 Prepare Docker and Nvidia-docker before running the server.
 ##### 1-a. If you have GPU machine.
-Serve flask server with Nvidia Docker
+Serve flask server with Nvidia Docker. Check the docker tag for programming language in [here](https://hub.docker.com/repository/registry-1.docker.io/graykode/commit-autosuggestions/tags).
+| Language       | Tag   |
+| :------------- | :---: |
+| Python         | py    |
+| JavaScript     | js    |
+| Go             | go    |
+| JAVA           | java  |
+| Ruby           | ruby  |
+| PHP            | php   |
+
 ```shell script
-$ docker run -it --gpus 0 -p 5000:5000 commit-autosuggestions:0.1-gpu
+$ docker run -it -d --gpus 0 -p 5000:5000 graykode/commit-autosuggestions:{language}
 ```
 ##### 1-b. If you don't have GPU machine.
--- a/docker/Dockerfile → docker/python/Dockerfile
View file @f7ef6cd
+++ b/docker/Dockerfile → docker/python/Dockerfile
View file @f7ef6cd
@@ -10,14 +10,14 @@ ARG ADDED_MODEL="1YrkwfM-0VBCJaa9NYaXUQPODdGPsmQY4"
 ARG DIFF_MODEL="1--gcVVix92_Fp75A-mWH0pJS0ahlni5m"
 RUN git clone https://github.com/graykode/commit-autosuggestions.git /app/commit-autosuggestions \
-    && cd /app/commit-autosuggestions && python3 setup.py install
+    && cd /app/commit-autosuggestions
 WORKDIR /app/commit-autosuggestions
 RUN pip3 install ${PYTORCH_WHEEL} gdown
-RUN gdown https://drive.google.com/uc?id=${ADDED_MODEL} -O weight/added/
+RUN gdown https://drive.google.com/uc?id=${ADDED_MODEL} -O weight/python/added/
-RUN gdown https://drive.google.com/uc?id=${DIFF_MODEL} -O weight/diff/
+RUN gdown https://drive.google.com/uc?id=${DIFF_MODEL} -O weight/python/diff/
 RUN pip3 install -r requirements.txt
-ENTRYPOINT ["python3", "app.py"]
+ENTRYPOINT ["python3", "app.py", "--load_model_path", "./weights/python/"]
--- a/gitparser.py
View file @f7ef6cd
+++ b/gitparser.py
View file @f7ef6cd
@@ -24,6 +24,15 @@ from multiprocessing.pool import Pool
 from transformers import RobertaTokenizer
 from pydriller import RepositoryMining
+language = {
+    'py' : ['.py'],
+    'js' : ['.js', '.ts'],
+    'go' : ['.go'],
+    'java' : ['.java'],
+    'ruby' : ['.rb'],
+    'php' : ['.php']
+}
+
 def message_cleaner(message):
     msg = message.split("\n")[0]
     msg = re.sub(r"(\(|)#([0-9])+(\)|)", "", msg)
@@ -34,7 +43,7 @@ def jobs(repo, args):
     repo_path = os.path.join(args.repos_dir, repo)
     if os.path.exists(repo_path):
         for commit in RepositoryMining(
-            repo_path, only_modifications_with_file_types=['.py']
+            repo_path, only_modifications_with_file_types=language[args.lang]
         ).traverse_commits():
             cleaned_message = message_cleaner(commit.msg)
             tokenized_message = args.tokenizer.tokenize(cleaned_message)
@@ -44,7 +53,7 @@ def jobs(repo, args):
             for mod in commit.modifications:
                 if not (mod.old_path and mod.new_path):
                     continue
-                if os.path.splitext(mod.new_path)[1] != '.py':
+                if os.path.splitext(mod.new_path)[1] not in language[args.lang]:
                     continue
                 if not mod.diff_parsed["added"]:
                     continue
@@ -121,6 +130,9 @@ if __name__ == "__main__":
                         help="directory that all repositories had been downloaded.",)
     parser.add_argument("--output_dir", type=str, required=True,
                         help="The output directory where the preprocessed data will be written.")
+    parser.add_argument("--lang", type=str, required=True,
+                        choices=['py', 'js', 'go', 'java', 'ruby', 'php'],
+                        help="The output directory where the preprocessed data will be written.")
     parser.add_argument("--tokenizer_name", type=str,
                         default="microsoft/codebert-base", help="The name of tokenizer",)
     parser.add_argument("--num_workers", default=4, type=int, help="number of process")