Merge branch 'Youtube' into 'master'

Youtube Crawl Youtube Crawl 결과 및 기능 구현 See merge request !1

Merge branch 'Youtube' into 'master'
Youtube Crawl Youtube Crawl 결과 및 기능 구현 See merge request !1
김건
Commit 760e53bc53014080942badf3348beb7236315c27 760e53bc 2 parents d48ce4be c2cf41d5
Showing 9 changed files with 573 additions and 24 deletions
JPype1-0.7.0-cp38-cp38-win_amd64.whl
Youtube/.gitignore
Youtube/LICENSE
Youtube/README.md
Youtube/downloader.py
Youtube/main.py
Youtube/requirements.txt
readme.md
youtube.md
--- a/JPype1-0.7.0-cp38-cp38-win_amd64.whl 0 → 100644
View file @760e53b
+++ b/JPype1-0.7.0-cp38-cp38-win_amd64.whl 0 → 100644
View file @760e53b
--- a/Youtube/.gitignore 0 → 100644
View file @760e53b
+++ b/Youtube/.gitignore 0 → 100644
View file @760e53b
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*,cover
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
--- a/Youtube/LICENSE 0 → 100644
View file @760e53b
+++ b/Youtube/LICENSE 0 → 100644
View file @760e53b
+The MIT License (MIT)
+
+Copyright (c) 2015 Egbert Bouman
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
--- a/Youtube/README.md 0 → 100644
View file @760e53b
+++ b/Youtube/README.md 0 → 100644
View file @760e53b
+# youtube-comment-downloader
+Simple script for downloading Youtube comments without using the Youtube API. The output is in line delimited JSON.
+
+### Dependencies
+* Python 2.7+
+* requests
+* lxml
+* cssselect
+
+The python packages can be installed with
+
+    pip install requests
+    pip install lxml
+    pip install cssselect
+
+### Usage
+```
+usage: downloader.py [--help] [--youtubeid YOUTUBEID] [--output OUTPUT]
+
+Download Youtube comments without using the Youtube API
+
+optional arguments:
+  --help, -h            Show this help message and exit
+  --youtubeid YOUTUBEID, -y YOUTUBEID
+                        ID of Youtube video for which to download the comments
+  --output OUTPUT, -o OUTPUT
+                        Output filename (output format is line delimited JSON)
+```
--- a/Youtube/downloader.py 0 → 100644
View file @760e53b
+++ b/Youtube/downloader.py 0 → 100644
View file @760e53b
+#!/usr/bin/env python
+
+from __future__ import print_function
+import sys
+import os
+import time
+import json
+import requests
+import argparse
+import lxml.html
+import io
+from urllib.parse import urlparse, parse_qs
+from lxml.cssselect import CSSSelector
+
+YOUTUBE_COMMENTS_URL = 'https://www.youtube.com/all_comments?v={youtube_id}'
+YOUTUBE_COMMENTS_AJAX_URL = 'https://www.youtube.com/comment_ajax'
+
+USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
+
+
+def find_value(html, key, num_chars=2):
+    pos_begin = html.find(key) + len(key) + num_chars
+    pos_end = html.find('"', pos_begin)
+    return html[pos_begin: pos_end]
+
+
+def extract_comments(html):
+    tree = lxml.html.fromstring(html)
+    item_sel = CSSSelector('.comment-item')
+    text_sel = CSSSelector('.comment-text-content')
+    time_sel = CSSSelector('.time')
+    author_sel = CSSSelector('.user-name')
+
+    for item in item_sel(tree):
+        yield {'cid': item.get('data-cid'),
+               'text': text_sel(item)[0].text_content(),
+               'time': time_sel(item)[0].text_content().strip(),
+               'author': author_sel(item)[0].text_content()}
+
+
+def extract_reply_cids(html):
+    tree = lxml.html.fromstring(html)
+    sel = CSSSelector('.comment-replies-header > .load-comments')
+    return [i.get('data-cid') for i in sel(tree)]
+
+
+def ajax_request(session, url, params, data, retries=10, sleep=20):
+    for _ in range(retries):
+        response = session.post(url, params=params, data=data)
+        if response.status_code == 200:
+            response_dict = json.loads(response.text)
+            return response_dict.get('page_token', None), response_dict['html_content']
+        else:
+            time.sleep(sleep)
+
+
+def download_comments(youtube_id, sleep=1):
+    session = requests.Session()
+    session.headers['User-Agent'] = USER_AGENT
+    # Get Youtube page with initial comments
+    response = session.get(YOUTUBE_COMMENTS_URL.format(youtube_id=youtube_id))
+    html = response.text
+    reply_cids = extract_reply_cids(html)
+
+    ret_cids = []
+    for comment in extract_comments(html):
+        ret_cids.append(comment['cid'])
+        yield comment
+    page_token = find_value(html, 'data-token')
+    session_token = find_value(html, 'XSRF_TOKEN', 4)
+    first_iteration = True
+
+    # Get remaining comments (the same as pressing the 'Show more' button)
+    while page_token:
+        data = {'video_id': youtube_id,
+                'session_token': session_token}
+
+        params = {'action_load_comments': 1,
+                  'order_by_time': True,
+                  'filter': youtube_id}
+
+        if first_iteration:
+            params['order_menu'] = True
+        else:
+            data['page_token'] = page_token
+
+        response = ajax_request(session, YOUTUBE_COMMENTS_AJAX_URL, params, data)
+        if not response:
+            break
+
+        page_token, html = response
+
+        reply_cids += extract_reply_cids(html)
+        for comment in extract_comments(html):
+            if comment['cid'] not in ret_cids:
+                ret_cids.append(comment['cid'])
+                yield comment
+
+        first_iteration = False
+        time.sleep(sleep)
+    # Get replies (the same as pressing the 'View all X replies' link)
+    for cid in reply_cids:
+        data = {'comment_id': cid,
+                'video_id': youtube_id,
+                'can_reply': 1,
+                'session_token': session_token}
+        params = {'action_load_replies': 1,
+                  'order_by_time': True,
+                  'filter': youtube_id,
+                  'tab': 'inbox'}
+        response = ajax_request(session, YOUTUBE_COMMENTS_AJAX_URL, params, data)
+        if not response:
+            break
+
+        _, html = response
+
+        for comment in extract_comments(html):
+            if comment['cid'] not in ret_cids:
+                ret_cids.append(comment['cid'])
+                yield comment
+        time.sleep(sleep)
+
+## input video 값 parsing
+def video_id(value):
+    query = urlparse(value)
+    if query.hostname == 'youtu.be':
+        return query.path[1:]
+    if query.hostname in ('www.youtube.com', 'youtube.com'):
+        if query.path == '/watch':
+            p = parse_qs(query.query)
+            return p['v'][0]
+        if query.path[:7] == '/embed/':
+            return query.path.split('/')[2]
+        if query.path[:3] == '/v/':
+            return query.path.split('/')[2]
+    # fail?
+    return None
+
+
+def main():
+
+    #parser = argparse.ArgumentParser(add_help=False, description=('Download Youtube comments without using the Youtube API'))
+    #parser.add_argument('--help', '-h', action='help', default=argparse.SUPPRESS, help='Show this help message and exit')
+    #parser.add_argument('--youtubeid', '-y', help='ID of Youtube video for which to download the comments')
+    #parser.add_argument('--output', '-o', help='Output filename (output format is line delimited JSON)')
+    #parser.add_argument('--limit', '-l', type=int, help='Limit the number of comments')
+    Youtube_id1 = input('Youtube_ID 입력 :')
+    ## Cutting Link를 받고 id만 딸 수 있도록
+    Youtube_id1 = video_id(Youtube_id1)
+    youtube_id = Youtube_id1
+    try:
+        # args = parser.parse_args(argv)
+
+        #youtube_id = args.youtubeid
+        #output = args.output
+        #limit = args.limit
+        result_List = []
+    ## input 값을 받고 값에 할당
+
+    ## Limit에 빈 값이 들어갈 경우 Default 값으로 100을 넣게 하였음
+        if not youtube_id :
+            #parser.print_usage()
+            #raise ValueError('you need to specify a Youtube ID and an output filename')
+            raise ValueError('올바른 입력 값을 입력하세요')
+
+        print('Downloading Youtube comments for video:', youtube_id)
+        Number = input(' 저장 - 0 저장 안함-  1 : ')
+        if Number == '0' :
+            Output1 = input('결과를 받을 파일 입력 :')
+            Limit1 = input('제한 갯수 입력 : ')
+            if Limit1 == '' :
+                Limit1 = 100
+                Limit1 = int(Limit1)
+            limit = int(Limit1)
+
+            output = Output1
+                ##### argument로 받지 않고 input으로 받기 위한 것
+            with io.open(output, 'w', encoding='utf8') as fp:
+                for comment in download_comments(youtube_id):
+                    comment_json = json.dumps(comment, ensure_ascii=False)
+                    print(comment_json.decode('utf-8') if isinstance(comment_json, bytes) else comment_json, file=fp)
+                    count += 1
+                    sys.stdout.flush()
+                    if limit and count >= limit:
+                        print('Downloaded {} comment(s)\r'.format(count))
+                        print('\nDone!')
+                        break
+
+        else :
+            count = 0
+            i = 0
+            limit = 40
+            for comment in download_comments(youtube_id):
+                dic = {}
+                dic['cid'] = comment['cid']
+                dic['text'] = comment['text']
+                dic['time'] = comment['time']
+                dic['author'] = comment['author']
+                result_List.append(dic)
+                count += 1
+                i += 1
+                if limit  == count :
+                    print(' Comment Thread 생성 완료')
+                    print ('\n\n\n\n\n\n\n')
+                    break
+        return result_List
+        #goto_Menu(result_List)
+
+
+
+    except Exception as e:
+        print('Error:', str(e))
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
--- a/Youtube/main.py 0 → 100644
View file @760e53b
+++ b/Youtube/main.py 0 → 100644
View file @760e53b
+import downloader
+from time import sleep
+from konlpy.tag import Twitter
+from collections import Counter
+from matplotlib import rc
+import matplotlib.pyplot as plt
+from matplotlib import font_manager as fm
+import pytagcloud
+import operator
+def get_tags (Comment_List) :
+
+    okja = []
+    for temp in Comment_List :
+        okja.append(temp['text'])
+    twitter = Twitter()
+    sentence_tag  =[]
+    for sentence in okja:
+        morph = twitter.pos(sentence)
+        sentence_tag.append(morph)
+        print(morph)
+        print('-'*30)
+    print(sentence_tag)
+    print(len(sentence_tag))
+    print('\n'*3)
+
+    noun_adj_list = []
+    for sentence1 in sentence_tag:
+        for word,tag in sentence1:
+             if len(word) >=2 and tag  == 'Noun':
+                noun_adj_list.append(word)
+    counts = Counter(noun_adj_list)
+    print(' 가장 많이 등장한 10개의 키워드. \n')
+    print(counts.most_common(10))
+    tags2 = counts.most_common(10)
+    taglist = pytagcloud.make_tags(tags2,maxsize=80)
+    pytagcloud.create_tag_image(taglist,'wordcloud.jpg',size =(900,600),fontname ='Nanum Gothic', rectangular = False)
+
+def print_result(Comment_List) :
+    for var in Comment_List :
+        print(var)
+    print('******* 검색 완료 *******')
+    print('\n\n\n')
+
+def search_by_author(Comment_List,author_name) :
+    result_List = []
+
+    for var in Comment_List :
+        if (var['author'] == author_name) :
+            result_List.append(var)
+
+    return result_List
+def search_by_keyword(Comment_List,keyword) :
+        result_List = []
+        for var in Comment_List :
+            print(var['text'])
+            if ( keyword in var['text']) :
+                result_List.append(var)
+
+        return result_List
+def search_by_time(Comment_List,Time_input) :
+    result_List = []
+    for var in Comment_List :
+        if(var['time'] == Time_input) :
+            result_List.append(var)
+    return result_List
+
+def make_time_chart (Comment_List) :
+    result_List = []
+    save_List = []
+    day_dict = {}
+    month_dict = {}
+    year_dict = {}
+    hour_dict = {}
+    minute_dict = {}
+    week_dict = {}
+    for var in Comment_List :
+        result_List.append(var['time'])
+    for i in range(len(result_List)) :
+        print(result_List[i] + ' ')
+    print('\n\n\n\n')
+    temp_List = list(set(result_List))
+    for i in range(len(temp_List)) :
+        print(temp_List[i] + ' ')
+    print('\n\n\n\n')
+    for i in range (len(temp_List)) :
+        result_dict = {}
+        a = result_List.count(temp_List[i])
+        result_dict[temp_List[i]] = a
+        save_List.append(result_dict)
+
+    for i in range (len(save_List)):
+        num = ''
+        data = 0
+        for j in save_List[i] :
+            num = j
+        for k in save_List[i].values() :
+            data = k
+        if num.find('개월') >= 0 :
+            month_dict[num] = k
+        elif num.find('일') >= 0 :
+            day_dict[num] = k
+        elif num.find('년') >= 0 :
+            year_dict[num] = k
+        elif num.find('시간') >= 0 :
+            hour_dict[num] = k
+        elif num.find('주') >= 0 :
+            week_dict[num] = k
+        elif num.find('분') >= 0 :
+            minute_dict[num] = k
+    year_data = sorted(year_dict.items(), key=operator.itemgetter(0))
+    month_data = sorted(month_dict.items(), key=operator.itemgetter(0))
+    week_data = sorted(week_dict.items(), key=operator.itemgetter(0))
+    day_data = sorted(day_dict.items(), key=operator.itemgetter(0))
+    hour_data = sorted(hour_dict.items(), key=operator.itemgetter(0))
+    minute_data = sorted(minute_dict.items(), key=operator.itemgetter(0))
+    #print(month_data)
+    #print(week_data)
+    #print(day_data)
+    make_chart(year_data,month_data,week_data,day_data,hour_data,minute_data)
+
+def make_chart(year_data,month_data,week_data,day_data,hour_data,minute_data) :
+    temp_list =  [year_data,month_data,week_data,day_data,hour_data,minute_data]
+    x_list = []
+    y_list = []
+    print(temp_list)
+    for var1 in temp_list :
+        for var2 in var1 :
+            if(var2[0].find('년')>=0):
+                temp1 = var2[0][0] + 'years'
+                temp2 = int(var2[1])
+                x_list.append(temp1)
+                y_list.append(temp2)
+            elif(var2[0].find('개월')>=0):
+                temp1 = var2[0][0] + 'months'
+                temp2 = int(var2[1])
+                x_list.append(temp1)
+                y_list.append(temp2)
+            elif(var2[0].find('주')>=0):
+                temp1 = var2[0][0] + 'weeks'
+                temp2 = int(var2[1])
+                x_list.append(temp1)
+                y_list.append(temp2)
+            elif(var2[0].find('일')>=0):
+                temp1 = var2[0][0] + 'days'
+                temp2 = int(var2[1])
+                x_list.append(temp1)
+                y_list.append(temp2)
+            elif(var2[0].find('시간')>=0):
+                temp1 = var2[0][0] + 'hours'
+                temp2 = int(var2[1])
+                x_list.append(temp1)
+                y_list.append(temp2)
+            else:
+                temp1 = var2[0][0] + 'minutes'
+                temp2 = int(var2[1])
+                x_list.append(temp1)
+                y_list.append(temp2)
+    print(x_list)
+    plt.bar(x_list,y_list,width = 0.5 , color = "blue")
+    # plt.show() -> 출력
+    plt.savefig('chart.png',dpi=300)
+    # plt.savefig('chart.png', dpi=300)
+
+def call_main ():
+    print(' Comment Thread 생성중 \n')
+
+    sleep(1)
+    print(' **************************************************************')
+    print(' **************************************************************')
+    print(' **************************************************************')
+    print(' **************** 생성 완료 정보를 입력하세요. ****************  ')
+    print(' **************************************************************')
+    print(' **************************************************************')
+    print(' **************************************************************')
+    a = downloader.main()
+
+    return a
+
+if __name__ == "__main__":
+    CommentList = call_main()
+    make_time_chart(CommentList)
+    ##author_results = search_by_author(CommentList,'광고제거기')
+    ##text_resutls = search_by_keyword(CommentList,'지현')
+    ##get_tags(CommentList)
+    ##print_result(author_results)
+    ##print_result(text_resutls)
--- a/Youtube/requirements.txt 0 → 100644
View file @760e53b
+++ b/Youtube/requirements.txt 0 → 100644
View file @760e53b
+requests
+beautifulsoup4
+lxml
+cssselect
+### ũѸ
+pygame
+pytagcloud
+### wordcloud 
+Jpye1
+### Ű м
+python -m pip install -U matplotlib==3.2.0rc1
--- a/readme.md deleted 100644 → 0
View file @d48ce4b
+++ b/readme.md deleted 100644 → 0
View file @d48ce4b
-개발할 기능
-- 자신의 닉네임으로 댓글 찾기
-- 타인의 닉네임으로 댓글 찾기
-- 키워드를 통한 댓글 찾기
-- 좋아요 높은 순서로 댓글 찾기
-
-2019.11.01 ~ 2019.11.08
-1차 구현
-- 분석할 대상 결정 및 구현 방법 결정
-
-2019.11.09 ~ 2019.11.16
-2차 구현
-- 실질적인 구현
-
-2019.11.17 ~ 2019.11.23
-3차 구현
-- 분석한 대상들을 merge한 후 서로에 대한 피드백 받기
-
-2019.11.24 ~ 2019.12.01
-3차 구현
-- node js를 통한 웹서버 구현
-
-2019.12.02 ~ 2019.12.05
-최종 점검 및 발표 준비
\ No newline at end of file
--- a/youtube.md 0 → 100644
View file @760e53b
+++ b/youtube.md 0 → 100644
View file @760e53b
+Youtube 3차 수정 사항
+-----------------------------------------------------
+1차에서 추가적으로 구현 할 사항
+
+1. 명령행 파라미터를 input 으로 넣는 함수
+2. csv 파일에서 리스트를 받아오는 함수
+3. 받아 온 Data를 가공 처리 하는 함수
+ * 가장 많이 등장한 키워드 찾는 함수
+ * 저자를 통해 검색하는 함수
+ * 내가 쓴 댓글을 확인 하는 함수
+ * 가장 댓글을 많이 입력한 사람을 찾는 함수
+-----------------------------------------------------
+2차 Update 사항
+
+1. 명령행 파라미터를 Input으로 변경하여 받도록 수정하였음
+2. csv 파일으로 저장 할 것인지 여부를 묻고, 저장 하지 않는 경우 Dictionary 형태로 List에 넣도록 수정하였음
+3. Test 형식으로 List에 들어간 값들이 정상적으로 출력되는지 점검하였음
+-----------------------------------------------------
+이후 추가 구현 사항
+
+1. Module 분리 (List 반환 모듈, Main 부분) -> 굳이 분리하지 않을 경우
+추가적으로 함수를 구현해야함
+2. 본격적으로 Data Set을 어떤 식으로 분리하여 제공 할지에 대한 추가적인 기능 구현 필요
+
+-----------------------------------------------------
+
+1. 2차 개발사항에서 오류가 있던 부분을 수정하였음
+2. 가져온 Comment를 가공하여 처리할 수 있도록 일부 함수 구현
+ (1) 키워드를 통해 검색할 수 있도록 함수 구현
+ (2) 작성자 이름을 통해 검색할 수 있도록 함수 구현
+
+-----------------------------------------------------
+추가 구현 사항
+
+1. konlpy (http://konlpy.org/ko/latest/)를 통하여 명사 추출 후 keyword 분석하기
+2. 시간대를 추출하여 시간대 별로 Comment 정리하기
+-----------------------------------------------------
+4차 개발사항
+
+1. konlpy를 이용하여 keyword 분석 후 가장 많이 등장한 키워드 리스트 출력
+2. 1번 기능을 사용하여 wordcloud 구성
+3. 시간대를 이용하여 검색할 수 있는 기능 구현
+4. 시간대 별로 sort된 리스트를 가질 수 있도록 구현
+-----------------------------------------------------
+추가 구현 사항
+
+1. 시간대 별로 sort된 리스트를 matplotlib python을 이용하여 차트화 시키기
+2. 기능 별로 접근할 수 있도록 정리할 것
+-----------------------------------------------------
+5차 개발사항
+
+1. 시간대 별로 sort된 리스트를 matplotlib for python을 이용하여 차트화 하였음
\ No newline at end of file