Merge branch 'Youtube' into 'master'

Youtube Crawl Youtube Crawl 결과 및 기능 구현 See merge request !1

Merge branch 'Youtube' into 'master'
Youtube Crawl Youtube Crawl 결과 및 기능 구현 See merge request !1
김건
Commit 760e53bc53014080942badf3348beb7236315c27 760e53bc 2 parents d48ce4be c2cf41d5
Showing 9 changed files with 573 additions and 24 deletions
JPype1-0.7.0-cp38-cp38-win_amd64.whl
Youtube/.gitignore
Youtube/LICENSE
Youtube/README.md
Youtube/downloader.py
Youtube/main.py
Youtube/requirements.txt
readme.md
youtube.md
--- a/JPype1-0.7.0-cp38-cp38-win_amd64.whl 0 → 100644
View file @760e53b
+++ b/JPype1-0.7.0-cp38-cp38-win_amd64.whl 0 → 100644
View file @760e53b
--- a/Youtube/.gitignore 0 → 100644
View file @760e53b
+++ b/Youtube/.gitignore 0 → 100644
View file @760e53b
+ # Byte-compiled / optimized / DLL files
+ __pycache__/
+ *.py[cod]
+ 
+ # C extensions
+ *.so
+ 
+ # Distribution / packaging
+ .Python
+ env/
+ build/
+ develop-eggs/
+ dist/
+ downloads/
+ eggs/
+ .eggs/
+ lib/
+ lib64/
+ parts/
+ sdist/
+ var/
+ *.egg-info/
+ .installed.cfg
+ *.egg
+ 
+ # PyInstaller
+ #  Usually these files are written by a python script from a template
+ #  before PyInstaller builds the exe, so as to inject date/other infos into it.
+ *.manifest
+ *.spec
+ 
+ # Installer logs
+ pip-log.txt
+ pip-delete-this-directory.txt
+ 
+ # Unit test / coverage reports
+ htmlcov/
+ .tox/
+ .coverage
+ .coverage.*
+ .cache
+ nosetests.xml
+ coverage.xml
+ *,cover
+ 
+ # Translations
+ *.mo
+ *.pot
+ 
+ # Django stuff:
+ *.log
+ 
+ # Sphinx documentation
+ docs/_build/
+ 
+ # PyBuilder
+ target/
--- a/Youtube/LICENSE 0 → 100644
View file @760e53b
+++ b/Youtube/LICENSE 0 → 100644
View file @760e53b
+ The MIT License (MIT)
+ 
+ Copyright (c) 2015 Egbert Bouman
+ 
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+ 
+ The above copyright notice and this permission notice shall be included in all
+ copies or substantial portions of the Software.
+ 
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ SOFTWARE.
+ 
--- a/Youtube/README.md 0 → 100644
View file @760e53b
+++ b/Youtube/README.md 0 → 100644
View file @760e53b
+ # youtube-comment-downloader
+ Simple script for downloading Youtube comments without using the Youtube API. The output is in line delimited JSON.
+ 
+ ### Dependencies
+ * Python 2.7+
+ * requests
+ * lxml
+ * cssselect
+ 
+ The python packages can be installed with
+ 
+     pip install requests
+     pip install lxml
+     pip install cssselect
+ 
+ ### Usage
+ ```
+ usage: downloader.py [--help] [--youtubeid YOUTUBEID] [--output OUTPUT]
+ 
+ Download Youtube comments without using the Youtube API
+ 
+ optional arguments:
+   --help, -h            Show this help message and exit
+   --youtubeid YOUTUBEID, -y YOUTUBEID
+                         ID of Youtube video for which to download the comments
+   --output OUTPUT, -o OUTPUT
+                         Output filename (output format is line delimited JSON)
+ ```
--- a/Youtube/downloader.py 0 → 100644
View file @760e53b
+++ b/Youtube/downloader.py 0 → 100644
View file @760e53b
+ #!/usr/bin/env python
+ 
+ from __future__ import print_function
+ import sys
+ import os
+ import time
+ import json
+ import requests
+ import argparse
+ import lxml.html
+ import io
+ from urllib.parse import urlparse, parse_qs
+ from lxml.cssselect import CSSSelector
+ 
+ YOUTUBE_COMMENTS_URL = 'https://www.youtube.com/all_comments?v={youtube_id}'
+ YOUTUBE_COMMENTS_AJAX_URL = 'https://www.youtube.com/comment_ajax'
+ 
+ USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
+ 
+ 
+ def find_value(html, key, num_chars=2):
+     pos_begin = html.find(key) + len(key) + num_chars
+     pos_end = html.find('"', pos_begin)
+     return html[pos_begin: pos_end]
+ 
+ 
+ def extract_comments(html):
+     tree = lxml.html.fromstring(html)
+     item_sel = CSSSelector('.comment-item')
+     text_sel = CSSSelector('.comment-text-content')
+     time_sel = CSSSelector('.time')
+     author_sel = CSSSelector('.user-name')
+ 
+     for item in item_sel(tree):
+         yield {'cid': item.get('data-cid'),
+                'text': text_sel(item)[0].text_content(),
+                'time': time_sel(item)[0].text_content().strip(),
+                'author': author_sel(item)[0].text_content()}
+ 
+ 
+ def extract_reply_cids(html):
+     tree = lxml.html.fromstring(html)
+     sel = CSSSelector('.comment-replies-header > .load-comments')
+     return [i.get('data-cid') for i in sel(tree)]
+ 
+ 
+ def ajax_request(session, url, params, data, retries=10, sleep=20):
+     for _ in range(retries):
+         response = session.post(url, params=params, data=data)
+         if response.status_code == 200:
+             response_dict = json.loads(response.text)
+             return response_dict.get('page_token', None), response_dict['html_content']
+         else:
+             time.sleep(sleep)
+ 
+ 
+ def download_comments(youtube_id, sleep=1):
+     session = requests.Session()
+     session.headers['User-Agent'] = USER_AGENT
+     # Get Youtube page with initial comments
+     response = session.get(YOUTUBE_COMMENTS_URL.format(youtube_id=youtube_id))
+     html = response.text
+     reply_cids = extract_reply_cids(html)
+ 
+     ret_cids = []
+     for comment in extract_comments(html):
+         ret_cids.append(comment['cid'])
+         yield comment
+     page_token = find_value(html, 'data-token')
+     session_token = find_value(html, 'XSRF_TOKEN', 4)
+     first_iteration = True
+ 
+     # Get remaining comments (the same as pressing the 'Show more' button)
+     while page_token:
+         data = {'video_id': youtube_id,
+                 'session_token': session_token}
+ 
+         params = {'action_load_comments': 1,
+                   'order_by_time': True,
+                   'filter': youtube_id}
+ 
+         if first_iteration:
+             params['order_menu'] = True
+         else:
+             data['page_token'] = page_token
+ 
+         response = ajax_request(session, YOUTUBE_COMMENTS_AJAX_URL, params, data)
+         if not response:
+             break
+ 
+         page_token, html = response
+ 
+         reply_cids += extract_reply_cids(html)
+         for comment in extract_comments(html):
+             if comment['cid'] not in ret_cids:
+                 ret_cids.append(comment['cid'])
+                 yield comment
+ 
+         first_iteration = False
+         time.sleep(sleep)
+     # Get replies (the same as pressing the 'View all X replies' link)
+     for cid in reply_cids:
+         data = {'comment_id': cid,
+                 'video_id': youtube_id,
+                 'can_reply': 1,
+                 'session_token': session_token}
+         params = {'action_load_replies': 1,
+                   'order_by_time': True,
+                   'filter': youtube_id,
+                   'tab': 'inbox'}
+         response = ajax_request(session, YOUTUBE_COMMENTS_AJAX_URL, params, data)
+         if not response:
+             break
+ 
+         _, html = response
+ 
+         for comment in extract_comments(html):
+             if comment['cid'] not in ret_cids:
+                 ret_cids.append(comment['cid'])
+                 yield comment
+         time.sleep(sleep)
+ 
+ ## input video 값 parsing
+ def video_id(value):
+     query = urlparse(value)
+     if query.hostname == 'youtu.be':
+         return query.path[1:]
+     if query.hostname in ('www.youtube.com', 'youtube.com'):
+         if query.path == '/watch':
+             p = parse_qs(query.query)
+             return p['v'][0]
+         if query.path[:7] == '/embed/':
+             return query.path.split('/')[2]
+         if query.path[:3] == '/v/':
+             return query.path.split('/')[2]
+     # fail?
+     return None
+ 
+ 
+ def main():
+ 
+     #parser = argparse.ArgumentParser(add_help=False, description=('Download Youtube comments without using the Youtube API'))
+     #parser.add_argument('--help', '-h', action='help', default=argparse.SUPPRESS, help='Show this help message and exit')
+     #parser.add_argument('--youtubeid', '-y', help='ID of Youtube video for which to download the comments')
+     #parser.add_argument('--output', '-o', help='Output filename (output format is line delimited JSON)')
+     #parser.add_argument('--limit', '-l', type=int, help='Limit the number of comments')
+     Youtube_id1 = input('Youtube_ID 입력 :')
+     ## Cutting Link를 받고 id만 딸 수 있도록
+     Youtube_id1 = video_id(Youtube_id1)
+     youtube_id = Youtube_id1
+     try:
+         # args = parser.parse_args(argv)
+ 
+         #youtube_id = args.youtubeid
+         #output = args.output
+         #limit = args.limit
+         result_List = []
+     ## input 값을 받고 값에 할당
+ 
+     ## Limit에 빈 값이 들어갈 경우 Default 값으로 100을 넣게 하였음
+         if not youtube_id :
+             #parser.print_usage()
+             #raise ValueError('you need to specify a Youtube ID and an output filename')
+             raise ValueError('올바른 입력 값을 입력하세요')
+ 
+         print('Downloading Youtube comments for video:', youtube_id)
+         Number = input(' 저장 - 0 저장 안함-  1 : ')
+         if Number == '0' :
+             Output1 = input('결과를 받을 파일 입력 :')
+             Limit1 = input('제한 갯수 입력 : ')
+             if Limit1 == '' :
+                 Limit1 = 100
+                 Limit1 = int(Limit1)
+             limit = int(Limit1)
+ 
+             output = Output1
+                 ##### argument로 받지 않고 input으로 받기 위한 것
+             with io.open(output, 'w', encoding='utf8') as fp:
+                 for comment in download_comments(youtube_id):
+                     comment_json = json.dumps(comment, ensure_ascii=False)
+                     print(comment_json.decode('utf-8') if isinstance(comment_json, bytes) else comment_json, file=fp)
+                     count += 1
+                     sys.stdout.flush()
+                     if limit and count >= limit:
+                         print('Downloaded {} comment(s)\r'.format(count))
+                         print('\nDone!')
+                         break
+ 
+         else :
+             count = 0
+             i = 0
+             limit = 40
+             for comment in download_comments(youtube_id):
+                 dic = {}
+                 dic['cid'] = comment['cid']
+                 dic['text'] = comment['text']
+                 dic['time'] = comment['time']
+                 dic['author'] = comment['author']
+                 result_List.append(dic)
+                 count += 1
+                 i += 1
+                 if limit  == count :
+                     print(' Comment Thread 생성 완료')
+                     print ('\n\n\n\n\n\n\n')
+                     break
+         return result_List
+         #goto_Menu(result_List)
+ 
+ 
+ 
+     except Exception as e:
+         print('Error:', str(e))
+         sys.exit(1)
+ 
+ 
+ if __name__ == "__main__":
+     main()
--- a/Youtube/main.py 0 → 100644
View file @760e53b
+++ b/Youtube/main.py 0 → 100644
View file @760e53b
+ import downloader
+ from time import sleep
+ from konlpy.tag import Twitter
+ from collections import Counter
+ from matplotlib import rc
+ import matplotlib.pyplot as plt
+ from matplotlib import font_manager as fm
+ import pytagcloud
+ import operator
+ def get_tags (Comment_List) :
+ 
+     okja = []
+     for temp in Comment_List :
+         okja.append(temp['text'])
+     twitter = Twitter()
+     sentence_tag  =[]
+     for sentence in okja:
+         morph = twitter.pos(sentence)
+         sentence_tag.append(morph)
+         print(morph)
+         print('-'*30)
+     print(sentence_tag)
+     print(len(sentence_tag))
+     print('\n'*3)
+ 
+     noun_adj_list = []
+     for sentence1 in sentence_tag:
+         for word,tag in sentence1:
+              if len(word) >=2 and tag  == 'Noun':
+                 noun_adj_list.append(word)
+     counts = Counter(noun_adj_list)
+     print(' 가장 많이 등장한 10개의 키워드. \n')
+     print(counts.most_common(10))
+     tags2 = counts.most_common(10)
+     taglist = pytagcloud.make_tags(tags2,maxsize=80)
+     pytagcloud.create_tag_image(taglist,'wordcloud.jpg',size =(900,600),fontname ='Nanum Gothic', rectangular = False)
+ 
+ def print_result(Comment_List) :
+     for var in Comment_List :
+         print(var)
+     print('******* 검색 완료 *******')
+     print('\n\n\n')
+ 
+ def search_by_author(Comment_List,author_name) :
+     result_List = []
+ 
+     for var in Comment_List :
+         if (var['author'] == author_name) :
+             result_List.append(var)
+ 
+     return result_List
+ def search_by_keyword(Comment_List,keyword) :
+         result_List = []
+         for var in Comment_List :
+             print(var['text'])
+             if ( keyword in var['text']) :
+                 result_List.append(var)
+ 
+         return result_List
+ def search_by_time(Comment_List,Time_input) :
+     result_List = []
+     for var in Comment_List :
+         if(var['time'] == Time_input) :
+             result_List.append(var)
+     return result_List
+ 
+ def make_time_chart (Comment_List) :
+     result_List = []
+     save_List = []
+     day_dict = {}
+     month_dict = {}
+     year_dict = {}
+     hour_dict = {}
+     minute_dict = {}
+     week_dict = {}
+     for var in Comment_List :
+         result_List.append(var['time'])
+     for i in range(len(result_List)) :
+         print(result_List[i] + ' ')
+     print('\n\n\n\n')
+     temp_List = list(set(result_List))
+     for i in range(len(temp_List)) :
+         print(temp_List[i] + ' ')
+     print('\n\n\n\n')
+     for i in range (len(temp_List)) :
+         result_dict = {}
+         a = result_List.count(temp_List[i])
+         result_dict[temp_List[i]] = a
+         save_List.append(result_dict)
+ 
+     for i in range (len(save_List)):
+         num = ''
+         data = 0
+         for j in save_List[i] :
+             num = j
+         for k in save_List[i].values() :
+             data = k
+         if num.find('개월') >= 0 :
+             month_dict[num] = k
+         elif num.find('일') >= 0 :
+             day_dict[num] = k
+         elif num.find('년') >= 0 :
+             year_dict[num] = k
+         elif num.find('시간') >= 0 :
+             hour_dict[num] = k
+         elif num.find('주') >= 0 :
+             week_dict[num] = k
+         elif num.find('분') >= 0 :
+             minute_dict[num] = k
+     year_data = sorted(year_dict.items(), key=operator.itemgetter(0))
+     month_data = sorted(month_dict.items(), key=operator.itemgetter(0))
+     week_data = sorted(week_dict.items(), key=operator.itemgetter(0))
+     day_data = sorted(day_dict.items(), key=operator.itemgetter(0))
+     hour_data = sorted(hour_dict.items(), key=operator.itemgetter(0))
+     minute_data = sorted(minute_dict.items(), key=operator.itemgetter(0))
+     #print(month_data)
+     #print(week_data)
+     #print(day_data)
+     make_chart(year_data,month_data,week_data,day_data,hour_data,minute_data)
+ 
+ def make_chart(year_data,month_data,week_data,day_data,hour_data,minute_data) :
+     temp_list =  [year_data,month_data,week_data,day_data,hour_data,minute_data]
+     x_list = []
+     y_list = []
+     print(temp_list)
+     for var1 in temp_list :
+         for var2 in var1 :
+             if(var2[0].find('년')>=0):
+                 temp1 = var2[0][0] + 'years'
+                 temp2 = int(var2[1])
+                 x_list.append(temp1)
+                 y_list.append(temp2)
+             elif(var2[0].find('개월')>=0):
+                 temp1 = var2[0][0] + 'months'
+                 temp2 = int(var2[1])
+                 x_list.append(temp1)
+                 y_list.append(temp2)
+             elif(var2[0].find('주')>=0):
+                 temp1 = var2[0][0] + 'weeks'
+                 temp2 = int(var2[1])
+                 x_list.append(temp1)
+                 y_list.append(temp2)
+             elif(var2[0].find('일')>=0):
+                 temp1 = var2[0][0] + 'days'
+                 temp2 = int(var2[1])
+                 x_list.append(temp1)
+                 y_list.append(temp2)
+             elif(var2[0].find('시간')>=0):
+                 temp1 = var2[0][0] + 'hours'
+                 temp2 = int(var2[1])
+                 x_list.append(temp1)
+                 y_list.append(temp2)
+             else:
+                 temp1 = var2[0][0] + 'minutes'
+                 temp2 = int(var2[1])
+                 x_list.append(temp1)
+                 y_list.append(temp2)
+     print(x_list)
+     plt.bar(x_list,y_list,width = 0.5 , color = "blue")
+     # plt.show() -> 출력
+     plt.savefig('chart.png',dpi=300)
+     # plt.savefig('chart.png', dpi=300)
+ 
+ def call_main ():
+     print(' Comment Thread 생성중 \n')
+ 
+     sleep(1)
+     print(' **************************************************************')
+     print(' **************************************************************')
+     print(' **************************************************************')
+     print(' **************** 생성 완료 정보를 입력하세요. ****************  ')
+     print(' **************************************************************')
+     print(' **************************************************************')
+     print(' **************************************************************')
+     a = downloader.main()
+ 
+     return a
+ 
+ if __name__ == "__main__":
+     CommentList = call_main()
+     make_time_chart(CommentList)
+     ##author_results = search_by_author(CommentList,'광고제거기')
+     ##text_resutls = search_by_keyword(CommentList,'지현')
+     ##get_tags(CommentList)
+     ##print_result(author_results)
+     ##print_result(text_resutls)
--- a/Youtube/requirements.txt 0 → 100644
View file @760e53b
+++ b/Youtube/requirements.txt 0 → 100644
View file @760e53b
+ requests
+ beautifulsoup4
+ lxml
+ cssselect
+ ### ũѸ
+ pygame
+ pytagcloud
+ ### wordcloud 
+ Jpye1
+ ### Ű м
+ python -m pip install -U matplotlib==3.2.0rc1
--- a/readme.md deleted 100644 → 0
View file @d48ce4b
+++ b/readme.md deleted 100644 → 0
View file @d48ce4b
- 개발할 기능
- - 자신의 닉네임으로 댓글 찾기
- - 타인의 닉네임으로 댓글 찾기
- - 키워드를 통한 댓글 찾기
- - 좋아요 높은 순서로 댓글 찾기
- 
- 2019.11.01 ~ 2019.11.08
- 1차 구현
- - 분석할 대상 결정 및 구현 방법 결정
- 
- 2019.11.09 ~ 2019.11.16
- 2차 구현
- - 실질적인 구현
- 
- 2019.11.17 ~ 2019.11.23
- 3차 구현
- - 분석한 대상들을 merge한 후 서로에 대한 피드백 받기
- 
- 2019.11.24 ~ 2019.12.01
- 3차 구현
- - node js를 통한 웹서버 구현
- 
- 2019.12.02 ~ 2019.12.05
- 최종 점검 및 발표 준비
\ No newline at end of file
--- a/youtube.md 0 → 100644
View file @760e53b
+++ b/youtube.md 0 → 100644
View file @760e53b
+ Youtube 3차 수정 사항
+ -----------------------------------------------------
+ 1차에서 추가적으로 구현 할 사항
+ 
+ 1. 명령행 파라미터를 input 으로 넣는 함수
+ 2. csv 파일에서 리스트를 받아오는 함수
+ 3. 받아 온 Data를 가공 처리 하는 함수
+  * 가장 많이 등장한 키워드 찾는 함수
+  * 저자를 통해 검색하는 함수
+  * 내가 쓴 댓글을 확인 하는 함수
+  * 가장 댓글을 많이 입력한 사람을 찾는 함수
+ -----------------------------------------------------
+ 2차 Update 사항
+ 
+ 1. 명령행 파라미터를 Input으로 변경하여 받도록 수정하였음
+ 2. csv 파일으로 저장 할 것인지 여부를 묻고, 저장 하지 않는 경우 Dictionary 형태로 List에 넣도록 수정하였음
+ 3. Test 형식으로 List에 들어간 값들이 정상적으로 출력되는지 점검하였음
+ -----------------------------------------------------
+ 이후 추가 구현 사항
+ 
+ 1. Module 분리 (List 반환 모듈, Main 부분) -> 굳이 분리하지 않을 경우
+ 추가적으로 함수를 구현해야함
+ 2. 본격적으로 Data Set을 어떤 식으로 분리하여 제공 할지에 대한 추가적인 기능 구현 필요
+ 
+ -----------------------------------------------------
+ 
+ 1. 2차 개발사항에서 오류가 있던 부분을 수정하였음
+ 2. 가져온 Comment를 가공하여 처리할 수 있도록 일부 함수 구현
+  (1) 키워드를 통해 검색할 수 있도록 함수 구현
+  (2) 작성자 이름을 통해 검색할 수 있도록 함수 구현
+ 
+ -----------------------------------------------------
+ 추가 구현 사항
+ 
+ 1. konlpy (http://konlpy.org/ko/latest/)를 통하여 명사 추출 후 keyword 분석하기
+ 2. 시간대를 추출하여 시간대 별로 Comment 정리하기
+ -----------------------------------------------------
+ 4차 개발사항
+ 
+ 1. konlpy를 이용하여 keyword 분석 후 가장 많이 등장한 키워드 리스트 출력
+ 2. 1번 기능을 사용하여 wordcloud 구성
+ 3. 시간대를 이용하여 검색할 수 있는 기능 구현
+ 4. 시간대 별로 sort된 리스트를 가질 수 있도록 구현
+ -----------------------------------------------------
+ 추가 구현 사항
+ 
+ 1. 시간대 별로 sort된 리스트를 matplotlib python을 이용하여 차트화 시키기
+ 2. 기능 별로 접근할 수 있도록 정리할 것
+ -----------------------------------------------------
+ 5차 개발사항
+ 
+ 1. 시간대 별로 sort된 리스트를 matplotlib for python을 이용하여 차트화 하였음
\ No newline at end of file