Upload
eungi-hong
View
355
Download
4
Embed Size (px)
Citation preview
홍은기
PYTHON LEARNING FOR NATURAL LANGUAGE
PROCESSING
1. Learning Sequence2. Lists and Functions3. Loops4. Processing Raw Text with NLTK
CONTENTS
• 1. Python Syntax• 2. Strings and Console Output• 3. Conditionals and Control Flow• 4. Functions• 5. Lists & Dictionaries• 6. Student Becomes the Teacher(test)• 7. Lists and Functions• 8. Loops• 9. Exam Statistics(test)• 10. Advanced Topic in Python• 11. Introduction to Classes• 12. File Input and Output
LEARNING SEQUENCE(WWW.CODECADEMY.COM)
LISTS AND FUNCTIONS
LOOPS
PROCESSING RAW TEXT WITH NLTK
(http://www.nltk.org/book/)
웹 상의 HTML 문서로부터 텍스트를 추출 후 ,NLTK 를 사용하여 텍스트의 키워드를 추출After extracting a text from HTML document on the
web, I tried to extract keywords from the text with NLTK.
EXAMPLES
이주 아동 외면하는 ' 다문화 한국사회‘(http://www.huffingtonpost.kr/kyongwhan-
ahn/story_b_6927970.html?utm_hp_ref=korea)
[('', 65), ('(', 9), (')', 9), (' 한다 ', 6), ("'", 6), (' 있다 ', 5),
(' 아동 ', 5), (' 큰 ', 5), (' 모든 ', 5), (' 일 ', 5), (' 국제 ', 4),
(' 대한민국 ', 4), (' 나라 ', 4), (' 땅 ', 4), (' 국제사회 ', 4),
(' 인권 ', 4), (' 의원 ', 3), (' 세계 ', 3), (' 여의 ', 3), (' 수 ', 3),
(' 안 ', 3), (' 강한 ', 3), (' 불문 ', 2), (' 이주 ', 2), (' 법무부 ', 2)]
1. HTML TO RAW TEXT
# -*- coding: utf-8 -*-from urllib import requestimport nltk, re, pprintfrom nltk import word_tokenizefrom nltk import *from bs4 import BeautifulSoup
url = “http://www.huffingtonpost.kr/kyongwhan-ahn/story_b_6927970.html?utm_hp_ref=korea”
html = request.urlopen(url).read().decode(‘utf8’)raw = BeautifulSoup(html).get_text()
1. HTML TO RAW TEXT
# -*- coding: utf-8 -*-from urllib import requestimport nltk, re, pprintfrom nltk import word_tokenizefrom nltk import *from bs4 import BeautifulSoup
url = “http://www.huffingtonpost.kr/kyongwhan-ahn/story_b_6927970.html?utm_hp_ref=korea”
html = request.urlopen(url).read().decode(‘utf8’)raw = BeautifulSoup(html).get_text()
2. RAW TEXT TO LIST
raw = raw[30123:32364]print (type(raw))-> <class ‘str’>
tokens = word_tokenize(raw)print (type(tokens))-> <class ‘list’>
3. LIST TO VOCABULARIES
words = Trial.NounExtractor(token)
3. LIST TO VOCABULARIES
words = Trial.NounExtractor(token)
3. LIST TO VOCABULARIES
token = [‘철수는’ , ‘동생에게’ , ‘자전거를’ , ‘빌려주었다’ ]
words = Trial.NounExtractor(token)
words = [‘철수’ , ‘동생’ , ‘자전거’ , ‘빌려주었다’ ]
4. FREQUENCY DISTRIBUTION
fdist = FreqDist(words)print (fdist.most_common(25))
4. FREQUENCY DISTRIBUTION
fdist = FreqDist(words)print (fdist.most_common(25))
EXAMPLES
나를 끌어내린 롯데월드(http://www.huffingtonpost.kr/seungjoon-
ahn/story_b_6928016.html?utm_hp_ref=korea)
[('', 63), (' 그 ', 19), (' 것 ', 12), (' 우리 ', 10), ('!', 8), (' 없 ', 8), (' 놀이기구 ', 8), (' 직원 ', 8), (' 수 ', 8), (' 시각장애인 ', 8), (' 안 ', 6), (' 내 ', 5), (' 있었다 ', 5), (' 않 ', 5), (' 매뉴얼 ', 5), (' 근거 ', 5), (' 사람 ', 5),(' 롯데월드 ', 5), (' 다른 ', 5), (' 있던 ', 4), (' 한 ', 4), (' 장애인 ', 4), (' 설명 ', 4), (' 때 ', 4), (' 상황 ', 4)]
POS TAGGED
Thank_VB You_PRP !_.