17
홍은기 PYTHON LEARNING FOR NATURAL LANGUAGE PROCESSING

Python learning for Natural Language Processing (2nd)

Embed Size (px)

Citation preview

Page 1: Python learning for Natural Language Processing (2nd)

홍은기

PYTHON LEARNING FOR NATURAL LANGUAGE

PROCESSING

Page 2: Python learning for Natural Language Processing (2nd)

1. Learning Sequence2. Lists and Functions3. Loops4. Processing Raw Text with NLTK

CONTENTS

Page 3: Python learning for Natural Language Processing (2nd)

• 1. Python Syntax• 2. Strings and Console Output• 3. Conditionals and Control Flow• 4. Functions• 5. Lists & Dictionaries• 6. Student Becomes the Teacher(test)• 7. Lists and Functions• 8. Loops• 9. Exam Statistics(test)• 10. Advanced Topic in Python• 11. Introduction to Classes• 12. File Input and Output

LEARNING SEQUENCE(WWW.CODECADEMY.COM)

Page 4: Python learning for Natural Language Processing (2nd)

LISTS AND FUNCTIONS

Page 5: Python learning for Natural Language Processing (2nd)

LOOPS

Page 6: Python learning for Natural Language Processing (2nd)

PROCESSING RAW TEXT WITH NLTK

(http://www.nltk.org/book/)

웹 상의 HTML 문서로부터 텍스트를 추출 후 ,NLTK 를 사용하여 텍스트의 키워드를 추출After extracting a text from HTML document on the

web, I tried to extract keywords from the text with NLTK.

Page 7: Python learning for Natural Language Processing (2nd)

EXAMPLES

이주 아동 외면하는 ' 다문화 한국사회‘(http://www.huffingtonpost.kr/kyongwhan-

ahn/story_b_6927970.html?utm_hp_ref=korea)

[('', 65), ('(', 9), (')', 9), (' 한다 ', 6), ("'", 6), (' 있다 ', 5),

(' 아동 ', 5), (' 큰 ', 5), (' 모든 ', 5), (' 일 ', 5), (' 국제 ', 4),

(' 대한민국 ', 4), (' 나라 ', 4), (' 땅 ', 4), (' 국제사회 ', 4),

(' 인권 ', 4), (' 의원 ', 3), (' 세계 ', 3), (' 여의 ', 3), (' 수 ', 3),

(' 안 ', 3), (' 강한 ', 3), (' 불문 ', 2), (' 이주 ', 2), (' 법무부 ', 2)]

Page 8: Python learning for Natural Language Processing (2nd)

1. HTML TO RAW TEXT

# -*- coding: utf-8 -*-from urllib import requestimport nltk, re, pprintfrom nltk import word_tokenizefrom nltk import *from bs4 import BeautifulSoup

url = “http://www.huffingtonpost.kr/kyongwhan-ahn/story_b_6927970.html?utm_hp_ref=korea”

html = request.urlopen(url).read().decode(‘utf8’)raw = BeautifulSoup(html).get_text()

Page 9: Python learning for Natural Language Processing (2nd)

1. HTML TO RAW TEXT

# -*- coding: utf-8 -*-from urllib import requestimport nltk, re, pprintfrom nltk import word_tokenizefrom nltk import *from bs4 import BeautifulSoup

url = “http://www.huffingtonpost.kr/kyongwhan-ahn/story_b_6927970.html?utm_hp_ref=korea”

html = request.urlopen(url).read().decode(‘utf8’)raw = BeautifulSoup(html).get_text()

Page 10: Python learning for Natural Language Processing (2nd)

2. RAW TEXT TO LIST

raw = raw[30123:32364]print (type(raw))-> <class ‘str’>

tokens = word_tokenize(raw)print (type(tokens))-> <class ‘list’>

Page 11: Python learning for Natural Language Processing (2nd)

3. LIST TO VOCABULARIES

words = Trial.NounExtractor(token)

Page 12: Python learning for Natural Language Processing (2nd)

3. LIST TO VOCABULARIES

words = Trial.NounExtractor(token)

Page 13: Python learning for Natural Language Processing (2nd)

3. LIST TO VOCABULARIES

token = [‘철수는’ , ‘동생에게’ , ‘자전거를’ , ‘빌려주었다’ ]

words = Trial.NounExtractor(token)

words = [‘철수’ , ‘동생’ , ‘자전거’ , ‘빌려주었다’ ]

Page 14: Python learning for Natural Language Processing (2nd)

4. FREQUENCY DISTRIBUTION

fdist = FreqDist(words)print (fdist.most_common(25))

Page 15: Python learning for Natural Language Processing (2nd)

4. FREQUENCY DISTRIBUTION

fdist = FreqDist(words)print (fdist.most_common(25))

Page 16: Python learning for Natural Language Processing (2nd)

EXAMPLES

나를 끌어내린 롯데월드(http://www.huffingtonpost.kr/seungjoon-

ahn/story_b_6928016.html?utm_hp_ref=korea)

[('', 63), (' 그 ', 19), (' 것 ', 12), (' 우리 ', 10), ('!', 8), (' 없 ', 8), (' 놀이기구 ', 8), (' 직원 ', 8), (' 수 ', 8), (' 시각장애인 ', 8), (' 안 ', 6), (' 내 ', 5), (' 있었다 ', 5), (' 않 ', 5), (' 매뉴얼 ', 5), (' 근거 ', 5), (' 사람 ', 5),(' 롯데월드 ', 5), (' 다른 ', 5), (' 있던 ', 4), (' 한 ', 4), (' 장애인 ', 4), (' 설명 ', 4), (' 때 ', 4), (' 상황 ', 4)]

Page 17: Python learning for Natural Language Processing (2nd)

POS TAGGED

Thank_VB You_PRP !_.