Python learning for Natural Language Processing (2nd)

홍은기

PYTHON LEARNING FOR NATURAL LANGUAGE

PROCESSING

1. Learning Sequence2. Lists and Functions3. Loops4. Processing Raw Text with NLTK

CONTENTS

• 1. Python Syntax• 2. Strings and Console Output• 3. Conditionals and Control Flow• 4. Functions• 5. Lists & Dictionaries• 6. Student Becomes the Teacher(test)• 7. Lists and Functions• 8. Loops• 9. Exam Statistics(test)• 10. Advanced Topic in Python• 11. Introduction to Classes• 12. File Input and Output

LEARNING SEQUENCE(WWW.CODECADEMY.COM)

LISTS AND FUNCTIONS

LOOPS

PROCESSING RAW TEXT WITH NLTK

(http://www.nltk.org/book/)

웹 상의 HTML 문서로부터 텍스트를 추출 후 ,NLTK 를 사용하여 텍스트의 키워드를 추출After extracting a text from HTML document on the

web, I tried to extract keywords from the text with NLTK.

EXAMPLES

이주 아동 외면하는 ' 다문화 한국사회‘(http://www.huffingtonpost.kr/kyongwhan-

ahn/story_b_6927970.html?utm_hp_ref=korea)

[('', 65), ('(', 9), (')', 9), (' 한다 ', 6), ("'", 6), (' 있다 ', 5),

(' 아동 ', 5), (' 큰 ', 5), (' 모든 ', 5), (' 일 ', 5), (' 국제 ', 4),

(' 대한민국 ', 4), (' 나라 ', 4), (' 땅 ', 4), (' 국제사회 ', 4),

(' 인권 ', 4), (' 의원 ', 3), (' 세계 ', 3), (' 여의 ', 3), (' 수 ', 3),

(' 안 ', 3), (' 강한 ', 3), (' 불문 ', 2), (' 이주 ', 2), (' 법무부 ', 2)]

1. HTML TO RAW TEXT

# -*- coding: utf-8 -*-from urllib import requestimport nltk, re, pprintfrom nltk import word_tokenizefrom nltk import *from bs4 import BeautifulSoup

url = “http://www.huffingtonpost.kr/kyongwhan-ahn/story_b_6927970.html?utm_hp_ref=korea”

html = request.urlopen(url).read().decode(‘utf8’)raw = BeautifulSoup(html).get_text()

1. HTML TO RAW TEXT

# -*- coding: utf-8 -*-from urllib import requestimport nltk, re, pprintfrom nltk import word_tokenizefrom nltk import *from bs4 import BeautifulSoup

url = “http://www.huffingtonpost.kr/kyongwhan-ahn/story_b_6927970.html?utm_hp_ref=korea”

html = request.urlopen(url).read().decode(‘utf8’)raw = BeautifulSoup(html).get_text()

2. RAW TEXT TO LIST

raw = raw[30123:32364]print (type(raw))-> <class ‘str’>

tokens = word_tokenize(raw)print (type(tokens))-> <class ‘list’>

3. LIST TO VOCABULARIES

words = Trial.NounExtractor(token)




token = [‘철수는’ , ‘동생에게’ , ‘자전거를’ , ‘빌려주었다’ ]


words = [‘철수’ , ‘동생’ , ‘자전거’ , ‘빌려주었다’ ]

4. FREQUENCY DISTRIBUTION

fdist = FreqDist(words)print (fdist.most_common(25))

4. FREQUENCY DISTRIBUTION

fdist = FreqDist(words)print (fdist.most_common(25))

EXAMPLES

나를 끌어내린 롯데월드(http://www.huffingtonpost.kr/seungjoon-

ahn/story_b_6928016.html?utm_hp_ref=korea)

[('', 63), (' 그 ', 19), (' 것 ', 12), (' 우리 ', 10), ('!', 8), (' 없 ', 8), (' 놀이기구 ', 8), (' 직원 ', 8), (' 수 ', 8), (' 시각장애인 ', 8), (' 안 ', 6), (' 내 ', 5), (' 있었다 ', 5), (' 않 ', 5), (' 매뉴얼 ', 5), (' 근거 ', 5), (' 사람 ', 5),(' 롯데월드 ', 5), (' 다른 ', 5), (' 있던 ', 4), (' 한 ', 4), (' 장애인 ', 4), (' 설명 ', 4), (' 때 ', 4), (' 상황 ', 4)]

POS TAGGED

Thank_VB You_PRP !_.

Technology

Python learning for Natural Language Processing (2nd)