Knowledge Extraction from Textdakchigo.kr/events/part3/pdf/LOD(20140123,02,김평).pdf · 2014-01-23 · Knowledge Extraction from Text (KET) NIPS 2013 (Neural Information Processing

Copyright © 20012~ JNUE

지식의 힘!! 그리고 Linked Open Data

Knowledge Extraction from Text

2014.1.24

김평 ([email protected])


지식을 어떻게 추출할 것인가?

Linked Data 생성 지식 획득: 여러 형태의 지식원천(Knowledge Source)으로부터 필요한 지식을 추출하여 구조적으로 조직화하는 과정

DBMS 구축 대상 선정 (구축하려는 서비스와 보유 자원 파악) 서비스 시나리오 중심의 데이터 분석 개념화 (클래스, 속성) 변환 (DBMS -> RDF) 검증 -> 발행

Text 비정형 문서에서 지식을 추출하는 작업이 추가적으로 필요

2



관련 기술 데이터 마이닝

전자상거래나 웹 로그 등 다양한 형태로 생성되는, 잠재적 가치를 가진 데이터로부터 유용한 정보를 추출하는 작업

데이터베이스의 데이터처럼 정형화된 데이터를 대상 특성간 연관성 파악이나 규칙 생성 등 다양한 알고리즘(결정트리, 신경망, 연관 규칙)이 개발되어 있음

텍스트 마이닝 자연어로 구성된 비정형 텍스트 데이터에서 패턴 또는 관계를 추출하는 마이닝 기법

자연어처리, 정보추출, 시각화, 데이터베이스 등 기계학습의 분야를 포함

브랜드 모니터링, 오피니언 마이닝, QA 시스템 등 다양하게 연구되고 있음

3



Knowledge Extraction from Text (KET) NIPS 2013 (Neural Information Processing Systems Foundation)

Text understanding is an old yet-unsolved AI problem consisting of a number of nontrivial steps. The critical step in solving the problem is knowledge acquisition from

text, i.e. a transition from a non-formalized text into a formalized actionable language (i.e. capable of reasoning).

Other steps in the text understanding pipeline include linguistic processing, reasoning, text generation, search, question answering etc. which are more or less solved to the degree which allows composition of a text understanding service.

On the other hand, we know that knowledge acquisition, as the key bottleneck, can be done by humans, while automating of the process is still out of reach in its full breadth.

4


지식의 추출과 서비스

Text understanding and knowledge acquisition AI research group: computational linguistics, machine learning,

probabilistic & logical reasoning, and semantic web 기계 학습(machine learning)은 인공 지능의 한 분야로, 컴퓨터가 학습할 수 있도록 하는 알고리즘과 기술을 개발하는 분야

Use of machine learning Carnegie Mellon University Cycorp IBM Research IDIAP Research Institute Jozef Stefan Institute KU Leuven(Katholieke Universiteit Leuven) Max Planck Institute MIT Media Lab University Washington Vulcan Inc.

5


Read the Web

카네기 멜론 대학의 연구 프로젝트 (2010.1 ~) NELL: Never-Ending Language Learning

First, it attempts to "read," or extract facts from text found in hundreds of millions of web pages (e.g., playsInstrument(George_Harrison, guitar)).

Second, it attempts to improve its reading competence, so that tomorrow it can extract more facts from the web, more accurately.

So far, NELL has accumulated over 50 million candidate beliefs by reading the web, and it is considering these at different levels of confidence. NELL has high confidence in 2,069,313 of these beliefs

http://rtw.ml.cmu.edu/rtw/

http://rtw.ml.cmu.edu/rtw/kbbrowser/

6

http://rtw.ml.cmu.edu/rtw/

http://rtw.ml.cmu.edu/rtw/kbbrowser/


OpenCyc (1)

Cyc the world's largest and most complete general knowledge base and

commonsense reasoning engine. rich domain modeling semantic data integration text understanding domain-specific expert systems game Ais

http://www.cyc.com/platform/opencyc ~239,000 terms (up from ~177,000 terms in the previous release) ~2,093,000 triples (up from ~1,500,000 in the previous release) ~69,000 owl:sameAs links to external (non-Cyc) semantic data

namespaces: http://www.cyc.com/vocabulary/basics

http://sw.opencyc.org/

7

http://www.cyc.com/platform/opencyc

http://www.cyc.com/vocabulary/basics

http://sw.opencyc.org/


OpenCyc (2)

Semantic Construction Grammar

8 Semantic Construction Grammar : Michael Witbrock (2012.1.18)


Watson

IBM: Watson understands natural language, breaking down the barrier between people and machines.

9

The Science Behind an Answer • Question Analysis (2:11)

• Hypothesis Generation (2:45)

• Hypothesis & Evidence Scoring (3:19)

• Final Merging & Ranking (4:17)

http://youtu.be/WFR3lOm_xhE

http://youtu.be/DywO4zksfXw


Deep Learning for NLP (1)

Deep Learning A new area of Machine Learning research, which has been

introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence.

10

http://ninacsmith.com/3CLearning/Ninas3CTools/ConstructiveTools/DeeporShallow.aspx

http://ninacsmith.com/3CLearning/Ninas3CTools/ConstructiveTools/DeeporShallow.aspx



AI 기술의 화두 알고리즘은 심층 신경망 기반 학습 알고리즘이다. 미래 ICT기기에 인간형 인지·판단 능력을 부여해 스스로 지능을 고도화해가는 기계 탄생이 가능하다. 구글, MIT, 가트너 등이 2013년 주목해야 할 기술로 선정

미국 매사추세츠공대(MIT)는 테크놀로지리뷰 2013년 5ㆍ6월호에서 '인간의 가능성을 높여줄 10가지 기술' 가운데 구글이 개발한 인공지능 시스템인 '딥러닝'을 소개하고 있다. 이 시스템은 사람처럼 배우고 학습하며 스스로 언어능력을 발전시켜 나간다고 함

11



http://deeplearning.net/ Demo (http://deeplearning.net/demos/)

재귀 신경망을 사용하여 스탠포드의 심리 분석 데모: 영화 리뷰어의 감정 분석: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html

12

http://deeplearning.net/

http://deeplearning.net/demos/

http://nlp.stanford.edu:8080/sentiment/rntnDemo.html


Xlike (1)

Jožef Stefan Institute, Slovenia (FP7) Cross-lingual Knowledge Extraction

two key open research problems: to extract and integrate formal knowledge from multilingual texts

with cross-lingual knowledge bases to adapt linguistic techniques and crowd sourcing to deal with

irregularities in informal language used primarily in social media.

13

http://www.xlike.org/home/partners/


Xlike (2)

Demo Newsfeed Clean stream of semantically enriched news articles

Multilingual Language Processing Wweb services for multilingual language processing

Cross-lingual Document Linking Demo of cross-lingual similarity search

News Data Visualization Interactive interface to Newsfeed data enriched with XLike technologies

14

http://newsfeed.ijs.si/

http://sandbox-xlike.isoco.com/demo/index

http://aidemo.ijs.si/xling/wikipedia.html

http://sandbox-xlike.isoco.com/portal/


SemEval

SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems

15

http://en.wikipedia.org/wiki/File:SemEval_framework.jpg

http://en.wikipedia.org/wiki/SemEval

http://en.wikipedia.org/wiki/File:SemEval_framework.jpg


Spatial Role Labeling

Spatial relationships between objects http://en.wikipedia.org/wiki/SemEval#Semantic_evaluation_tasks

16

http://en.wikipedia.org/wiki/SemEval#Semantic_evaluation_tasks


YAGO2s (1)

Max-Planck-Institut Informatik YAGO2s

Huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames. Currently, YAGO2s has knowledge of more than 10 million entities

(like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.

The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value.

YAGO combines the clean taxonomy of WordNet with the richness of the Wikipedia category system, assigning the entities to more than 350,000 classes.

YAGO is an ontology that is anchored in time and space. YAGO attaches a temporal dimension and a spacial dimension to many of its facts and entities.

In addition to a taxonomy, YAGO has thematic domains such as "music" or "science" from WordNet Domains.

17

http://wordnet.princeton.edu/

http://wndomains.fbk.eu/


YAGO2s (2)

YAGO2s Demo YAGO as Linked Open Data

YAGO2 is part of the linked data cloud. We are linked directly to DBpedia. You can download these links. sameAs-links between the classes of DBpedia and YAGO2s sameAs-links between the individuals of YAGO2s and the YAGO-

based classes of Dbpedia subClassOf-links between the classes of YAGO2s and the manual

ontology classes of DBpedia: download as TSV (with precision), download as RDF/TTL (cut at 60% precision). These links have been computed automatically by the PARIS project and are not 100% accurate.

subPropertyOf-links between the relations of YAGO2s and the manual ontology properties of DBpedia: download as TSV (with precision), download as RDF/TTL (cut at 40% precision). These links have been computed automatically by the PARIS project. They are not perfect, but of very reasonable quality.

18

http://www.mpi-inf.mpg.de/yago-naga/yago/demo.html


ConceptNet (1)

ConceptNet is a semantic network containing lots of things computers should know about the world, especially when understanding text written by people.

19

http://conceptnet5.media.mit.edu/


ConceptNet (2)

ConceptNet A freely available commonsense knowledgebase and natural-

language-processing toolkit which supports many practical textual-reasoning tasks over real-world documents right out-of-the-box (without additional statistical training) including topic-jisting (e.g. a news article containing the concepts, “gun,”

“convenience store,” “demand money” and “make getaway” might suggest the topics “robbery” and “crime”),

affect-sensing (e.g. this email is sad and angry), analogy-making (e.g. “scissors,” “razor,” “nail clipper,” and “sword”

are perhaps like a “knife” because they are all “sharp,” and can be used to “cut something”),

text summarization contextual expansion causal projection cold document classification and other context-oriented inferences

20


Open Information Extraction (1)

University of Washington AI Get answers to natural-language questions!

How can a computer accumulate a massive body of knowledge? What will Web search engines look like in ten years?

To address the questions above, the Open IE project has been developing a Web-scale information extraction system that reads arbitrary text from any domain on the Web, extracts meaningful information and stores in a unified knowledge base for efficient querying. In contrast to traditional information extraction, the Open Information Extraction paradigm attempts to overcome the knowledge acquisition bottleneck by extracting a large number of relations at once.

21

http://ai.cs.washington.edu/projects/open-information-extraction


Open Information Extraction (2)

Demo Demo: TextRunner extracted over 500,000,000 assertions from

100 million Web pages.

Software: ReVerb Open Information Extraction Software and additional information.

Data: Horn-clause inference rules learned by the Sherlock system.

Demo: Selectional Preferences from Web Text compute admissible argument values for a relation.

Data: 10,000 Functional Relations learned from Web Text predict the functionality of a phrase.

22

http://www.cs.washington.edu/research/textrunner


SILK

SLIK (Semantic Inferencing on Large Knowledge) SILK is the newest part of Vulcan Inc.'s Project Halo

23 http://silk.semwebcentral.org/talk-silk-ruleml2011.pdf

http://projecthalo.com/

http://silk.semwebcentral.org/talk-silk-ruleml2011.pdf






Wolfram|Alpha

Making the world’s knowledge computable http://www.wolframalpha.com/examples/

24

http://www.wolframalpha.com/examples/


Knowledge Extraction

the creation of knowledge from structured and unstructured sources (Wikipedia)

25

http://en.wikipedia.org/wiki/Knowledge_extraction


결론

LOD가 확산되기 위한 절차 그 걸림돌은?

누가, 무엇을, 어떻게????

어떻게 구축하고, 확산할 것인가?

지식이 자동화되기 위한 어렵고도 먼 길…..

26

Documents

Knowledge Extraction from Textdakchigo.kr/events/part3/pdf/LOD(20140123,02,김평).pdf · 2014-01-23 · Knowledge Extraction from Text (KET) NIPS 2013 (Neural Information Processing