View
214
Download
1
Category
Preview:
Citation preview
WATSON @ RPI
PROFESSOR JIM HENDLERSIMON ELLIS
KATE MCGUIRE NICOLE NEGEDLYAVI WEINSTOCK MATT KLAWONN JENN CHAN SARABETH JAFFE
WATSON TECHNOLOGIESAND
OPEN ARCHITECTURE QUESTION ANSWERING
INSIDE DEEPQAManaging complex unstructured data with UIMA
Simon Ellis
22nd November, 2013
WATSON RPI
INTRODUCTION
???IBM Watson
???Watson is… … a piece of software that will run on your laptop
Though very slowly Specialised hardware and control platform
… an implementation of the DeepQA concept
… the first iteration of the ‘cognitive computing’ platform
… a very clever artificial intelligence A very clever application of human intelligence
???Inside Watson
Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2
WATSON RPI
Nicole Negedly
QUESTION ANALYSIS
???Question Analysis
???Question analysis
What is the question asking for?
Which terms in the question refer to the answer?
Given any natural language question, how can Watson accurately discover this information?
Who is the president of Rensselaer Polytechnic Institute?
Focus Terms: “Who”, “president of Rensselaer
Polytechnic Institute”
Answer Types: Person, President
QuestionAnalysis
???Parsing and semantic analysis
What information about a previously unseen piece of English text can Watson determine?
How is this information useful?
Natural Language Parsing Semantic Analysis
- grammatical structure
- parts of speech
- relationships between words
- ...etc.
- meanings of words, phrases, etc.
- synonyms, entailment
- hypernyms, hyponyms
- ...etc.
???Parsing
Stanford’s NLP toolset is used
???Semantic relations in WordNet
Princeton University’s WordNet
Words are grouped into groups of synonyms called synsets
Relationships exist between noun synsets hypernym/hyponym: type-of relation
e.g. Canine is a hypernym of dog
holonym/meronym: part-of relation e.g. Building is a holonym of window
???How is this useful?
This information can be used to “understand” a question
Current Question Analysis work with RPI’s version of Watson Creating and training machine learning classifiers
Parse TreesDependency Relations
CoreferencesNamed Entities
Semantic Relations
Classifiers
Manually AnnotatedQuestions
New QuestionCritical Elements
of Question
???Question analysis pipeline
UnstructuredQuestion Text
Parsing&
SemanticAnalysis
MachineLearning
Classifiers
Structured Annotationsof Question:
Focus, answer types, Useful search queries
WATSON RPI
Kate McGuire
CANDIDATE GENERATION
???Search Result Processing and Candidate Generation
???Primary Search
Primary Search is used to generate our corpus of information from which to take candidate answers, passages, supporting evidence, and essentially all textual input to the system
It formulates queries based on the results of Question Analysis
These queries are passed into a search engine which returns a set number of highly relevant documents and their ranks.
???Search Result Processing
Search Result Processing restructures the information in the document so it is useful. HTML tags are cleaned from the document Passage Retrieval/Chunking
Breaks the document down into smaller pieces Adds information, such as the html text, length, place in the
document, etc.
Passage Parsing Parse trees are formed for each passage
???Candidate Generation
Candidate Generation generates a wide net of possible answers for the question from each document.
Using each document, and the passages created by Search Result Processing, we generate candidates using three techniques: Title of Document (T.O.D.): Adds the title of the
document as a candidate. Wikipedia Title Candidate Generation: Adds any noun
phrases within the document’s passage texts that are also the titles of Wikipedia articles.
Anchor Text Candidate Generation: Adds candidates based on the hyperlinks and metadata within the document.
???Search Result Processing andCandidate Generation
WATSON RPI
Matt Klawonn
SCORING & RANKING
???Scoring & Ranking
???Scoring
Analyzes how well a candidate answer relates to the question
Two basic types of scoring algorithm Context-independent scoring Context-dependent scoring
???Types of scorers
Context-independent Question Analysis Ontologies (DBpedia, YAGO, etc) Reasoning
Context-dependent Analyzes natural language that candidates appear in Relies on “passages” found during search
???Scorers
Examples of scorers include Passage Term Match Textual Alignment Skip-Bigram
Each of these scores supportive evidence
Scores are then merged to produce a single candidate score
???Inside Watson
Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2
WATSON RPI
Simon Ellis
THE TAO OF UIMA
???UIMA
‘Unstructured Information Management Architecture’
A platform for the analysis of unstructured information and its integration with search technologies
Permits multi-modal analysis of collections or archives
???UIMA
http://uima.apache.org/d/uimaj-2.4.0/
???‘Unstructured information’
The most rapidly-growing source of information in existence The internet Print media Video recordings Audio recordings ...
“Unstructured information is just information that doesn’t have the kind of structure you need it to have for what you’re doing.” [Peter Fox, X-Informatics class]
???UIMA (again)
The UIMA platform can be thought of in four ways:
A specification for component interfaces for, and in, an analytics pipeline
A specification of certain design patterns for that pipeline
An outline of 2 data representations: in-memory annotations for local analysis and XML representation for remote web integration
An outline for possible development roles allowing tools to be used by users with a wide range of skills
???CAS
Common Analysis Structure (CAS) Object-based structure Allows representation of objects, properties and values Stores arbitrary data structures
Annotations Types
Object types may be related by single-inheritance Contains document being analysed, either physically
or logically
Results of analysis are shared and recorded in a CAS
???Annotator
Core UIMA component type
Contains analysis algorithms designed to work on data contained in a CAS Original document Annotation Search evidence Candidate score ...
Form the building blocks of Analysis Engines
???Analysis Engine
Building blocks of a UIMA pipeline
Section of code containing 1 or more annotators
Analyses source document(s) and provides analysis results Results typically represent metadata about the source
Analysis Engines are effectively software agents that discover and record metadata
???Example
http://uima.apache.org/d/uimaj-2.4.0/
???Sofas and CAS Views
Sofa Subject of Analysis A piece of data intended for analysis by UIMA
components
CAS View A section of a CAS dedicated to one Sofa Shares the same name as its Sofa May be dynamically created as needed by applications
or AEs
Each Sofa permits a different perspective of an artefact
???Example
Dr Shirley Ann Jackson
Teacher of physics
President, RPI
Researcher at Bell Labs
IBM Board of Directors
Chairman, USNRC
???Descriptors
All components consist of two parts Code Descriptor (declaration)
Functions of the descriptor Contains metadata about the code block
Name Structure Behaviour
Used in component discovery, reuse, and tool composition
???UIMA (again, again)
Highly reliant on XML Flexible Extensible
XML... ... describes components and their behaviour ... controls data (CAS) flow through the pipeline ... is used to create larger components from
subcomponents Aggregate Analysis Engines
???Aggregate Analysis Engine
A complex analysis engine made up of other components May contain simple AEs or other AAEs Components further down the pipeline may rely on all
output Performs a larger, complete task, e.g. named entity
recognition language detection and tokenisation part-of-speech detection deep grammatical parsing named entity recognition
???CAS Multiplier
Creates 0 or more new CAS objects from an input CAS
May be used to duplicate or merge CAS objects e.g....
... creating alternative versions of an input Sofa ... breaking a large input CAS into multiple smaller
pieces ... aggregating multiple input CAS into a single output
???Inside Watson
Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2
???UIMA, once more
UIMA runs in the Java Runtime Environment Uses XML code to run system UIMA framework reads XML dynamically and
creates objects using them Only the UIMA framework itself is compiled
SO HOW DOES IT WORK?
???How it works
Abstract class prototyping UIMA Framework objects are usually derived from a
base class
Function signature UIMA Framework objects each have certain functions
which can or must be overridden initialize() process()
This ensures all classes are of known supertypes and have a recognisable function signature for all key functions
???How it works
Reflection The ability of a computer program to examine and
modify the structure and behavior (specifically the values, meta-data, properties and functions) of an object at runtime.
XML descriptors define the nature of objects class name constructor parameters ...
UIMA dynamically creates objects using reflection
???The ‘magic code’
// create type of obj we want
JCasAnnotator ann = null;
// use Java inbuilt function to create abstract class
Class annClass = Class.forName("com.ibm.tutorial.tycor");
// get constructors for abstract class type
Constructor cons = annClass.getConstructor(<params>);
// should return a JCasAnnotator object
ann = cons.newInstance(<params>);
???UIMA, finally
Effectively an interpreter for code ‘scripted’ in XML and Java
Component-oriented design makes scaling easy BlueJ (Jeopardy! hardware) had ≫ 2,000 cores
Most easily written in Java Java runs in the Java Runtime Environment Dynamic typing & reflection are therefore possible Could not have been written in C++08
An OS for multimodal, unstructured information management
WATSON RPI
QUESTIONS & ANSWERS
Recommended