50
Knowledge Graph Curation with Deep Learning and Semantic Reasoning Jiaoyan Chen (陈矫彦) Department of Computer Science, University of Oxford [email protected] https://www.cs.ox.ac.uk/isg/people/jiaoyan.chen/ 08/11/2019 1

Knowledge Graph Curation withDeep Learning and Semantic … · 2019-11-08 · Knowledge Graph (KG) • TheKnowledge Graphis a knowledge base used byGoogleand its services to enhance

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

KnowledgeGraphCurationwith DeepLearningandSemanticReasoning

JiaoyanChen(陈矫彦)DepartmentofComputerScience,UniversityofOxford

[email protected]://www.cs.ox.ac.uk/isg/people/jiaoyan.chen/

08/11/2019 1

KnowledgeGraph(KG)

• The KnowledgeGraph isa knowledgebase usedby Google anditsservicestoenhanceits searchengine'sresultswithinformationgatheredfromavarietyofsources.

• Proposedaround2012

-------- FromWikipedia

08/11/2019 2

Semantic Web• The SemanticWeb isanextensionofthe WorldWideWeb throughstandardsbythe WorldWideWebConsortium (W3C)

• URI:UniformResourceIdentification• http://dbpedia.org/resource/University_of_Oxford

• RDF (ResourceDescription Framework)• Triple:<Subject, Property, Object>

• RDF Schema• e.g., class hierarchy, property domain and range

• Web Ontology Language (OWL)• SPARQL (RDF query language)• SemanticReasoning

08/11/2019 3

KG: RDF triples + RDF Schema orOWLOntology

IanHorrocks

UniversityofOxford Oxford

worksFor

locatedIn

ProfessorisA Person

subClassOf

UKlocatedIn

KGApplications

• Searchengines

• Chatrobotse.g.,AlexaandSiri

• Dataanalyticsandmachinelearning e.g., explanation

• Intelligentfinance

• Knowledgeengineering

• ...

08/11/2019 4

From DataCurationtoKnowledgeGraphCuration

• Datacuration (策展):Aseriesofdatamanagementandcleaningtaskse.g.,organization,integration, annotation,publication,maintenance,reuse,preservation,etc.

• FromdatacurationtoKG Curation:• Knowledge boosting

• E.g.,extractRDFtriplesfromtext,tabulardata• Refinement

• Canonicalization,error detection and correction• Qualitymeasurement• Etc.

08/11/2019 5

Tabular Data Matching(forDataAnalyticsandKGBoosting)

08/11/2019 6

CaseStudy

StockTicker Name GICSSector

AMZN Amazon …

GOOG Alphabet …

Basic_Info

Stock URL CEOName Country

AMZN www.amazon.com JeffBezos USA

Background_Info

Ticker Date Income($) Liabilities

AMZN 06/01/2018 177.86 billion ...

Finance_Result

(Soft)ForeignKeyDiscovery

• W3CRDBtoRDF(Ontology)Mapping• table“Basic_Info”intoontologyclass,• foreignkeyintoobjectproperty• column“Income”intodataproperty,• column“CEO_Name”intodataproperty,• Etc.

Basic_Info

Background_Info

has_background

has_name

has_ceo_name

Finance_Result

has_finance_result

xsd:inthas_income

xsd:string

xsd:string

08/11/2019 7

However, suchstraightforwardmatchingbrings limitedsemantics.e.g.,Whatdoes“JeffBezos”mean?

CaseStudy

StockTicker Name GICSSector

AMZN Amazon …

GOOG Alphabet …

Basic_Info

Stock URL CEOName Country

AMZN www.amazon.com JeffBezos USA

Background_Info

Ticker Date Income($) Liabilities

AMZN 06/01/2018 177.86 billion ...

Finance_Result

(Soft)ForeignKeyDiscovery

• W3CRDBtoRDF(Ontology)Mapping (e.g.,BootOX)• table“Basic_Info”intoontologyclass,• foreignkeyintoobjectproperty• column Incomeintodataproperty,• column“CEO_Name”intodataproperty,• Etc.

• Semanticmatching:• Column"ColumnName”byclassCompany,column“CEO_Name”by

classPerson• Columns“Stock”and“CEO_Name”bypropertyhas_ceo,• Cell“JeffBezos” byentityJeff_Preston_Bezos• Etc.

Basic_Info

Background_Info

has_background

Companyhas_company

Personhas_ceo

Finance_Result

has_finance_result

xsd:inthas_income

ExternalKGs

works_for(LinkPrediction)

08/11/2019 8

TasksofTable Matching

• Task #1:EntityColumnà Class• e.g.,Column“CEOName”toClassCEOandClassPerson• equaltoTableà Classmatching iftheentitycolumnisa primarykey

• Task #2:Cellà Entity• e.g.,“Jeff Bezos”toEntityJeff_Preston_Bezos• equalto Rowà Class matching if the column is a primary key

• Task #3:ColumnPairà Property• e.g.,“Stock”and“CEOName”à propertyhas_ceo• includesdatapropertyandobjectproperty• two columns can come from two tables• columnordermatters

08/11/2019 9

RelatedWork(1)-- ProbabilisticGraphicalModel(PGM)

• Steps• Modelstasksof#1,#2and#3withaPGM (e.g., Markov Random Fields):variablefortablecomponent;valueforKGcomponent

• E.g., celltoentitybystringsimilarity,columnheadertoclass• Searchesforanassignment ofvaluestothevariablestomaximizethejointprobability

• Examples• [Limayeetal.2010,Mulwad etal.2013,Bhagavatula etal.2015,Takeoka etal.2019]

• Limitations:• Mostlyrelyonmetadata• Lowscalabilityandefficiency

• Limaye,G.,Sarawagi,S.,&Chakrabarti,S.(2010).Annotatingandsearchingwebtablesusingentities,typesandrelationships.ProceedingsoftheVLDBEndowment,3(1–2),1338–1347.• Mulwad,V.,Finin,T.,&Joshi,A.(2013).SemanticMessagePassingforGeneratingLinkedDatafromTables.InISWC.• Bhagavatula,C.,Noraset,T.,&Downey,D.(2015).TabEL:EntityLinkinginWebTables.ISWC (Vol.9366).• Venetis,P.,Halevy,A.,Madhavan,J.,Paşca,M.,Shen,W.,Wu,F.,…Wu,C.(2011).Recoveringsemanticsoftablesontheweb.ProceedingsoftheVLDBEndowment,4(9),528–538.• Chu,X.,Morcos,J.,Ilyas,I.F.,Ouzzani,M.,Papotti,P.,Tang,N.,&Ye,Y.(2015).KATARA: ADataClearning SystemPoweredbyKnowledgeBaseandCrowdsourcing.InSIGMOD (Vol.8,

pp.1247–1261).NewYork,NewYork,USA:ACMPress.• Takeoka,K.,Oyamada,M.,Nakadai,S.,&Okadome,T.(2019).Meimei:AnEfficientProbabilisticApproachforSemanticallyAnnotatingTables.InAAAI.

1008/11/2019

RelatedWork(2)-- Iterative Approach

• Abootstrappingapproachwithtwoiterations(TableMiner+ [Zhang 2017]):• Getsomeinitialmatchingsusingpartialdatafromthetable e.g.,columntypebycolumnheader

• Usestheinitialmatchingsasconstraintstointerprettherestmatchingsofthetable

• Multipleiterationswithanoverallscore(T2KMatch[Ritze 2015])

• Limitations:• Performancedropswithoutmetadata• Hardtocapturethecontextualsemantics

• Zhang,Z.(2017).EffectiveandefficientSemanticTableInterpretationusingTableMiner+.SemanticWeb,8(6),921–957.• Ritze,D.,Lehmberg,O.,&Bizer,C.(2015).MatchingHTMLTablestoDBpedia.InProceedingsofthe5thInternationalConferenceonWebIntelligence,MiningandSemantics-WIMS

’15 (pp.1–6).NewYork,NewYork,USA:ACMPress.• Syed,Z.,Finin,T.,Mulwad,V.,&Joshi,A.(2010).ExploitingaWebofSemanticDataforInterpretingTables.ProceedingsoftheWebSci10:ExtendingtheFrontiersofSocietyOn-Line,

April26-27th,2010,Raleigh,NC:US,(April),26–27.• Mulwad,V.,Finin,T.,Syed,Z.,&Joshi,A.(2010).Usinglinkeddatatointerprettables.ProceedingsofthetheFirstInternationalWorkshoponConsumingLinkedData,665.

1108/11/2019

RelatedWork(3)--DeepLearningfortheSemantics

• Disambiguate by learning representation of the context of the KG and the table• [Efthymiou 2017]: semantic embeddings of KG entities• [Luo 2018]: learn table features of cells

• Limitations• Thecontexthasnotbeenfullyexplored(e.g.,semanticsofsurroundingrelations)• Focus more on cell to entity matching, but lesson columntype matching and relationmatching

• Need labeledsamplesets(laborcost,generalization, transferability)

• Efthymiou,V.,Hassanzadeh,O.,Rodriguez-Muro,M.,&Christophides,V.(2017).Matchingwebtableswithknowledgebaseentities:Fromentitylookupstoentityembeddings.InISWC(Vol.10587LNCS,pp.260–277).

• Luo,X.,Luo,K.,Chen,X.,&Zhu,K.Q.(2018).Cross-LingualEntityLinkingforWebTables.InAAAI.• Nishida,K.,Sadamitsu,K.,Higashinaka,R.,&Matsuo,Y.(2017).UnderstandingtheSemanticStructuresofTableswithaHybridDeepNeuralNetworkArchitecture.AAAI,168–174.

Retrievedfromwww.aaai.org• Fetahu,etal.(2019),TableNet:AnApproachforDeterminingFine-grainedRelationsforWikipediaTables.WWW

1208/11/2019

RelatedWork(4)-- From cell to entity and knowledge gap

• A straight forward solution: match cells to entities, and then determine thetype by majority voting

• e.g., [Zwicklbauer 2013]

• Knowledge gap: When(apartof)cellshavenoentitycorrespondences• [Pham 2016]:compare a target column with a set of labeled columns by machine learning

• It’s hard togetenough labeled columns• [Quercini 2013]:Predict the type with web page queried by a search engine

• It introduces additional noise and irrelevant content, both of which changesovertime

• Zwicklbauer,S.,Einsiedler,C.,Granitzer,M.,&Seifert,C.(2013).Towardsdisambiguatingwebtables.InISWC(Poster) (Vol.1035,pp.205–208).• Efthymiou,V.,Hassanzadeh,O.,Rodriguez-Muro,M.,&Christophides,V.(2017).Matchingwebtableswithknowledgebaseentities:Fromentitylookupstoentityembeddings.InISWC

(Vol.10587LNCS,pp.260–277).• Luo,X.,Luo,K.,Chen,X.,&Zhu,K.Q.(2018).Cross-LingualEntityLinkingforWebTables.InAAAI.• Pham,M.,Alse,S.,Knoblock,C.A.,&Szekely,P.(2016).SemanticLabeling:ADomain-IndependentApproach.InISWC (pp.446–462).• Quercini,G.,Reynaud-Delaître,C.,&Reynaud,C.(2013).EntityDiscoveryandAnnotationinTables.EDBT/ICDT.

1308/11/2019

ColNet:Anautomaticcolumntypematchingframework

08/11/2019 14

Problem: ColumnTypePrediction• Problem

• Annotatingacolumnwhosecellsaretextphrases(i.e.,entitymentions)withclassesofaKG,withoutassuminganymetadata(e.g.,columnheaders)

• Example• Annotate the followingcolumn “Name” by classes Company and ITCompany

08/11/2019 15

Name Employee # CEO Income($)Google 85,050 Sundar Pichar …

Amazon 613,300 Jeff Bezos …

Apple 132,000 Arthur D. Levinson ...

BlackBerry 4,044 John S. Chen ...

Orange 155, 202 Stéphane Richard ... A Web TableExample

Difficulty in Column TypePrediction

• Multipleandhierarchicaloutput classes• Identifyfine-grainedbut completely correct classes

• Intheaboveexample:Company (√),ITCompany (√√),USCompany(⨉ )

• Disambiguation• “BlackBerry”to“BlackBerry Limited (Company)”or“BlackBerry (MobilePhone)” or “Blackberry (Fruit)”

• ColumncellsmayhavefeworevenemptyKBentitycorrespondences(KnowledgeGap)

08/11/2019 16

ColNetinaNutshell

• Doesnot assumetheexistenceofmetadata

• ColumnClassifier• ConvolutionalNeuralNetworks(CNNs)toexplorecontextualsemantics i.e.,inter-cellcorrelations

• Knowledge-basedlearning• ReliesonKGlookupandSPARQLquery(reasoning)forautomaticsampling• Transferlearningtobridgetheknowledgegap

08/11/2019 17

ColNetFramework

08/11/2019 18

HierarchicalClassesKG

Entities&Facts

Entities&CandidateClasses

ColumnClassifiers

Samples

TabularData

ClassAnnotations

EstimatedMatching

Predict&Type

Step#1:EntityLookup

08/11/2019 19

MatchedEntities

CandidateClassesLookupColumns

Entities&Triples Schema KnowledgeGraph

Google

Amazon

Apple

BlackBerry

Orange

Example

“Google”,“Amazon.com”,“Amazonrainforest”“AppleInc.”,“Apple(fruit)”,“OSX”“BlackBerryLimited”,“Blackberry(fruit)”“Orange(fruit)”,“OrangeCounty”,“OrangeS.A.”,etc.

Company,ITCompany,USCompany, Fruit,Software,Forest,AdministrativeRegion,Etc.

Step#2:Sampling&Training

08/11/2019 20

Entities&Triples Schema KnowledgeGraph

CandidateClasses

SamplingMatchedEntities

ParticularEntities

GeneralEntities

NegativeEntities

SyntheticColumns CNNs

Particularentities:“Amazon.com”,“Google”,“AppleInc.”,“BlackBerryLimited”,etc.

CandidateClass

Example:USCompany

Generalentities:“TwitterInc.”,“Facebook,Inc.”,“USAirways”,“FordMotorCompany”,etc.

Negativeentities:“OSX”,“OrangeS.A.”,“Orange(fruit)”,“Apple(fruit)”,“Roll-RoyceLimited”,etc.

”Google”

“Amazon.com”

“AppleInc.”

“BlackBerryLimited”

SyntheticColumn

AbinaryclassifierforUS

Company

Step#2:Sampling&Training(somedetails)• One-vs-rest

• Unfixedcandidateclasses,multipleand hierarchical outputclasses

• Syntheticcolumn(sizefixed):learninter-cellcorrelationsbyCNNs• E.g.,Givenacolumn[“Google”,“BlackBerry”,“Apple”],predictingcellbycellgivesafinalscorefrom0.33to0.66toCompany,whileconsideringtheircorrelationsincreasesthescoretoaround1.0

• Transferlearningandautomaticsampling• Pre-trainingwithgeneralentities;fine-tuningwithparticularentities

• Negativesampling• Entitiesofdisjointclasses• Givehigherprioritytoclassesthatappeartogetherinacolumn

08/11/2019 21

Step#2:Sampling&Training(CNN)

08/11/2019 22

Step#3:Prediction&Typing(Example1)

08/11/2019 23

Google

Amazon

Apple

BlackBerry

Orange

Google

Amazon

Apple

BlackBerry

Amazon

Apple

BlackBerry

OrangeSyntheticColumns

Company:0.9ITCompany:0.78USCompany:0.38Fruit:0.08Software: 0.03Forest:0.02AdministrativeRegion:0.01Etc.

Predictedscore:𝑝"

Predict&Average

KGLookup&Vote

Company:1.0ITCompany:1.0USCompany:0.8Fruit:0.6Forest:0.2Etc.

Votedscore:𝑣"

Ensemble

Company:1.0ITCompany:1.0USCompany:0.38Fruit:0.32Software: 0.03Forest:0.02AdministrativeRegion:0.01Etc.

Threshold:0.5

top-2matchingoverDBpedia

𝜎% = 0.9𝜎* = 0.1

AtargetcolumnofITcompanies

Lowknowledgegap

Step#3:Prediction&Typing(Example2withbigknowledgegap)

08/11/2019 24

Oxford SemanticTechnology

DeepReason.ai

Oxstem

Oxbotica

Tripadvisor

Company:0.65ITCompany:0.45University:0.21ResearchInstitute:0.51Etc.

Predictedscore:𝑝"

Predict&Average

KGLookup&Vote

Company:0.2ITCompany:0.2University:0ResearchInstitute:0Etc.

Votedscore:𝑣"

Ensemble

Company: 0.65ITCompany: 0.45University:0ResearchInstitute:0Etc.

Threshold:0.5

top-2matchingoverDBPedia 𝜎% = 0.9

𝜎* = 0.1

Atargetcolumnwith largeknowledgegap

Highknowledgegap

Evaluation #1: WebTable

• DBpedia as the KG, DBPedia lookup service, SPARQL endpoint• Word2vec, trained byWikipedia article dump

• Limaye (tables fromWikipedia pages),T2Dv2 (tables from the Web)• Both have twosetsofground truth classes-- “Best” and “Okay”,resultingtotwoevaluationmodels– “Strict”and“Tolerant”.

08/11/2019 25

Name Columns Avg. Cells Different “Best” (“Tolerant”) Classes

Limaye 428 23 21 (24)

T2Dv2 411 124 56 (35)

OverallResultsonLimaye

• Predictionimpact• ColNetensembleandColNet>Lookup-Vote• Improverecalldramatically

• Ensembleimpact• ColNetensemble >ColNet• Improveprecisiondramatically

• Comparisonwiththestate-of-the-art• ColNetensembleandColNet>T2KMatch• ColNetensembleandColNethascompetitiveprecisionasEfthymiou17-Vote,butmuchhigherrecallandF1score

08/11/2019 26

Precision, recall,F1score

• Baselines• Lookup-Vote:DBPedia entitylookup+Voting• T2KMatch(iterativeapproach)• Efthymiou17-Vote:entitymatchingbyEfthymiou

etal.2017 (entityembedding) +Voting

(PK = Primary Key)

OverallResultsonT2Dv2• Predictionimpact√• Ensembleimpact√• Comparisonwiththestate-of-the-art√

• Knowledgegapimpact• Limayehasshortercolumnsinaverage,whichleadstolargerknowledgegap

• TheabsoluteprecisionandrecallishigheronT2Dv2thanonLimaye

• ImprovementsofColNetensembleandColNetaremoresignificantonLimayethanonT2Dv2,especiallyw.r.t.Lookup-Vote

08/11/2019 27

Precision, recall,F1score (PK = Primary Key)

ResultsonLimaye

ResultsonT2Dv2

Evaluation #2:CSV Files(AIDA)

• ThreeAIDAdatasets:• TundraTraits,Broadband,HomeElectricitySurvey• 20 different CSV files• 81 entity columns

08/11/2019 28

08/11/2019 29

Broadband

Home Electricity Survey

TundraTrait

Evaluation #2:CSV Files

• ComparisonwithBaseline:• ColNet>Lookup-Vote,ColNet>T2KMatch

08/11/2019 30

ColNetExtension:RicherContextforTheClassifier

08/11/2019 31

ColumnClassifierNutshell

• Semantics from surrounding rowsandcolumns

• Hybrid Neural Network (HNN): tablefeaturelearningthatcapturesthecontextualsemantics– cell phrases and cell correlations

• AKGlookupandsemanticqueryalgorithm for column features

08/11/2019 32

Col 1.A Col 1.B

Apple Arthur D. Levinson

BlackBerry John S. Chen

Orange Stéphane Richard

Col 2.A Col 2.B

Apple Vitamin C 7%

BlackBerry Vitamin C 35%

Orange Vitamin C 88%

Impact of the surrounding column: Col 1.A belongs to Company while Col 2.A belongs to Fruit

HNNInsights and Architecture• BidirectionalRecurrentNeuralNetworks(BiRNNs)

• Cellphraseembeddingwithwordcorrelations

• Attention Layer• Higherweighton“Limited”thanon“BlackBerry” w.r.t.Company;

• Convolution Layer• Encodethecorrelationbetweencellswithinacolumnorarow;• e.g.,Conv(“Apple”,“GoogleInc.”)>“Apple”

08/11/2019 33

Table CellEmbedding

ColumnFeatures

RowFeatures

CellFeatures

Convolution layeroverthecolumnandrow

Fullyconnected layeroverthemaincell

Features

Maxpooling+Concatenation+

Dropout

FullyConnected Layer+Softmax Layer

Scores

BiRNNs+AttentionLayer

RemainingChallenges

• Unfixedrelativecolumn positionmakesitdifferentfromtraditionalfeaturelearningfortext and images

• Thepotentialrelations(properties)betweencolumnsarequitepredictiveandtransferable

• E.g.,“has_CEO”vs.“has_Vitamin”

• However,theneuralnetworkfails tocapturesuchsemanticrelations

08/11/2019 34

Col 1.A Col 1.B

Apple Arthur D. Levinson

BlackBerry John S. Chen

Orange Stéphane Richard

Col 2.A Col 2.B

Apple Vitamin C 7%

BlackBerry Vitamin C 35%

Orange Vitamin C 88%

PropertyVector(P2Vec)

• P2Vecisamulti-hotvector:one slotrepresentsthelikelihoodofoneproperty

• Example:assumethathas_CEO correspondstothefirstslot,has_Vitamin correspondstothesecondlastslot

08/11/2019 35

Col 1.A Col 1.B

Apple Arthur D. Levinson

BlackBerry John S. Chen

Orange Stéphane Richard

Col 2.A Col 2.B

Apple Vitamin C 7%

BlackBerry Vitamin C 35%

Orange Vitamin C 88%vs

[1,0,0,…,0,0]vs[0,0,0,…,1,0]

PropertyVector(P2Vec)

08/11/2019 36

BlackBerry John S.Chen

… … …

Entitiese.g.,e

Propertiese.g.,p

Matchinge.g.,e’vs”JohnS.Chen” P2Vec

KBLookup Query Triples

e.g.,<e,p,e’>

ABriefProcedureforP2VecCalculation

Ensemble:HNN+P2Vec

I. Train a classifier • with the input of HNN features (intermediate output) and P2Vec • using e.g., Multiple Layer Perception (MLP) and Logistic Regression (LR)

II. Averagethepredictedscoreby• a classifier trained with the input of P2Vec using e.g., MLP and LR• the HNN

08/11/2019 37

Experiment

• Data• KG: DBpedia• Table sets: Limaye (8 classes, 114 columns), Efthymiou (31 classes,

620 columns), T2Dv2 (37 classes, 411 columns)

• Settings• Transfer:T2D-Tr(70%ofT2Dv2)fortraining;Limaye,EfthymiouandT2D-Te(30%ofT2Dv2)fortesting

• Independent:foreachtableset,70%fortraining;30%fortesting

• Evaluation• PerformanceofHNN,P2Vec,ensembleapproaches• Overallaccuracyincomparisonwithbaselines

08/11/2019 38

Overallresults(accuracy)• HNN+P2Vec>CNNs usedinColNet,inbothindependentandtransfersettings• HNN+P2Vec>>lexicalmatchingbasedbaselines(Lookup-VoteandT2KMatch),intheindependentsetting

• Transfersetting:• HNN+P2Vec>Lookup-Vote,T2KMatchonLimaye• Lookup-Vote>HNN+P2Vec>T2KMatchonEfthymiou• Agoodcompromiseisusing10%oflocaldataplustransferreddatafromtraining

08/11/2019 39

Transfer

Transfer

Independent

Independent

Compromise

Formore

• Papers• Chen,Jiaoyan,etal."ColNet:Embedding thesemanticsofwebtablesforcolumntypeprediction."AAAI,2019.

• Chen,Jiaoyan,etal."LearningSemanticAnnotations forTabularData." IJCAI, 2019.

• Moreprojectinformation:• https://github.com/alan-turing-institute/SemAIDA/• https://www.turing.ac.uk/research/research-projects/artificial-intelligence-data-analytics-aida

• SemanticWebChallengeonTabularDatatoKnowledgeGraphMatching• ISWC2019• Withlargesyntheticandrealdataforallthreetasks• http://www.cs.ox.ac.uk/isg/challenges/sem-tab/

08/11/2019 40

KnowledgeGraphCorrection

08/11/2019 41

Problem

• Tripleassertionswithliteralobjects(entitymentions)• e.g.,<Yangtze_River,passesArea,“threegorgesdistrict”>;• Suchliteralobjectsfailtocapturethesemanticsliketypeandlinkage;• Quite common:

• inKGs fromwikilike DBpediaandzhishi.me• inKGsfromtableslikeLinkedGeoData• whenmultiple KGs arealigned

• Tripleassertionswitherroneousobjects• e.g., <Sergio_Aguero,playsFor,Manchester_United>(Aguero actuallyplaysforManchesterCity)

• sucherrorsareoftencausedbysemanticorlexicalconfusions

08/11/2019 42

Solution

• Step#1:Detection• Downstreamapplications• Logicrulesandconstraints(e.g.,SPARQLquerytemplate[Kontokostas etal.WWW2014])• Machinelearningmethodse.g.,semanticembedding,outlierdetectionandsupervisedlearning

• Step#2:Correction• FindanentityfromtheKBforsubstitution,

• “threegorgesdistrict” toThree_Gorges_Reservoir_Region• <Sergio_Aguero,playsFor,Manchester_United>to<Sergio_Aguero,playsFor,Manchester_City>

• Step#3:Canonicalization• CreateanewentityforsubstitutionifnoexistingentitycanbefoundinStep#2

08/11/2019 43

Typeoftheentitysubstituteis useful.ChenJiaoyan,etal.“CanonicalizingKnowledgeBaseLiterals”,ISWC2019

Nutshell for Typing

• Aknowledgebasedlearningframeworkforliteralobjectclassification

• BiRNNs +attention forexploringthecontextualsemantics– literalandtriple

• Therole of constraints• Highqualitynegativesampling

• Getnegativesamples(entities) fromthesiblingclassesofatargetclass• Ensurethattheentitycannotbeinferred

• Hierarchicalprediction(utilizingdisjointness)• Place: 0.9, Professor: 0.8 à adoptPlace;• Place: 0.9, Town: 0.8à adoptbothPlaceandTown

08/11/2019 44

Framework

08/11/2019 45

<Subject,Property,Literal>’s-,passesArea,“Three Gorges District”,-,passesArea,“London,England”

Entities𝐸-

Entities𝐸.

Classes𝐶-

Classes𝐶.

CandidateClasses𝐶-.

LiteralMatching

PropertyObjects

Query

Foreachclass𝑐 in𝐶-.

Sampling

Sample:(triple< 𝑠, 𝑝, 𝑙 >,label)Particular&Generalsamples

Samplerefinement

NeuralNetwork(Classifier)

Triple< 𝑠, 𝑝, 𝑙 > toSequence[𝑤𝑜𝑟𝑑%,…,𝑤𝑜𝑟𝑑= ]

BidirectionalRNNs+AttentionLayer

Training

Foreachliteral,predicttypes,lookupentities

LiteralTyping&Canonicalization

“ThreeGorgesDistrict”–Locationand BodyOfWater (Types)

“London,England”–London (Entity)

Query Merge

AttentiveRecurrentNeuralNetwork

08/11/2019 46

“Yangtze” “River”

Subject𝑠

“passes” “Area”

Predicate𝑝

“Three” “Gorges” “District”

Literal𝑙

WordVectors

BidirectionalRNNs

AttentionLayer

Word AttentionsFCLayer+

LogisticRegression

𝑓(𝑠,𝑝, 𝑙)

Query

𝑥C, 𝑡 ∈ [1, 𝑇]

ℎC, 𝑡 ∈ [1, 𝑇]

ℎC, 𝑡 ∈ [𝑇, 1]

𝑢I

𝛼C,𝑡 ∈ [1,𝑇]

𝑣

ConclusionandOutlook

08/11/2019 47

Discussion and Outlook

• Semantic table annotation and KGboosting• Column type matching• Propertymatching• Robust feature learning

• KGcorrection• Detection• Correctionand canonicalization• HeterogeneousKG (alignmentof multiple sources)

• KG &Machine learning• Machine learning explanation

• Chen,Jiaoyan,etal."Knowledge-basedtransferlearningexplanation." KR2018.• Integrationof learning and reasoning (neural-symbolic)

08/11/2019 48

• Main contributors:• IanHorrocks(UniversityofOxford)• ErnestoJimenez-Ruiz(City,UniversityofLondon&UniversityofOslo)• CharlesSutton(TheUniversityofEdinburgh&Google)• Maximilian Pflueger (Ph.D student,University ofOxford)

• PartnersandSponsor:• SIRUS,UniversityofOslo• AlanTuringInstitute,London• SamsungResearch,London(new)• Siemens,München (new)

08/11/2019 49

Thanks for Your Attentions!

08/11/2019 50