32
1 data.cnr.it and the Semantic Scout CNR Semantic Technology Lab ISTC - SI Aldo Gangemi , Alberto Salvati, Enrico Daga, Gianluca Troiani Thanks to Claudio Baldassarre (UN-FAO) and Alfio Gliozzo (IBM-Watson) http://stlab.istc.cnr. it http://data.cnr.it http://bit.ly/semanticscout

Linked Open data: CNR

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Linked Open data: CNR

1

data.cnr.it and theSemantic Scout

CNR Semantic Technology LabISTC - SI

Aldo Gangemi, Alberto Salvati, Enrico Daga, Gianluca TroianiThanks to Claudio Baldassarre (UN-FAO) and Alfio Gliozzo (IBM-Watson)

http://stlab.istc.cnr.ithttp://data.cnr.it

http://bit.ly/semanticscout

Page 2: Linked Open data: CNR

data.cnr.it

2

Page 3: Linked Open data: CNR

Enhanced SPARQL endpoint

3

Page 4: Linked Open data: CNR

Ontologies

4

Page 5: Linked Open data: CNR

Sample class from ontology

5

Page 6: Linked Open data: CNR

6

The Semantic Scout• A framework for search, presentation, and analysis of entities and

their associated knowledge• Employs SW, LOD, NLP, IR• Scientific work goes back to 2006, first presented at ISWC2007• An evolving prototype for requirements of the EU IP IKS: semantic

search, hybrid IR/SW identity management, automatic document classification (against DBpedia)

• 2009 requirements from the technology transfer office of CNR for the NetwOrK initiative

Page 7: Linked Open data: CNR

The CNR

• CNR is the largest research institution in Italy– about 8000 permanent researchers (+14000)– 7 departments focused on the main scientific

research areas– 108 institutes spread all over Italy

• Subdivided into research units, labs, etc.

7

Page 8: Linked Open data: CNR

The CNR data sources

Curricula

DB

Frameworks,Programmes,

Workpackages

DB

Departments

DB

Institutes,Central admin,Publications

DB

Permanent employees

DB

Other research

employees,Externally

funded projects

DB

Accounting,Contracts,Invoicing

DB

Administrationdocumentation

File SystemOrganizational data

Personnel-related data

Activity-related data

Financial data

Only partly as open data!

8

Page 9: Linked Open data: CNR

The CNR tasks• Strategic objective: matching the research

demand to the research supply• Requirements

– Semantic interoperability between heterogeneous data sources

– Expert finding based on competence– Monitoring funding and evolution of different

research areas and units– Browsing and reporting capabilities

9

Page 10: Linked Open data: CNR

Architecture

10

Page 11: Linked Open data: CNR

11

Page 12: Linked Open data: CNR

12

Methods for data conversion, extraction, inference, integration, linking, publishing, and searching

Page 13: Linked Open data: CNR

13

Figures 28 modules

120 classes

300 relations

1200 axioms

>200K entities ≈3M facts (about 2M inferred or extracted) ≈240 datasets }

} CNR  Ontology

CNR  Data

Page 14: Linked Open data: CNR

14

Sources and lifting• Situation usually not as clean as using a

unique CMS for most organizational tasks• DB (e.g. SQL Server) + a lot of textual

records + HTML Web Site + textual corpus + linked open data

• DB + interaction schemata (XML templates and HTML scraping, needed because of schemata degradation and user perspective evolution)

Page 15: Linked Open data: CNR

15

Ontology design

• Starting from XML templates as module/pattern drafts• Reengineering XML and scraped templates• Reengineering DB schemata (system engineer

involved)• Obtained modular, pattern-based, task-based ontology• Textual DB records with identity: precondition for

hybridizing IR and SW (see later)• Alignments to FOAF, SIOC, SKOS, WordNet ontologies• Used patterns: situation, place, transitive reduction

Page 16: Linked Open data: CNR

The CNR ontology

16

Page 17: Linked Open data: CNR

17

Data design• Triplifiers based on SQL rules (automatic

scripting on JDBC drivers not enough because of legacy degradation of physical schemata)– Cf. also: Semion reengineering tool

• Inferences: OWL (Pellet, HermiT), SPARQL CONSTRUCT

• Extraction tool: Semiosearch, categorizer over Wikipedia categories– Next: deep parsing approach (facts, relations, entities)

Page 18: Linked Open data: CNR

18

Publishing and hybridizing• Publishing OWL-RDF datasets

– linked data approach (persistent URIs, triple stores for RDF dataset management, linking to common vocabularies: FOAF, DBpedia, Geonames, Bibo, ...)

– OWL ontologies for dataset generation, querying, inference (new enriched datasets)

• Subgraph extraction through SNA• Virtual semantic corpus

– IRW to distinguish information and non-information resources– SPARQL rules to generate virtual texts associated with entities

• Indexing– Lucene+LSA indexing of semantic corpus– “Semantic” Lucene extension to produce tight coupling of virtual texts with

entities– Multilinguality

Page 19: Linked Open data: CNR

19

Consuming• SPARQL endpoint, with interface enhancement• Keyword-based search

– Semantic browsing with SPARQL-based AJAX DHTML, RDF relation browser, or XML-based relation browser

• Category-based search– Keyword-based result focusing

Page 20: Linked Open data: CNR

20

Page 21: Linked Open data: CNR

21

Page 22: Linked Open data: CNR

http://bit.ly/semanticscout

22

Page 23: Linked Open data: CNR

23

Expert finding: Task-based testing

• It is based on the ability to materialize on demand a contextual network of relevant information.

• It is performed with a combination of tools in the toolkit to:– Identify the main topics of research– Recursively search the CNR data cloud

Page 24: Linked Open data: CNR

24

Identifying the main topics of research: project description

• “Reputation is a social knowledge, on which a number of social decisions are accomplished. Regulating society from the morning of mankind becomes more crucial with the pace of development of ICT technologies, dramatically enlarging the range of interaction and generating new types of aggregation. Despite its critical role, reputation generation, transmission and use are unclear. The project aims to an interdisciplinary theory of reputation and to modeling the interplay between direct evaluations and meta-evaluations in three types of decisions, epistemic (whether to form a given evaluation), strategic (whether and how interact with target), and memetic (whether and which evaluation to transmit).”– Project About: Social Knowledge for e-Governance.– Topics can be manually annotated, or automatically induced,

e.g.: ethics, sociology, collaboration, social network, reputation

Page 25: Linked Open data: CNR

25

Identifying the main topics of research: text categorization

• Query: “ethics, sociology, collaboration, social network, reputation”

Page 26: Linked Open data: CNR

26

Search the CNR data cloud: identify an entry point

• “Commessa” (programme): “Il Circuito dell’Integrazione: Mente, Relazioni e Reti Sociali. Simulazione Sociale e Strumenti di Governance”

Page 27: Linked Open data: CNR

27

Search the CNR data cloud: identify key people

• Ing. Jordi Sabater: Cognitive Science;• Dott. Mario Paolucci: Sociology, Psichology;• Gennaro di Tosto: Artificial Intelligence;• Walter Quattrociocchi: Interdisciplinary Fields;

• Giuseppe Castaldi: Ethics;• Aldo Gangemi: Semantic Web, Knowledge representation.

Page 28: Linked Open data: CNR

28

Expert Finding: Results

• The description of “eRep project” was adopted as a gold standard to evaluate the results when testing the Semantic Scout.

• 6 out of 10 CNR researchers, were correctly retrieved and a project member affiliated with another institution.– Project Coordinator: Dott. Mario Paolucci– External Member: Jordi Sabater Mir

Page 29: Linked Open data: CNR

29

Functional evaluation of Semantic Scout (example)

• Expert finding accuracy– All the 6 retrieved people scored among the first 10 in the

result from the search engine.• Benefit of integrated data cloud

– The user judged an “activity” to be relevant to his goal and used it as entry point to the CNR newtork of resources.

Page 30: Linked Open data: CNR

30

Functional evaluation of Semantic Scout

• Accessibility and Interaction– Multiple users interfaces guarantee the users an adaptive level

of interaction to each specific type of required information• Completeness of retrieval

– 4 people have not been included in our result set. – Antonietta Di Salvatore: scored below the first 10 people in the

list;(+1)– Giulia Andrighetto was not listed among the people relevant to

the query, but belongs to the social network of Dr. Rosaria Conte.(+1)

– Marco Capenni and Stefano Picascia: have a technician profile, hence they are neither reported among the people relevant to the search query, nor belong to the network of any of the other researchers.

Page 31: Linked Open data: CNR

Ongoing work• More data linking (e.g. DBLP,

Georeferencing)• Synchronization with data sources• More interaction paradigms• Privacy issues interlaced with hierarchical

and idiosyncratic practices

31

Page 32: Linked Open data: CNR

Conclusions• Hybridizing several semantic and retrieval

technologies provides added value to a research organization

• Scalability works for CNR figures• Interaction is a core selling point• Try it at http://bit.ly/semanticscout• @data_cnr_it, @semanticscout,

@aldogangemi

32