DH101 2013/2014 course 6 - Semantic coding, RDF, CIDOC-CRM

Preview:

Citation preview

Digital Humanities 101 - 2013/2014 - Course 6

Digital Humanities Laboratory

Frederic Kaplan

frederic.kaplan@epfl.ch

Semester 1 : Content of each course

• (1) 19.09 Introduction to the course / Live Tweeting and Collective note

taking

• (2) 25.09 Introduction to Digital Humanities / Wordpress / First assignment

• (3) 2.10 Introduction to the Venice Time Machine project / Zotero

•9.10 No course

• (4) 16.10 Digitization techniques / Deadline first assignment

• (5) 23.10 Datafication / Presentation of projects

• (6) 30.10 Semantic modelling / RDF / Deadline peer-reviewing of first

assignment

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 2o

Semester 1 : Content of each course

• (7) 6.11 Pattern recognition / OCR / Semantic disambiguation

• (8) 13.11 Historical Geographical Information Systems, Procedural modelling

/ City Engine / Deadline Project selection

• (9) 20.11 Crowdsourcing / Wikipedia / OpenStreetMap

• (10) 27.11 Cultural heritage interfaces and visualisation / Museographic

experiences

•4.12 Group work on the projects

•11.12 Oral exam / Presentation of projects / Deadline Project blog

•18.12 Oral exam / Presentation of projects

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 3o

Objective of today’s course

•Showing you the beauty and making you feel the power of semantic coding

•Give you a quick idea about what is behind the following strange acronyms :

RDF, URI, OWL, SPARQL, SWRL, CIDOC-CRM

•Motivate you to look deeper.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 4o

A short introduction to semantic coding

•Many good books exist. I recommend

this one.

• I will reuse some of their example in the

following slides.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 5o

Doris Stockly

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 6o

incanti.dhlab.ch

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 7o

The simplest kind of dataset, that everyone is familiarwith, is tabular data (any data kept in a table such as anExcel spreadsheet).

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 8o

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 9o

Data kept in table is easy to display, sort, print, edit.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 10o

You might not even think of data in an Excel spreadsheetas modeled. But there are semantics in data table.Where ?

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 11o

There are also obvious limitations with this kind ofstorage.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 12o

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 13o

You cannot search for the routes that stay more than 2days at Corfu. Sorting the columns does not capture thedeeper meaning of the text we entered.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 14o

Relational databases are a solution. Many very matureproducts exist like Oracle DB, MySQL and PostgreSQL. Arelational database allows multiple tables to be joined in astandardized way.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 15o

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 16o

But, as our project goes we may need to reformate ourtables.This is called schema migration. A painful process.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 17o

For big databases, schema can get incredibly complex.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 18o

Trying to normalize these databases in a single schema isa labor-intensive process.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 19o

How tomake future-proof schemata

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 20o

How tomake future-proof schemata

•With this mode of coding we can add easily new properties (price of

Route, captain, etc.). The schema is future-proof.

• In addition, the data about the data (i.e. the medadata, the name of

columns) is now part of the data itself.

•This is ideal for projects in Perpetual Beta.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 21o

and most important it makes a direct and simpleconnection with a well-developed research field : logic.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 22o

Indeed, this can bewritten in a different way

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 23o

Indeed, this can bewritten in a different way

• (Subject Predicate Object)• (R1 departure Venice)

•This is called a RDF statement, an atomic relation in a database

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 24o

RDF statements

• (Subject Predicate Object)• (R1 departure Venice)

•This is called a RDF statement, an atomic relation in a database

• (R1 departure-date 2.7.1422)

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 25o

This is a graph

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 26o

As RDF statements can be understood both a logicstatements and as parts of a graph, one can use manytools and idea from logic and graph theory to manipulatethem.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 27o

URIs

•The nodes of the Graph are called Resources.

•When you want to coordinate multiple datasets it can become

increasingly difficult to guarantee unique and consistent identifiers fore

ach node.

•R1 that we use in our database may mean something else in an other

database.

•For naming resources, RDF uses URIs (Unique Resource Identifiers) and

an optional Fragment identifier.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 28o

URIs

•You are probably familiar with URL (Universal Resource Locators), the

string used to specify how web pages are retrieved.

•URIs generalize this concept further by saying that anything, whether you

can retrieve it electronically or not, can be uniquely identified in a similar

way.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 29o

URIs

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 30o

Since URIs can identified anything as a resource, thesubject of an RDF statement can be a resource, the objectcan be a resource and most importantly predicates arealways resources.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 31o

An example of URI Ref for a common RDF predicate

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 32o

It is common in RDF to shorten URIs by assigning anamespace to the base URI and writing only thedistinctive part of the identifier. The last URIs can bewritten in a shorter manner : rdf:type

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 33o

Serialization

•While the data model that RFD uses is very simple, the serialized

representation tends to get complicated when a RDF graph is saved in a

file or sent over a network.

•Different serialization formats exist :, N3, RDF/XML(the most freq.

used), RDFa (RDF in attributes)

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 34o

Vocabularies

•A set of URIRefs is known as a vocabulary.

•We can design a specific vocabulary for our maritime route examples.

•There are also famous vocabularies like the RDF vocabulary (the set of

URIRefs describing the RDF concepts, ex. rdf :resource, rdf :type)

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 35o

SPARQL

•Just as SQL provides a standard query language across relational

databases, SPARQL provides a query language for RDF graphs.

(pronounce sparkle)

•SPARQL queries attempt to match patterns in the graph and bind

wildcard variables as its finds solutions.

•Departure( ?x1,Venice)

•Captain( ?x1, ?x2), Gender( ?x2,Women)

•Semantic coding is all about asking bigger questions.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 36o

SWRL

•With RDF coding, we can also write rules to infer new triples

• If hasParent( ?x1, ?x2) and hasBrother( ?x2, ?x3) then hasUncle( ?x1, ?x3)

•This is also a way of detecting possible incoherence in the set of

knowledge coded in the triple store (actors doing things after their death)

•One standard language to do this is SWRL (Semantic Web Rule

Language)

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 37o

Ontologies

•An ontology provides a special vocabulary with which knowledge can be

represented.

•This vocabulary allows us to specify which entities will be represented,

how they can be grouped and what relationship connect them together.

• (Venice isa Place), (Corfu isa Place), (Place haslat latitude), (Place

haslong longitude)

•Now, something very beautiful...

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 38o

An ontology can be expressed as RDF triples and storedin a graph alongside the data it describes.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 39o

An ontology can be expressed as RDF triples and storedin a graph alongside the data it describes.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 40o

OWL

•OWL (Web Ontology Language) is an ontology language layered on top

of RDF and RDFs

•Terminology statements

• ex:Bridge rdf:type rdfs:class

• ex:Bridge rdfs:subclass ex:Place

•Assertion statements

• ex:Rialto rdf:type ex:Bridge

• ex:ex:RialtoCons ex:broughtIntoExistence ex:Rialto

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 41o

It is relatively easy to create your own ontology using asoftware like Protégé. But some ontologies aim at beinguniversal

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 42o

CIDOC-CRM

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 43o

CIDOC-CRM

•CIDOC-CRM is an ontology for Cultural heritage.

•About 20 years of work.

•An ISO standard 21127.

•100+ schema. Very stable.

•CIDOC-CRM is a tentative to formalise an underlying semantics common

to many classifications. It includes very interesting ideas.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 44o

CIDOC-CRM : Events

• In CIDOC-CRM, the modelling is event-centric.

•The underlying idea is to model change, not state. Therefore, temporal

entities play a central role.

• Instead of coding the birthdate of a actor, it is better to code the event

of its birth.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 45o

Actors relate to things only via temporal entities and events.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 46o

CIDOC-CRM : Events

•The participation or presence of several non-temporal entities in an event

e1 allows to conclude that they have been in the same time-interval and

space, even without knowledge of the particular time or space.

•They must have existed at that time. They have not been somewhere

else at that time (with electronic communication, the space volume in

which events occur can become very large).

•The events e0i of creation of each participant i have happened before or

at the time of e1. The events e2i of destruction (or vanishing) of each

participant have happened after or at the time of e1.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 47o

CIDOC-CRM : Properties

•The property P11 had participants denotes active or passive involvement

of Actors, whereas P12 occurred in the presence of ranges from objects

just being there (e.g. a desk where a treaty was signed)

•The properties P92 brought into existence, P93 took out of existence are

limiting the existence of things which have a persistent existence.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 48o

CIDOC-CRM : Properties

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 49o

CIDOC-CRM : Properties

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 50o

CIDOC-CRM : Properties

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 51o

CIDOC-CRM : Place

•CIDOC-CRM has also implemented a very interesting model for places.

What is hard about places ?

•The question where is it can be answered in natural language by relation

to two different kinds of entities : geometric areas or objects.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 52o

In France, in Athens, 39N 124E. Points given by spatialcoordinates are typically understood as the centre of awider, extended area.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 53o

on mount St Helens, at the Rhine river.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 54o

on Queen Elizabeth (the ship), in my suitcase, at home.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 55o

CIDOC-CRM : Place

•Following the CIDOC CRM, geometric areas (E53 Place) can only be

defined relative to larger objects, including the surface of earth.

•Those objects in turn may be located at different times at different places

(relative to a larger object).

•The cultural interest is in the relation to other things and not to an

abstract absolute space. Absolute coordinates seem to make no sense

when the reference objects move.

•As historical information is incomplete and sparse, and many reference

objects move, normalization of place information to absolute coordinates

should not replace the primary information, which is typically relative.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 56o

CIDOC-CRM : Places

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 57o

CIDOC-CRM : Influence

•Another problematic issue is the notion of influence. It is difficult to

develop a systematic understanding of the different forms of influence

and their mutual relations

•Some are more physical, like using a mould or a tool. The influence of a

mould on a produced object is strong and can often be verified on the

object afterwards. The influence of a hammer is less specific.

•Similarly, making a copy of a painting has a strong influence on the

product, copying the idea of a painting, a weak one. The latter is more

an intellectual influence than a physical one.

• If a real influence existed, a temporal sequence can be deduced.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 58o

CIDOC-CRM : Influence

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 59o

CIDOC-CRM : Influence

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 60o

CIDOC-CRM

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 61o

CIDOC-CRM

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 62o

Summary : Guidelines for coding historical data

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 63o

(1) Prefer events to properties. Actors do not haveproperties, they participate to event. Instead of coding thebirthdate of a actor, it is better to code the event of itsbirth.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 64o

(2) Code date intervals instead of dates. This is muchmore flexible and permits to detect inconsistencies.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 65o

(3) Code places in a relative manner and not an absolutemanner. The cultural interest is in the relation to otherthings and not to an abstract absolute space. Absolutecoordinates seem to make no sense when the referenceobjects move.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 66o

All this is very beautifut, but is it sufficient to do the kindof historical modeling we want to do ? We have an issue,which one ?

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 67o

Metaknowledge : Knowledge about how knowledge isproduced.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 68o

How canwe encodemetaknowledge

•Expressed knowledge (RDF triples) is not in the same space as resources

(URI). We can easily attach new information to resource but not to

triples.

• It is not easy to represent metaknowledge like the origin of the

uncertainty linked with an information.

•To overcome this issue we need to introduce two levels of knowledge and

use a trick.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 69o

Reifued RDF vs. Standard RDF

•An expressed RDF (RialtoReconstruction hasTimeSpan 1588-1591) can

be transformed in 3 reified triplets

• (s1 rdf:subject RialtoReconstruction)

• (s1 rdf:predicate hasTimeSpan)

• (s1 rdf:object 1588-1591)

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 70o

Reifued RDF vs. Standard RDF

•An expressed RDF (RialtoReconstruction hasTimeSpan 1588-1591) can

be transformed in 3 reified triplets

• (s1 rdf:subject RialtoReconstruction)

• (s1 rdf:predicate hasTimeSpan)

• (s1 rdf:object 1588-1591)

• (s1 metardf:reliability 0.8)

• (s1 metardf:creator FredericKaplan)

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 71o

Possible historical spaces

•Now our RDF store includes both historical knowledge and knowledge

about the creation of this historical knowledge.

•These kinds of metainformation can document all the construction

phases (whether realized by humans or machines)

•With this approach, we can extract through queries the historical

knowledge corresponding to some specific sources and thus create a

possible historical reality.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 72o

Summary

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 73o

Encodingmetahistorical information

•We must not only model historical information, but model each step of

the construction of historical knowledge.

•There is a need for semantic framework capable of coding historical

information and meta-historical information.

•Coding meta-historical information implies documenting the choice of

sources, transcription phases, interpretation processes realized by humans

or machines.

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 74o

No unique global truth but fully documented possiblehistorical reconstructions

my header

Digital Humanities 101 - 2013/2014 - Course 6 | 2013 75o

Recommended