View
191
Download
0
Category
Preview:
Citation preview
volt:A Provenance-Producing, Transparent SPARQL Proxyfor the On-Demand Computation of Linked Data &its Application to Spatiotemporally Dependent Data
Blake Regalia, Krzysztof Janowicz, and Song Gao
June 2, 2016 - ESWC 2016
STKO LabUniversity of California, Santa Barbara, CA, USA
0
motivation
linked data
Linked Data has successfully provided methods and tools that ease thepublication, retrieval, sharing, reuse and integration of rich data acrossheterogeneous sources on the web.
For these reasons, we have seen rapid increase of data sources in LinkedOpen Data as well as an uptake of the involved technologies byorganizations in academia, governments and industry.
2
problem
However, there are still several hurdles that are challenging dataconsumers from using and applying Linked Data at its full potential.
We believe that these key issues need to be addressed:
▶ data quality, coverage and longevity▶ background knowledge needed to query distant data▶ reproducibility of query results and their derived findings▶ lack of accessible computational capabilities
3
solution
To address these issues, we propose a computational framework, VOLT,VOLT Ontology and Linked-data Technology, and its proxy.
In this presentation, we:
1. Illustrate the need for computation in Linked Data2. Introduce the VOLT framework3. Explain how the VOLT proxy works4. Examine a case study5. Demonstrate how the framework generalizes
4
need for computation
dependent data
How can we use the population density of a place?
Area = PopulationDensity = 8491079 ppl
10755.995322 pplkm2
≈ 789.4275 km2
6
dependent data
So, population density reflects dbo:areaLand?
Area = PopulationDensity = 446007 ppl
1100 pplkm
≈ 405.461 km
7
filter out inconsistencies
select ?place ?density ?landAreaErrorKm ?totalAreaErrorKm ?errorMargin ?closerAreaProperty {# triple patterns?place dbo:populationDensity ?density ;
dbo:populationTotal ?population ;dbo:areaTotal ?totalArea ;dbo:areaLand ?landArea .
# avoid division by zero, ignore bad valuesfilter(?density != 0 && ?landArea != 0 && ?totalArea != 0 && ?population !=0)# no duplicationsfilter not exists { ?place dbo:populationDensity ?wd . filter(?density != ?wd) }filter not exists { ?place dbo:populationTotal ?wp . filter(?population != ?wp) }filter not exists { ?place dbo:areaLand ?wla. filter(?landArea != ?wla) }# calculate expected areabind(?population / ?density as ?expectedAreaKm)# convert given area values to km unitsbind(?landArea / 1000000 as ?landAreaKm)bind(?totalArea / 1000000 as ?totalAreaKm)# compute amount of error in each area propertybind(abs(?landAreaKm - ?expectedAreaKm) as ?landAreaErrorKm)bind(abs(?totalAreaKm - ?expectedAreaKm) as ?totalAreaErrorKm)# only show places that have less area towards wrong propertyfilter(?totalAreaErrorKm > ?landAreaErrorKm)# bind closer area property by which has smaller errorbind(if(?landAreaErrorKm < ?totalAreaErrorKm, dbo:areaTotal,dbo:areaLand) as ?closerAreaProperty)# compute difference among errorsbind(?landAreaErrorKm - ?totalAreaErrorKm as ?errorMargin)# set closer area property value?place ?closerAreaProperty ?closerAreaValue .# only show those where the error is less than a fraction of closer area valuefilter(?errorMargin < ?closerAreaProperty / 10)
} order by desc(?errorMargin)
8
... or, just compute it
select ?place (?population / ?area as ?density) {?place dbo:areaLand ?area ;
dbo:populationTotal ?population .filter(?area != 0)
}
Its more reliable to derive your own population density value on-the-fly
A Linked-Data consumer expects this property to be reliabledbo:populationDensity ... but it is inconsistent
?density := ?population / ?area
The nature of population density property is that its value is derived fromother data, so why not just compute it anyway?
9
framework
framework don’ts
Some things to avoid:
1. requiring data providers to adopt new software2. not revealing source code of rules to end-user3. deviating from W3C standards or “reinventing the wheel”
How can we aid the computation of dependent data without violating thephilosophies of Linked Open Data?
11
framework ideals
To encourage adoption of our framework, we want to:
▶ operate ad-hoc, without requiring data providers to mutate▶ be fully transparent about what is being done to data by keeping
everything openly available for inspection▶ conform to existing W3C standards and maintain interoperability
How do we seamlessly integrate an extendable computational engine intothe Semantic Web Layer Cake?
12
transparent proxy
man in the middle... support
We propose a framework that functions as a transparent proxy to anyexisting SPARQL 1.1 endpoint.
The layers of VOLT
14
sparql as an api
Take advantage of existing SPARQL grammar to create an API
By using this format,
▶ end-user writes normal SPARQL query▶ these syntactic patterns match their materialized form▶ the same query can reused elsewhere
15
the volt ontology
The VOLT Ontology serializes program logic in RDF
...a volt:IfThenElse ;volt:if [
a volt:Operation ;volt:operator "<"^^volt:Operator ;volt:lhs "?lower"^^volt:Variable ;volt:rhs 0 ;
] ;volt:then (
[a volt:Assignment ;volt:assign [
volt:variable "?lower"^^volt:Variable ;volt:operator "+="^^volt:Operator ;volt:expression 6.283185307179586 ;
] ;][
a volt:Yield ;volt:expression [ ... ]
]) ;
...
16
transparency of procedures
describe ?procedure {graph volt:graphs { ?modelGraph a volt:ModelGraph }graph ?modelGraph {
?procedure rdf:type/rdfs:subClassOf volt:Procedure .?procedure (!</>)+ geo:geometry .
}}
Source of procedures remains open and readily accessible
Client may use that capability to:
▶ search for procedures that match some criteria▶ inspect a procedure to understand its assumptions▶ copy/modify/redistribute procedures from data providers
17
reproducibility
Procedures are only invoked if the triple in question does not already exist
▶ Caching spares computation▶ Provenance ensures reproducability and invalidation of stale cache
18
Cardinal Directions
19
diversity
20
statistics
• 1.15 million places1 on DBpedia2
• ~3.2% of them (36.7k) take part in cardinal direction relations
• 138.8k cardinal direction triples on DBpedia in total
1Individuals that are dbo:Place or have geo:geometry2As of DBpedia 2015-10
21
accuracy
136,964 combinations of geometries3 among places with cardinaldirection relations
Using 8 equal divisions (π4 ) of the compass Nearly 1
3 of all relations are innaccurate
3Formatted in Well-Known Text: Geographic coordinates22
strategy
Enumerating all possible combinations of cardinal direction relationsbetween places with geometries...
(951.2k
2
)> 452 billion triples
Currently only 1.1 billion triples on English DBpedia,or 8.8 billion triples overall (i.e., globally)
23
on-demand computation
We tackle these relations using VOLT, only computing triples on-demand.
24
generalizing
extending volt
The proxy natively handles flow control, scoped variables, operationalexpressions and SPARQL queries.
For more advanced operations, it also supports external systems such asspawning child processes to employ algorithms in libraries, make HTTPrequests, read/write from file system, etc.
26
postgis
For instance, we developed a VOLT plugin that enables users anddevelopers to call the spatial functions found in PostGIS on their data
@prefix postgis: <http://postgis.net/functions/>
postgis:areapostgis:azimuthpostgis:centroidpostgis:closestPointpostgis:clusterWithinpostgis:containspostgis:coverspostgis:coveredBypostgis:crossespostgis:disjoint...
With this set of EVT functions combined with the native capabilities ofthe proxy, we were able to perform complex spatial queries much shorterand simpler than their GeoSPARQL equivalents
27
using geosparql
Suppose we want to compute the sum of populations for counties alongthe coast using dbo:populationTotal, or dbp:populationTotal ifthe former is not valid/available.
With GeoSPARQL we can do this with the following query:# count the population of coastal counties in Californiaselect (sum(?countyPopulation) as ?coastalPopulation) where {data:PacificCoast geo:hasGeometry/geo:asWkt ?pacificCoastWkt .{ select ?county (sample(?population) as ?countyPopulation) {
?county a yago:CaliforniaCounties .?county geo:hasGeometry/geo:asWKT ?countyWkt .filter(regex(?countyWkt, '^(<[^>]*>)?(MULTI)?POLYGON', 'i'))filter(geof:sfTouches(?countyWkt, ?pacificCoastWkt)){ ?county dbo:populationTotal ?population .
filter(isNumeric(?population))} union {?county dbp:populationTotal ?population .filter(isNumeric(?population))filter not exists {
?county dbo:populationTotal ?best_population .filter(isNumeric(?best_population))
}}} group by ?county }
}28
using volt with postgis
Or, we can call specific VOLT procedures to do the heavy lifting for us
# count the population of coastal counties in Californiaselect ?population ?area where {{ select (volt:cluster(?county) as ?setOfCounties) {
?county a yago:CaliforniaCounties .?county stko:along data:PacificCoast . } }
[] stko:sumOfPlaces [input:places ?setOfCounties ;input:propertyList (dbo:populationTotal dbp:populationTotal) ;output:sum ?population ;output:coveredArea ?area ; ]
}
volt:cluster acts as an aggregate function that compiles an RDFcollection to be used by the surrounding outer-select
stko:along tests for adjacency by using PostGIS functions to deal withsliver polygons
stko:sumOfPlaces computes the sum of a given property (or itsfallbacks) among a collection of places with geometries
29
conclusions
recap
▶ Data that are dependent should be computed to improve quality,coverage, and longevity
▶ Some data are better suited for on-demand computation rather thanbeing pre-computed
▶ Here, we explored its use with spatiotemporal data - but VOLT isgeneric, and is equally prepared for any domain
▶ The VOLT proxy integrates seamlessly into existing technology andacts fully transparently
Our aim is to empower end-users’ computational abilities, provide themeans to inspect how computations are made, and track the provenanceof computed data.
31
thank you!
https://github.com/blake-regalia/volt
blake regalia @ gmail
32
Recommended