39
Metadata Quality Assurance Framwork Part II. – The implementation begins Péter Király [email protected] Göttingen, Geiststraße 10, GWDG meeting room 20/05/2016 Oberseminar Datenmanagement, Cloud und e-Infrastructure Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen

Metadata Quality Assurance Part II. The implementation begins

Embed Size (px)

Citation preview

Page 1: Metadata Quality Assurance Part II. The implementation begins

Metadata Quality Assurance FramworkPart II. – The implementation beginsPéter Kirá[email protected]öttingen, Geiststraße 10, GWDG meeting room 20/05/2016Oberseminar Datenmanagement, Cloud und e-Infrastructure

Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen

Page 2: Metadata Quality Assurance Part II. The implementation begins

2

Metadata Quality Assurance Framework

Why data quality is important?

„Fitness for purpose”

no metadata no access to data no data usage

more explanation:Data on the Web Best PracticesW3C Working Draft 17 December 2015http://www.w3.org/TR/2015/WD-dwbp-20151217/

Page 3: Metadata Quality Assurance Part II. The implementation begins

3

Metadata Quality Assurance Framework

What it is good for?

Improve the metadata Improve metadata schema and its

docum. Propagate „good practice” Improve services: „good” data is ranked

higher in search result list

Specifically for GWDG: Could be built in to current and planned

data management / data archiving tools

Page 4: Metadata Quality Assurance Part II. The implementation begins

4

Metadata Quality Assurance Framework

Project principles

Full transparency Open source, open data (CC0) Minimal viable product „Release early. Release often. And listen

to your customers” (Eric S. Raymond) „Eat your own dog food” Getting real https

://gettingreal.37signals.com/

Page 5: Metadata Quality Assurance Part II. The implementation begins

5

Metadata Quality Assurance Framework

Measurements

Schema-independent structural featuresExistence, cardinality, uniqueness

Use case scenarios („fit for purpose”)Requirements of the most important

functions

Problem catalogKnown metadata problems

Page 6: Metadata Quality Assurance Part II. The implementation begins

6

Metadata Quality Assurance Framework

Europeana Data Quality Committee

Online collaboration Use case documents Problem catalog Tickets Discussion forum #EuropeanaDataQuali

ty Bi-weekly teleconf Bi-yearly face-to-face

meeting

Topics Usage scenarios Metadata profiles Schema modification Measuring Event model

Page 7: Metadata Quality Assurance Part II. The implementation begins

7

Metadata Quality Assurance Framework

Discovery scenarios and their metadata requirements

1. Basic retrieval with high precision and recall2. Cross-language recall3. Entity-based facets4. Date-based facets5. Improved language facets6. Browse by subjects and resource types7. Browse by agents8. Browse/Search by Event9. Entity-based knowledge cards and pages10.Categorised similar items11.Spatial search, browse, and map display12.Entity-based autocompletion13.Diversification of results14.Hierarchical search and facets

Credit: the document was initialized by Tim Hill, Europeana’s search engineer

Page 8: Metadata Quality Assurance Part II. The implementation begins

8

Metadata Quality Assurance Framework

Discovery scenarios and their metadata requirements - 3. Entity-based facets

ScenarioAs a user, ... I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc.

Metadata analysisIn each case the underlying requirement is that the relevant EDM fields for objects be populated by identifying URIs rather than free text. These URIs need to be related, at a minimum, to a label for each of the supported languages.

Measurement rules The relevant field values should be resolvable URI each URI should have labels in multiple languages

Page 9: Metadata Quality Assurance Part II. The implementation begins

9

Metadata Quality Assurance Framework

Discovery scenarios and their metadata requirements – 4. Date-based facets

ScenarioI want to be able to filter my results by a variety of timespans, e.g.: Date of creation Date of publication Date as subject

Metadata analysisDates should be fully and consistently normalised to follow the XSD date-time data types. Dates expressed in styles like “490 avant J.C” that are inherently language dependent should be avoided as they’re very difficult to normalise (e.g. this should be represented as “-0490”^^xsd:gYear).

Measurement rules Field value should be XSD date-time data types

Page 10: Metadata Quality Assurance Part II. The implementation begins

10

Metadata Quality Assurance Framework

Problem catalog

Title contents same as description contents Systematic use of the same title Bad string: "empty" (and variants) Shelfmarks and other identifiers in fields Creator not an agent name Absurd geographical location Subject field used as description field Unicode U+FFFD ( )� Very short description field

Credit: the document was initialized by Tim Hill, Europeana’s search engineer

Page 11: Metadata Quality Assurance Part II. The implementation begins

11

Metadata Quality Assurance Framework

Problem catalog

Description Title contents same as description contentsExample /2023702/35D943DF60D779EC9EF31F5DF...Motivation Distorts search weightingsChecking Method Field comparisonNotes Record display: creator concatenated onto titleMetadata Scenario Basic Retrieval

Page 12: Metadata Quality Assurance Part II. The implementation begins

12

Metadata Quality Assurance Framework

Problem catalog – proposed basis of implementation

Shapes Constraint Language (SHACL)https://www.w3.org/TR/shacl/

SHACL (Shapes Constraint Language) is a language for describing and constraining the contents of RDF graphs. SHACL groups these descriptions and constraints into "shapes", which specify conditions that apply at a given RDF node. Shapes provide a high-level vocabulary to identify predicates and their associated cardinalities, datatypes and other constraints.

sh:equals, sh:notEquals sh:hasValue sh:in sh:lessThan, sh:lessThanOrEquals sh:minCount, sh:maxCount sh:minLength, sh:maxLength sh:pattern

Page 13: Metadata Quality Assurance Part II. The implementation begins

13

Metadata Quality Assurance Framework

Field frequency / main

Page 14: Metadata Quality Assurance Part II. The implementation begins

14

Metadata Quality Assurance Framework

Field frequency per collections / all

Page 15: Metadata Quality Assurance Part II. The implementation begins

15

Metadata Quality Assurance Framework

Field frequency per collections / >0%

Page 16: Metadata Quality Assurance Part II. The implementation begins

16

Metadata Quality Assurance Framework

Field frequency per collections / =100%

Page 17: Metadata Quality Assurance Part II. The implementation begins

17

Metadata Quality Assurance Framework

Field cardinality – overview

Page 18: Metadata Quality Assurance Part II. The implementation begins

18

Metadata Quality Assurance Framework

Field cardinality –histogram

Page 19: Metadata Quality Assurance Part II. The implementation begins

19

Metadata Quality Assurance Framework

Field cardinality – an outlier

Page 20: Metadata Quality Assurance Part II. The implementation begins

20

Metadata Quality Assurance Framework

Multilinguality

@ = language notation in RDF

resource notation

no language

Page 21: Metadata Quality Assurance Part II. The implementation begins

21

Metadata Quality Assurance Framework

Language frequency / barchart

Page 22: Metadata Quality Assurance Part II. The implementation begins

22

Metadata Quality Assurance Framework

Language frequency / barchart

Page 23: Metadata Quality Assurance Part II. The implementation begins

23

Metadata Quality Assurance Framework

Language frequency / Treemap

Page 24: Metadata Quality Assurance Part II. The implementation begins

24

Metadata Quality Assurance Framework

Language frequency / Treemap with resources

Page 25: Metadata Quality Assurance Part II. The implementation begins

25

Metadata Quality Assurance Framework

Language frequency / Treemap + interaction + table

Page 26: Metadata Quality Assurance Part II. The implementation begins

26

Metadata Quality Assurance Framework

Entropy – term uniqueness / main

Page 27: Metadata Quality Assurance Part II. The implementation begins

27

Metadata Quality Assurance Framework

Entropy – term uniqueness / collection

Page 28: Metadata Quality Assurance Part II. The implementation begins

28

Metadata Quality Assurance Framework

Entropy – term uniqueness / field value

Page 29: Metadata Quality Assurance Part II. The implementation begins

29

Metadata Quality Assurance Framework

Entropy – term uniqueness / terms

Page 30: Metadata Quality Assurance Part II. The implementation begins

30

Metadata Quality Assurance Framework

Problem catalog – Long subject

Page 31: Metadata Quality Assurance Part II. The implementation begins

31

Metadata Quality Assurance Framework

Problem catalog – Long subject – example (not so long...)

Conclusion: we have to refine the definition of „long”

Page 32: Metadata Quality Assurance Part II. The implementation begins

32

Metadata Quality Assurance Framework

Problem catalog – same title and description

Page 33: Metadata Quality Assurance Part II. The implementation begins

33

Metadata Quality Assurance Framework

Problem catalog – same title and description – example

Page 34: Metadata Quality Assurance Part II. The implementation begins

34

Metadata Quality Assurance Framework

Record view – functionality matrix

Page 35: Metadata Quality Assurance Part II. The implementation begins

35

Metadata Quality Assurance Framework

Other elements of the record view

Page 36: Metadata Quality Assurance Part II. The implementation begins

36

Metadata Quality Assurance Framework

Further steps

Building in completeness measurements to Europeana’s ingestion tool Including usage statistics (log files, Google Analitics API) Human evaluation of metadata quality Measuring timeliness (changes of scores over time) Machine learning:

Classification/Clustering of records Statistical relevancy of measurements

Göttingen use case: proposed SUB project „Shared Print Study” Göttingen use case: incorporating into research data management tool Cooperation with other projects

Page 37: Metadata Quality Assurance Part II. The implementation begins

37

Metadata Quality Assurance Framework

Architectural overview

Apache Spark (Java)

OAI-PMH client (PHP)

Analysis with Spark (Scala) Analysis with R

Web interface(PHP, d3.js)

Hadoop File System

JSON files

Apache Solr

Apache Cassandra

JSON filesJSON files

Image files

CSV files CSV files

recent workflowplanned workflow

Page 38: Metadata Quality Assurance Part II. The implementation begins

38

Metadata Quality Assurance Framework

Articles, reports, presentations

Page 39: Metadata Quality Assurance Part II. The implementation begins

39

Metadata Quality Assurance Framework

Follow me

Project plan and blog: http://pkiraly.github.io

Site: http://144.76.218.178/europeana-qa/

Software development: https://github.com/pkiraly/europeana-qa-spark:

Europeana Metadata Quality Assurance Toolkit https://github.com/pkiraly/europeana-qa-r:

Europeana Metadata Quality Assurance Toolkit @kiru, https://

www.linkedin.com/in/peterkiraly