19
Dr. Ross King AIT Austrian Institute of Technology GmbH Preservation at Scale Workshop Lisbon, September 5, 2013 SCAPE Tools and Infrastructure for Preservation at Scale

SCAPE - Scalable Preservation Environments

Embed Size (px)

DESCRIPTION

Ross King, Project Director of SCAPE, gave a short presentation of the EU funded project SCAPE, including descriptions of tools for planning and monitoring digital preservation, scalable computation and repositories, SCAPE Testbeds and where to learn more. The presentation was given at the workshop ‘Preservation at Scale’ http://bit.ly/17ppAln in connection with the iPres2013 conference in Lissabon, Portugal, in September 2013.

Citation preview

Page 1: SCAPE - Scalable Preservation Environments

Dr. Ross King AIT Austrian Institute of Technology GmbH

Preservation at Scale Workshop Lisbon, September 5, 2013

SCAPE Tools and Infrastructure for Preservation at Scale

Page 2: SCAPE - Scalable Preservation Environments

• SCAPE Project • SCAPE Solutions

• Scalable Planning • Scalable Tools • Scalable Computation • Scalable Repositories

• SCAPE Testbeds • SCAPE Additional Information

• Online Resources • Training Events • Contact Information

2

Outline

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 3: SCAPE - Scalable Preservation Environments

SCAPE – what is it about? • Planning and executing computing-intensive digital preservation

processes such as the large-scale ingestion, characterisation or migration of large (multi-Terabyte) and complex data sets

• SCAPE results include • Preservation scenarios • Preservation tools • Preservation workflows • Preservation infrastructure • Preservation best-practices

SCAPE is a follow-up to the highly successful FP6 IP Planets.

3

Page 4: SCAPE - Scalable Preservation Environments

SCAPE Project Data • Project instrument: FP7 Collaborative Project • 6. Call

• Objective ICT-2009.4.1: Digital Libraries and Digital Preservation

• Target outcome (a) Scalable systems and services for preserving digital content

• 10. Call • Objective ICT-2013.11.4: Supplements to Strengthen

Cooperation in ICT R&D in an Enlarged European Union • Duration: 42 44 months

• February 2011 – July September 2014 • Budget: 11.3 12.0 Million Euro

• Funded: 8.6 9.2 Million Euro 4

Page 5: SCAPE - Scalable Preservation Environments

SCAPE Consortium

5

Page 6: SCAPE - Scalable Preservation Environments

SCAPE Solutions

6

Page 7: SCAPE - Scalable Preservation Environments

• SCOUT: an automated preservation watch system • Enables planning tool and decision makers to monitor the world and the organisation • Collects relevant knowledge and enable automated notification • Open and extensible

• c3po: scalable content profiling • c3po analyses characterisation data based on fits • Scale-out MongoDB (100k/min/node) • Visual drill-down and well-documented profile • Automated sample selection

• PLATO 4.1: scalable preservation planning • www.ifs.tuwien.ac.at/dp/plato • Technology upgrade - refactored, rebuilt, standardised, tested • New features

• Groups allow collaborative planning • Integration of control policies for group • Quality domain – measures

7

Scalable Planning and Watch

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 8: SCAPE - Scalable Preservation Environments

• Tool Wrapper • Application that adapts existing tools to the SCAPE Platform

• https://github.com/openplanets/scape-toolwrapper

• Enhances wrapped tools • Standard naming scheme for CC, AS and QA tools • Standard invocation method (CLI) • Debian packages for easy deployment on the cluster • Support for data streaming (useful for Hadoop jobs)

• Generates Preservation Components • Taverna workflows with embedded metadata for easy discovery • Automatic publication of components on myExperiment (to support discoverability) • Standard ports to enable composition of Preservation Components (based on well defined component

profiles, CC, AS & QA)

• Digital Preservation Toolkit • Software suite that contains a large set of DP tools

• 77 operations in total

• Easy to deploy on Linux machines (via apt‐get) • apt - get i nst al l di gi t al - pr eser vat i on- t ool s

8

Scalable Tools

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 9: SCAPE - Scalable Preservation Environments

• Deployment of environments • XEN Hypervisor • Eucalyptus

• Deployment of tools • Debian Packages • Tool Spec

• Job Execution Service (JES) • Apache Oozie • Apache Hadoop

9

Scalable Computation

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

from digitalbevaring.dk

User-view on SCAPE development cloud at AIT: Eucalyptus web interface, Hybridfox browser add-on, and terminal-based interaction.

Page 10: SCAPE - Scalable Preservation Environments

• Fedora 4.0.0 • All REST, no SOAP • RDF as first class objects • JCR 2.0 Implementation (ModeShape) • Infinispan distributed NoSQL datastore

• Lily 2.0

• Built on top of HBase/HDFS • Integration of computation and storage

10

Scalable Repositories

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 11: SCAPE - Scalable Preservation Environments

11

SCAPE Architecture

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Plan Management API

Digital Object Repository

Execution Platform

JES

Hadoop

JES API

Data Connector API

Automated Watch

Automated Planning

PLATO

Plan Management

GUI

Digital Objects/

Metadata

Preservation Plan Store

Plan

Component Catalogue

Component Lookup

API

Taverna Workbench

Component Registration

API

Component Profile

Validator

Automated Watch

Sources

Push API

Pull API

Knowledge

Source Adaptor

Client Service

Watch Request API

Notification API

Report API

Assessment

Data Publication

Platform

LDS3 API Data

Loader Application

Page 12: SCAPE - Scalable Preservation Environments

SCAPE Testbeds

12

Page 13: SCAPE - Scalable Preservation Environments

SCAPE Testbeds

• Large-scale Digital Repositories • Carry out large scale image migrations

• The master files from legacy digitized image collections are typically TIFF files that can be costly to store due to their size. The cost benefit can only be realized if one can remove the original TIFFs and this can only be done if one can provide evidence of successful migration. (2.2 million pages, 80 TB)

• Detect poor sound quality • In a collection of mp3 files (20 TB - 360.000 files) we have discovered files with very bad sound quality. Before

ingesting everything into our DOMS we would like to be able to discover the bad files and potentially get those re-digitized from the original analogue media.

• Research Data Sets

• RAW to NEXUS conversion • There are file size and volume of content challenges identified for nexus files

the raw to nexus format migration tool can be customised to account for various other types of experiment data files in the process of the migration. However, the scalability challenge here is that for different instrument specific to each facility), the other types of experiment data files vary significantly.

13

from digitalbevaring.dk

See http://wiki.opf-labs.org/display/SP/Scenarios

Page 14: SCAPE - Scalable Preservation Environments

SCAPE Testbeds

• Web Content • Quality assurance in web harvesting

• Web crawling is a process that is highly susceptible to errors. Often, essential data is missed by the crawler and thus not captured and preserved. Currently, quality assurance requires manual effort and because crawls often contain millions of pages, manual quality assurance will be neither very efficient

• Data Centers • Anonymization of medical data

• In order to fulfil the requirements for storing medical data in terms of safety and security, it will be necessary to develop encryption and anonymization services that will allow medical data transfer to a data center’s remote storage facilities. On one hand, the encryption techniques will be used to secure sensitive personal data (e.g. internal documents, patient databases) which must only be accessible from authorized services and users. On the other hand, the anonymization services will enable medical data (like x-ray generator outputs, x-ray computed tomography outputs, surgery recordings) being stored in the data center without having sensitive data attached.

14

from digitalbevaring.dk

Page 15: SCAPE - Scalable Preservation Environments

SCAPE Additional Information

15

Page 16: SCAPE - Scalable Preservation Environments

Additional Resources of Interest • Development Infrastructure

• Code repository hosted by the Open Planets Foundation and GitHub • https://github.com/openplanets/scape/

• Development Wiki • http://wiki.opf-labs.org/display/SP/Home

• Experimental Workflows • http://www.myexperiment.org/search?query=SCAPE&type=all&commit=Search

• Publications • http://www.scape-project.eu/category/publication

• Public Deliverables • http://www.scape-project.eu/category/deliverable

• Tools • http://www.scape-project.eu/tools

16

Page 17: SCAPE - Scalable Preservation Environments

SCAPE Training Events

• Future Formats First: Application Infrastructures for Action Services • 16-17 September 2013, London • Registration: http://scape-future-formats-first.eventbrite.co.uk/

• Critical Path: Effective Evidence Based Preservation Planning • 13 November 2013, Aarhus

• Hadoop-driven Digital Preservation (Hackathon) • 2-4 December 2013, Vienna

17

See http://www.scape-project.eu/events

Page 18: SCAPE - Scalable Preservation Environments

SCAPE Contact Information

• http://www.scape-project.eu/ • Twitter: #scapeproject • [email protected]

• Dr. Ross King

AIT Austrian Institute of Technology GmbH Donau-City-Strasse 1 A-1220 Wien

18

Page 19: SCAPE - Scalable Preservation Environments

Thank you for your attention!

Questions?

19