Upload
scape-project
View
169
Download
1
Embed Size (px)
DESCRIPTION
Ross King, Project Director of SCAPE, gave a short presentation of the EU funded project SCAPE, including descriptions of tools for planning and monitoring digital preservation, scalable computation and repositories, SCAPE Testbeds and where to learn more. The presentation was given at the workshop ‘Preservation at Scale’ http://bit.ly/17ppAln in connection with the iPres2013 conference in Lissabon, Portugal, in September 2013.
Citation preview
Dr. Ross King AIT Austrian Institute of Technology GmbH
Preservation at Scale Workshop Lisbon, September 5, 2013
SCAPE Tools and Infrastructure for Preservation at Scale
• SCAPE Project • SCAPE Solutions
• Scalable Planning • Scalable Tools • Scalable Computation • Scalable Repositories
• SCAPE Testbeds • SCAPE Additional Information
• Online Resources • Training Events • Contact Information
2
Outline
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
SCAPE – what is it about? • Planning and executing computing-intensive digital preservation
processes such as the large-scale ingestion, characterisation or migration of large (multi-Terabyte) and complex data sets
• SCAPE results include • Preservation scenarios • Preservation tools • Preservation workflows • Preservation infrastructure • Preservation best-practices
SCAPE is a follow-up to the highly successful FP6 IP Planets.
3
SCAPE Project Data • Project instrument: FP7 Collaborative Project • 6. Call
• Objective ICT-2009.4.1: Digital Libraries and Digital Preservation
• Target outcome (a) Scalable systems and services for preserving digital content
• 10. Call • Objective ICT-2013.11.4: Supplements to Strengthen
Cooperation in ICT R&D in an Enlarged European Union • Duration: 42 44 months
• February 2011 – July September 2014 • Budget: 11.3 12.0 Million Euro
• Funded: 8.6 9.2 Million Euro 4
SCAPE Consortium
5
SCAPE Solutions
6
• SCOUT: an automated preservation watch system • Enables planning tool and decision makers to monitor the world and the organisation • Collects relevant knowledge and enable automated notification • Open and extensible
• c3po: scalable content profiling • c3po analyses characterisation data based on fits • Scale-out MongoDB (100k/min/node) • Visual drill-down and well-documented profile • Automated sample selection
• PLATO 4.1: scalable preservation planning • www.ifs.tuwien.ac.at/dp/plato • Technology upgrade - refactored, rebuilt, standardised, tested • New features
• Groups allow collaborative planning • Integration of control policies for group • Quality domain – measures
7
Scalable Planning and Watch
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Tool Wrapper • Application that adapts existing tools to the SCAPE Platform
• https://github.com/openplanets/scape-toolwrapper
• Enhances wrapped tools • Standard naming scheme for CC, AS and QA tools • Standard invocation method (CLI) • Debian packages for easy deployment on the cluster • Support for data streaming (useful for Hadoop jobs)
• Generates Preservation Components • Taverna workflows with embedded metadata for easy discovery • Automatic publication of components on myExperiment (to support discoverability) • Standard ports to enable composition of Preservation Components (based on well defined component
profiles, CC, AS & QA)
• Digital Preservation Toolkit • Software suite that contains a large set of DP tools
• 77 operations in total
• Easy to deploy on Linux machines (via apt‐get) • apt - get i nst al l di gi t al - pr eser vat i on- t ool s
8
Scalable Tools
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Deployment of environments • XEN Hypervisor • Eucalyptus
• Deployment of tools • Debian Packages • Tool Spec
• Job Execution Service (JES) • Apache Oozie • Apache Hadoop
9
Scalable Computation
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
from digitalbevaring.dk
User-view on SCAPE development cloud at AIT: Eucalyptus web interface, Hybridfox browser add-on, and terminal-based interaction.
• Fedora 4.0.0 • All REST, no SOAP • RDF as first class objects • JCR 2.0 Implementation (ModeShape) • Infinispan distributed NoSQL datastore
• Lily 2.0
• Built on top of HBase/HDFS • Integration of computation and storage
10
Scalable Repositories
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
11
SCAPE Architecture
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Plan Management API
Digital Object Repository
Execution Platform
JES
Hadoop
JES API
Data Connector API
Automated Watch
Automated Planning
PLATO
Plan Management
GUI
Digital Objects/
Metadata
Preservation Plan Store
Plan
Component Catalogue
Component Lookup
API
Taverna Workbench
Component Registration
API
Component Profile
Validator
Automated Watch
Sources
Push API
Pull API
Knowledge
Source Adaptor
Client Service
Watch Request API
Notification API
Report API
Assessment
Data Publication
Platform
LDS3 API Data
Loader Application
SCAPE Testbeds
12
SCAPE Testbeds
• Large-scale Digital Repositories • Carry out large scale image migrations
• The master files from legacy digitized image collections are typically TIFF files that can be costly to store due to their size. The cost benefit can only be realized if one can remove the original TIFFs and this can only be done if one can provide evidence of successful migration. (2.2 million pages, 80 TB)
• Detect poor sound quality • In a collection of mp3 files (20 TB - 360.000 files) we have discovered files with very bad sound quality. Before
ingesting everything into our DOMS we would like to be able to discover the bad files and potentially get those re-digitized from the original analogue media.
• Research Data Sets
• RAW to NEXUS conversion • There are file size and volume of content challenges identified for nexus files
the raw to nexus format migration tool can be customised to account for various other types of experiment data files in the process of the migration. However, the scalability challenge here is that for different instrument specific to each facility), the other types of experiment data files vary significantly.
13
from digitalbevaring.dk
See http://wiki.opf-labs.org/display/SP/Scenarios
SCAPE Testbeds
• Web Content • Quality assurance in web harvesting
• Web crawling is a process that is highly susceptible to errors. Often, essential data is missed by the crawler and thus not captured and preserved. Currently, quality assurance requires manual effort and because crawls often contain millions of pages, manual quality assurance will be neither very efficient
• Data Centers • Anonymization of medical data
• In order to fulfil the requirements for storing medical data in terms of safety and security, it will be necessary to develop encryption and anonymization services that will allow medical data transfer to a data center’s remote storage facilities. On one hand, the encryption techniques will be used to secure sensitive personal data (e.g. internal documents, patient databases) which must only be accessible from authorized services and users. On the other hand, the anonymization services will enable medical data (like x-ray generator outputs, x-ray computed tomography outputs, surgery recordings) being stored in the data center without having sensitive data attached.
14
from digitalbevaring.dk
SCAPE Additional Information
15
Additional Resources of Interest • Development Infrastructure
• Code repository hosted by the Open Planets Foundation and GitHub • https://github.com/openplanets/scape/
• Development Wiki • http://wiki.opf-labs.org/display/SP/Home
• Experimental Workflows • http://www.myexperiment.org/search?query=SCAPE&type=all&commit=Search
• Publications • http://www.scape-project.eu/category/publication
• Public Deliverables • http://www.scape-project.eu/category/deliverable
• Tools • http://www.scape-project.eu/tools
16
SCAPE Training Events
• Future Formats First: Application Infrastructures for Action Services • 16-17 September 2013, London • Registration: http://scape-future-formats-first.eventbrite.co.uk/
• Critical Path: Effective Evidence Based Preservation Planning • 13 November 2013, Aarhus
• Hadoop-driven Digital Preservation (Hackathon) • 2-4 December 2013, Vienna
17
See http://www.scape-project.eu/events
SCAPE Contact Information
• http://www.scape-project.eu/ • Twitter: #scapeproject • [email protected]
• Dr. Ross King
AIT Austrian Institute of Technology GmbH Donau-City-Strasse 1 A-1220 Wien
18
Thank you for your attention!
Questions?
19