21
UPTAP Workshop 2007 How Can e-Social Science Promote the Re-Use of Data? Rob Procter National Centre for e-Social Science [email protected] www.ncess.ac.uk

How Can e-Social Science Promote the Re-Use of Data?

  • Upload
    chuck

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

How Can e-Social Science Promote the Re-Use of Data?. Rob Procter National Centre for e-Social Science [email protected] www.ncess.ac.uk. The e-Science Vision. - PowerPoint PPT Presentation

Citation preview

Page 1: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 1

How Can e-Social Science Promote the Re-Use of Data?

Rob ProcterNational Centre for e-Social Science

[email protected]

Page 2: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 2

The e-Science Vision “e-Science is about global collaboration in key areas of

science and the next generation of infrastructure that will enable it.” (John Taylor, former DG, Research Councils)

That infrastructure is the Grid:“ … a software infrastructure that enables flexible, secure,

coordinated resource sharing among dynamic collections of individuals, institutions and resources” (Foster, Kesselman and Tuecke)

The Grid is not just an enabler of visionary research, however, but can help researchers in more mundane ways.

But, to be successful, the development of the Grid must be driven by researchers’ needs.

I want to use the opportunity provided by this workshop to gather ideas from you about what those needs are with a specific focus on the (re-)use of data.

Page 3: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 3

NCeSS Overview Launched in May 2004 to develop

and promote UK e-Social Science. Unified Centre with distributed

structure:– Co-ordinating Hub: Manchester &

UKDA – Seven research Nodes located

across UK– Twelve small projects

Page 4: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 4

NCeSS Overview

Applications of e-Social Science:– Harnessing new kinds of research

infrastructure and tools to tackle substantive problems and promote innovation in research methods

Social shaping:– Usability of new infrastructure and tools – Socio-technical factors in their design,

uptake and use– Research and policy drivers, impacts

Page 5: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 5

Hub

Social Shaping

NCeSS 2006

Tools

CQeSS

MoSeS PolicyGrid

Disclosure Risk

Assessment

CeSDeMIDE

GeSRM

Intelligent Simulation

MiMeG

HeadTalk

Analysis

Infrastructure and services

Research methods

OeSS

DReSS

AGN enabled

interviews

Learning Disabilities

Entangled Data

Data chronicles

Replayer

Grid-enabled data

collection

Data

GeoVUE

GeODE

Page 6: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 6

Today’s Research Infrastructure Heterogeneous resources with poor inter-operability

and complex administrative arrangements.

HPC

HPCAnalysis

Data archive

Analysis

Study

Experiment

HPC

Researcher

Computing

Data archive

Doesn’t scale well and makes re-use and sharing of data and other research resources difficult.

Page 7: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 7

Grid-Enabled Research Infrastructure

Social scientist

Social scientist

Social scientist

Grid

Middle-ware

Storage

Storage

ComputingAnalysis

Analysis

Experiment

HPC

HPC

Grid middleware manages the interactions between users, and heterogeneous and distributed resources, providing seamless integration of data, analytic tools and compute resources.

Data archive

Data archive

Study

Page 8: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 8

The Grid Dissected

Tools to support collaboration between distributed researchers.

Computational Grids for scalable, high-performance computation.

Data Grids for accessing and integrating heterogeneous datasets.

Sensor Grids for collecting real-time data.

Page 9: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 9

Research and Policy Drivers

Ageing population

Migration

Globalisation

Childhood development

Census and

population

surveysAdministrati

ve dataLongitudinal surveys

Socio-medical data

Business and

economic data

International

macro/micro data

           

         

     

       

Page 10: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 10

Research and Policy Drivers The range of research resources on offer

to the social science community has never been greater.

These include not only traditional research datasets, but new kinds of social data.

However, the often highly distributed and heterogeneous character of these datasets makes it difficult to exploit them to their full potential.

Page 11: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 11

Research and Policy Drivers The data deluge in social sciences:

– WWW archive currently contains 55 billion Web pages or 2 petabytes (2x1015) of data and is growing at the rate of 20 terabytes (20x1012) per month

Administrative and transactional data is generated on increasing scale as by product of our everyday activities:– This data is complex and multi-dimensional

Page 12: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 12

Data Grids for Social Science Data Grids are designed to provide unimpeded

and integrated use of distributed, heterogeneous, autonomous data resources.

Grid enabling a dataset creates new opportunities for (re-)use:– enables users to integrate it with other datasets – makes it possible to analyse the dataset using

techniques that require the kind of computational power that is only feasible using the Grid (e.g., more complex models, more data points)

– standardisation of procedures and mechanisms used to access and update the dataset increase its shareability

– Automated analyses (i.e., analyses can be re-run automatically when databases are updated)

Page 13: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 13

An Example Data Linkage Problem

Many research questions require combination of data from multiple geo-referenced datasets:– E.g., Linking post coded data to census geography

Conversion of data relating to different geographies to a common target geography is– A complex time consuming task – Requires a range of data handling/processing skills – A major barrier to use!

The data conversion process requires users to perform the following generic tasks:– Extract and download data in different formats from a number

of databases using different interfaces– Convert each dataset to the desired target geography using

geographical conversion tables– Combine the converted sets into a single dataset for analysis

These generic tasks can be automated.

Page 14: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 14

A Solution: ConvertGrid

ConvertGrid provides access to 225 UK-wide geography conversion tables between census, electoral, administrative, postal, health and statistical geographies derived from the AFPD.

Facility to convert a researcher’s data from one set of geographical units to another (e.g., from postcode geography to heath geography).

Extensible system - further conversion tables from any source can be incorporated.

Page 15: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 15

ConvertGrid – Data Visualisation Interface

Relationship between average house price sales (Experian) and percentage of 16-19 year olds entering university (Neighbourhood Statistics & Census aggregate statistics).

Contact Keith Cole ([email protected]) for more information.

High average house price sales but low participation rates

Low average house price sales but high participation rates

Ten minutes from start to finish

Page 16: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 16

Supporting the Research Lifecycle

Share results and conclusions and discuss

with collaborators

Explore datasets and determine suitability

Analyse results and compare with

hypothesis

Review literature and generate hypothesis

Write papers

Build models and execute them

Publish papers

Find datasets related to proposed area of work

Page 17: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 17

Increasing (Re-)Use of Social Data

Removing barriers to more effective use of existing social data collections:– Data providers (e.g., ONS, data archives)– Data users

Many researchers are both generators and users of data:– Preparation of data for submission to data

archives is not well rewarded so re-use suffers Removing barriers to use of new kinds of

social data:– Privacy and confidentiality of personal data

Page 18: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 18

The Data Provider Perspective Preparation procedures:

– Cleaning the data– Generating derived variables– Re-weighting– Adding metadata– Writing user documentation

Maintenance:– Managing changes in sampling frames, definitions,

variables and questionnaire over time– Re-weighting

User support:– Handling queries from users about concepts, meaning

and linking waves

Page 19: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 19

The Data User Perspective Discovering appropriate data:

– Determining what can be done with the data and how. Accessing the data:

– Are existing provisions, such as VMDLs, for access to confidential data adequate?

Understanding how the data has been used to generate answers to other research questions:– Provenance of results, links to publications– Re-running statistical models, comparing results

Ease and of use and quality of documentation:– User manuals

Page 20: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 20

The Data User Perspective Data preparation:

– Selecting variables– Linking waves– Linking data sets

Performing and possibly repeating analysis with different data.

Interpreting and visualising results. Supporting the research lifecycle. Collaboration with other users and with data

providers.

Page 21: How Can e-Social Science Promote the Re-Use of Data?

UPTAP Workshop 2007 21

Contacting NCeSS and Getting Involved

[email protected] www.ncess.ac.uk

– Join our email list:– Participate in events:

•Agenda setting workshop on combining and sharing data, January 22nd-23rd, Manchester

•Annual conference