34
Produced by: MONTHLY SERIES In partnership with: Data Lake Architecture October 5, 2017

Data Lake Architecture

Embed Size (px)

Citation preview

The First Step in Information Management

looker.com

Produced by:

MONTHLY SERIES

In partnership with:

Data Lake ArchitectureOctober 5, 2017

Topics for Today’s Analytics Webinar

Benefits and Risks of a Data Lake

Data Lake Reference Architecture

Lab and the Factory

Base Environment for Batch Analytics, Streaming and Real-Time Data

Critical Governance Components

Key Take-Aways

Q&A

pg 2© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

Polling Questions

Do you have a data lake? − Yes− No− Unsure

If yes, is it:− Operational and regularly used for analytics− Informally used, like a lab or sandbox− Unsure

pg 3© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

Defining the Data Lake

A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format.

The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse).

A data lake can support either/or exploratory analytics and operational uses of data.

pg 4© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

Source: Gartner IT Glossary

www.firstsanfranciscopartners.com

Benefits and Risks of the Data Lake

Benefits of the Data Lake

pg 6© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

Enables “productionizing” advanced analytics Cost-effective scalability and flexibility Derives value from unlimited data types

(including raw data) Reduces long-term cost of ownership across

entire spectrum of data use

Risks of the Data Lake

Loss of trust Loss of relevance and momentum Increased risk Long-term excessive cost

pg 7© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

www.firstsanfranciscopartners.com

Data Lake Reference Architecture

Modern Reality of the Data Lake

pg 9© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

The data lake has changed due to storage availability, data management tools and ease of which data can be managed.

Today’s data lake is comprised of:‒ Landing Zone‒ Standardization Zone‒ Analytics Sandbox

Modern Reality of the Data Lake

pg 10© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

LANDING ZONEDATA SOURCES

Landing Zone: Closest to original data lake

conception where raw data is storedand available for consumption

Modern Reality of the Data Lake

pg 11© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

LANDING ZONE STANDARDIZATION ZONEDATA SOURCES

Standardization Zone: Standardized, cleaned data –

the preferred version for downstream consumers and

the Analytics Sandbox

Modern Reality of the Data Lake

pg 12© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOXDATA SOURCES

Analytics Sandbox:

Where Data Scientists work to

create new models

Modern Reality of the Data Lake

pg 13© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOXDATA SOURCES

DATA MANAGEMENT

Modern Reality of the Data Lake

pg 14© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX

DATA GOVERNANCE DATA OPERATIONS

DATA SOURCES

DATA MANAGEMENT

Modern Reality of the Data Lake

pg 15© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX

DATA GOVERNANCE DATA OPERATIONS

DATA SOURCES

DATA SCIENTISTS

DATA MANAGEMENT

Modern Reality of the Data Lake

pg 16© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX

DATA GOVERNANCE

DATA CONSUMERS

DATA OPERATIONS

DATA SOURCES

DATA SCIENTISTS

DATA MANAGEMENT

Reminder: Two Lenses to Derive an Effective Architecture

pg 17© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

FormDeveloping the

architecture so all stakeholders can

actually understand and develop it

ProgressionDevelop architectures

that are best fit for purpose and effective, no matter how simple

or complex

www.firstsanfranciscopartners.com

Lab and the Factory

Why is This Topic Important?

A key to successful data lake management is understanding if it is a lab,a factory or both.

There are architectural, governance and organizational impacts.

You must clearly identify if you are evolving from a lab to a factory or intend to keep them separate.

pg 19© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

First Progression – Lab Elements

pg 20© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

Organization Elements Functional Elements Technology Elements

Data

Co

nsum

ptio

nDa

ta S

uppl

y Ch

ain/

Logi

stic

sDa

ta

Man

agem

ent

Landing/Staging ETL

Data AnalystsAccess – Publish, Subscribe, Notify

Access Tools – BI, AnalyticsAnalytics – Descriptive, Predictive, Prescriptive

HDFS, Columnar and Graph

Operational Elements

pg 21© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

Organization Elements Functional Elements Technology Elements

Data

Co

nsum

ptio

nDa

ta S

uppl

y Ch

ain/

Logi

stic

sDa

ta

Man

agem

ent

Pedigree and Preparation

Landing/Staging

Model/Metrics Management

Data Reduction

Glossary Management

Machine Learning/AI

Data Governance

Data Operations Data Ingestion

Reference and Master Data

Competency Centers

Self-Service/Data Citizens

ETL/Virtualization

Distributed Processing

Metadata

Data Quality/Hygiene

Lake, Pond, Warehouse

HDFS, Columnar and Graph

Data Streaming

Data Glossary

Data Lake Management

Taxonomy/Ontology

Web Services

Policy and Process

Data Analysts and Scientists

Collaboration, Decision-Making Access – Publish, Subscribe, Notify Access Tools – BI, Analytics

Applications

Analytics – Descriptive, Predictive, Prescriptive

Business/Tech. Planning Security, Privacy

Business Continuity

pg 22

The Lab – Characteristics

Allows for experimentation, testing new models, proof of concepts

Technical− Flexible architectures, even ad hoc or non-persistent − Rarely documented − Schema on read

Organizational − Run by the main users, hence informal or departmental

Functional − By nature, results should be evaluated for relevance

© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

pg 23

The Factory – Characteristics

© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

Addressing directed requirements, producing regular outputs associated with a business service, product or action

Technical− Architecture needs to be defined so its use and limits are understood

Organizational− Published rules of engagement

Functional − Data quality is monitored and known− Lineage and metadata support navigation and use of content − May need scheduled access and loading− Publishing results will require some form of quality control and approval − Models that are executed on a scheduled basis will require some sort of

administrative and maintenance capabilities

www.firstsanfranciscopartners.com

Base Environment for Batch Analytics, Streaming and Real-Time Data

A Base Environment

pg 25© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

Data Governance Data Operations

Rapid ingestion –stream, low latencyor batch updating

Ease of access –find it, use it, know what it means

Effective data supply chain – data of the correct quality needs to be where it is supposed to be

Flexibility – Data Scientists need to be able to experiment, but without polluting the lake

Additional Components for Real-time Analyticsand Ingesting Streaming Data Are you replacing the Operational Data Store (ODS)?

Will you be doing full CRUD operation (Create, Update, Read, Delete)?

How fast do you need to go? Latencies should match your real needs.

Vendors – Hortonworks, Attunity, Splice − Ingest− Process− Consumption

Technologies you will hear about − Apache Kafka, Storm (real-time streaming components) − Apache Spark (fast batches)

pg 26© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

www.firstsanfranciscopartners.com

Critical Governance Components

Major Areas of Data Governance Concern

pg 28© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

In the Data Lake

LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX

DATA GOVERNANCE

DATA CONSUMERS

DATA OPERATIONS

DATA SOURCES

DATA SCIENTISTS

1

2

3 3 3

4

5

6

1

2

3

4

5

6

Data Acquisition

Data Catalog

Data Decisions

Analytics Governance

Data Usage

Model Productionalization

Some Data Governance approaches are new, and others are applications of traditional approaches

Major Areas of Data Governance Concern in the Data Lake

pg 29© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

Data is cataloged/ mapped so it’s easily found

Data is described adequately to permit reuse for any need

Decisions about data are logged and communicated

Flow of data (data lineage) is documented, so users/regulators can understand where it came from

Staff who knows and understands the data are identified

Data Governance defines the information you

need to maintain your data, develops the

processes to do this, trains staff and provides

the environments to manage the knowledge,

while monitoring and ensuring compliance.

Evolution of Critical Governance Components

pg 30© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

While flexible, governance is required to ensure appropriate use

While operational, governance will ensure legitimacy, compliance and verify

alignment with business needs

To move to operational, governance should supply road map, new policies,

training and organization management

www.firstsanfranciscopartners.com

Key Take-Aways

Key Take-Aways

Make sure you offer up business benefits in addition to traditional “access to data” – such as lower costs, more nimble reactions. Avoid additional data risks by providing oversight of data

quality and sources. Do not take a causal approach to managing the data lake assets. Understand that the architectural aspect of the data lake

(as it is evolving) is becoming a standard, much like the data warehouse. Maintain an open mind for supporting technologies, because

they are changing every day. Implement Data Governance. It is a critical success factor, no

matter how you view the data lake.

pg 32© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

Questions?

© 2017 First San Francisco Partners www.firstsanfranciscopartners.com

MONTHLY SERIES

Thank you for joining today!Please join us Thursday, Nov. 2 for the

Keys to Effective Data Visualization webinar.

John Ladley @[email protected]

Kelle O’Neal @[email protected]