176

Q4 2016 GeoTrellis Presentation

Embed Size (px)

Citation preview

Page 1: Q4 2016 GeoTrellis Presentation
Page 2: Q4 2016 GeoTrellis Presentation

What we’ll be covering

What GeoTrellis is, what it can do

Demo of GeoTrellis in action

Talk about what the next steps for GeoTrellis are, look into some of the possible use cases for GeoTrellis that we’re excited about, and talk about our roadmap

Feel free to ask questions throughout!

Page 3: Q4 2016 GeoTrellis Presentation

Where did

come from?

Page 4: Q4 2016 GeoTrellis Presentation
Page 5: Q4 2016 GeoTrellis Presentation
Page 6: Q4 2016 GeoTrellis Presentation
Page 7: Q4 2016 GeoTrellis Presentation
Page 8: Q4 2016 GeoTrellis Presentation
Page 9: Q4 2016 GeoTrellis Presentation
Page 10: Q4 2016 GeoTrellis Presentation
Page 11: Q4 2016 GeoTrellis Presentation
Page 12: Q4 2016 GeoTrellis Presentation
Page 13: Q4 2016 GeoTrellis Presentation

2011 - 2013

Page 14: Q4 2016 GeoTrellis Presentation

2013 - Present

Page 15: Q4 2016 GeoTrellis Presentation
Page 16: Q4 2016 GeoTrellis Presentation

What is

?

Page 17: Q4 2016 GeoTrellis Presentation

GeoTrellis

a Scala library for geospatial data types and operations.

enables Spark with geospatial capabilities

storage and query raster data from HDFS, S3, Accumulo, and Cassandra (HBase soon)

Page 18: Q4 2016 GeoTrellis Presentation

Geo +

Rasters +

Page 19: Q4 2016 GeoTrellis Presentation

Rasters, some Vector +

v1.0 Q4 2016

Page 20: Q4 2016 GeoTrellis Presentation

Rasters, Vector, VectorTiles, Point Cloud +

ROADMAP v1.1

w/

Page 21: Q4 2016 GeoTrellis Presentation

Vector Data with GeoTrellis (non-Spark)

Wraps JTS

GeoJson, WKT, WKB reading/writing

Reprojection (Proj4j)

Kriging Interpolation

Page 22: Q4 2016 GeoTrellis Presentation

Rasters with GeoTrellis (non-Spark)Read GeoTiffs

Map Algebra (local, focal, zonal)

Polygonal Summaries

Generally transform and combine raster data

Kernel Density, rasterization, vectorization

Get histograms

Render via color breaks

Page 23: Q4 2016 GeoTrellis Presentation

GeoTrellis & Spark

Ingest data to local file system, HDFS, Accumulo, S3, or Cassandra

Distributed computations of Spatial and Spatio-temporal raster data

Map algebra on distributed tile sets

General ways to transform and combine distributed tile sets

Page 24: Q4 2016 GeoTrellis Presentation

BACKGROUND

Page 25: Q4 2016 GeoTrellis Presentation

PROCESSING GEOSPATIAL DATA @ SCALE

Page 26: Q4 2016 GeoTrellis Presentation

PROCESSING GEOSPATIAL DATA @ SCALE

Page 27: Q4 2016 GeoTrellis Presentation

Geospatial Data

Core of GIS (Geographic information system)

Raster (images, weather data)

Vector (points of interest, country boundries)

Page 28: Q4 2016 GeoTrellis Presentation

Geospatial Data

Core of GIS (Geographic information system)

Raster (images, weather data)

Vector (points of interest, country boundries)

VectorTiles, Point Cloud

Page 29: Q4 2016 GeoTrellis Presentation

Raster Data

Page 30: Q4 2016 GeoTrellis Presentation

Raster Data

Page 31: Q4 2016 GeoTrellis Presentation

Raster Data

Page 32: Q4 2016 GeoTrellis Presentation

Raster Data

Page 33: Q4 2016 GeoTrellis Presentation

Vector Data (Points)

Page 34: Q4 2016 GeoTrellis Presentation

Vector Data (Lines)

Page 35: Q4 2016 GeoTrellis Presentation

Vector Data (Polygons)

Source: https://ryouready.wordpress.com/2009/11/16/infomaps-using-r-visualizing-german-unemployment-rates-by-color-on-a-map/

Page 36: Q4 2016 GeoTrellis Presentation

Vector Data

Page 37: Q4 2016 GeoTrellis Presentation

PROCESSING GEOSPATIAL DATA @ SCALE

Page 38: Q4 2016 GeoTrellis Presentation

Contains

Page 39: Q4 2016 GeoTrellis Presentation

Contains

Page 40: Q4 2016 GeoTrellis Presentation

Heatmap (Kernel Density)

Page 41: Q4 2016 GeoTrellis Presentation

Zonal Statistics

Page 42: Q4 2016 GeoTrellis Presentation

Feature Extraction (Image Segmentation)

Source: http://www.professeurs.polymtl.ca/christopher.pal/

Page 43: Q4 2016 GeoTrellis Presentation

Map Algebra

Page 44: Q4 2016 GeoTrellis Presentation

Local Operation

Page 45: Q4 2016 GeoTrellis Presentation

Focal Operation

Page 46: Q4 2016 GeoTrellis Presentation

Map Algebra in GeoTrellis

Page 47: Q4 2016 GeoTrellis Presentation

PROCESSING GEOSPATIAL DATAWITH

Page 48: Q4 2016 GeoTrellis Presentation

Polygonal Summary Statistics

Page 49: Q4 2016 GeoTrellis Presentation
Page 50: Q4 2016 GeoTrellis Presentation
Page 51: Q4 2016 GeoTrellis Presentation
Page 52: Q4 2016 GeoTrellis Presentation
Page 53: Q4 2016 GeoTrellis Presentation

PROCESSING GEOSPATIAL DATA @ SCALE

Page 54: Q4 2016 GeoTrellis Presentation

NED 1/3 arc second

Page 55: Q4 2016 GeoTrellis Presentation

NED 1/3 arc second

Page 56: Q4 2016 GeoTrellis Presentation

NED 1/3 arc second

Page 57: Q4 2016 GeoTrellis Presentation

NED 1/3 arc second

Page 58: Q4 2016 GeoTrellis Presentation

NED 1/3 arc second

Page 59: Q4 2016 GeoTrellis Presentation

• 170 X 180 km

• 2gb each.

• 11 bands

• 700 scenes per day

• 1.4 TB / day

• 255,500 scenes / year

• 0.25 PB / year

Landsat 8

Page 60: Q4 2016 GeoTrellis Presentation

Landsat 8 on

• All Landsat 8 scenes from 2015 and beyond.• Selection of cloud-free scenes from 2013 and 2014.

Page 61: Q4 2016 GeoTrellis Presentation

Landsat 8 on

645,763 scenes

Page 62: Q4 2016 GeoTrellis Presentation

Landsat 8 on

≈1 Petabyte

Page 63: Q4 2016 GeoTrellis Presentation
Page 64: Q4 2016 GeoTrellis Presentation
Page 65: Q4 2016 GeoTrellis Presentation
Page 66: Q4 2016 GeoTrellis Presentation

64 GB

Page 67: Q4 2016 GeoTrellis Presentation

32 Landsat 8 Scenes

Page 68: Q4 2016 GeoTrellis Presentation
Page 69: Q4 2016 GeoTrellis Presentation

This many people’s phones could hold all the Landsat 8 AWS is holding.

Page 70: Q4 2016 GeoTrellis Presentation

PROCESSING GEOSPATIAL DATA @ SCALE

Page 71: Q4 2016 GeoTrellis Presentation
Page 72: Q4 2016 GeoTrellis Presentation

Project to build a better search engine, back in the early 2000’s.

Worked for small datasets, but was not scalable.

Page 73: Q4 2016 GeoTrellis Presentation

The Google papers

Page 74: Q4 2016 GeoTrellis Presentation

After reading the papers, Nutch developers added a distributed file system and MapReduce model to Nutch.

In 2006, those portions were spun out of Nutch to form…

Page 75: Q4 2016 GeoTrellis Presentation
Page 76: Q4 2016 GeoTrellis Presentation

Apache Hadoop

Heavily supported by Yahoo, which moved it’s large data processing to Hadoop.

by 2007, Twitter, Facebook, LinkedIn and many others were doing serious work with Hadoop

2008 Hadoop graduated to a top level Apache project

Page 77: Q4 2016 GeoTrellis Presentation

Hadoop

Source: http://cs.calvin.edu/courses/cs/374/exercises/12/lab/MapReduceWordCount.png

Page 78: Q4 2016 GeoTrellis Presentation

Matei Zaharia

Worked with Hadoop at UC Berklee

Noticed Hadoop was not a good fit for Machine Learning algorithms and other iterative models.

So in 2009, he created…

Page 79: Q4 2016 GeoTrellis Presentation
Page 80: Q4 2016 GeoTrellis Presentation
Page 81: Q4 2016 GeoTrellis Presentation

Open sourced in 2010 under BSD license

Maintained by UC Berkeley’s AMPLab

Donated to the Apache Software Foundation in 2013 and relicensed as Apache 2.0

Graduated to a top level Apache project in 2014

Apache Spark

Page 82: Q4 2016 GeoTrellis Presentation

Apache Spark

a distributed computation engine.

An API that lets you work with distributed data as a collection.

Written in Scala, with language bindings for use with Java, Python, and R.

Page 83: Q4 2016 GeoTrellis Presentation
Page 84: Q4 2016 GeoTrellis Presentation
Page 85: Q4 2016 GeoTrellis Presentation

Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones?

Page 86: Q4 2016 GeoTrellis Presentation
Page 87: Q4 2016 GeoTrellis Presentation

Data Node

Data Node

Data Node

Name Node

Master

Tablet Server

Tablet Server

Tablet Server

Accumulo

BigTable clone (columnar database)

Records stored on HDFS

Lexicographically sorted table index

Page 88: Q4 2016 GeoTrellis Presentation
Page 89: Q4 2016 GeoTrellis Presentation

Apache Accumulo

Created by the NSA in 2008

Donated to the Apache Foundation in 2011

Graduated to a top level project in 2012

Page 90: Q4 2016 GeoTrellis Presentation

2006

Page 91: Q4 2016 GeoTrellis Presentation
Page 92: Q4 2016 GeoTrellis Presentation

(Sec. 929) Prohibits any DOD component from utilizing the cloud computing database developed by the National Security Agency (NSA) and known as "Accumulo" after the end of FY2013, unless the DOD CIO certifies that: (1) there are no viable commercial open source databases that have such security features, or (2) Accumulo itself has become a successful open source database project. Requires DOD and intelligence community officials to coordinate the use by DOD components of cloud computing infrastructure and services offered by the intelligence community for purposes other than intelligence analysis.

Page 93: Q4 2016 GeoTrellis Presentation

(Sec. 929) Prohibits any DOD component from utilizing the cloud computing database developed by the National Security Agency (NSA) and known as "Accumulo" after the end of FY2013, unless the DOD CIO certifies that: (1) there are no viable commercial open source databases that have such security features, or (2) Accumulo itself has become a successful open source database project. Requires DOD and intelligence community officials to coordinate the use by DOD components of cloud computing infrastructure and services offered by the intelligence community for purposes other than intelligence analysis.

Page 94: Q4 2016 GeoTrellis Presentation

Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones per month?

Page 95: Q4 2016 GeoTrellis Presentation

PROCESSING GEOSPATIAL DATA @ SCALE

Page 96: Q4 2016 GeoTrellis Presentation

Hey Flyers Fans, can you take the average pixel value of each scene’s band and derive a EPSG:3857 tile set of PNGs to be served on web

maps?

Page 97: Q4 2016 GeoTrellis Presentation

Hey Flyers Fans, can you take the average pixel value of each scene’s band and derive a EPSG:3857 tile set of PNGs to be served on web

maps?

Page 98: Q4 2016 GeoTrellis Presentation
Page 99: Q4 2016 GeoTrellis Presentation

How does

work?

Page 100: Q4 2016 GeoTrellis Presentation
Page 101: Q4 2016 GeoTrellis Presentation
Page 102: Q4 2016 GeoTrellis Presentation

Polygonal Summaries

Page 103: Q4 2016 GeoTrellis Presentation

Polygonal Summaries

Page 104: Q4 2016 GeoTrellis Presentation

SPACE FILLING CURVES

Page 105: Q4 2016 GeoTrellis Presentation

Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones per month, per country?

Page 106: Q4 2016 GeoTrellis Presentation

Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones per country?

Page 107: Q4 2016 GeoTrellis Presentation

SPACE FILLING CURVES

Page 108: Q4 2016 GeoTrellis Presentation

Z curve

Page 109: Q4 2016 GeoTrellis Presentation

Hilbert Curve

Page 110: Q4 2016 GeoTrellis Presentation

Space Filling Curves

Page 111: Q4 2016 GeoTrellis Presentation
Page 112: Q4 2016 GeoTrellis Presentation

Range Decomposition

70 -> 75 92 -> 99 116 -> 121

Page 113: Q4 2016 GeoTrellis Presentation
Page 114: Q4 2016 GeoTrellis Presentation

on

Page 115: Q4 2016 GeoTrellis Presentation

on

s3 key layerName/zoom/[SFC Index (Hilbert or Z order)]

s3 valueAvro Encoded Seq[(K, V)] where

K = Key Type (e.g. SpatialKey)V = Value Type (e.g. Tile)

Page 116: Q4 2016 GeoTrellis Presentation

Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones A) per month, B) per country,

C) per both?

Page 117: Q4 2016 GeoTrellis Presentation

Why

?

Page 118: Q4 2016 GeoTrellis Presentation

Sharding raster data across the cluster

Caching operation results across cluster

HDFS support

Advanced fault tolerance

Advanced task scheduling

Page 119: Q4 2016 GeoTrellis Presentation

Source: http://cs.calvin.edu/courses/cs/374/exercises/12/lab/MapReduceWordCount.png

Page 120: Q4 2016 GeoTrellis Presentation

Say we have a large set of imagery, and would like to apply two filters to each band:

First, we want to apply a simple threshold filter: if a value is above 10,000, we want to discard it

Second, we would like to apply a 5 x 5 median filter.

Example Problem: Filtering

Page 121: Q4 2016 GeoTrellis Presentation
Page 122: Q4 2016 GeoTrellis Presentation
Page 123: Q4 2016 GeoTrellis Presentation
Page 124: Q4 2016 GeoTrellis Presentation

(1, 1) (2, 1)(0, 1)

(0, 0) (1, 0) (2, 0)

(1, 2) (2, 2)(0, 2)

Page 125: Q4 2016 GeoTrellis Presentation

(1, 1) (2, 1)(0, 1)

(0, 0) (1, 0) (2, 0)

(1, 2) (2, 2)(0, 2)

Node 1

Node 2

Node 3

Page 126: Q4 2016 GeoTrellis Presentation

(1, 1) (2, 1)(0, 1)

(0, 0) (1, 0) (2, 0)

(1, 2) (2, 2)(0, 2)

Node 1

Node 2

Node 3

Page 127: Q4 2016 GeoTrellis Presentation

(1, 1) (2, 1)(0, 1)

(0, 0) (1, 0) (2, 0)

(1, 2) (2, 2)(0, 2)

Node 1

Node 2

Node 3

Page 128: Q4 2016 GeoTrellis Presentation

(1, 1) (2, 1)(0, 1)

Node 1

Node 2

Node 3

Page 129: Q4 2016 GeoTrellis Presentation

(1, 1) (2, 1)(0, 1)

Node 1

Node 2

Node 3

Page 130: Q4 2016 GeoTrellis Presentation

(c, r)

Page 131: Q4 2016 GeoTrellis Presentation

Example Problem: Querying

We want to retrieve all imagery for the city of Rio de Janeiro taken in March 2016, find the maximum NDVI values for each pixel and save it as a GeoTiff.

Page 132: Q4 2016 GeoTrellis Presentation
Page 133: Q4 2016 GeoTrellis Presentation

What are uses of

?

Page 134: Q4 2016 GeoTrellis Presentation
Page 135: Q4 2016 GeoTrellis Presentation

100 spot instance m3.xlarge workers @ $0.04 / hr = $4.00 / hr

400 CPUs / ≈1.5 TB memory

1 master m3.xlarge on-demand instance @ $0.26 / hr

EMR cluster charge, $0.07 / hr

$4.37 / hr

Rendering elevation with hillshade + NLCD on AWS EMR

Page 136: Q4 2016 GeoTrellis Presentation
Page 137: Q4 2016 GeoTrellis Presentation
Page 138: Q4 2016 GeoTrellis Presentation

NED 1/3 arc second + NLCD

Page 139: Q4 2016 GeoTrellis Presentation

NED 1/3 arc second + NLCD

Page 140: Q4 2016 GeoTrellis Presentation

NED 1/3 arc second + NLCD

Page 141: Q4 2016 GeoTrellis Presentation

GLOBAL CIRCULATION MODELS

Models for predicting world temperature and precipitation.

Page 142: Q4 2016 GeoTrellis Presentation

GLOBAL CIRCULATION MODELS

Page 143: Q4 2016 GeoTrellis Presentation

NASA NEX Downscaled Climate Projections (NEX-DCP30)

• Monthly data over conterminous US

• Historical from 1950 - 2006

• 4 RCP scenarios from 2006 - 2099

• 8190 netCDF files on S3 - s3://nasanex/NEX-DCP30

• 15.3 TB in compressed GeoTiff tiles.

• RCP 8.5, max for datatype/model combo: 90.92 GB

Page 144: Q4 2016 GeoTrellis Presentation
Page 145: Q4 2016 GeoTrellis Presentation

Landsat NDVI/NDWI change detection demo

Page 146: Q4 2016 GeoTrellis Presentation

Static vs Dynamic

Page 147: Q4 2016 GeoTrellis Presentation

serving static data pre-processed through a batch transformation pipeline vs serving data dynamically

processed on-demand from unprocessed source data

Page 148: Q4 2016 GeoTrellis Presentation

Static vs Dynamic

GeoTrellis systems tend to have two major components:

A batch pre-processing pipeline, which processes large amounts of data into some static data at rest.

A dynamic pipeline which processes data at the time the user requests it.

Page 149: Q4 2016 GeoTrellis Presentation

“Raw” Data

Served Data

Processing Pipeline

Page 150: Q4 2016 GeoTrellis Presentation

“Raw” Data

Served Data

Completely dynamic

Application Data

Processing at request time

Page 151: Q4 2016 GeoTrellis Presentation

“Raw” Data

Served Data

Completely static

Batch data pre-processing

Application Data

Page 152: Q4 2016 GeoTrellis Presentation

“Raw” Data

Served Data

Application Data

Mix of static and dynamic

Batch data pre-processing Processing at request time

Page 153: Q4 2016 GeoTrellis Presentation

“Raw” Data

Served Data

Application Data

Mix of static and dynamic

Ingest/ETL Server

Page 154: Q4 2016 GeoTrellis Presentation

“Raw” Data

Served Data

Application Data

More static

Faster to serve, less flexibility

Page 155: Q4 2016 GeoTrellis Presentation

“Raw” Data

Served Data

Application Data

More dynamic

More flexible, slower to serve

Page 156: Q4 2016 GeoTrellis Presentation

Ingesting Landsat data

Landsat images are pulled off of S3 or Google’s public Earth Engine storage.

In an Spark job run on EMR, these images are reprojected, tiled, indexed, and saved off to Accumulo or HDFS.

The indexed tile set is now ready to be used by the server application.

Page 157: Q4 2016 GeoTrellis Presentation

Landsat GeoTiffs

on S3

PNGs, JSON

EPSG:3857 tiled imagery in Accumulo

Ingest/ETL Server

Page 158: Q4 2016 GeoTrellis Presentation

Landsat GeoTiffs

on S3

PNGs, JSON

EPSG:3857 tiled imagery in Accumulo

Ingest/ETL Server

Page 159: Q4 2016 GeoTrellis Presentation

Landsat GeoTiffs

on S3

PNGs, JSON

EPSG:3857 tiled imagery in Accumulo

Ingest/ETL Server

Page 160: Q4 2016 GeoTrellis Presentation

DEPLOYMENT

Page 161: Q4 2016 GeoTrellis Presentation

Example Deployment

Page 162: Q4 2016 GeoTrellis Presentation

Servicing User Requests

Page 163: Q4 2016 GeoTrellis Presentation
Page 164: Q4 2016 GeoTrellis Presentation

ROAD MAP

Page 165: Q4 2016 GeoTrellis Presentation

Release Schedule

v1.0 Q4 2016

v1.1 Q2 2017

Graduation

Page 166: Q4 2016 GeoTrellis Presentation

Rasters, Vector, VectorTiles, Point Cloud +

ROADMAP v1.1

w/

Page 167: Q4 2016 GeoTrellis Presentation
Page 168: Q4 2016 GeoTrellis Presentation
Page 169: Q4 2016 GeoTrellis Presentation

DOCUMENTATION!

Page 170: Q4 2016 GeoTrellis Presentation

IMPROVED DEPLOYMENT WITH

Page 171: Q4 2016 GeoTrellis Presentation
Page 172: Q4 2016 GeoTrellis Presentation

Integration work

Page 173: Q4 2016 GeoTrellis Presentation

VECTOR TILES

Image: osm2vectortile

Page 174: Q4 2016 GeoTrellis Presentation

POINT CLOUD

Page 175: Q4 2016 GeoTrellis Presentation

MACHINE LEARNING PIPELINES

http://blog.tomnod.com/finding-pools-with-deep-learning

Page 176: Q4 2016 GeoTrellis Presentation