Q4 2016 GeoTrellis Presentation

Preview:

Citation preview

What we’ll be covering

What GeoTrellis is, what it can do

Demo of GeoTrellis in action

Talk about what the next steps for GeoTrellis are, look into some of the possible use cases for GeoTrellis that we’re excited about, and talk about our roadmap

Feel free to ask questions throughout!

Where did

come from?

2011 - 2013

2013 - Present

What is

?

GeoTrellis

a Scala library for geospatial data types and operations.

enables Spark with geospatial capabilities

storage and query raster data from HDFS, S3, Accumulo, and Cassandra (HBase soon)

Geo +

Rasters +

Rasters, some Vector +

v1.0 Q4 2016

Rasters, Vector, VectorTiles, Point Cloud +

ROADMAP v1.1

w/

Vector Data with GeoTrellis (non-Spark)

Wraps JTS

GeoJson, WKT, WKB reading/writing

Reprojection (Proj4j)

Kriging Interpolation

Rasters with GeoTrellis (non-Spark)Read GeoTiffs

Map Algebra (local, focal, zonal)

Polygonal Summaries

Generally transform and combine raster data

Kernel Density, rasterization, vectorization

Get histograms

Render via color breaks

GeoTrellis & Spark

Ingest data to local file system, HDFS, Accumulo, S3, or Cassandra

Distributed computations of Spatial and Spatio-temporal raster data

Map algebra on distributed tile sets

General ways to transform and combine distributed tile sets

BACKGROUND

PROCESSING GEOSPATIAL DATA @ SCALE

PROCESSING GEOSPATIAL DATA @ SCALE

Geospatial Data

Core of GIS (Geographic information system)

Raster (images, weather data)

Vector (points of interest, country boundries)

Geospatial Data

Core of GIS (Geographic information system)

Raster (images, weather data)

Vector (points of interest, country boundries)

VectorTiles, Point Cloud

Raster Data

Raster Data

Raster Data

Raster Data

Vector Data (Points)

Vector Data (Lines)

Vector Data (Polygons)

Source: https://ryouready.wordpress.com/2009/11/16/infomaps-using-r-visualizing-german-unemployment-rates-by-color-on-a-map/

Vector Data

PROCESSING GEOSPATIAL DATA @ SCALE

Contains

Contains

Heatmap (Kernel Density)

Zonal Statistics

Feature Extraction (Image Segmentation)

Source: http://www.professeurs.polymtl.ca/christopher.pal/

Map Algebra

Local Operation

Focal Operation

Map Algebra in GeoTrellis

PROCESSING GEOSPATIAL DATAWITH

Polygonal Summary Statistics

PROCESSING GEOSPATIAL DATA @ SCALE

NED 1/3 arc second

NED 1/3 arc second

NED 1/3 arc second

NED 1/3 arc second

NED 1/3 arc second

• 170 X 180 km

• 2gb each.

• 11 bands

• 700 scenes per day

• 1.4 TB / day

• 255,500 scenes / year

• 0.25 PB / year

Landsat 8

Landsat 8 on

• All Landsat 8 scenes from 2015 and beyond.• Selection of cloud-free scenes from 2013 and 2014.

Landsat 8 on

645,763 scenes

Landsat 8 on

≈1 Petabyte

64 GB

32 Landsat 8 Scenes

This many people’s phones could hold all the Landsat 8 AWS is holding.

PROCESSING GEOSPATIAL DATA @ SCALE

Project to build a better search engine, back in the early 2000’s.

Worked for small datasets, but was not scalable.

The Google papers

After reading the papers, Nutch developers added a distributed file system and MapReduce model to Nutch.

In 2006, those portions were spun out of Nutch to form…

Apache Hadoop

Heavily supported by Yahoo, which moved it’s large data processing to Hadoop.

by 2007, Twitter, Facebook, LinkedIn and many others were doing serious work with Hadoop

2008 Hadoop graduated to a top level Apache project

Hadoop

Source: http://cs.calvin.edu/courses/cs/374/exercises/12/lab/MapReduceWordCount.png

Matei Zaharia

Worked with Hadoop at UC Berklee

Noticed Hadoop was not a good fit for Machine Learning algorithms and other iterative models.

So in 2009, he created…

Open sourced in 2010 under BSD license

Maintained by UC Berkeley’s AMPLab

Donated to the Apache Software Foundation in 2013 and relicensed as Apache 2.0

Graduated to a top level Apache project in 2014

Apache Spark

Apache Spark

a distributed computation engine.

An API that lets you work with distributed data as a collection.

Written in Scala, with language bindings for use with Java, Python, and R.

Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones?

Data Node

Data Node

Data Node

Name Node

Master

Tablet Server

Tablet Server

Tablet Server

Accumulo

BigTable clone (columnar database)

Records stored on HDFS

Lexicographically sorted table index

Apache Accumulo

Created by the NSA in 2008

Donated to the Apache Foundation in 2011

Graduated to a top level project in 2012

2006

(Sec. 929) Prohibits any DOD component from utilizing the cloud computing database developed by the National Security Agency (NSA) and known as "Accumulo" after the end of FY2013, unless the DOD CIO certifies that: (1) there are no viable commercial open source databases that have such security features, or (2) Accumulo itself has become a successful open source database project. Requires DOD and intelligence community officials to coordinate the use by DOD components of cloud computing infrastructure and services offered by the intelligence community for purposes other than intelligence analysis.

(Sec. 929) Prohibits any DOD component from utilizing the cloud computing database developed by the National Security Agency (NSA) and known as "Accumulo" after the end of FY2013, unless the DOD CIO certifies that: (1) there are no viable commercial open source databases that have such security features, or (2) Accumulo itself has become a successful open source database project. Requires DOD and intelligence community officials to coordinate the use by DOD components of cloud computing infrastructure and services offered by the intelligence community for purposes other than intelligence analysis.

Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones per month?

PROCESSING GEOSPATIAL DATA @ SCALE

Hey Flyers Fans, can you take the average pixel value of each scene’s band and derive a EPSG:3857 tile set of PNGs to be served on web

maps?

Hey Flyers Fans, can you take the average pixel value of each scene’s band and derive a EPSG:3857 tile set of PNGs to be served on web

maps?

How does

work?

Polygonal Summaries

Polygonal Summaries

SPACE FILLING CURVES

Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones per month, per country?

Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones per country?

SPACE FILLING CURVES

Z curve

Hilbert Curve

Space Filling Curves

Range Decomposition

70 -> 75 92 -> 99 116 -> 121

on

on

s3 key layerName/zoom/[SFC Index (Hilbert or Z order)]

s3 valueAvro Encoded Seq[(K, V)] where

K = Key Type (e.g. SpatialKey)V = Value Type (e.g. Tile)

Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones A) per month, B) per country,

C) per both?

Why

?

Sharding raster data across the cluster

Caching operation results across cluster

HDFS support

Advanced fault tolerance

Advanced task scheduling

Source: http://cs.calvin.edu/courses/cs/374/exercises/12/lab/MapReduceWordCount.png

Say we have a large set of imagery, and would like to apply two filters to each band:

First, we want to apply a simple threshold filter: if a value is above 10,000, we want to discard it

Second, we would like to apply a 5 x 5 median filter.

Example Problem: Filtering

(1, 1) (2, 1)(0, 1)

(0, 0) (1, 0) (2, 0)

(1, 2) (2, 2)(0, 2)

(1, 1) (2, 1)(0, 1)

(0, 0) (1, 0) (2, 0)

(1, 2) (2, 2)(0, 2)

Node 1

Node 2

Node 3

(1, 1) (2, 1)(0, 1)

(0, 0) (1, 0) (2, 0)

(1, 2) (2, 2)(0, 2)

Node 1

Node 2

Node 3

(1, 1) (2, 1)(0, 1)

(0, 0) (1, 0) (2, 0)

(1, 2) (2, 2)(0, 2)

Node 1

Node 2

Node 3

(1, 1) (2, 1)(0, 1)

Node 1

Node 2

Node 3

(1, 1) (2, 1)(0, 1)

Node 1

Node 2

Node 3

(c, r)

Example Problem: Querying

We want to retrieve all imagery for the city of Rio de Janeiro taken in March 2016, find the maximum NDVI values for each pixel and save it as a GeoTiff.

What are uses of

?

100 spot instance m3.xlarge workers @ $0.04 / hr = $4.00 / hr

400 CPUs / ≈1.5 TB memory

1 master m3.xlarge on-demand instance @ $0.26 / hr

EMR cluster charge, $0.07 / hr

$4.37 / hr

Rendering elevation with hillshade + NLCD on AWS EMR

NED 1/3 arc second + NLCD

NED 1/3 arc second + NLCD

NED 1/3 arc second + NLCD

GLOBAL CIRCULATION MODELS

Models for predicting world temperature and precipitation.

GLOBAL CIRCULATION MODELS

NASA NEX Downscaled Climate Projections (NEX-DCP30)

• Monthly data over conterminous US

• Historical from 1950 - 2006

• 4 RCP scenarios from 2006 - 2099

• 8190 netCDF files on S3 - s3://nasanex/NEX-DCP30

• 15.3 TB in compressed GeoTiff tiles.

• RCP 8.5, max for datatype/model combo: 90.92 GB

Landsat NDVI/NDWI change detection demo

Static vs Dynamic

serving static data pre-processed through a batch transformation pipeline vs serving data dynamically

processed on-demand from unprocessed source data

Static vs Dynamic

GeoTrellis systems tend to have two major components:

A batch pre-processing pipeline, which processes large amounts of data into some static data at rest.

A dynamic pipeline which processes data at the time the user requests it.

“Raw” Data

Served Data

Processing Pipeline

“Raw” Data

Served Data

Completely dynamic

Application Data

Processing at request time

“Raw” Data

Served Data

Completely static

Batch data pre-processing

Application Data

“Raw” Data

Served Data

Application Data

Mix of static and dynamic

Batch data pre-processing Processing at request time

“Raw” Data

Served Data

Application Data

Mix of static and dynamic

Ingest/ETL Server

“Raw” Data

Served Data

Application Data

More static

Faster to serve, less flexibility

“Raw” Data

Served Data

Application Data

More dynamic

More flexible, slower to serve

Ingesting Landsat data

Landsat images are pulled off of S3 or Google’s public Earth Engine storage.

In an Spark job run on EMR, these images are reprojected, tiled, indexed, and saved off to Accumulo or HDFS.

The indexed tile set is now ready to be used by the server application.

Landsat GeoTiffs

on S3

PNGs, JSON

EPSG:3857 tiled imagery in Accumulo

Ingest/ETL Server

Landsat GeoTiffs

on S3

PNGs, JSON

EPSG:3857 tiled imagery in Accumulo

Ingest/ETL Server

Landsat GeoTiffs

on S3

PNGs, JSON

EPSG:3857 tiled imagery in Accumulo

Ingest/ETL Server

DEPLOYMENT

Example Deployment

Servicing User Requests

ROAD MAP

Release Schedule

v1.0 Q4 2016

v1.1 Q2 2017

Graduation

Rasters, Vector, VectorTiles, Point Cloud +

ROADMAP v1.1

w/

DOCUMENTATION!

IMPROVED DEPLOYMENT WITH

Integration work

VECTOR TILES

Image: osm2vectortile

POINT CLOUD

MACHINE LEARNING PIPELINES

http://blog.tomnod.com/finding-pools-with-deep-learning