Upload
nosqlmatters
View
214
Download
6
Embed Size (px)
Citation preview
GoDataDrivenPROUDLY PART OF THE XEBIA GROUP
Real time data driven applications
Giovanni LanzaniData Whisperer
and SQL vs NoSQL databases
Who am I?
2008-2012: PhD Theoretical Physics
2012-2013: KPMG
2013-Now: GoDataDriven
GoDataDriven
Feedback
@gglanzani
GoDataDriven
Real-time, data driven app?
• No store and retrieve;
• Store, {transform, enrich, analyse} and retrieve;
• Real-time: retrieve is not a batch process;
• App: something your mother could use:
SELECT attendees FROM NoSQLMatters WHERE password = '1234';
GoDataDriven
Get insight about event impact
GoDataDriven
Get insight about event impact
GoDataDriven
Get insight about event impact
GoDataDriven
Get insight about event impact
GoDataDriven
Challenges1. Big Data;2. Privacy;3. Some real-time analysis;
4. Real-time retrieval.
GoDataDriven
Is it Big Data?Everybody talks about itNobody knows how to do itEveryone thinks everyone else is doing it, so everyone claims they’re doing it…
Dan Ariely
GoDataDriven
Is it Big Data?
• Raw logs are in the order of 40TB;
• We use Hadoop for storing, enriching and pre-processing.
GoDataDriven
2. Privacy
GoDataDriven
3. (Some) real-time analysis
GoDataDriven
• Harder than it looks;
• Large data;
• Retrieval is by giving date, center location + radius.
4. Real-Time Retrieval
GoDataDriven
AngularJS python appREST
Front-end Back-end
JSON
Architecture
GoDataDriven
JS-1
GoDataDriven
JS-2
GoDataDriven
date hour id_activity postcode hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
Data Example
GoDataDriven
date hour id_activity postcode hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
Data Example
GoDataDriven
helper.py example
def get_statistics(data, sbi): sbi_df = data[data.sbi == sbi] # select * from data where sbi = sbi hits = sbi_df.hits.sum() # select sum(hits) from … delta_hits = sbi_df.delta.sum() # select sum(delta) from … if delta_hits: percentage = (hits - delta_hits) / delta_hits else: percentage = 0
return {"sbi": sbi, "total": hits, "percentage": percentage}
GoDataDriven
helper.py example
def get_timeline(data, sbi): df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum) # select sum(hits), sum(delta) from data group by date, hour, sbi return df_sbi
GoDataDriven
Who has my data?
• First iteration was a (pre)-POC, less data (3GB vs 500GB);
• Time constraints;
• Oeps: everything is a pandas df!
GoDataDriven
Advantage of “everything is a df ”
Pro:
• Fast!!
• Use what you know
• NO DBA’s!
• We all love CSV’s!
Contra:
• Doesn’t scale;
• Huge startup time;
• NO DBA’s!
• We all hate CSV’s!
GoDataDriven
• Set the dataframe index wisely;
• Align the data to the index:
• Beware of modifications of the original dataframe!
source_data.sort_index(inplace=True)
If you want to go down this path
GoDataDriven
The reason pandas is faster is because I came up with a better algorithm
If you want to go down this path
GoDataDriven
AngularJS python appREST
Front-end Back-end Database
JSON?
If you don’t
GoDataDriven
A word about (traditional) databases…
Db: programming language dict
Postgres for data driven apps?
Postgres for data driven apps?
GoDataDriven
Issues?!
• With a radius of 10km, in Amsterdam, you get 10k postcodes. You need to do this in your SQL:
• Index on date and postcode, but single queries running more than 20 minutes.
SELECT * FROM datapoints WHERE date IN date_array AND postcode IN postcode_array;
GoDataDriven
PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries:
SELECT * FROM datapoints WHERE ST_DWithin(lon, lat, 1500) AND dates IN ('2013-02-30', '2013-02-31'); -- every point within 1.5km -- from (lat, lon) on imaginary dates
Postgres + Postgis (2.x)
Other db’s?
GoDataDriven
How we solved it1. Align data on disk by date;2. Use the temporary table trick:
3. Lose precision: 1234AB→1234
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY); INSERT INTO tmp (postcodes) VALUES postcode_array;
SELECT * FROM tmp JOIN datapoints d ON d.postcode = tmp.postcodes WHERE d.dt IN dates_array;
GoDataDriven
Take home messages1. Geospatial problems are “hard” and cam kill your
queries;2. Not everybody has infinite resources: be smart
and KISS!3. SQL or NoSQL? (Size, schema)
GoDataDriven
We’re hiring / Questions? / Thank you!
Giovanni LanzaniData Whisperer