41
Neo4j Import Webinar Mark Needham (@markhneedham) 30 th July 2015

Neo4j Import Webinar

Embed Size (px)

Citation preview

Page 1: Neo4j Import Webinar

Neo4j Import WebinarMark Needham (@markhneedham)30th July 2015

Page 2: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Chicago Crime dataset

Page 3: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Chicago Crime dataset

Page 4: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Chicago Crime CSV file

imported into

The goal

Page 5: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Exploring the data

Page 6: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Exploring the data

LOAD CSV WITH HEADERS FROM"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowRETURN rowLIMIT 1

Page 7: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Exploring the data

Page 8: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Exploring the data

Page 9: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Sketch a rough initial model

Page 10: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Import a sample: CrimesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (crime:Crime { id: row.ID, description: row.Description, caseNumber: row.`Case Number`, arrest: row.Arrest, domestic: row.Domestic});

Page 11: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Import a sample: Crimes

Show how to do this better by splitting up the attributes

Page 12: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Import a sample: Crime TypesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (:CrimeType { name: row.`Primary Type`});

Page 13: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Import a sample: Crimes -> Crime TypesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MATCH (crime:Crime { id: row.ID, description: row.Description})MATCH (crimeType:CrimeType { name: row.`Primary Type`})MERGE (crime)-[:TYPE]->(crimeType);

Page 14: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Add indexesCREATE INDEX ON :Label(property)

Page 15: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Add indexesCREATE INDEX ON :Label(property)

CREATE INDEX ON :Crime(id);CREATE INDEX ON :Location(name);CREATE INDEX ON :CrimeType(name);CREATE INDEX ON :Location(name);...

Page 16: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Periodic Commit

USING PERIODIC COMMIT

LOAD CSV WITH HEADERS FROM file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv

MERGE (crime:Crime { id: row.ID, description: row.Description})

Page 17: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Periodic Commit• Neo4j keeps all transaction state in

memory which becomes problematic for large CSV files

• USING PERIODIC COMMIT flushes the transaction after a certain number of rows

• Default is 1000 rows but it’s configurable• Currently only works with LOAD CSV

Page 18: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Avoiding the Eager• Cypher has an Eager operator which will

bring forward parts of a query to ensure safety

• We don’t want to see this operator when we’re importing data – it will slow things down a lot

• Put a diagram of eager => slow (maybe a query plan?)

Page 19: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

LOAD CSV in summary• ETL power tool• Built into Neo4J since version 2.1• Can load data from any URL• Good for medium size data (up to 10M

rows)

Page 20: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Bulk loading an initial data set• Introducing the Neo4j Import Tool• Find it in the bin folder of your Neo4j

download• Used to large sized initial data sets• Skips the transactional layer of Neo4j and

writes store files directly

Page 21: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Expects files in a certain format

:ID(Crime) :LABEL description :ID(Beat) :LABEL

:START_ID(Crime) :END_ID(Beat)

:TYPE

Nodes

Relationships

Page 22: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

What we have…

Page 23: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Chicago Crime CSV file

Neo4j ready CSV files

Translation Phase required

Translation Phase

Page 24: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Chicago Crime CSV file

Spark all the things

Spark Job

processed by

spits out

Neo4j ready CSV files

imported into

Page 25: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

The Spark Job

Page 26: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

The Spark Job

Page 27: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Submitting the Spark Job./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar

real 1m25.506suser8m2.183ssys 0m24.267s

Page 28: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Submitting the Spark Job./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar

real 1m25.506suser8m2.183ssys 0m24.267s

Page 29: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

The generated files$ ls -1 tmp/*.csvtmp/beats.csvtmp/crimeDates.csvtmp/crimes.csvtmp/crimesBeats.csvtmp/crimesDates.csvtmp/crimesLocations.csvtmp/crimesPrimaryTypes.csvtmp/dates.csvtmp/locations.csvtmp/primaryTypes.csv

Page 30: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Importing into Neo4jDATA=tmpNEO=./neo4j-enterprise-2.2.3$NEO/bin/neo4j-import \--into $DATA/crimes.db \--nodes $DATA/crimes.csv \--nodes $DATA/beats.csv \--nodes $DATA/primaryTypes.csv \--nodes $DATA/locations.csv \--relationships $DATA/crimesBeats.csv \--relationships $DATA/crimesPrimaryTypes.csv \--relationships $DATA/crimesLocations.csv \--stacktrace

IMPORT DONE in 36s 208ms

Page 31: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Enriching the crime graph

Page 32: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Enriching the crime graph

Page 33: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Enriching the crime graph

Page 34: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

2 options

JSON CSVjq LOAD CSV

JSON Language Driver

HTTP API

Page 35: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Using py2neo to load JSON into Neo4jimport jsonfrom py2neo import Graph, authenticate

authenticate("localhost:7474", "neo4j", "foobar")graph = Graph()

with open('categories.json') as data_file: json = json.load(data_file)

query = """WITH {json} AS documentUNWIND document.categories AS categoryUNWIND category.sub_categories AS subCategoryMERGE (c:CrimeCategory {name: category.name})MERGE (sc:SubCategory {code: subCategory.code})ON CREATE SET sc.description = subCategory.descriptionMERGE (c)-[:CHILD]->(sc)"""

print graph.cypher.execute(query, json = json)

Page 36: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Enriching the crime graph

Translate from JSON to CSV

Page 37: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Enriching the crime graph

Import using LOAD CSV

Page 38: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Updating the graph• As new crimes come in we want to update

the graph to take them into account

Page 39: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

Updating the graph• Import this using REST Transactional API

Page 40: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

This talk brought to you by…

Page 41: Neo4j Import Webinar

Neo Technology, Inc Confidential#neo4j

And that’s it…