Neo4j Import Webinar

Neo4j Import WebinarMark Needham (@markhneedham)30th July 2015

Neo Technology, Inc Confidential#neo4j

Chicago Crime dataset


Chicago Crime dataset


Chicago Crime CSV file

imported into

The goal


Exploring the data


Exploring the data

LOAD CSV WITH HEADERS FROM"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowRETURN rowLIMIT 1


Exploring the data


Exploring the data


Sketch a rough initial model


Import a sample: CrimesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (crime:Crime { id: row.ID, description: row.Description, caseNumber: row.`Case Number`, arrest: row.Arrest, domestic: row.Domestic});


Import a sample: Crimes

Show how to do this better by splitting up the attributes


Import a sample: Crime TypesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (:CrimeType { name: row.`Primary Type`});


Import a sample: Crimes -> Crime TypesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MATCH (crime:Crime { id: row.ID, description: row.Description})MATCH (crimeType:CrimeType { name: row.`Primary Type`})MERGE (crime)-[:TYPE]->(crimeType);


Add indexesCREATE INDEX ON :Label(property)


Add indexesCREATE INDEX ON :Label(property)

CREATE INDEX ON :Crime(id);CREATE INDEX ON :Location(name);CREATE INDEX ON :CrimeType(name);CREATE INDEX ON :Location(name);...


Periodic Commit

USING PERIODIC COMMIT

LOAD CSV WITH HEADERS FROM file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv

MERGE (crime:Crime { id: row.ID, description: row.Description})


Periodic Commit• Neo4j keeps all transaction state in

memory which becomes problematic for large CSV files

• USING PERIODIC COMMIT flushes the transaction after a certain number of rows

• Default is 1000 rows but it’s configurable• Currently only works with LOAD CSV


Avoiding the Eager• Cypher has an Eager operator which will

bring forward parts of a query to ensure safety

• We don’t want to see this operator when we’re importing data – it will slow things down a lot

• Put a diagram of eager => slow (maybe a query plan?)


LOAD CSV in summary• ETL power tool• Built into Neo4J since version 2.1• Can load data from any URL• Good for medium size data (up to 10M

rows)


Bulk loading an initial data set• Introducing the Neo4j Import Tool• Find it in the bin folder of your Neo4j

download• Used to large sized initial data sets• Skips the transactional layer of Neo4j and

writes store files directly


Expects files in a certain format

:ID(Crime) :LABEL description :ID(Beat) :LABEL

:START_ID(Crime) :END_ID(Beat)

:TYPE

Nodes

Relationships


What we have…



Neo4j ready CSV files

Translation Phase required

Translation Phase



Spark all the things

Spark Job

processed by

spits out

Neo4j ready CSV files

imported into


The Spark Job


The Spark Job


Submitting the Spark Job./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar

real 1m25.506suser8m2.183ssys 0m24.267s


Submitting the Spark Job./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar

real 1m25.506suser8m2.183ssys 0m24.267s


The generated files$ ls -1 tmp/*.csvtmp/beats.csvtmp/crimeDates.csvtmp/crimes.csvtmp/crimesBeats.csvtmp/crimesDates.csvtmp/crimesLocations.csvtmp/crimesPrimaryTypes.csvtmp/dates.csvtmp/locations.csvtmp/primaryTypes.csv


Importing into Neo4jDATA=tmpNEO=./neo4j-enterprise-2.2.3$NEO/bin/neo4j-import \--into $DATA/crimes.db \--nodes $DATA/crimes.csv \--nodes $DATA/beats.csv \--nodes $DATA/primaryTypes.csv \--nodes $DATA/locations.csv \--relationships $DATA/crimesBeats.csv \--relationships $DATA/crimesPrimaryTypes.csv \--relationships $DATA/crimesLocations.csv \--stacktrace

IMPORT DONE in 36s 208ms


Enriching the crime graph






2 options

JSON CSVjq LOAD CSV

JSON Language Driver

HTTP API


Using py2neo to load JSON into Neo4jimport jsonfrom py2neo import Graph, authenticate

authenticate("localhost:7474", "neo4j", "foobar")graph = Graph()

with open('categories.json') as data_file: json = json.load(data_file)

query = """WITH {json} AS documentUNWIND document.categories AS categoryUNWIND category.sub_categories AS subCategoryMERGE (c:CrimeCategory {name: category.name})MERGE (sc:SubCategory {code: subCategory.code})ON CREATE SET sc.description = subCategory.descriptionMERGE (c)-[:CHILD]->(sc)"""

print graph.cypher.execute(query, json = json)



Translate from JSON to CSV



Import using LOAD CSV


Updating the graph• As new crimes come in we want to update

the graph to take them into account


Updating the graph• Import this using REST Transactional API


This talk brought to you by…


And that’s it…