etl into neo4j
Post on 10-May-2015
15.930 Views
Preview:
DESCRIPTION
TRANSCRIPT
ETL into Neo4j
Max De Marzi
About Me
• My Blog: http://maxdemarzi.com• Find me on Twitter: @maxdemarzi• Email me: maxdemarzi@gmail.com• GitHub: http://github.com/maxdemarzi
Built the Neography Gem (Ruby Wrapper to the Neo4j REST API)Playing with Neo4j since 10/2009
Agenda
• ETL your mind• ETL with Batch and the REST API• ETL with Gremlin and Groovy• ETL with the Batch Importer• ETL from SQL
ETL your Mind
You have to start there
More Relational than Relational
Stop thinking about howTables are related
Start thinking about relationships
Objects like to mingle
Optimized for “trees” of data Optimized for seeing the forest and the trees, and the branches, and the trunks
SELECT skills.*, user_skill.* FROM users JOIN user_skill ON users.id = user_skill.user_id JOIN skills ON user_skill.skill_id = skill.id WHERE users.id = 1
START user = node(1) MATCH user -[user_skill]-> skill RETURN skill, user_skill
Property Graph
name
code
word_count
Language
name
code
flag_uri
Country
IS_SPOKEN_IN
as_primary
language_code
language_name
word_count
Language
country_code
country_name
flag_uri
Country
language_code
country_code
primary
LanguageCountry
name: “Canada”
languages_spoken: “[ ‘English’, ‘French’ ]”
name: “Canada”
language:“English”
language:“French”
spoken_in
spoken_in
name: “USA”
name: “France”
spoken_in
spoken_in
name
flag_uri
language_name
number_of_words
yes_in_langauge
no_in_language
currency_code
currency_name
Country
USES_CURRENCY
name
flag_uri
Country
name
number_of_words
yes
no
Language
SPEAKS
code
name
Currency
ETL with Batch and the REST API
Batch command from REST API
Great for importing Facebook/Twitter friends
Keep each request under 10k commands
Preferably send a request every 2k to 5k commands
Using Batch from Neography
Why BatchTransactional: any failures not committed.
Ordered: responses guaranteed to be in the same order as sent.
Continuous loading/updating nodes and relationships in spurts or streaming.
ETL with Gremlin and Groovy
Commit every 1000 changes or so, make sure to stop the transaction to commit the last few changes at the very end.
Look into auto-indexing to make life easier.
Disabled by default. See Docs for trick to make it full text instead of exact index.
http://docs.neo4j.org/chunked/milestone/auto-indexing.html
Crazy Format is okId :: Title :: Genre|Genre|Genre
But it’s preferable to stay clear of escape characters like “|”
String location of data file, converted to URL, then processed one line at a time.Movie vertex created, genre vertex created unless it exists (index lookup), edge from movie to genre is created.
Full walk-through on http://maxdemarzi.com/2012/01/13/neo4j-on-heroku-part-one/
ETL with the Batch Importer
Installation Walk-Through
Testing it
7.5M nodes, 42M relationships in just over 3 minutes on a laptop.
Loading it into Neo4j
Full walk-through on http://maxdemarzi.com/2012/02/28/batch-importer-part-1/
When to use the Batch Importer?
• 1st time loading or periodic reloading
• When you need Speed
• When you don’t mind a little Java
ETL from SQL
Identities who vouched for each other
row_number() and INTO are our friends
The “term” vouched for will serve as our relationship type, status is a relationship property.
Notice there are no node ids.These are automatic, clkao is node 1
No time to get coffee >8-[
What about multiple types of nodes?No problem, just add the MAX(node_id) from the first table.
Full walk-through at: http://maxdemarzi.com/2012/02/28/batch-importer-part-2/
Need help? E-mail me, catch me on Google chat or Skype.
Please don’t be shy…. and read my blog:
http://maxdemarzi.com
Thank you!http://maxdemarzi.com
top related