opendata, graphs and do-it-yourself journalism · 2019. 9. 23. · about neo4j • creators of...

29
OpenData, Graphs and do-it-yourself "Journalism" Sascha Peukert 1

Upload: others

Post on 14-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

OpenData, Graphs and do-it-yourself "Journalism"

Sascha Peukert

1

Page 2: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Sascha PeukertCypher Implementation Developer

[email protected]@SasPeuk

2

Page 3: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

About Neo4j

• Creators of Neo4j Graph Plattform and

Neo4j - World’s leading open-source graph database

• Company founded 2007 in Sweden

• Today 250+ employees in

San Mateo, London,

Malmö and remote

• You can join us!

3

Page 4: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

What’s that graph thing again?

4

Daten!

Page 5: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

What’s that graph thing again?

5

Labeled Property Graph Model

( :City {name:“Dresden”} ) <-[ :HAS_SEAT_IN ]- ( :Company {name:“T-Systems” } )

Page 6: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Motivation

6

opendata-illustration by Julie Beck

Page 7: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Agenda

7

https://pixabay.com/photos/files-paper-office-paperwork-stack-1614223/ https://pixabay.com/photos/image-statue-alive-artist-3895819/

Page 9: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data

Cities, postcodes and federal states

CSV file from:https://www.suche-postleitzahl.org/download_files/public/

zuordnung_plz_ort.csv

9

Page 10: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

10

Page 11: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data: cities, postcodes and states

Indexes!

CREATE INDEX ON :City(name);

CREATE INDEX ON :PostCode(code);

CREATE INDEX ON :State(name);

11

Page 12: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data: cities, postcodes and states

Loading the CSV

LOAD CSV WITH HEADERS FROM

'file:///Users/Sascha/Documents/JUG/zuordnung_plz_ort.csv' AS line

Path to file as string

12

Page 13: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data: cities, postcodes and states

Loading the CSV & creating

the data

LOAD CSV WITH HEADERS FROM

'file:///Users/Sascha/Documents/JUG/zuordnung_plz_ort.csv' AS line

CREATE ( p:PostCode {code:line.plz} )

CREATE ( b:State {name:line.bundesland} )

13

Page 14: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data: cities, postcodes and states

Loading the CSV & creating

the data

LOAD CSV WITH HEADERS FROM

'file:///Users/Sascha/Documents/JUG/zuordnung_plz_ort.csv' AS line

CREATE ( p:PostCode {code:line.plz} )

CREATE ( b:State {name:line.bundesland} )

MERGE ( b:State {name:line.bundesland} )

14

Page 15: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data: cities, postcodes and states

Loading the CSV & creating

the data

LOAD CSV WITH HEADERS FROM

'file:///Users/Sascha/Documents/JUG/zuordnung_plz_ort.csv' AS line

CREATE ( p:PostCode {code:line.plz} )

MERGE ( b:State {name:line.bundesland} )

MERGE ( b )<-[:LOCATED_IN]-( c:City {name:line.ort} )

CREATE ( c )<-[:BELONGS_TO]-( p )

15

Page 16: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data: cities, postcodes and states

Loading the CSV & creating

the data

USING PERIODIC COMMIT

LOAD CSV WITH HEADERS FROM

'file:///Users/Sascha/Documents/JUG/zuordnung_plz_ort.csv' AS line

CREATE ( p:PostCode {code:line.plz} )

MERGE ( b:State {name:line.bundesland} )

MERGE ( b )<-[:LOCATED_IN]-( c:City {name:line.ort} )

CREATE ( c )<-[:BELONGS_TO]-( p )

16

Page 17: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

17

Page 18: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data

Open register / OpenCorporates

JSONL file from:https://offeneregister.de/

18

Page 19: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data: Open register / OpenCorporates

19

• Simple version from Bert Radke: https://blog.faboo.org/2019/03/handelregister-jsonl/

• Remarks about the data and import:

• 5.305.727 companies & 4.803.514 officers

• Unexpected nulls (some examples)

• “Registered address” is missing on 68.5% of all companies

• “Registered office” is null for one active company...

• 10% of officers don’t have a city set

• Problem: Persons & Cities do not have a unique key in the json

Page 20: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data: Open register / OpenCorporates

20

Intermediate status

Page 21: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data

Lobbypedia party donations

Tool:https://lobbypedia.de/wiki/Spezial:Abfrage_ausf%C3%BChren/Parteispenden

JSON files:2000 - 2010https://lobbypedia.de/wiki/Spezial:Semantische_Suche/-5B-5BKategorie:Parteispende-5D-5D-20-5B-5BJahr::2000-7C-7C2001-7C-7C2002-7C-7C2003-7C-7C2004-7C-7C2005-7C-7C2006-7C-7C2007-7C-7C2008-7C-7C2009-7C-7C2010-5D-5D/-3FGeldgeber/-3FParteispende-2FKategorie%3DKategorie/-3FBetrag/-3FEmpf%C3%A4nger/-3FJahr-23-2Dn/-3FOrt/-3FBundesland/-3FBranche/-3FSchlagworte/mainlabel%3D/limit%3D10000/order%3Ddescending/sort%3DModification-20date/offset%3D0/format%3Djson/default%3Dkeine-20Ergebnisse-20mit-20der-20aktuellen-20Auswahl

2011 - 2019https://lobbypedia.de/wiki/Spezial:Semantische_Suche/-5B-5BKategorie:Parteispende-5D-5D-20-5B-5BJahr::2011-7C-7C2012-7C-7C2013-7C-7C2014-7C-7C2015-7C-7C2016-7C-7C2017-7C-7C2018-7C-7C2019-5D-5D/-3FGeldgeber/-3FParteispende-2FKategorie%3DKategorie/-3FBetrag/-3FEmpf%C3%A4nger/-3FJahr-23-2Dn/-3FOrt/-3FBundesland/-3FBranche/-3FSchlagworte/mainlabel%3D/limit%3D10000/order%3Ddescending/sort%3DModification-20date/offset%3D0/format%3Djson/default%3Dkeine-20Ergebnisse-20mit-20der-20aktuellen-20Auswahl

21

Page 22: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data: Lobbypedia party donations

Names as “join keys” between datasets are… problematic!

22

Page 23: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Import all the data: Lobbypedia party donations

Names as “join keys” between datasets are… problematic!

My solution:

Adding “index nodes” for persons

and relationships that indicate

context closeness

23

Page 24: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Schema graph

24

https://b0ef77c6.databases.neo4j.io/browser/

User: partyPassword: party

Page 25: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Demo-Disclaimer

• All data is from public and open sources or common knowledge

• I did not change those sources nor do I claim them to be correct

• Due to the imperfect nature of the data, the import cannot be perfectly

accurate so do NOT blindly take the outcomes as fact!

25

Page 26: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Takeaways

( Graphs ) -[ :ARE ]-> ( Everywhere)

Use indexes

Expect some data wrangling when working with (open) data

Link to full import script

Play with the data at: https://b0ef77c6.databases.neo4j.io/browser/user & password: party

26

Page 27: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Free O’Reilly Book

neo4j.com/graph-algorithms-book

• Spark & Neo4j Examples• Machine Learning Chapter

Page 28: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Graph & ML Algorithms in Neo4j+35

neo4j.com/graph-algorithms-

book/

Pathfinding & Search

Centrality / Importance

Community Detection

Link Prediction

Finds optimal paths or evaluates route

availability and quality

Determines the importance of distinct nodes in the network

Detects group clustering or partition

options

Evaluates how alike nodes are

Estimates the likelihood of nodes forming a future relationship

Similarity

Page 29: OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of Neo4j Graph Plattform and Neo4j - World’s leading open-source graph database • Company

Thank you!Questions? Ideas?

[email protected]: @SasPeuk

29