tracking data lineage with neo4j and linkurious
TRANSCRIPT
Tracking data lineage with Neo4j and Linkurious.
SAS founded in 2013 in Paris | http://linkurio.us | @linkurious
French startup specialized in graph-visualization.
CTO
Web-scale archiving
Université de Technologie de
Compiègne
CMO
>5 years in consulting
Sciences Po + Ecole de Guerre
Economique
JeanVilledieu
SébastienHeymann
DavidRapin
CEO
Created Gephi
Phd in CS and complex systems
from UPMC
What is a graph?
PERSONname: Séb
age: 29
PERSONname: Jean
age: 31
LOCATIONname: Paris
Lives
inLives in
Knows
A graph is a set of nodes and relationships.
This is a node
This is a relationship
PERSONname: Séb
age: 29
PERSONname: Jean
age: 31
LOCATIONname: Paris
This is a property
What is data lineage?
“Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes”
- Wikipedia
A real-world data pipeline.
Top 5 data lineage questions.
1. Where is this data coming from?
2. Who has access to that information?
3. Do we have sensitive data that’s being propagated unsafely?
4. Is my database still being used in an important company process or can I remove it?
5. What systems and reports would be impacted by a change in that particular process?
Traditional databases are not adapted to data lineage.
Hard to query
Querying connected data through SQL is a hard and error-prone process.
Slow
Slow performances for questions requiring looking up multiple connections.
Too rigid
Hard to accommodate an evolving data model in a relational database.
The cost of bad data lineage.
● A general lack of confidence in data;
● Potential legal exposure;
● Finding answers and making decisions becomes complex and time-consuming;
...it results in wasted time, money, opportunities, etc.
Graph DBs are perfect for data lineage.
● Easy to model the flow of data in a graph;
● Query relationships with ease and in real-time;
● Adapt your schema to accommodate new data and relationships;
● Popularity of graph databases has increased 500% in the last 2 years and our partner Neo4j is the leader.
Linkurious brings ability to find answers.
● Tech and business users can search the data lineage intuitively and find answers;
● Visualization brings ability to understand and communicate complex connections;
● Accelerate and improve decisions.
Unique ability to store and analyse your data lineage.
Neo4j
Your data lineage is a large graph. Store and query it quickly with Neo4j.
Linkurious
Search and find answers easily through a visual interface.
Metadata Process
System
Process Metadata ReportMetadata
System System
Example: a graph model for data lineage.
Question #1: what’s the data lineage of this report?
Our business people need to know what data was
used to generate this month’s sales report. I need
to understand which metadata, which systems
and which processes were involved.
IT Analyst
Question #1: visualize the data lineage of a report.
It only takes a few minutes to search a report and analyse its lineage. No need to be an expert!
Question #2: what is this database used for?
We’re relocating our datacenter and need to move
a server on which a database is stored? Can we
decommission it? I need to understand what
processes and reports rely on this server.
IT Analyst
Question #2: visualize an impact analysis.
We can visualize and inspect the complex set of relationships involved in the impact analysis.