collaboration insights from data access analytics 'follow ... · elements of the pipeline...

21
Collaboration insights from data access analytics "Follow the data" Ravi Krishnaswamy Autodesk Inc.

Upload: others

Post on 12-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Collaboration insights from data access analytics

"Follow the data"

Ravi Krishnaswamy

Autodesk Inc.

Page 2: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

How Valuable is a Network ?

Reed: the utility of large networks, particularly social networks, can scale exponentially with the size of the network

Page 3: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Community detection through dataCore concepts

Bob, a Desktop user Scott , a Desktop user

opens

opens

Page 4: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Community detection through dataCore concepts

Bob, a Desktop user Scott , a Desktop user

opensSaves/exports

opensreferences

Saves/exports

Page 5: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Community detection through dataCore concepts

Bob, a Desktop user

Mary, a Desktop user

Yan, a Mobile userJoe and John, web users

Scott , a Desktop user

opens

opens

Saves/exports

opens opens opens

opensreferences

Saves/exports

Saves/exports

“Lineage”

Page 6: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Community detection through dataCore concepts

ScottMary

Yan

Bob

Joe John

Bob, a Desktop user

Mary, a Desktop user

Yan, a Mobile userJoe and John, web users

Scott , a Desktop user

opens

opens

Saves/exports

opens opens opens

opensreferences

Saves/exports

Saves/exports

“Lineage”

Page 7: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Hash fingerprints to connect versionsExisting use cases

Page 8: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Connecting versions through hashes

EF8A09D

D9A22B

opens

saves

Log Item: (anonymized-user-id, platform, file-operation, hash-before, hash-after, time)

(u88, ‘desktop-win’, ‘save’, ‘EF8A09D’, ‘D9A22B’, 9320031)(u89, ‘mobile-ios’, ‘open’, ‘D9A22B’, ‘D9A22B’, 10311299)

opens

Ours is another use case

Page 9: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Connecting by hashes at scale

User88

EF8A09D’

D9A22B’

User89

Distinct users

on different

platforms who

share data

User88 Desktop-win Save EF8A09D D9A22B

User89 Ios Open D9A22B n/a

Page 10: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Mixpanel

Query/Extract to CSV

SPARK/Qubole

Bulk Import to Neo4j

Layout and VisuzalizationTool (Gephi)

Query/Output Query

Results to GraphML/ CSV

The Pipeline

Page 11: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Elements of the pipeline

• Hive data processed in Spark 2.4 cluster

• Scala scripts to clean and export edgelists

• Scala scripts to import to Neo4j with loadCSV

• Postprocess graph to build lineages, interval information, access counts

• Data Exploration: Cypher queries to answer basic questions

• Data Exploration: Visualize graphs (Neovis, Gephi)

• Export queries (Cypher) for more post processing (Pandas)

Page 12: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Db Schema

Page 13: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Industry types that interact

Identify lineages with algo.unionFind()

Page 14: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Web/Mobile/Desktop interaction

Purple: Fingerprint of specific file version

Chain of purple nodes: Lineages

Size of arrow: Number of accesses to specific fingerprint version

Green: Desktop; Red: Web; Blue: Mobile

Page 15: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Lineages and access patterns

Page 16: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Connections by indirect reference to data

Page 17: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

What fraction of data is accessed by distinct devices?

Minimum number of file versions per lineage

(%) lineages accessed by more than 1 device

Minimum number of file versions per lineage

Page 18: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

What fraction of data is accessed by distinct devices?

Page 19: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Time Series: access patterns

Page 20: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Takeaways

• Relatively easy to integrate into spark pipelines

• ‘Sweet spot’ size for data sets

• Flexibility of Graphs: Augmenting/Changing schema

• Rich set of queries possibly by Cypher and plugins algo and apoc

• Rich set of queries to provide input to advanced Analytics/ML

Page 21: Collaboration insights from data access analytics 'Follow ... · Elements of the pipeline •Hive data processed in Spark 2.4 cluster •Scala scripts to clean and export edgelists

Questions

1. Efficient load of external file data into Neo4j can be achieved with which of the clauses?

(a) MERGE

(b) SET

(c) LOAD CSV

2. The value of a social network of n nodes using Reeds law can be thought to be

(a) O (n)

(b) O (n2)

(c) O (2n)

3. Name the procedure used in this talk to determine the connected components of the graph