modeling employees relationships with apache spark

20
Wassim Trifi Apr 2016

Upload: wassim-trifi

Post on 22-Jan-2018

139 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Modeling employees relationships with Apache Spark

Wassim Trifi – Apr 2016

Page 2: Modeling employees relationships with Apache Spark

Objectives

Build a business relationship in an organization basedon transactions between employees.

Express the business model.

Understand the communication process betweenemployees.

Get employee’s influence by degree.

Get most connected employees.

Have more insights on communitees inside an organization

Page 3: Modeling employees relationships with Apache Spark

How to do ?

Data

• HR data from HRIS DB

• Daily email communications between employees from EmailServer

ETL

• Collect data from source with SparkSQL

• Clean, transform and aggregate

• Load into structred frames

Analyze

• Prepare imputs for the Graph

• Create the GraphX of communication

• Apply algorithms and comment results

Visualize

• Load the graph into Neo4J

• Visualize the components and their respective connections.

Page 4: Modeling employees relationships with Apache Spark

Architecture

HDFS

HRIS DB

EmailServer

File

Ta

ble

s

Create On

Vis

ua

lize

Page 5: Modeling employees relationships with Apache Spark

Data Source: Email Server

A Json file of 42000 lines with emails details ( «ID» : email identification,

«Receivers», «Sender») :

I developed a python program to generate this file

Page 6: Modeling employees relationships with Apache Spark

Data Source: Employee DB

A Postgresql Table of 1000 rows containing employees data :

Page 7: Modeling employees relationships with Apache Spark

ETL: Extract Employee Table Apache Spark is a powerful tool for ETL and analatycs.

Use the SparkSql library in Scala, to load data from Posgresql :

Page 8: Modeling employees relationships with Apache Spark

ETL: Employee DataFrame

The new data frame is created with 1000 rows of

employees details :

Page 9: Modeling employees relationships with Apache Spark

ETL: Extract Emails File

The JSON file is saved under HDFS.

We load and transform the content into a dataframe with SparkSQL in

Scala code :

Page 10: Modeling employees relationships with Apache Spark

ETL: Emails DataFrame

The new data frame is stored in-memory with 372011

rows of emails details :

Page 11: Modeling employees relationships with Apache Spark

ETL: Merge All in one Table

Map Emails and Employee details into a new Dataframe.

As result we have a new DataFrame of 372011 rows that maps the

receivers and Senders with the ID of email sent :

Page 12: Modeling employees relationships with Apache Spark

Build the Graph

Once we have the data prepared, we move on and we buid the Graph of

employees conneted by emails.

A Graph has as input Vertices and Edges. The nodes are the employees

and the Edges are the emails sent and received.

We create the Graph with the Spark Library GraphX.

Page 13: Modeling employees relationships with Apache Spark

Analyze : Degree of influence

by Employee (1)

The PageRank algorithm proovides a ranking value for each of the

vertices in a graph. It makes the assumption that the vertices that are

connected to the most edges are the most important ones.

Page 14: Modeling employees relationships with Apache Spark

Analyze : Degree of influence

by Employee (2)

The employee « Lewis Mark » with the highset rank. He receives most of

emails echanged. That person is certainly occupying a key role in the

organization ( leader, CEO ??)

Page 15: Modeling employees relationships with Apache Spark

Analyze : Most Connceted

Employees (1)

This algorithm provides a measure of connectivity

between employees. Who are strongly connected and

those not connected.

Page 16: Modeling employees relationships with Apache Spark

Analyze : Most Connceted

Employees (2)

This algorithm provides a measure of connectivity

between employees. Who are strongly connected and

those not connected.

Page 17: Modeling employees relationships with Apache Spark

Visualize with Neo4J

Neo4J is Graph Database and has its own web server

to browse and visualize data. This makes more efficient

to undersand the Graph components and the

connection inside.

It is possible to run several algorithms, like the famous

RankPage on the Neo4J graphs.

To create the Graph into Neo4J, we have to extract

vertices ans edges from Spark into csv files, then run

the neo4j-import script to load database.

Page 18: Modeling employees relationships with Apache Spark

Neo4J Graph

Page 19: Modeling employees relationships with Apache Spark

Conclusion

Spark plateform is so powerful. It provides amazing

tools for data analysis.

HR data are more relevant when analyzed with the

other datasource of an organization.

It is really possible to cut HR costs and acting with

performance when we understand how the

organization is made.

Page 20: Modeling employees relationships with Apache Spark

For your Feedbacks join me on linkedin :

fr.linkedin.com/in/WassTRIFI