modeling employees relationships with apache spark

Post on 22-Jan-2018

139 Views

Category:

Data & Analytics

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Wassim Trifi – Apr 2016

Objectives

Build a business relationship in an organization basedon transactions between employees.

Express the business model.

Understand the communication process betweenemployees.

Get employee’s influence by degree.

Get most connected employees.

Have more insights on communitees inside an organization

How to do ?

Data

• HR data from HRIS DB

• Daily email communications between employees from EmailServer

ETL

• Collect data from source with SparkSQL

• Clean, transform and aggregate

• Load into structred frames

Analyze

• Prepare imputs for the Graph

• Create the GraphX of communication

• Apply algorithms and comment results

Visualize

• Load the graph into Neo4J

• Visualize the components and their respective connections.

Architecture

HDFS

HRIS DB

EmailServer

File

Ta

ble

s

Create On

Vis

ua

lize

Data Source: Email Server

A Json file of 42000 lines with emails details ( «ID» : email identification,

«Receivers», «Sender») :

I developed a python program to generate this file

Data Source: Employee DB

A Postgresql Table of 1000 rows containing employees data :

ETL: Extract Employee Table Apache Spark is a powerful tool for ETL and analatycs.

Use the SparkSql library in Scala, to load data from Posgresql :

ETL: Employee DataFrame

The new data frame is created with 1000 rows of

employees details :

ETL: Extract Emails File

The JSON file is saved under HDFS.

We load and transform the content into a dataframe with SparkSQL in

Scala code :

ETL: Emails DataFrame

The new data frame is stored in-memory with 372011

rows of emails details :

ETL: Merge All in one Table

Map Emails and Employee details into a new Dataframe.

As result we have a new DataFrame of 372011 rows that maps the

receivers and Senders with the ID of email sent :

Build the Graph

Once we have the data prepared, we move on and we buid the Graph of

employees conneted by emails.

A Graph has as input Vertices and Edges. The nodes are the employees

and the Edges are the emails sent and received.

We create the Graph with the Spark Library GraphX.

Analyze : Degree of influence

by Employee (1)

The PageRank algorithm proovides a ranking value for each of the

vertices in a graph. It makes the assumption that the vertices that are

connected to the most edges are the most important ones.

Analyze : Degree of influence

by Employee (2)

The employee « Lewis Mark » with the highset rank. He receives most of

emails echanged. That person is certainly occupying a key role in the

organization ( leader, CEO ??)

Analyze : Most Connceted

Employees (1)

This algorithm provides a measure of connectivity

between employees. Who are strongly connected and

those not connected.

Analyze : Most Connceted

Employees (2)

This algorithm provides a measure of connectivity

between employees. Who are strongly connected and

those not connected.

Visualize with Neo4J

Neo4J is Graph Database and has its own web server

to browse and visualize data. This makes more efficient

to undersand the Graph components and the

connection inside.

It is possible to run several algorithms, like the famous

RankPage on the Neo4J graphs.

To create the Graph into Neo4J, we have to extract

vertices ans edges from Spark into csv files, then run

the neo4j-import script to load database.

Neo4J Graph

Conclusion

Spark plateform is so powerful. It provides amazing

tools for data analysis.

HR data are more relevant when analyzed with the

other datasource of an organization.

It is really possible to cut HR costs and acting with

performance when we understand how the

organization is made.

For your Feedbacks join me on linkedin :

fr.linkedin.com/in/WassTRIFI

top related