capgemini - project industrialization with apache spark

Post on 16-Apr-2017

540 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Apache Spark and Bluemix MeetupJean-Baptiste MartinJuly 6, 2016Project industrialization with Apache Spark

2Copyright © Capgemini 2015. All Rights Reserved

Who am I

Jean-Baptiste Martin Managing Consultant at Capgemini Background: technical Big Data Analytics for 2 years Product manager People Analytics Founder at Top Notch

3Copyright © Capgemini 2015. All Rights Reserved

Project industrialization with Apache Spark

1. Spark in People Analytics

2. Team Organization

3. Issue #1: Text Replace

4. Issue #2: Non-Serializable Objects

5. Issue #3: Unit Testing

6. Issue #4: Wall of Code

Code available at:

https://github.com/jeanbmar/meetup-spark

4Copyright © Capgemini 2015. All Rights Reserved

Spark in People Analytics

What is People Analytics?

5Copyright © Capgemini 2015. All Rights Reserved

Spark in People Analytics

Unstructured

WEXAppBuilder

Watson Explorer

WEX Engine

Data Indexing

Vis

ualiz

atio

n

HDFSStore

Analytics Engine

Data Reconciliation

ODPi

HDFS Access

Structured

SGBD

CSV Files

Employees

Candidates

Jobs

12

3

4

6Copyright © Capgemini 2015. All Rights Reserved

Team Organization

1. Prototyping:• Technologies: Hadoop, Java, R, Watson Explorer• Team Profiles: 4 big data dev (Java), 1 data scientist, 1 data analyst

2. Industrialization:• Technologies: Hadoop, Java and Scala, Spark, Watson Explorer• Team Profiles:

– 2 data scientists– 2 software developers– 1 sys admin– 2 web developers

3. All along: • Strong support from IBM (expertise, implementation, go-to-market)

7Copyright © Capgemini 2015. All Rights Reserved

Issues we faced

Issue #1: Text Replace Issue #2: Non-Serializable Objects Issue #3: Unit Testing Issue #4: Wall of Code

8Copyright © Capgemini 2015. All Rights Reserved

Issue #1 : Text Replace

Browse and replace text is common when performing natural language processing

« I work with WEX at Cap Gemini »

« I work with Watson Explorer at Capgemini »

Cap Gemini Capgemini

WEX Watson Explorer

+

=

9Copyright © Capgemini 2015. All Rights Reserved

Issue #1 : Text Replace

Issues when:• There’s a lot of documents to process• Dictionaries (synonyms, stopwords, protected words, …) contain 1000+ entries

Traditional implementations:• Loop over dictionary entries LOW PERF AND/OR INCORRECT• Regular Expressions LOW PERF

We want: read text 1x and perform transformations on the fly

10Copyright © Capgemini 2015. All Rights Reserved

Issue #1 : Text Replace

Solution

1. Expand dictionaries in HashMap objects, e.g.

2. Read text character by character and perform lookups over HashMap objects– X combination of characters is a part of an existing word– null no match– Other match

W XWE XWEX Watson Explorer

11Copyright © Capgemini 2015. All Rights Reserved

Issue #1 : Text Replace

Case 1:• Have: “Engineer. English. Fluent en.”• Want: “Engineer. English. Fluent english.”

Case 2:• Have: “Cap Gemini consultant and Big Data developer with strong xp on Hadoop,

mostly Hadoop FS. BI background (DataStage, Cognos, Oracle, DB2). Worked on multiple Watson technologies, including Watson API and WEX.”

• Dictionary, 875 entries including:Cap Gemini Capgemini

Hadoop FS HDFS

DataStage IBM DataStage

Cognos IBM Cognos

DB2 IBM DB2

WEX Watson Explorer

12Copyright © Capgemini 2015. All Rights Reserved

Issue #2: Non-Serializable Objects

Sometimes, people need to use external libraries to perform specific transformations on objects

Example: perform NLP transformations with Apache OpenNLP

Problem:• OpenNLP objects are not serializable No broadcast• OpenNLP objects take time to initialize Never-ending closures• We don’t want to convert OpenNLP source code (actually we tried)

13Copyright © Capgemini 2015. All Rights Reserved

Issue #2: Non-Serializable Objects

Solution: Initiliaze singletons and bind them to Spark tasks using Java ThreadLocal

Singleton class

Bind singleton to task thread

Will be called in closure

14Copyright © Capgemini 2015. All Rights Reserved

Issue #2: Non-Serializable Objects

Then call transformation in closures:

Benefits: objects are initialized only 1x per task instead of 1x per RDD element

Retrieve holder from current task

Get singleton object

Get SimpleClass object

15Copyright © Capgemini 2015. All Rights Reserved

Issue #3: Unit Testing

One major step when moving from prototype to production is to define a proper testing strategy

Way people do their tests (non-exhaustive):1. They run everything on cluster \o/2. They use a local context

What we did:• Use a local context

Problem: jobs grab content from HDFS using Oozie job.properties Solution: setup a flexible configuration to operate seamlessly on cluster and locally

16Copyright © Capgemini 2015. All Rights Reserved

Issue #3: Unit Testing

How it looks like:

Class applying a set of transformations

This grabs files on HDFS, can’t use locally

17Copyright © Capgemini 2015. All Rights Reserved

Issue #3: Unit Testing

How can seamlessly operate with remote or local job.properties?

Using this

ConfigHelper class

18Copyright © Capgemini 2015. All Rights Reserved

Issue #3: Unit Testing

Call conf

Grab on FS

19Copyright © Capgemini 2015. All Rights Reserved

Issue #3: Unit Testing

Finally, our test:

20Copyright © Capgemini 2015. All Rights Reserved

Issue #4: Wall of Code

Object-oriented programming modeling doesn’t apply well in Spark

As a result, we tend to write huge functions with tons of transformations People Analytics V0.01alpha : 1 class

How we managed this: We regrouped consistant sets of transformations into functional classes

Functional class

Class consecutive operations in run method

21Copyright © Capgemini 2015. All Rights Reserved

Thank YouCredits:

jean-baptiste.martin@capgemini.comjerome.delvigne@capgemini.com

Code available at: https://github.com/jeanbmar/meetup-spark

The information contained in this presentation is proprietary.Copyright © 2015 Capgemini. All rights reserved.

Rightshore® is a trademark belonging to Capgemini.

www.capgemini.com

About CapgeminiWith 180,000 people in over 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2014 global revenues of EUR 10.573 billion.

Together with its clients, Capgemini creates and delivers business, technology and digital solutions that fit their needs, enabling them to achieve innovation and competitiveness. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business Experience™, and draws on Rightshore®, its worldwide delivery model.

Learn more about us at www.capgemini.com.

top related