capgemini - project industrialization with apache spark

22
Apache Spark and Bluemix Meetup Jean-Baptiste Martin July 6, 2016 Project industrialization with Apache Spark

Upload: jean-baptiste-martin

Post on 16-Apr-2017

540 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Capgemini - Project industrialization with apache spark

Apache Spark and Bluemix MeetupJean-Baptiste MartinJuly 6, 2016Project industrialization with Apache Spark

Page 2: Capgemini - Project industrialization with apache spark

2Copyright © Capgemini 2015. All Rights Reserved

Who am I

Jean-Baptiste Martin Managing Consultant at Capgemini Background: technical Big Data Analytics for 2 years Product manager People Analytics Founder at Top Notch

Page 3: Capgemini - Project industrialization with apache spark

3Copyright © Capgemini 2015. All Rights Reserved

Project industrialization with Apache Spark

1. Spark in People Analytics

2. Team Organization

3. Issue #1: Text Replace

4. Issue #2: Non-Serializable Objects

5. Issue #3: Unit Testing

6. Issue #4: Wall of Code

Code available at:

https://github.com/jeanbmar/meetup-spark

Page 4: Capgemini - Project industrialization with apache spark

4Copyright © Capgemini 2015. All Rights Reserved

Spark in People Analytics

What is People Analytics?

Page 5: Capgemini - Project industrialization with apache spark

5Copyright © Capgemini 2015. All Rights Reserved

Spark in People Analytics

Unstructured

WEXAppBuilder

Watson Explorer

WEX Engine

Data Indexing

Vis

ualiz

atio

n

HDFSStore

Analytics Engine

Data Reconciliation

ODPi

HDFS Access

Structured

SGBD

CSV Files

Employees

Candidates

Jobs

12

3

4

Page 6: Capgemini - Project industrialization with apache spark

6Copyright © Capgemini 2015. All Rights Reserved

Team Organization

1. Prototyping:• Technologies: Hadoop, Java, R, Watson Explorer• Team Profiles: 4 big data dev (Java), 1 data scientist, 1 data analyst

2. Industrialization:• Technologies: Hadoop, Java and Scala, Spark, Watson Explorer• Team Profiles:

– 2 data scientists– 2 software developers– 1 sys admin– 2 web developers

3. All along: • Strong support from IBM (expertise, implementation, go-to-market)

Page 7: Capgemini - Project industrialization with apache spark

7Copyright © Capgemini 2015. All Rights Reserved

Issues we faced

Issue #1: Text Replace Issue #2: Non-Serializable Objects Issue #3: Unit Testing Issue #4: Wall of Code

Page 8: Capgemini - Project industrialization with apache spark

8Copyright © Capgemini 2015. All Rights Reserved

Issue #1 : Text Replace

Browse and replace text is common when performing natural language processing

« I work with WEX at Cap Gemini »

« I work with Watson Explorer at Capgemini »

Cap Gemini Capgemini

WEX Watson Explorer

+

=

Page 9: Capgemini - Project industrialization with apache spark

9Copyright © Capgemini 2015. All Rights Reserved

Issue #1 : Text Replace

Issues when:• There’s a lot of documents to process• Dictionaries (synonyms, stopwords, protected words, …) contain 1000+ entries

Traditional implementations:• Loop over dictionary entries LOW PERF AND/OR INCORRECT• Regular Expressions LOW PERF

We want: read text 1x and perform transformations on the fly

Page 10: Capgemini - Project industrialization with apache spark

10Copyright © Capgemini 2015. All Rights Reserved

Issue #1 : Text Replace

Solution

1. Expand dictionaries in HashMap objects, e.g.

2. Read text character by character and perform lookups over HashMap objects– X combination of characters is a part of an existing word– null no match– Other match

W XWE XWEX Watson Explorer

Page 11: Capgemini - Project industrialization with apache spark

11Copyright © Capgemini 2015. All Rights Reserved

Issue #1 : Text Replace

Case 1:• Have: “Engineer. English. Fluent en.”• Want: “Engineer. English. Fluent english.”

Case 2:• Have: “Cap Gemini consultant and Big Data developer with strong xp on Hadoop,

mostly Hadoop FS. BI background (DataStage, Cognos, Oracle, DB2). Worked on multiple Watson technologies, including Watson API and WEX.”

• Dictionary, 875 entries including:Cap Gemini Capgemini

Hadoop FS HDFS

DataStage IBM DataStage

Cognos IBM Cognos

DB2 IBM DB2

WEX Watson Explorer

Page 12: Capgemini - Project industrialization with apache spark

12Copyright © Capgemini 2015. All Rights Reserved

Issue #2: Non-Serializable Objects

Sometimes, people need to use external libraries to perform specific transformations on objects

Example: perform NLP transformations with Apache OpenNLP

Problem:• OpenNLP objects are not serializable No broadcast• OpenNLP objects take time to initialize Never-ending closures• We don’t want to convert OpenNLP source code (actually we tried)

Page 13: Capgemini - Project industrialization with apache spark

13Copyright © Capgemini 2015. All Rights Reserved

Issue #2: Non-Serializable Objects

Solution: Initiliaze singletons and bind them to Spark tasks using Java ThreadLocal

Singleton class

Bind singleton to task thread

Will be called in closure

Page 14: Capgemini - Project industrialization with apache spark

14Copyright © Capgemini 2015. All Rights Reserved

Issue #2: Non-Serializable Objects

Then call transformation in closures:

Benefits: objects are initialized only 1x per task instead of 1x per RDD element

Retrieve holder from current task

Get singleton object

Get SimpleClass object

Page 15: Capgemini - Project industrialization with apache spark

15Copyright © Capgemini 2015. All Rights Reserved

Issue #3: Unit Testing

One major step when moving from prototype to production is to define a proper testing strategy

Way people do their tests (non-exhaustive):1. They run everything on cluster \o/2. They use a local context

What we did:• Use a local context

Problem: jobs grab content from HDFS using Oozie job.properties Solution: setup a flexible configuration to operate seamlessly on cluster and locally

Page 16: Capgemini - Project industrialization with apache spark

16Copyright © Capgemini 2015. All Rights Reserved

Issue #3: Unit Testing

How it looks like:

Class applying a set of transformations

This grabs files on HDFS, can’t use locally

Page 17: Capgemini - Project industrialization with apache spark

17Copyright © Capgemini 2015. All Rights Reserved

Issue #3: Unit Testing

How can seamlessly operate with remote or local job.properties?

Using this

ConfigHelper class

Page 18: Capgemini - Project industrialization with apache spark

18Copyright © Capgemini 2015. All Rights Reserved

Issue #3: Unit Testing

Call conf

Grab on FS

Page 19: Capgemini - Project industrialization with apache spark

19Copyright © Capgemini 2015. All Rights Reserved

Issue #3: Unit Testing

Finally, our test:

Page 20: Capgemini - Project industrialization with apache spark

20Copyright © Capgemini 2015. All Rights Reserved

Issue #4: Wall of Code

Object-oriented programming modeling doesn’t apply well in Spark

As a result, we tend to write huge functions with tons of transformations People Analytics V0.01alpha : 1 class

How we managed this: We regrouped consistant sets of transformations into functional classes

Functional class

Class consecutive operations in run method

Page 21: Capgemini - Project industrialization with apache spark

21Copyright © Capgemini 2015. All Rights Reserved

Thank YouCredits:

[email protected]@capgemini.com

Code available at: https://github.com/jeanbmar/meetup-spark

Page 22: Capgemini - Project industrialization with apache spark

The information contained in this presentation is proprietary.Copyright © 2015 Capgemini. All rights reserved.

Rightshore® is a trademark belonging to Capgemini.

www.capgemini.com

About CapgeminiWith 180,000 people in over 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2014 global revenues of EUR 10.573 billion.

Together with its clients, Capgemini creates and delivers business, technology and digital solutions that fit their needs, enabling them to achieve innovation and competitiveness. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business Experience™, and draws on Rightshore®, its worldwide delivery model.

Learn more about us at www.capgemini.com.