capgemini - project industrialization with apache spark
Post on 16-Apr-2017
540 Views
Preview:
TRANSCRIPT
Apache Spark and Bluemix MeetupJean-Baptiste MartinJuly 6, 2016Project industrialization with Apache Spark
2Copyright © Capgemini 2015. All Rights Reserved
Who am I
Jean-Baptiste Martin Managing Consultant at Capgemini Background: technical Big Data Analytics for 2 years Product manager People Analytics Founder at Top Notch
3Copyright © Capgemini 2015. All Rights Reserved
Project industrialization with Apache Spark
1. Spark in People Analytics
2. Team Organization
3. Issue #1: Text Replace
4. Issue #2: Non-Serializable Objects
5. Issue #3: Unit Testing
6. Issue #4: Wall of Code
Code available at:
https://github.com/jeanbmar/meetup-spark
4Copyright © Capgemini 2015. All Rights Reserved
Spark in People Analytics
What is People Analytics?
5Copyright © Capgemini 2015. All Rights Reserved
Spark in People Analytics
Unstructured
WEXAppBuilder
Watson Explorer
WEX Engine
Data Indexing
Vis
ualiz
atio
n
HDFSStore
Analytics Engine
Data Reconciliation
ODPi
HDFS Access
Structured
SGBD
CSV Files
Employees
Candidates
Jobs
12
3
4
6Copyright © Capgemini 2015. All Rights Reserved
Team Organization
1. Prototyping:• Technologies: Hadoop, Java, R, Watson Explorer• Team Profiles: 4 big data dev (Java), 1 data scientist, 1 data analyst
2. Industrialization:• Technologies: Hadoop, Java and Scala, Spark, Watson Explorer• Team Profiles:
– 2 data scientists– 2 software developers– 1 sys admin– 2 web developers
3. All along: • Strong support from IBM (expertise, implementation, go-to-market)
7Copyright © Capgemini 2015. All Rights Reserved
Issues we faced
Issue #1: Text Replace Issue #2: Non-Serializable Objects Issue #3: Unit Testing Issue #4: Wall of Code
8Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Browse and replace text is common when performing natural language processing
« I work with WEX at Cap Gemini »
« I work with Watson Explorer at Capgemini »
Cap Gemini Capgemini
WEX Watson Explorer
+
=
9Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Issues when:• There’s a lot of documents to process• Dictionaries (synonyms, stopwords, protected words, …) contain 1000+ entries
Traditional implementations:• Loop over dictionary entries LOW PERF AND/OR INCORRECT• Regular Expressions LOW PERF
We want: read text 1x and perform transformations on the fly
10Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Solution
1. Expand dictionaries in HashMap objects, e.g.
2. Read text character by character and perform lookups over HashMap objects– X combination of characters is a part of an existing word– null no match– Other match
W XWE XWEX Watson Explorer
11Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Case 1:• Have: “Engineer. English. Fluent en.”• Want: “Engineer. English. Fluent english.”
Case 2:• Have: “Cap Gemini consultant and Big Data developer with strong xp on Hadoop,
mostly Hadoop FS. BI background (DataStage, Cognos, Oracle, DB2). Worked on multiple Watson technologies, including Watson API and WEX.”
• Dictionary, 875 entries including:Cap Gemini Capgemini
Hadoop FS HDFS
DataStage IBM DataStage
Cognos IBM Cognos
DB2 IBM DB2
WEX Watson Explorer
12Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
Sometimes, people need to use external libraries to perform specific transformations on objects
Example: perform NLP transformations with Apache OpenNLP
Problem:• OpenNLP objects are not serializable No broadcast• OpenNLP objects take time to initialize Never-ending closures• We don’t want to convert OpenNLP source code (actually we tried)
13Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
Solution: Initiliaze singletons and bind them to Spark tasks using Java ThreadLocal
Singleton class
Bind singleton to task thread
Will be called in closure
14Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
Then call transformation in closures:
Benefits: objects are initialized only 1x per task instead of 1x per RDD element
Retrieve holder from current task
Get singleton object
Get SimpleClass object
15Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
One major step when moving from prototype to production is to define a proper testing strategy
Way people do their tests (non-exhaustive):1. They run everything on cluster \o/2. They use a local context
What we did:• Use a local context
Problem: jobs grab content from HDFS using Oozie job.properties Solution: setup a flexible configuration to operate seamlessly on cluster and locally
16Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
How it looks like:
Class applying a set of transformations
This grabs files on HDFS, can’t use locally
17Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
How can seamlessly operate with remote or local job.properties?
Using this
ConfigHelper class
18Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
Call conf
Grab on FS
19Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
Finally, our test:
20Copyright © Capgemini 2015. All Rights Reserved
Issue #4: Wall of Code
Object-oriented programming modeling doesn’t apply well in Spark
As a result, we tend to write huge functions with tons of transformations People Analytics V0.01alpha : 1 class
How we managed this: We regrouped consistant sets of transformations into functional classes
Functional class
Class consecutive operations in run method
21Copyright © Capgemini 2015. All Rights Reserved
Thank YouCredits:
jean-baptiste.martin@capgemini.comjerome.delvigne@capgemini.com
Code available at: https://github.com/jeanbmar/meetup-spark
The information contained in this presentation is proprietary.Copyright © 2015 Capgemini. All rights reserved.
Rightshore® is a trademark belonging to Capgemini.
www.capgemini.com
About CapgeminiWith 180,000 people in over 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2014 global revenues of EUR 10.573 billion.
Together with its clients, Capgemini creates and delivers business, technology and digital solutions that fit their needs, enabling them to achieve innovation and competitiveness. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business Experience™, and draws on Rightshore®, its worldwide delivery model.
Learn more about us at www.capgemini.com.
top related