![Page 1: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/1.jpg)
Apache Spark and Bluemix MeetupJean-Baptiste MartinJuly 6, 2016Project industrialization with Apache Spark
![Page 2: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/2.jpg)
2Copyright © Capgemini 2015. All Rights Reserved
Who am I
Jean-Baptiste Martin Managing Consultant at Capgemini Background: technical Big Data Analytics for 2 years Product manager People Analytics Founder at Top Notch
![Page 3: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/3.jpg)
3Copyright © Capgemini 2015. All Rights Reserved
Project industrialization with Apache Spark
1. Spark in People Analytics
2. Team Organization
3. Issue #1: Text Replace
4. Issue #2: Non-Serializable Objects
5. Issue #3: Unit Testing
6. Issue #4: Wall of Code
Code available at:
https://github.com/jeanbmar/meetup-spark
![Page 4: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/4.jpg)
4Copyright © Capgemini 2015. All Rights Reserved
Spark in People Analytics
What is People Analytics?
![Page 5: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/5.jpg)
5Copyright © Capgemini 2015. All Rights Reserved
Spark in People Analytics
Unstructured
WEXAppBuilder
Watson Explorer
WEX Engine
Data Indexing
Vis
ualiz
atio
n
HDFSStore
Analytics Engine
Data Reconciliation
ODPi
HDFS Access
Structured
SGBD
CSV Files
Employees
Candidates
Jobs
12
3
4
![Page 6: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/6.jpg)
6Copyright © Capgemini 2015. All Rights Reserved
Team Organization
1. Prototyping:• Technologies: Hadoop, Java, R, Watson Explorer• Team Profiles: 4 big data dev (Java), 1 data scientist, 1 data analyst
2. Industrialization:• Technologies: Hadoop, Java and Scala, Spark, Watson Explorer• Team Profiles:
– 2 data scientists– 2 software developers– 1 sys admin– 2 web developers
3. All along: • Strong support from IBM (expertise, implementation, go-to-market)
![Page 7: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/7.jpg)
7Copyright © Capgemini 2015. All Rights Reserved
Issues we faced
Issue #1: Text Replace Issue #2: Non-Serializable Objects Issue #3: Unit Testing Issue #4: Wall of Code
![Page 8: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/8.jpg)
8Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Browse and replace text is common when performing natural language processing
« I work with WEX at Cap Gemini »
« I work with Watson Explorer at Capgemini »
Cap Gemini Capgemini
WEX Watson Explorer
+
=
![Page 9: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/9.jpg)
9Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Issues when:• There’s a lot of documents to process• Dictionaries (synonyms, stopwords, protected words, …) contain 1000+ entries
Traditional implementations:• Loop over dictionary entries LOW PERF AND/OR INCORRECT• Regular Expressions LOW PERF
We want: read text 1x and perform transformations on the fly
![Page 10: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/10.jpg)
10Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Solution
1. Expand dictionaries in HashMap objects, e.g.
2. Read text character by character and perform lookups over HashMap objects– X combination of characters is a part of an existing word– null no match– Other match
W XWE XWEX Watson Explorer
![Page 11: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/11.jpg)
11Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Case 1:• Have: “Engineer. English. Fluent en.”• Want: “Engineer. English. Fluent english.”
Case 2:• Have: “Cap Gemini consultant and Big Data developer with strong xp on Hadoop,
mostly Hadoop FS. BI background (DataStage, Cognos, Oracle, DB2). Worked on multiple Watson technologies, including Watson API and WEX.”
• Dictionary, 875 entries including:Cap Gemini Capgemini
Hadoop FS HDFS
DataStage IBM DataStage
Cognos IBM Cognos
DB2 IBM DB2
WEX Watson Explorer
![Page 12: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/12.jpg)
12Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
Sometimes, people need to use external libraries to perform specific transformations on objects
Example: perform NLP transformations with Apache OpenNLP
Problem:• OpenNLP objects are not serializable No broadcast• OpenNLP objects take time to initialize Never-ending closures• We don’t want to convert OpenNLP source code (actually we tried)
![Page 13: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/13.jpg)
13Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
Solution: Initiliaze singletons and bind them to Spark tasks using Java ThreadLocal
Singleton class
Bind singleton to task thread
Will be called in closure
![Page 14: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/14.jpg)
14Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
Then call transformation in closures:
Benefits: objects are initialized only 1x per task instead of 1x per RDD element
Retrieve holder from current task
Get singleton object
Get SimpleClass object
![Page 15: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/15.jpg)
15Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
One major step when moving from prototype to production is to define a proper testing strategy
Way people do their tests (non-exhaustive):1. They run everything on cluster \o/2. They use a local context
What we did:• Use a local context
Problem: jobs grab content from HDFS using Oozie job.properties Solution: setup a flexible configuration to operate seamlessly on cluster and locally
![Page 16: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/16.jpg)
16Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
How it looks like:
Class applying a set of transformations
This grabs files on HDFS, can’t use locally
![Page 17: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/17.jpg)
17Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
How can seamlessly operate with remote or local job.properties?
Using this
ConfigHelper class
![Page 18: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/18.jpg)
18Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
Call conf
Grab on FS
![Page 19: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/19.jpg)
19Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
Finally, our test:
![Page 20: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/20.jpg)
20Copyright © Capgemini 2015. All Rights Reserved
Issue #4: Wall of Code
Object-oriented programming modeling doesn’t apply well in Spark
As a result, we tend to write huge functions with tons of transformations People Analytics V0.01alpha : 1 class
How we managed this: We regrouped consistant sets of transformations into functional classes
Functional class
Class consecutive operations in run method
![Page 21: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/21.jpg)
21Copyright © Capgemini 2015. All Rights Reserved
Thank YouCredits:
[email protected]@capgemini.com
Code available at: https://github.com/jeanbmar/meetup-spark
![Page 22: Capgemini - Project industrialization with apache spark](https://reader036.vdocument.in/reader036/viewer/2022062523/5870be151a28ab0b4a8b67d1/html5/thumbnails/22.jpg)
The information contained in this presentation is proprietary.Copyright © 2015 Capgemini. All rights reserved.
Rightshore® is a trademark belonging to Capgemini.
www.capgemini.com
About CapgeminiWith 180,000 people in over 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2014 global revenues of EUR 10.573 billion.
Together with its clients, Capgemini creates and delivers business, technology and digital solutions that fit their needs, enabling them to achieve innovation and competitiveness. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business Experience™, and draws on Rightshore®, its worldwide delivery model.
Learn more about us at www.capgemini.com.