![Page 1: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/1.jpg)
1
Big Data Technologies and Physics Analysis with Apache Spark
Inverted CERN School of Computing 2019
4-6 March 2019
Evangelos Motesnitsalis
![Page 2: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/2.jpg)
2
Learning Goals
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
Important big data concepts
Main architecture characteristics of
big data technologies
Popular big data frameworks such as
Apache Hadoop and Apache Spark
Basic physics analysis concepts and frameworks
Connecting tools between big data and High Energy
Physics
Example of physics analysis with Apache Spark
![Page 3: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/3.jpg)
3
1. Introduction to Big Data
2. Big Data Systems Architecture:
• Architecture Principles
• Distributed Filesystem
• Cluster Manager
• Processing Framework
3. Popular Big Data Frameworks:
• Apache Hadoop
• Hadoop Distributed Filesystem (HDFS)
• Apache YARN
• Hadoop MapReduce
• Apache Spark
4. Standard Physics Analysis Procedures
5. Big Data Tools and Approaches for HEP
6. Example Workloads
7. Projects beyond Physics Analysis
8. Conclusions
Contents
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 4: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/4.jpg)
4
Who am I?
![Page 5: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/5.jpg)
5
I am…
Technical Coordinator for 2 months
(still Data Engineer at heart)
Research Fellow in 2017 – today
Technical Student in 2014 – 2015
Summer Student in 2013
5+ years of Big Data experience
Former Big Data Devops Support Engineer
and Escalation Engineer @AWS
MSc Distributed Systems @Imperial College London
Studies at King’s College London and Aristotle
University of Thessaloniki
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 6: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/6.jpg)
6
Introduction to Big Data
![Page 7: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/7.jpg)
7
Introduction to Big Data
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
Large Datasets
Strategies to handle
Large Datasets
Big Data
![Page 8: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/8.jpg)
8
Introduction to Big DataA bit of history…
18.000 BC: earliest signs of humans
storing and analyzing data
(Ishango bone)
2004: «MapReduce: Simplified Data
Processing on Large Clusters» by
2005: The term “Big Data” emerges
2006: Introduction of Apache Hadoop
2010: «The amount of data from the beginning
of human civilization to 2003 is generated
every 2 days» E. Schmidt
Today: ~15 ZB per year
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 9: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/9.jpg)
9
Big Data Systems Architecture
![Page 10: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/10.jpg)
10
Architecture Overview
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
Distributed Filesystem
Cluster Resource Manager
Distributed Processing Frameworks
Top Level Abstractions
Cluster Node
Cluster Node
Cluster Node
Cluster Node
Cluster Node
Cluster Node
![Page 11: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/11.jpg)
11
Architecture Principles
Resource Pooling: combining
storage space, CPU and memory
for different use cases and
frameworks
High Availability: fault tolerance for
source code execution and hardware
components
Scalability: horizontal with new
machines, vertical with bigger
machines
Data Persistence: high replication factor and
automatic restoration
Data Ingestion: ability to import raw data,
RDBMS data, etc.
Parallelization: follow programming patterns that allow batch processing
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 12: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/12.jpg)
12
Distributed File System
A software framework that allows
users to acess and process
distributed data
Same semantics and interfaces as
Local Filesystems
Transparency everywhere: access, location, concurrency, failure, migration
Heterogeneity and Scalability
Usually centralized yet highly available metadata
Typical examples: Hadoop FileSystem (HDFS),EOS, Windows DFS, etc.
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 13: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/13.jpg)
13
Cluster Manager
A software framework that runs
distributely on cluster nodes
Multiple software components,
multiple execution locations
Resource allocation and service
configurations
High Availability, Scalability, and Resource Pooling
are usually handled by Cluster Managers
Usually follows master – slave
architecture principles
Typical open source examples: Apache YARN, Apache Mesos, Kubernetes, etc.
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 14: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/14.jpg)
14
Distributed Processing Framework
A software framework that allows
distributed computations
Usually follows specific programming
models (e.g. MapReduce)
Code is automatically deployed in multiple locations
Heterogeneity and Scalability
Most distributed processing frameworks are highly
interconnected (especially in the Hadoop Ecosystem)
Typical examples: Hadoop MapReduce, Apache Spark, etc.
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 15: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/15.jpg)
15
Big Data Frameworks
![Page 16: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/16.jpg)
16
HDFSHadoop Distributed File System
YARNCluster resource manager
MapReduce
Hiv
eSQ
L
Pig
Scri
pti
ng
Sqo
op
Dat
a ex
chan
ge w
ith
RD
BM
S
Flu
me
Dat
aco
llect
or
Oo
zie
Wo
rkfl
ow
man
ager
Zoo
keep
erC
oo
rdin
atio
n Imp
ala
SQL
Spar
kLa
rge
scal
e d
ata
pro
cees
ing
Kaf
kaD
ata
stre
amin
g
HB
ase
No
Sql c
olu
mn
ar s
tore
Hu
eW
eb U
I fo
r H
ado
op
Elas
ticS
ear
chFu
ll-te
xt s
earc
h, r
eal-
tim
e in
dex
ing
The Big Data Ecosystem
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 17: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/17.jpg)
17
The Hadoop Ecosystem
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 18: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/18.jpg)
18
Apache Hadoop
![Page 19: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/19.jpg)
19
Apache HadoopA Framework for Large-scale Distributed Data Processing
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
Apache Hadoop
Hadoop Filesystem
(HDFS)
Apache Hadoop YARN
Hadoop MapReduce
4 Vs: Volume, Variety, Velocity, Veracity
Runs on commodity hardware and
based on Data Locality concepts
Open source
![Page 20: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/20.jpg)
20
The Hadoop Ecosystem
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 21: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/21.jpg)
21
Hadoop Distributed File SystemThe Filesystem of Hadoop
Fault tolerant – multiple replicas
Scalable – design for high throughput
Files cannot be modified –“Write Once – Read Many”
Rack awareness
Minimal data motion and rebalance
Consists of: 1 or 2 Namenodes (2 in HA)1 Datanode per cluster node
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 22: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/22.jpg)
22
Hadoop Distributed File SystemThe Filesystem of Hadoop
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 23: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/23.jpg)
23
Hadoop Distributed File System
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
1)One 1176 MB File to be stored on HDFS
B2256MB
B3256MB
B4256MB
B1256MB
B5152MB
2) Splitting into 256MB blocks
DataNode1 DataNode2 DataNode3 DataNode4
B1R1
B1R2
B1R3
B2R1
B2R2
B2R3
B3R1
B3R2
B3R3
B4R1
B4R2
B4R3
B5R1
B5R2
B5R3
4) Blocks with their replicas (by default 3) are distributed across Data Nodes3) Ask NameNodewhere to put them
NameNode
![Page 24: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/24.jpg)
24
Hadoop Distributed File SystemInteracting with HDFS
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
hdfs dfs –ls #listing home dirhdfs dfs –ls /user #listing user dir…hdfs dfs –du –h /user #space usedhdfs dfs –mkdir newdir #creating dirhdfs dfs –put myfile.csv . #storing a file on HDFShdfs dfs –get myfile.csv . #getting a file fr HDFS
![Page 25: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/25.jpg)
25
Hadoop Distributed File SystemData Flow in HDFS
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
DATA SOURCE
2. Analytic processing
2b
. Lo
w la
ten
cy s
tore
1a. Reprocess the data
1. Data Ingestion
Graphical UI3. Publish
Shell/Notebook2a. Visualize
![Page 26: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/26.jpg)
26
The Hadoop Ecosystem
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 27: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/27.jpg)
27
Apache HadoopYARNYet Another Resource Negotiator
Manages cluster computing
resources in Hadoop
Creates the environment for
Hadoop applications and deploys
them
Negotiates with the applications the CPU and Memory resources that will be assigned to them
Utilizes different user queues and different
schedulers: Fair, Capacity, and FIFO
Each application relies on an Application Master
Consists of: 1 Resource Manager in the master node1 Node Manager per cluster node
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 28: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/28.jpg)
28
Apache HadoopYARNYet Another Resource Negotiator
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 29: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/29.jpg)
29
Apache Hadoop YARNInteracting with YARN
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
yarn application –list #listing apps submitedyarn application -status <id> #details about appyarn application –kill <id> #kill running app
![Page 30: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/30.jpg)
30
The Hadoop Ecosystem
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 31: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/31.jpg)
31
Hadoop MapReduceThe First Batch Processing Framework
MapReduce is a programming
model for parallel processing
Executes Java code in parallel
Contains two stages: Map & Reduce
Optimized for local data access
Good for huge data sets and offline analysis
but does not fit every use case
Time consuming and not interactive
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 32: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/32.jpg)
32
Hadoop MapReduceHello World – aka « Wordcount »
//MAP method body
map(String key, String value)
// key: document name
// value: document contents
for each word w in value
EmitIntermediate(w, "1")
//REDUCER method body
reduce(String key, Iterator values):
// key: word
// values: a list of counts
for each v in values:
result += ParseInt(v);
Emit(AsString(result))
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 33: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/33.jpg)
33
Hadoop MapReduceHello World – aka « Wordcount »
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 34: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/34.jpg)
34
The Hadoop Ecosystem
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 35: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/35.jpg)
35
Apache Spark
![Page 36: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/36.jpg)
36
Apache Spark
Apache Spark is an open source
cluster computing framework
Consists of multiple components:Spark SQLSpark MlibSpark GraphSpark Structured Streaming
APIs in Python, Scala, Java, R
Compatible with multiple cluster managers:Apache YARNApache MesosKubernetesStandalone
Brought fast, iterative, near real-time
processing with no strict programming
model – everything that MapReduce
lacked.
Multiple File Formats and Filesystem
Compatibility
Overview
![Page 37: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/37.jpg)
37
Apache Spark
Spark supports complex
processing patterns based on DAG
Transformations are lazy, they only get
executed when we call an actionStaged Data are kept in memory
RDDS: Resilient Distributed Dataset, the
basic abstraction of Spark is collection of
partitioned data with primivite values
Directed Acyclic Graph: A finite
directed graph with no directed
cycles
Spark supports two types of
operations: transformations and actions
Basic Concepts of Apache Spark
![Page 38: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/38.jpg)
38
Apache Spark
import scala.math.random
val slices = 3val n = 100000 * slicesval rdd = sc.parallelize(1 to n, slices)val sample = rdd.map { i =>
val x = randomval y = randomif (x*x + y*y < 1) 1 else 0
}val count = sample.reduce(_ + _)println("Pi is roughly " + 4.0 * count / n)
Driver and Executors
cluster
Cluster Resource Manager
Node 2
Executor
Node 1
Executor
Node 3
Executor
Driver
SparkContext
![Page 39: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/39.jpg)
39
Apache SparkHello World – aka « Wordcount »
text_file = sc.textFile("/user/emotes/datasets/")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("/user/emotes/outputfolder/")
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
#defining dataframe with schema from parquet files
val df = spark.read.parquet("/user/emotes/datasets/")
#counting the number of pre-filtered rows with DF API
df.filter($"l1username".contains("emotes")).count
#counting the number of pre-filtered rows with SQL
df.registerTempTable("my_table")
spark.sql("SELECT count(*) FROM my_table where l1username like '%emotes%'").show
SQL on Spark
![Page 40: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/40.jpg)
40
Apache SparkHello World – aka « Wordcount »
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 41: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/41.jpg)
41
Standard Physics Analysis Procedures
![Page 42: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/42.jpg)
42
Physics Analysis is typically done with the ROOT Framework which uses physics data that are saved in
ROOT format files.
At CERN these files are stored within the EOS Storage Service.
HEP Data Processing
ROOT Data Analysis Framework
A modular scientific software framework which providesall the functionalities needed to deal with big dataprocessing, statistical analysis, visualization and file storage.
EOS Service
A disk-based, low-latency storage service with ahighly-scalable hierarchical namespace, which enablesdata access through the XRootD protocol.
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 43: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/43.jpg)
43
WLCG
The Worldwide LHC Computing Grid
(WLCG) is a global collaboration of
more than 170 institutions
in 42 countries which provide
resources to store, distribute and
analyse the PBs of LHC Data
Worldwide LHC Computing Grid
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 44: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/44.jpg)
44
LHC Data Flow at CERN
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
Event Filtering
Raw Data Reconstruction
RecoProcessing, skimming
Analysis
Formats
Event Selection
Results
![Page 45: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/45.jpg)
45
Big Data Tools for High Energy Physics
![Page 46: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/46.jpg)
46
Bridging the Gap
EOS Storage Service
Physics Analysis is typically done with the ROOT Framework which uses physics data that are saved in
ROOT format files.At CERN these files are stored within the EOS Storage Service.
1. access data 2. read format 3. visualize
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 47: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/47.jpg)
47
Different Approaches for Physics Analysis with Spark
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
1: Apache Spark with Hadoop-XRootD Connector, spark-root, SWAN
2: Apache Spark with ROOT Rdataframe, SWAN
![Page 48: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/48.jpg)
48
The ‘Hadoop – XRootD Connector’ LibraryConnecting XRootD-based Storage Systems with Hadoop and Spark
A Java library that connects to the XRootD client via JNI
Reads files from the EOS Storage
Service directly
Makes all Physics Data available
for processing with Spark
Open Source: https://github.com/cerndb/hadoop-xrootd
Supports Kerberos and GRID
Certificate Authentication
JNI
HadoopHDFS APIHadoop-
XRootDConnector
EOSStorageService XRootD
Client
C++ Java
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 49: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/49.jpg)
49
The ‘Spark – Root’ Library
A Scala library which implements
DataSource for Apache Spark
Spark can read ROOT TTrees and
infer their schema
Root files are imported to
Spark Dataframes/Datasets/RDDs
Developed by DIANA-HEP in
collaboration with CERN openlab
Open Source: https://github.com/diana-hep/spark-root/
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 50: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/50.jpg)
50
ROOT RDataframe
Implemented in C++
but also interfaced
on Python
Exploratory work to
parallelize RDataFrame
computations with
multiple backends
Spark Dataframes tailored
for ROOT and HEP
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 51: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/51.jpg)
51
SWAN Service and Spark IntegrationHosted Jupyter Notebooks for Data Analysis
Web-based interactive analysisusing PySpark in the cloud
Direct access to the EOS and HDFSFully Integrated with IT Spark and
Hadoop Clusters
Cover the need for user-friendly
environments that allow collaboration
and sharing between researchers
Combines code, equations, text and
visualisations
No need to install software
https://swan.web.cern.ch/
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 52: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/52.jpg)
52
Overview
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
EOS Storage Service
runs on
access data from
access data from
accessed by
runs on
![Page 53: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/53.jpg)
53
Physics Analysis with Apache Spark
![Page 54: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/54.jpg)
54
The Use Case of theCMS Data Reduction Facility
![Page 55: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/55.jpg)
55
CMS Data Reduction and Analysis FacilityPerforming Physics Analysis and Data Reduction with Apache Spark
Investigate new ways to analyse physics data and improve resource utilization and time-to-physics
It offers an alternative for ‘ad-hoc’ data reduction
for each research group
Bridge the gap between High Energy Physics and Big
Data communities
Data Reduction refers to event
selection and feature preparation based
on potentially complicated queries
We now have fully functioning
Analysis and Reduction examples
tested over CMS Open Data
Main goal was to be able to reduce 1 PB of data in
5 hours or less
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 56: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/56.jpg)
56
CMS Data Reduction and Analysis Facility
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 57: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/57.jpg)
57
Examples
![Page 58: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/58.jpg)
58
Example on Physics Analysis
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
EOS
Storage
Service
ROOTfile
Driver
Executor
Task 1
Task 2
Executor
Task x-1
Taskx
{Physics Analysis Code}
Test Workload Architecture and File-Task Mapping
Spark Cluster
ROOTfile
ROOTfile
ROOTfile
![Page 59: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/59.jpg)
59
Example on Physics Analysis with SWAN
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 60: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/60.jpg)
60
Example on Physics Analysis with SWAN
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
val h = df.filter(_.muons.length >= 2).flatMap({e: Event => for (i <- 0 until e.muons.length; j <- 0 until e.muons.length) yield buildDiCandidate(e.muons(i), e.muons(j))}).rdd.aggregate(emptyDiCandidate)(new Increment, new Combine);
![Page 61: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/61.jpg)
61
Example on Physics Analysis with SWAN
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 62: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/62.jpg)
62
Final Result
![Page 63: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/63.jpg)
63Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 64: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/64.jpg)
64
OK, OK, one with better graphics
![Page 65: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/65.jpg)
65
Final Result
That is what we will do in the exercises
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 66: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/66.jpg)
66
Projects beyond Physics Analysis
![Page 67: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/67.jpg)
67
Next Accelerator Logging Service (NXCALS)
A control system with:
Streaming
Online System
API for Data Extraction
Critical for LHC
Operations
Runs on a dedicated cluster
Credits: BE-CO-DS
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 68: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/68.jpg)
68
Data Center and WLCG Monitoring Systems
Kafka cluster
(buffering) *
Processing
Data enrichment
Data aggregation
Batch Processing
Transport
Flume
Kafka
sink
Flume
sinks
FTS
Data
Sources
Rucio
XRootD
Jobs
…
Lemon
syslog
app log
DB
HTTP
feed
AMQFlume
AMQ
Flume
DB
Flume
HTTP
Flume
Log GW
Flume
Metric
GW
Logs
Lemon
metrics
HDFS
ElasticS
earch
…
Storage &
Search
Others
(influxdb)
Data
Access
CLI, API
Credits: IT-CM-MM
Critical for Data
Center operations
and WLCG
200M events/day
500 GB/day
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 69: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/69.jpg)
69
Computer Security Intrusion Detection
Credits: CERN security team, IT-DI
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
![Page 70: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/70.jpg)
70
Conclusions
![Page 71: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/71.jpg)
71
Conclusions
There is a broad ecosystem of Big Data Frameworks, most of which share the same architecture
principles such as resource pooling, high availability, fault tolerance, etc.
There are now available tools and services to use these big data technologies in order to
perform analytics on physics, infrastructure, and accelerator data.
Evangelos Motesnitsalis - inverted CERN School of Computing 2019
Popular Big Data Frameworks such as Apache Spark show great potential in bridging the gap
between the High Energy Physics community and the Big Data community.
![Page 72: Big Data Technologies and Physics Analysis with Apache Spark€¦ · Hadoop FileSystem (HDFS), EOS, Windows DFS, etc. Evangelos Motesnitsalis - inverted CERN School of Computing 2019](https://reader034.vdocument.in/reader034/viewer/2022051604/6001a6fb4b8284295806ee21/html5/thumbnails/72.jpg)
72
Acknowledgements
My mentors, Sebastian Lopienski
and Enric Tejedor Saavedra
The CSC Team, especially Joelma
and Nikos
Colleagues at the CERN Hadoop, Spark, and
streaming services who kindly helped with material
and feedback
CMS members of the Big Data Reduction Facility, DIANA/HEP, Fermilab
Evangelos Motesnitsalis - inverted CERN School of Computing 2019