analysis languages discussion and ideas · 2019. 6. 4. · •dynamic requests •transform by...
TRANSCRIPT
ANALYSIS LANGUAGESDISCUSSION AND IDEASG. Watts (UW/Seattle)
Analysis Languages Workshop
Analysis Description
Languages Workshop
May 7, 2019
MOTIVATION
G. Watts (UW/Seattle)
2
From S. Sekmen’s Talk
How can we do analysis:
• Correctly?
• Quickly?
• With small team?
• Efficient use of resources?
3
G. Watts (UW/Seattle)
Think Big!
THE LHC
G. Watts (UW/Seattle)
4
Run 3: 300 𝑓𝑏−1
Run 4: 3 𝑎𝑏−1
THE LHC
G. Watts (UW/Seattle)
5
Run 3: 300 𝑓𝑏−1
Run 4: 3 𝑎𝑏−1
Analysis files will be ~10’s of TB’s
• Laptops for anything other than development?
• What about running at 10K meters?
• Will need server/PROOF like functionality
• Shared resources between groups or countries?
• But you need local editing!!
• Editing a file at CERN from the USA is not an
acceptable thing.
Think about scaling as we design these languages!
ENVIRONMENT
G. Watts (UW/Seattle)
6
How is a user going to start using this?
• Software requirements
• Version of python, gcc, etc.
• Metadata that may be experiment specific
• Frameworks that are used to access data
Personal Opinion
The world is moving to a
container-based sandbox method
of application distribution
e.g. Docker
• Built into Linux
• Built into Windows (soon)
• MacOS VM
• No story for ChromeBook
By Run 4 make
this a zero-level
requirement for
local
development?
WHERE BE THE DATA?
G. Watts (UW/Seattle)
7
• All LHC experiments write out ROOT data
• Many smaller ones are avoiding ROOT
• Most experiments have a custom ROOT
format
• Can’t be read w/out experiment’s
software framework
• TTree’s without objects are common
intermediate format
• Non LHC moving away from ROOT
Data Format
• Most of our tools expect ROOT format
• Most tools outside HEP expect numpy or
similar format
• Increasing in popularity in the field
• Pandas, hdf5, awkward array
Tooling
Need bridges!
Where is the data stored?
• For Run 4 – data lakes
• Large federated storage
• Distributed across country
• Delivery by cache
• Perhaps basic transform by iDDS?
IDDS
G. Watts (UW/Seattle)
8
Analysis System
iDDS
• Deliver just the data you
want
• Dynamic requests
• Transform by adding new
‘columns’, removing some,
reformatting, etc.
• Reduce disk usage!
NOT MY DAD’S COMPUTER
G. Watts (UW/Seattle)
9
Compute power is now in co-processors!
All the super computers announced for the start of Run 4
are GPU enhanced!
1
A21 will have a significant
amount of Intel Optaine memory
Rewrite bench-mark analysis to use co processors
to prove it is faster
2Write out analysis languages so we can move
between physics and computer representations
I suspect they are from
numba benchmarks These co-processors
like to crunch columns
of data, not rows.
HOW WILL PEOPLE USE IT?
G. Watts (UW/Seattle)
10
Jupyter Lab interface
• Tutorials
• Quick Examinations
• Easy to present text, code, plots in one
• One level up from TTree::Draw
Would you write an analysis in a notebook?
Would you preserve an analysis in a notebook?
Chrome Book with big enough backend?
HOW WILL PEOPLE USE IT?
G. Watts (UW/Seattle)
11
Command Line/CI
• Any automation
• Continuous Integration for testing of analyses
• Complex algorithms?
What would output look like?
HOW WILL PEOPLE USE IT?
G. Watts (UW/Seattle)
12
Full Fledge IDE/GUI
• IDE’s now have language servers
• Syntax checking, type checking
• Compile errors as you type in your editor
• Debugging in your editor
• Automatic completion
• Underused in field, but huge productivity enhancers
• Custom GUI for the language
• As long as text files remain the linga-franca
• Hard to round-trip
WHAT WILL THEY DO WITH IT?
G. Watts (UW/Seattle)
13
Preserve the Analysis
• This will only be done in the language that
was used to write the analysis originally.
• Too expensive otherwise
• RECAST ‘preserves the mess’ for example
Explore
• Lots of ideas, lots of dead-ends
• Plots, and scripts…
• Lots of stuff thrown away
• A few nuggets kept
• Though we often do not remember to
remove them from our code!
Quick-Checks
• Explore funny shape in one distribution
• Often need to reuse complex selection
• But add-on code should be separate
• Otherwise remains as dead code long
after it is needed
• Key thing that leads to 5000 line long C++
macros.
Analysis
• Big Iron
• Systematics, control regions, fitting, etc.
• Carefully tracked and maintained
G. Watts (UW/Seattle)
14
Can we use the same toolset and
language to do all of this?
15
G. Watts (UW/Seattle)
A few thoughts
99 LANGUAGES ON THE WALL… TAKE ONE DOWN, PARSE IT AROUND…
G. Watts (UW/Seattle)
16
Data
Query
Hist, etc.
Limit Plot
ADL1 Data
• Structured, binary, etc.
• Scalars
• (non-event data)
2 Analysis Description Language
• Control and signal regions
• Fitting
• Systematics, ML control, etc.
3 Query Language
• Per-event language
• Declarative
• Select events, objects
• Calculate ML results
• Histograms or some other aggregate
data back
QUERY LANGUAGE
G. Watts (UW/Seattle)
17
Specifically Designed to loop over structured data
NO EVENT (DATABASE) LEFT BEHIND
G. Watts (UW/Seattle)
18
Run #10
Event #123
Run #10
Event #234
Run #11
Event #501
Event
Jet 𝒑𝑻 Jet 𝜼 Jet 𝝓 Near Tracks
55.0 1.2 2.34 1, 2, 10
130.3 0.5 -0.7 3,5,10
Track 𝒑𝑻 Track 𝜼 Track 𝝓
55.0 1.2 2.34
130.3 0.5 1.2
85.3 -1.2 0.78
…
Physics: every
collision is
independent
This has strong
effects on our
compute approach
• Embarrassing
parallel
• Each event
can be its own
database
CS ALREADY KNOWS
G. Watts (UW/Seattle)
19
The syntax isn’t awesome
But the set of operations is complete and unambiguous
EACH EVENT IS A DATABASE
G. Watts (UW/Seattle)
20
events.SelectMany(e => e.Jets).FuturePlot(“jet_pt", “Jet p_T",
100, 0.0, 1000.0, j => j.pt).Save(hdir);
events.SelectMany(e => e.Jets).Where(j => j.pt > 40.0).Count()
Run a query over each event,
Aggregate in a histogram
Run a query over each event,
Aggregate in a single integer.
Plot of all jet 𝑝𝑇’s in sample
Number of jets in sample with
𝑝𝑇 > 40
Though clear to us what is meant here, a bit tricky to code up crossing the event boundary
NO EVENT (DATABASE) LEFT BEHIND
G. Watts (UW/Seattle)
21
• How to reason about
nested data structures
• Flatten nested arrays,
filter, sorting, matching,
multi-object looping, etc.
• Terminals (Count,
Aggregate, Max, Min,
etc.)
ANALYSIS LANGUAGE
G. Watts (UW/Seattle)
22
ADL
Query
Language
I have always thought of the ADL as
the wild west
• Totally wacky manipulations of query
results
• Impossible to predict
You need a General Purpose
Programming Language
HistFactory – a statistical package that
combines the results of queries into limits, etc.
KEEP THEM SEPARATED?
G. Watts (UW/Seattle)
23
ADL
Query
Language
ADL
Query
Language?
WHY I CHOSE C# ORIGINALLY
G. Watts (UW/Seattle)
24
It has a query language (SQL) embedded in the GPL
C# is well supported
• Tooling, debuggers, etc. all for free!
• Parser and AST built into language standard
• I just had to implement a library back-end!
LEAKY ABSTRACTIONS
G. Watts (UW/Seattle)
25
1 Carefully control where abstraction leaks
2 Especially dangerous in the query language
• Automated optimization is much more
difficult
• Limits the type of backend you can run on
(GPU, CPU, etc.)
HOW DO YOU CALCULATE Δ𝑅
G. Watts (UW/Seattle)
26
pip install physics-tenpy
Installs a quantum many-
body simulator
pip install scikit-hep
Installs packages for
doing HEP work in python
Ecosystem
• Uniform interface for installing add-on’s
• Exist for many programming languages
Why not one for an analysis language
Reuse one that is out there if possible!!
LAZY EVALUATION
G. Watts (UW/Seattle)
27
If user makes plot “jet pT”,
do not calculate delta R
between jet and tracks that
is used for another
unrequested plot.
System should
optimize, not the
user
This means dataflow!
e.g. An unused control region in
the ADL file
Start from the goal and work backwards
This is particularly useful in an analysis group
• Lots of people work on a ‘framework’
• Has lots of regions, algorithms, etc.
• User needs only a small portion of them.
CAN WE WRITE IT ONCE?
G. Watts (UW/Seattle)
28
Write the same ADL source file for CMS, ATLAS, etc?
Superficially: yes.
Usefully: no. • Detectors are different (muons!)
• Thresholds (and 𝜂 cuts) for valid
objects are different.
• Philosophies are different
• Analysis Techniques
• Systematics Calculations
• Leaf names in flat ntuples are
different
• Where corrections are applied (and
how) are different
• Theorists are ‘easy’
The ADL and the query language
may be opinionated…
But they can’t be too
opinionated or they will
suffer adoption problems.
29
G. Watts (UW/Seattle) Moving from C# to Python
CURRENT WORK
G. Watts (UW/Seattle)
30
Based on my LINQ work
(first check in was Dec 11, 2010
2250 comits)
Query
LanguageAST DAG Backend
CURRENT WORK
G. Watts (UW/Seattle)
31
Based on my LINQ work
Backend
Run on ATLAS
xAOD’s
Run on
awkward Arrays
Run on flat Ttree
with
RDataFrame
Second axis:
• Run on the GRID
• Run on a local cluster
Second axis:
• Awkward arrays could run
on a GPU or CPU
CURRENT WORK
G. Watts (UW/Seattle)
32
Based on my LINQ work
AST DAG
AST (or DAG) contains the complete information for a query
• How to manipulate the data
• How to filter the data
• What histogram to calculate
• How to weight the data
• Application of a ML weight
• etc
CURRENT WORK
G. Watts (UW/Seattle)
33
Based on my LINQ work
AST DAG
Insert a cache between these two
• The AST becomes the cache key
• Can request the same plot and the second time it should be ms
to return it
• Not 10 minutes with a compute cluster
• Can re-run full analysis in second or two
• Spend time making only the new plots, but have it all
together
Cache
STATUS
G. Watts (UW/Seattle)
34
Based on my LINQ work
Query
LanguageAST DAG Backend
LINQ in Python
• No effort to make it concise• Text strings (ick! )
AST is Python’s
• Some minor extensions
• Can transform
• To add convivence tuples, for
example
• and can be put into a http request
DAG is part of backendBackend
• xAOD and RDF can run
on single files,
awkward array can
read in many
• xAOD most complete
(𝑍 → ℓℓ flat ntuple
generator, per-jet
training for LLP search)
• Trigger!?
Jupyter Examples
WHAT’S NEXT?
G. Watts (UW/Seattle)
35
Based on my LINQ work
Query
LanguageAST DAG Backend
Query language
• Use python with
language parsing
• No (or almost no) text
strings
• Types?
• Play to python’s
strengths
Convert to simplified
AST that Jim has
discussed
DAG is part of backend
Backend
• Turn into web service
• Create cache
• Already have web
service to load GRID
data local
iDDS
THE CONVERSATION
• There is a huge amount of activity around Analysis and Query Languages
• Join the Conversation!• HSF Data Analysis Forum (home (email list), indico)
• Topical Meetings @IRIS-HEP (sign up for one!
• CHEP and ACAT conferences• CHEP deadline is soon! Please submit an abstract!
• IRIS-HEP/Slack channel
• Think Big• The context for Run 3 and Run 4 is much bigger than we are used to
• Can we do a full analysis with a small team?
• Scalability?
• An Analysis System, not just an ADL!
G. Watts (UW/Seattle)
36