analysis languages discussion and ideas · 2019. 6. 4. · •dynamic requests •transform by...

36
ANALYSIS LANGUAGES DISCUSSION AND IDEAS G. Watts (UW/Seattle) Analysis Languages Workshop Analysis Description Languages Workshop May 7, 2019

Upload: others

Post on 17-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

ANALYSIS LANGUAGESDISCUSSION AND IDEASG. Watts (UW/Seattle)

Analysis Languages Workshop

Analysis Description

Languages Workshop

May 7, 2019

Page 2: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

MOTIVATION

G. Watts (UW/Seattle)

2

From S. Sekmen’s Talk

How can we do analysis:

• Correctly?

• Quickly?

• With small team?

• Efficient use of resources?

Page 3: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

3

G. Watts (UW/Seattle)

Think Big!

Page 4: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

THE LHC

G. Watts (UW/Seattle)

4

Run 3: 300 𝑓𝑏−1

Run 4: 3 𝑎𝑏−1

Page 5: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

THE LHC

G. Watts (UW/Seattle)

5

Run 3: 300 𝑓𝑏−1

Run 4: 3 𝑎𝑏−1

Analysis files will be ~10’s of TB’s

• Laptops for anything other than development?

• What about running at 10K meters?

• Will need server/PROOF like functionality

• Shared resources between groups or countries?

• But you need local editing!!

• Editing a file at CERN from the USA is not an

acceptable thing.

Think about scaling as we design these languages!

Page 6: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

ENVIRONMENT

G. Watts (UW/Seattle)

6

How is a user going to start using this?

• Software requirements

• Version of python, gcc, etc.

• Metadata that may be experiment specific

• Frameworks that are used to access data

Personal Opinion

The world is moving to a

container-based sandbox method

of application distribution

e.g. Docker

• Built into Linux

• Built into Windows (soon)

• MacOS VM

• No story for ChromeBook

By Run 4 make

this a zero-level

requirement for

local

development?

Page 7: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

WHERE BE THE DATA?

G. Watts (UW/Seattle)

7

• All LHC experiments write out ROOT data

• Many smaller ones are avoiding ROOT

• Most experiments have a custom ROOT

format

• Can’t be read w/out experiment’s

software framework

• TTree’s without objects are common

intermediate format

• Non LHC moving away from ROOT

Data Format

• Most of our tools expect ROOT format

• Most tools outside HEP expect numpy or

similar format

• Increasing in popularity in the field

• Pandas, hdf5, awkward array

Tooling

Need bridges!

Where is the data stored?

• For Run 4 – data lakes

• Large federated storage

• Distributed across country

• Delivery by cache

• Perhaps basic transform by iDDS?

Page 8: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

IDDS

G. Watts (UW/Seattle)

8

Analysis System

iDDS

• Deliver just the data you

want

• Dynamic requests

• Transform by adding new

‘columns’, removing some,

reformatting, etc.

• Reduce disk usage!

Page 9: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

NOT MY DAD’S COMPUTER

G. Watts (UW/Seattle)

9

Compute power is now in co-processors!

All the super computers announced for the start of Run 4

are GPU enhanced!

1

A21 will have a significant

amount of Intel Optaine memory

Rewrite bench-mark analysis to use co processors

to prove it is faster

2Write out analysis languages so we can move

between physics and computer representations

I suspect they are from

numba benchmarks These co-processors

like to crunch columns

of data, not rows.

Page 10: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

HOW WILL PEOPLE USE IT?

G. Watts (UW/Seattle)

10

Jupyter Lab interface

• Tutorials

• Quick Examinations

• Easy to present text, code, plots in one

• One level up from TTree::Draw

Would you write an analysis in a notebook?

Would you preserve an analysis in a notebook?

Chrome Book with big enough backend?

Page 11: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

HOW WILL PEOPLE USE IT?

G. Watts (UW/Seattle)

11

Command Line/CI

• Any automation

• Continuous Integration for testing of analyses

• Complex algorithms?

What would output look like?

Page 12: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

HOW WILL PEOPLE USE IT?

G. Watts (UW/Seattle)

12

Full Fledge IDE/GUI

• IDE’s now have language servers

• Syntax checking, type checking

• Compile errors as you type in your editor

• Debugging in your editor

• Automatic completion

• Underused in field, but huge productivity enhancers

• Custom GUI for the language

• As long as text files remain the linga-franca

• Hard to round-trip

Page 13: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

WHAT WILL THEY DO WITH IT?

G. Watts (UW/Seattle)

13

Preserve the Analysis

• This will only be done in the language that

was used to write the analysis originally.

• Too expensive otherwise

• RECAST ‘preserves the mess’ for example

Explore

• Lots of ideas, lots of dead-ends

• Plots, and scripts…

• Lots of stuff thrown away

• A few nuggets kept

• Though we often do not remember to

remove them from our code!

Quick-Checks

• Explore funny shape in one distribution

• Often need to reuse complex selection

• But add-on code should be separate

• Otherwise remains as dead code long

after it is needed

• Key thing that leads to 5000 line long C++

macros.

Analysis

• Big Iron

• Systematics, control regions, fitting, etc.

• Carefully tracked and maintained

Page 14: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

G. Watts (UW/Seattle)

14

Can we use the same toolset and

language to do all of this?

Page 15: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

15

G. Watts (UW/Seattle)

A few thoughts

Page 16: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

99 LANGUAGES ON THE WALL… TAKE ONE DOWN, PARSE IT AROUND…

G. Watts (UW/Seattle)

16

Data

Query

Hist, etc.

Limit Plot

ADL1 Data

• Structured, binary, etc.

• Scalars

• (non-event data)

2 Analysis Description Language

• Control and signal regions

• Fitting

• Systematics, ML control, etc.

3 Query Language

• Per-event language

• Declarative

• Select events, objects

• Calculate ML results

• Histograms or some other aggregate

data back

Page 17: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

QUERY LANGUAGE

G. Watts (UW/Seattle)

17

Specifically Designed to loop over structured data

Page 18: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

NO EVENT (DATABASE) LEFT BEHIND

G. Watts (UW/Seattle)

18

Run #10

Event #123

Run #10

Event #234

Run #11

Event #501

Event

Jet 𝒑𝑻 Jet 𝜼 Jet 𝝓 Near Tracks

55.0 1.2 2.34 1, 2, 10

130.3 0.5 -0.7 3,5,10

Track 𝒑𝑻 Track 𝜼 Track 𝝓

55.0 1.2 2.34

130.3 0.5 1.2

85.3 -1.2 0.78

Physics: every

collision is

independent

This has strong

effects on our

compute approach

• Embarrassing

parallel

• Each event

can be its own

database

Page 19: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

CS ALREADY KNOWS

G. Watts (UW/Seattle)

19

The syntax isn’t awesome

But the set of operations is complete and unambiguous

Page 20: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

EACH EVENT IS A DATABASE

G. Watts (UW/Seattle)

20

events.SelectMany(e => e.Jets).FuturePlot(“jet_pt", “Jet p_T",

100, 0.0, 1000.0, j => j.pt).Save(hdir);

events.SelectMany(e => e.Jets).Where(j => j.pt > 40.0).Count()

Run a query over each event,

Aggregate in a histogram

Run a query over each event,

Aggregate in a single integer.

Plot of all jet 𝑝𝑇’s in sample

Number of jets in sample with

𝑝𝑇 > 40

Though clear to us what is meant here, a bit tricky to code up crossing the event boundary

Page 21: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

NO EVENT (DATABASE) LEFT BEHIND

G. Watts (UW/Seattle)

21

• How to reason about

nested data structures

• Flatten nested arrays,

filter, sorting, matching,

multi-object looping, etc.

• Terminals (Count,

Aggregate, Max, Min,

etc.)

Page 22: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

ANALYSIS LANGUAGE

G. Watts (UW/Seattle)

22

ADL

Query

Language

I have always thought of the ADL as

the wild west

• Totally wacky manipulations of query

results

• Impossible to predict

You need a General Purpose

Programming Language

HistFactory – a statistical package that

combines the results of queries into limits, etc.

Page 23: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

KEEP THEM SEPARATED?

G. Watts (UW/Seattle)

23

ADL

Query

Language

ADL

Query

Language?

Page 24: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

WHY I CHOSE C# ORIGINALLY

G. Watts (UW/Seattle)

24

It has a query language (SQL) embedded in the GPL

C# is well supported

• Tooling, debuggers, etc. all for free!

• Parser and AST built into language standard

• I just had to implement a library back-end!

Page 25: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

LEAKY ABSTRACTIONS

G. Watts (UW/Seattle)

25

1 Carefully control where abstraction leaks

2 Especially dangerous in the query language

• Automated optimization is much more

difficult

• Limits the type of backend you can run on

(GPU, CPU, etc.)

Page 26: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

HOW DO YOU CALCULATE Δ𝑅

G. Watts (UW/Seattle)

26

pip install physics-tenpy

Installs a quantum many-

body simulator

pip install scikit-hep

Installs packages for

doing HEP work in python

Ecosystem

• Uniform interface for installing add-on’s

• Exist for many programming languages

Why not one for an analysis language

Reuse one that is out there if possible!!

Page 27: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

LAZY EVALUATION

G. Watts (UW/Seattle)

27

If user makes plot “jet pT”,

do not calculate delta R

between jet and tracks that

is used for another

unrequested plot.

System should

optimize, not the

user

This means dataflow!

e.g. An unused control region in

the ADL file

Start from the goal and work backwards

This is particularly useful in an analysis group

• Lots of people work on a ‘framework’

• Has lots of regions, algorithms, etc.

• User needs only a small portion of them.

Page 28: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

CAN WE WRITE IT ONCE?

G. Watts (UW/Seattle)

28

Write the same ADL source file for CMS, ATLAS, etc?

Superficially: yes.

Usefully: no. • Detectors are different (muons!)

• Thresholds (and 𝜂 cuts) for valid

objects are different.

• Philosophies are different

• Analysis Techniques

• Systematics Calculations

• Leaf names in flat ntuples are

different

• Where corrections are applied (and

how) are different

• Theorists are ‘easy’

The ADL and the query language

may be opinionated…

But they can’t be too

opinionated or they will

suffer adoption problems.

Page 29: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

29

G. Watts (UW/Seattle) Moving from C# to Python

Page 30: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

CURRENT WORK

G. Watts (UW/Seattle)

30

Based on my LINQ work

(first check in was Dec 11, 2010

2250 comits)

Query

LanguageAST DAG Backend

Page 31: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

CURRENT WORK

G. Watts (UW/Seattle)

31

Based on my LINQ work

Backend

Run on ATLAS

xAOD’s

Run on

awkward Arrays

Run on flat Ttree

with

RDataFrame

Second axis:

• Run on the GRID

• Run on a local cluster

Second axis:

• Awkward arrays could run

on a GPU or CPU

Page 32: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

CURRENT WORK

G. Watts (UW/Seattle)

32

Based on my LINQ work

AST DAG

AST (or DAG) contains the complete information for a query

• How to manipulate the data

• How to filter the data

• What histogram to calculate

• How to weight the data

• Application of a ML weight

• etc

Page 33: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

CURRENT WORK

G. Watts (UW/Seattle)

33

Based on my LINQ work

AST DAG

Insert a cache between these two

• The AST becomes the cache key

• Can request the same plot and the second time it should be ms

to return it

• Not 10 minutes with a compute cluster

• Can re-run full analysis in second or two

• Spend time making only the new plots, but have it all

together

Cache

Page 34: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

STATUS

G. Watts (UW/Seattle)

34

Based on my LINQ work

Query

LanguageAST DAG Backend

LINQ in Python

• No effort to make it concise• Text strings (ick! )

AST is Python’s

• Some minor extensions

• Can transform

• To add convivence tuples, for

example

• and can be put into a http request

DAG is part of backendBackend

• xAOD and RDF can run

on single files,

awkward array can

read in many

• xAOD most complete

(𝑍 → ℓℓ flat ntuple

generator, per-jet

training for LLP search)

• Trigger!?

Jupyter Examples

Page 35: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

WHAT’S NEXT?

G. Watts (UW/Seattle)

35

Based on my LINQ work

Query

LanguageAST DAG Backend

Query language

• Use python with

language parsing

• No (or almost no) text

strings

• Types?

• Play to python’s

strengths

Convert to simplified

AST that Jim has

discussed

DAG is part of backend

Backend

• Turn into web service

• Create cache

• Already have web

service to load GRID

data local

iDDS

Page 36: ANALYSIS LANGUAGES DISCUSSION AND IDEAS · 2019. 6. 4. · •Dynamic requests •Transform by adding new ‘columns’, removing some, reformatting, etc. •Reduce disk usage! NOT

THE CONVERSATION

• There is a huge amount of activity around Analysis and Query Languages

• Join the Conversation!• HSF Data Analysis Forum (home (email list), indico)

• Topical Meetings @IRIS-HEP (sign up for one!

• CHEP and ACAT conferences• CHEP deadline is soon! Please submit an abstract!

• IRIS-HEP/Slack channel

• Think Big• The context for Run 3 and Run 4 is much bigger than we are used to

• Can we do a full analysis with a small team?

• Scalability?

• An Analysis System, not just an ADL!

G. Watts (UW/Seattle)

36