![Page 1: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/1.jpg)
Next-Generation Databases
Miguel Branco on behalf of the RAW team
![Page 2: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/2.jpg)
2
Trends• More complex hardware
– Multicores, GPUs, Cloud, NUMA*, PoP+SoC**, …
• More complex questions– “Last month sales” “Next month sales”
• More complex apps– Distributed, Service-oriented, Rack-aware, ...
• More data analysts– Easy-of-use, Interactivity, Collaboration, ..
• More data– Volume, File Formats, ... * Non-uniform memory architectures
** Package on Package, System on a Chip
![Page 3: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/3.jpg)
3
![Page 4: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/4.jpg)
4
• No data loading– No “physical” data copy: support existing file formats
• No database tuning– Instead, self-tuned based on actual usage patterns
• Not restricted to tables– Add support for trees, vectors, matrices, …
• Not just SQL– Instead, enable domain-specific languages
![Page 5: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/5.jpg)
5
Traditional Database
Data adapts to the query engine
DBMS
SQL
CSV XML JSON
![Page 6: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/6.jpg)
6
RAW
Query engine adapts to the data
DBMS
SQL
CSV XML JSON
RAW lang
“DSL”
![Page 7: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/7.jpg)
How RAW adapts to data
CSVROOT
join
scanroot
scancsvfilter
… containing“good” run numbers … containing
physics events
Code Generate the Access Paths
Code Generate the Query
Build Position and Data Caches
SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumber AND root. EF_2mu13 == TRUE AND …
Adapt to format, file instance and query just-in-time
![Page 8: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/8.jpg)
8
Adapting to schema & query
[CSV input]col: if col needed: if col isInt
readInt(); if col isFloat
readFloat(); if ... else: skipField();
GENERAL-PURPOSE
readInt();readInt();skipField();readFloat();skipRestLine();
JUST-IN-TIME
Remove overhead of generic operators
![Page 9: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/9.jpg)
9
Adapting to format• Unroll Columns
• Free navigation in files
• Embedded indexes/existing APIs
col:if col needed: if col isInt ...
readInt();skipField();readFloat();skipRest();
- fieldLength:10- tupleLength:100- Need fields 2 & 5
of 2nd row
moveTo(110);readInt();moveTo(140);readFloat();- Bitmaps, R-Trees etc.
- readNextField() vs. readField(filename,id)
![Page 10: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/10.jpg)
11
ElectroneventID INTeta FLOATpt FLOAT
JeteventID INT
eta FLOAT
pt FLOATEvent
eventID INT
runNumber INT
MuoneventID INT
eta FLOAT
pt FLOAT
ROOT - C++ RAWclass Event {
class Muon {float pt, eta;…
} class Electron {
float pt, eta;…
} class Jet {
float pt, eta;…
} int runNumber; vector<Muon> muons; vector<Electron> electrons; vector<Jet> jets; }
HEP analysis: Data
![Page 11: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/11.jpg)
12
HEP analysis: Queries“Identify events of interest → Filter out background events
→ Plot aggregated results in a histogram”
SELECT event FROM root:/data1/ATLAS/*.root , csv:/data1/ATLAS/events.csv WHERE ( csv.id = event.id AND event.EF_e24vhi_medium1 OR
event.EF_e60_medium1 OR event.EF_2e12Tvh_loose1 OR
event.EF_mu24i_tight OR event.EF_mu36_tight OR event.EF_2mu13) AND event.muon.mu_ptcone20 < 0.1 *
event.muon.mu_pt AND event.muon.mu_pt > 20000. AND ABS(event.muon.mu_eta) < 2.4 AND …..
1000+ lines of C++for (unsigned int imuon = 0 ; imuon<((*curr_entries)[jentry].mu_pt)->size(); imuon++) { if (((*curr_entries)[jentry].
mu_ptcone20)->at(imuon) < 0.1 * ((*curr_entries)[jentry].mu_pt)->at(imuon) &&
((*curr_entries)[jentry].mu_pt)->at(imuon) > 20000. &&
fabs(((*curr_entries)[jentry].mu_eta)->at(imuon)) < 2.4 &&
…}...
ROOT - C++ RAW
![Page 12: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/12.jpg)
13
Query 1 (Cold)
Query 2 Query 3 Query 4 Query 5 Query 6100
1000
10000
100000
1000000
10000000
RAW ROOT
Exec
ution
Tim
e (s
ec)
RAW vs. the ROOT framework[Xeon CPU E7-28867 @ 2.13GHz1TB HDD - 7200RPM,192GB RAM]
ROOT: 900 GB in 127 files
CSV: 1 “table” of IDs
Declarative queries + up to 90x improvement
![Page 13: Next-Generation Databases Miguel Branco on behalf of the RAW team](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eb65503460f94bc07cd/html5/thumbnails/13.jpg)
14
RAW for High-Energy Physics
• End-users:– Performance (JIT, codegen, vectorwise, …)– Easy-to-use (declarative) query language
• Infrastructure Providers: – Data kept in original location & file format– Declarative query language More optimization opportunities
• “Event” caches
http://dias.epfl.ch/RAWThank You!