![Page 1: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/1.jpg)
Myria: Scalable Analytics as a Service
Bill Howe, PhDUniversity of Washington
XLDB South America 2014
![Page 2: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/2.jpg)
04/11/2023 2/57
This morning
• UW eScience Institute– A “Data Science Environment”
• SQLShare and High Variety Data
• Myria and “Relational Algorithmics”
Bill Howe, UW
![Page 3: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/3.jpg)
3
“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
![Page 4: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/4.jpg)
The Fourth Paradigm
1. Empirical + experimental2. Theoretical3. Computational4. Data-Intensive
Jim Gray
04/11/2023 Bill Howe, UW 4
![Page 5: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/5.jpg)
“All across our campus, the process of discovery will increasingly rely on researchers’ ability to extract knowledge from vast amounts of data… In order to remain at the forefront, UW must be a leader in advancing these techniques and technologies, and in making [them] accessible to researchers in the broadest imaginable range of fields.”
2005-2008
In other words: • Data-driven discovery will be
ubiquitous • UW must be a leader in inventing
the capabilities • UW must be a leader in
translational activities – in putting these capabilities to work
• It’s about intellectual infrastructure (human capital) and software infrastructure (shared tools and services – digital capital)
![Page 6: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/6.jpg)
A 5-year, US$37.8 million cross-institutional collaboration to create a data science environment
6
2014
![Page 7: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/7.jpg)
04/11/2023 7Bill Howe, UW
Data Science Kickoff Session:137 posters from 30+ departments and units
![Page 8: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/8.jpg)
Establish a virtuous cycle
• 6 working groups, each with • 3-6 faculty from each institution
![Page 9: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/9.jpg)
04/11/2023 9
UW Data Science Education Efforts
Bill Howe, UW
Students Non-StudentsCS/Informatics Non-Major
professionals researchersundergrads grads undergrads grads
UWEO Data Science Certificate MOOC Intro to Data ScienceIGERT: Big Data PhD Track New CS Courses Bootcamps and workshops Intro to Data Programming Data Science Masters (planned) Incubator: hands-on training
![Page 10: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/10.jpg)
04/11/2023 10Bill Howe, UW
Next Session begins June 30, 2014https://www.coursera.org/course/datasci
![Page 11: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/11.jpg)
11/57
MOOC Participation numbers
• “Registered”: 119,517 totally irrelevant
• Clicked play in first 2 weeks: 78,589 • Turned in 1st homework: 10,663• Completed all assignments: ~9000 typical attrition for a
MOOC• “Passed”: 7022• Forum threads: 4661• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
![Page 12: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/12.jpg)
Educational transformation:A new generation of “Pi-shaped” scientists
12
PhD πhD
Educational transformation
Magda Balazinska
![Page 13: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/13.jpg)
13
Educational transformation
Big Data access and management
Big Data modeling
Big Data analytics
Collaborative Big Data scienceData
Education and Research in Data Science• Ultimate goal: A new PhD program
– Initial goal: A new certificate based on Big Data tracks in all departments
– Education highlights: data science courses, co-advising, and internships
• End-to-End Research Agenda– Big Data mgmt, analytics, modeling, & collaboration
• Cyberinfrastructure Development– Big Data analysis service
![Page 14: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/14.jpg)
The Data Science Studio
• An open collaborative research space• A resident data science team
– Permanent staff of ~5 data scientists – applied research and development
– ~15-20 data science fellows (research scientists, visitors, postdocs, students)
• How to Engage:– Drop-in open workspace– Studio “Office Hours”– Incubation Program
14
![Page 15: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/15.jpg)
15
6th floor Physics Astronomy Building
A partnership among …
• Provost• UW Libraries• Physics, Astronomy,
Arts & Sciences• eScience Institute
![Page 16: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/16.jpg)
16
Estimated Timeline:• Design Phase Jan-June• Construction June – Sep• Target: October 1, 2014
![Page 17: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/17.jpg)
04/11/2023 17Bill Howe, UW
The rest of this talk…
![Page 18: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/18.jpg)
04/11/2023 18/57Bill Howe, UW
How can we deliver 1000 little SDSSs to anyone who wants one?
![Page 19: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/19.jpg)
04/11/2023 19/57Bill Howe, UW
# o
f b
yte
s
# of data sources
telescopes
spectra
LSST (~100PB; images, spectra)
PanSTARRS (~40PB; images, trajectories)
OOI (~50TB/year; sims, RSN)IOOS (~50TB/year; sims, satellite, gliders,
AUVs, vessels, more)CMOP (~10TB/year; sims, stations, gliders,
AUVs, vessels, more)
SDSS (~400TB; images, spectra, catalogs)
n-body sims
models
AUVs
stations
cruises, CTDsflow cytometry
gliders
ADCPsatellites
Astronomy
Ocean Sciences
3 V’s of Big Data
Volume
Variety
Velocity
![Page 20: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/20.jpg)
04/11/2023 20/57
How much time do you spend “handling data” as opposed to “doing science”?
Mode answer: “90%”
Bill Howe, UW
Key question: How can we reduce this “data overhead”?
![Page 21: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/21.jpg)
21/5704/11/2023 Bill Howe, UW
Simple Example
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
COGAnnotation_coastal_sample.txt
SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit
![Page 22: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/22.jpg)
04/11/2023 22/57
Data Science Workflow:
Bill Howe, UW
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
DB
ML/Stats
Vis
![Page 23: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/23.jpg)
“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval, dissection into data blocks and processing steps, order in which steps are performed to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with certain features (e.g. capping ENCODE data), testing features and feature products to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out how to do things (engineering) and 20% getting files and getting them into the right format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of scale, file handling, and feature engineering.
Martin Kircher, Genome SciencesWhy?
3k NSF postdocs in 2010$50k / postdocat least 50% overhead
maybe $75M annually at NSF alone?
![Page 24: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/24.jpg)
Benchmark 1 Benchmark 20
30
60
90
120
Old system Your system Our system
A typical Computer Science paper….
slide src: Dan Halperin
![Page 25: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/25.jpg)
Benchmark 1 Benchmark 20
2500
5000
7500
10000
12500
Old system Your system Our systemWhat people use
The reality of the situation….
slide src: Dan Halperin
![Page 26: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/26.jpg)
04/11/2023 26/57
A modest goal:
Expose all the world’s science data through declarative query interfaces
Bill Howe, UW
![Page 27: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/27.jpg)
QUERY-AS-A-SERVICE
27
2010 - present
Version 1
![Page 28: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/28.jpg)
1) Upload data “as is”Cloud-hosted, secure; no need to install or design a database; no pre-defined schema; schema inference; some itegration
2) Write QueriesRight in your browser, writing views on top of views on top of views ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results Make them public, tag them, share with specific colleagues – anyone with access can query
http://sqlshare.escience.washington.edu
![Page 29: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/29.jpg)
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap
FROM [[email protected]].[hotspots_deserts.tab] x INNER JOIN [[email protected]].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries (rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of queries written by non-programmers
![Page 30: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/30.jpg)
Howe, et al., CISE 2013
![Page 31: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/31.jpg)
Steven Roberts
SQL as a lab notebook:http://bit.ly/16Xj2JP
Popular service for Bioinformatics Workflows
![Page 32: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/32.jpg)
Halperin, Howe, et al. SSDBM 2013
![Page 33: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/33.jpg)
Two Problems with SQLShare
• No help for truly big datasets• No help for “algorithmics”
33
Limitations of SQLShare
![Page 34: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/34.jpg)
04/11/2023 34Bill Howe, UW
Relational Algorithmics-as-a-Service
Version 2
http://myria.cs.washington.edu
![Page 35: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/35.jpg)
Myria is…
• MyriaQ: A compiler framework for multiple iterative RA-based languages and multiple big data back ends
• MyriaX: A parallel, shared-nothing, iterative execution engine
• MyriaWeb: A RESTful Analytics-as-a-Service platform and web-based interface 35
Myria is …
![Page 36: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/36.jpg)
Magda Balazinska, Bill Howe, and Dan Suciu
Dan Halperin (technical lead)Victor AlmeidaAndrew Whitaker
PhD StudentsShumo Chu Eric GribkoffJeremy HyrkasParis KoutrisRyan MaasDominik MoritzLaurel OrrJennifer OrtizEmad SoroushJingjing WangShengLiang Xu
Undergraduate StudentsLee Lee ChooVaspol Ruamviboonsuk
Myria Team
![Page 37: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/37.jpg)
Myria Architecture
Coordinator
Language Parser
Myria Compiler
Logical Optimizer for RA+While
REST Server
Worker Catalog
Catalog
…
json query plan
netty protocols
RDBMS
jdbc
Worker Catalog
RDBMS
jdbc
Worker Catalog
RDBMS
jdbc
MyriaX (Java)
C Compiler Grappa
Web UI
MyriaQ (Python)
HDFS HDFS HDFS
Datalog SQL MyriaL
REST
SciDB
![Page 38: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/38.jpg)
SparkSerial C++GrappaMyriaX SQL
SQLDatalogMyriaL ??
Relational Algebra + Iteration
Compiler Compiler Compiler Compiler Compiler
MyriaQ
Oceanography, Astronomy, Biology, Medical Informatics
![Page 39: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/39.jpg)
Laser
Microscope Objective
Pine Hole Lens
Nozzle d1
d2
FSC (Forward scatter)
Orange fluo
Red fluo
EX: SeaFlowFrancois Ribalet
Jarred Swalwell
Ginger Armbrust
![Page 40: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/40.jpg)
Ex: SeaFlowd1
/ F
SC
d2 / FSC
RE
D f
luor
esce
nce
FSC
Picoplankton
Nanoplankton
IS
Ultraplankton
Prochlorococcus
Continuous observations of various phytoplankton groups from 1-20 mm in size
Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton Based on ORANGE fluo: Synechococcus, Cryptophytes Based on FSC: Coccolithophores
Francois Ribalet
Jarred Swalwell
Ginger Armbrust
![Page 41: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/41.jpg)
Ex: SeaFlowFrancois Ribalet
Jarred Swalwell
Ginger Armbrust
![Page 42: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/42.jpg)
SeaFlow in Myria
• “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler”
Dan Halperin Sophie Clayton
![Page 43: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/43.jpg)
04/11/2023 43/57Bill Howe, UW
1) BD experiments are ridiculously labor-intensive– N systems x M real-world applications– Big clusters and big datasets
2) No “one size fits all solution”– Realistic environments will use more than one system
3) A return to distributed, federated databases– Erase the distinction between ETL and Analytics
Why a big data middleware?
![Page 44: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/44.jpg)
Pregel (Malewicz)
Hadoop 2008
2009
2010
2011
2012
2013
2014
HaLoop (Bu)
Spark (Zakaria)
Vertica (Pavlo)
~100x faster
SystemML (Ghoting)
Hyracks (Borkar)
GraphLab (Low)
faster
Cumulon (Huang)
comparable or inconclusive
Giraph (Tian)
Dremel (Melnik)
SimSQL (Cai)
epiC (Jiang)
Impala (Cloudera)
Shark (Xin)
HIVE (Thusoo)
“The good old days”
“The age of uncertainty”
![Page 45: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/45.jpg)
04/11/2023 45/57Bill Howe, UW
What can we conclude?
Hadoop was probably just pretty bad
The rest of the story not so clear
![Page 46: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/46.jpg)
04/11/2023 46/57
Relational Algebra is the Calculus of Big Data
• Hadoopspawn: Pig, HIVE, blah• Hadoop contemporaries: Cascalog, Flume, blah• Post-Hadoop: Spark/Shark, Dremel, blah• etc.
Bill Howe, UW
![Page 47: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/47.jpg)
04/11/2023 47/57
HBase
Bill Howe, UW
BigTable
Dremel
Tenzing
2004
Pregel
Hadoop
2005
MapReduce
2006
2007
2008
2009
Spanner
Megastore
2010
2011
2012
Google Big Data Systems
non-Google open source implementationdirect influence / shared features
compatible
implementation of
SQL-like interface
BigQuery
![Page 48: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/48.jpg)
04/11/2023 48/57
Relational Algebra is the Calculus of Small Data
• Galaxy – “bioinformatics workflows”
• Pandas (Python)merge(left, right, on=‘key’)
• dplyr (R)filter(x), select(x), arrange(x),
groupby(x), inner_join(x, y), left_join(x, y), ….
• Manimal, Pyxis/StatusQuo, others– Extract RA operators implemented manually in Java
codeBill Howe, UW
“…Operate on Genomics Intervals -> Join”
![Page 49: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/49.jpg)
04/11/2023 Bill Howe, UW 49/57
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
![Page 50: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/50.jpg)
A closer look at an example
ROI(id, start, stop) is a set of “regions of interest”
Read(id, start, stop) is a set of “reads” from sequencer
Task: For each region of interest, count the number of reads it contains
start stop
stopstart
![Page 51: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/51.jpg)
SELECT roi.id, count(rd.id)FROM regions_of_interest roi, reads rdWHERE roi.start <= rd.start AND rd.[end] <= roi.[end]GROUP BY roi.id
As a query
“region of interest”sequence “read”
![Page 52: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/52.jpg)
SELECT roi.id, count(rd.start)FROM regions_of_interest roi, reads rdWHERE roi.start <= rd.start AND rd.[end] <= roi.[end]GROUP BY roi.id
Why databases get a bad reputation
many minutes
SELECT roi.id, count(rd.start) as cntFROM regions_of_interest roi, indexed_reads rdWHERE roi.start <= rd.start AND rd.start <= roi.[end] AND roi.start <= rd.[end] AND rd.[end] >= roi.[end]GROUP BY roi.id
3 seconds!
roiread
two-sided index scan
one-sided index scan, plus filter
The broken promise of declarative query…
![Page 53: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/53.jpg)
Lowering barrier to entry
![Page 54: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/54.jpg)
Giving users insight
Shumo Chu Dominik Moritz
![Page 55: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/55.jpg)
Diagnosing problemsS
ourc
e no
deDestination node
Shumo Chu Dominik Moritz
![Page 56: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/56.jpg)
56
A = LOAD('points.txt', id:int, x:float, y:float)
E = LIMIT(A, 4);F = SEQUENCE();Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)];Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)]
DOI = CROSS(Kmeans, Centroids); J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id, $distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))];
K = [FROM J EMIT id, distance=$min(distance)]; L = JOIN(J, id, K, id) M = [FROM L WHERE J.distance <= K.distance EMIT (id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)];
Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))];
Delta = DIFF(Kmeans', Kmeans) Kmeans = Kmeans'
Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))];
WHILE DELTA != {}
K-Means in the language MyriaL
![Page 57: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/57.jpg)
57
CurGood = SCAN(public:adhoc:sc_points);
DO mean = [FROM CurGood EMIT val=AVG(v)]; std = [FROM CurGood EMIT val=STDEV(v)]; NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *]; CurGood = CurGood - NewBad; continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];WHILE continue;
DUMP(CurGood);
Sigma-clipping, V0
![Page 58: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/58.jpg)
58
CurGood = Psum = [FROM CurGood EMIT SUM(val)];sumsq = [FROM CurGood EMIT SUM(val*val)]cnt = [FROM CurGood EMIT CNT(*)];NewBad = []DO sum = sum – [FROM NewBad EMIT SUM(val)]; sumsq = sum – [FROM NewBad EMIT SUM(val*val)]; cnt = sum - [FROM NewBad EMIT CNT(*)]; mean = sum / cnt std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum)) NewBad = FILTER([ABS(val-mean)>std], CurGood) CurGood = CurGood - NewBad WHILE NewBad != {}
Sigma-clipping, V1: Incremental
![Page 59: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/59.jpg)
59
Points = SCAN(public:adhoc:sc_points);aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];newBad = []
bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];
DO new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum, sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];
stats = [FROM aggs EMIT mean=_sum/cnt, std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];
newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];
tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v AND v >= bounds.lower EMIT v=Points.v]; tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v AND v <= bounds.upper EMIT v=Points.v]; newBad = UNIONALL(tooLow, tooHigh);
bounds = newBounds; continue = [FROM newBad EMIT COUNT(v) > 0];WHILE continue;
output = [FROM Points, bounds WHERE Points.v > bounds.lower AND Points.v < bounds.upper EMIT v=Points.v];DUMP(output);
Sigma-clipping, V2
![Page 60: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/60.jpg)
• Hypothesis: Loops + RA covers everything anyone wants to do– and it scales, it’s optimizable, and it’s accessible
• We can smooth the ROI curve for novices– Start with simple queries…– …end up working on advanced parallel algorithms
• “White Box Analytics”– Compose queries, inspect plans, monitoring, debugging, “UDRs”
– user-defined optimization rules
• Multiple languages, multiple backends, one data/query model– Ask me about graph data– Ask me about array data (or, rather, mesh data)
“Relational Algorithmics”
![Page 61: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/61.jpg)
Takeaways
• We hope to see “Data Science Environments” at universities worldwide– We try to make our programs and activities reusable
• Software-as-a-service to reach the “long tail” of science
• “Relational Algorithmics” – The relational algebra is the calculus of big data– “It’s not just for databases anymore”– Learn it, use it, teach it– Myria is a platform for “relational algorithmics”
http://escience.washington.edu@[email protected]
![Page 62: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/62.jpg)
62
![Page 63: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/63.jpg)
63
Maslow’s Needs Hierarchy
“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”
-- Maslow 43
![Page 64: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/64.jpg)
A “Needs Hierarchy” of Science Data Management
storage
sharing
64
query
integration
analytics
“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”
-- Maslow 43
![Page 65: XLDB South America Keynote: eScience Institute and Myria](https://reader038.vdocument.in/reader038/viewer/2022110306/554e75e5b4c9054a698b4d86/html5/thumbnails/65.jpg)
A “Needs Hierarchy” of Science Data Management
storage
sharing
65
integration
query
analytics
“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”
-- Maslow 43