external presentation template - twosigma.com · software engineer @ focus on analytics related...
TRANSCRIPT
![Page 1: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/1.jpg)
www.twosigma.com
Huohua 火花Distributed Time Series Analysis Framework For Spark
August 28, 2017
Wenbo Zhao
Spark Summit 2016
![Page 2: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/2.jpg)
About Me
August 28, 2017
Software Engineer @
Focus on analytics related tools, libraries and Systems
![Page 3: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/3.jpg)
$0.0
$500.0
$1,000.0
$1,500.0
$2,000.0
$2,500.0
1/3
/195
0
1/3
/195
3
1/3
/195
6
1/3
/195
9
1/3
/196
2
1/3
/196
5
1/3
/196
8
1/3
/197
1
1/3
/197
4
1/3
/197
7
1/3
/198
0
1/3
/198
3
1/3
/198
6
1/3
/198
9
1/3
/199
2
1/3
/199
5
1/3
/199
8
1/3
/200
1
1/3
/200
4
1/3
/200
7
1/3
/201
0
1/3
/201
3
1/3
/201
6
S&P 500
We view everything as a time series
August 28, 2017
Stock market prices
Temperatures
Sensor logs
Presidential polls
…
50°F
55°F
60°F
65°F
70°F
75°F
80°F
85°F
90°F
95°F
100°F
New York
San Francisco
![Page 4: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/4.jpg)
What is a time series?
August 28, 2017
A sequence of observations obtained in successive time order
Our goal is to forecast future values given past observations
$8.90 $8.95
$8.90
$9.06 $9.10
8:00 11:00 14:00 17:00 20:00
corn price?
![Page 5: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/5.jpg)
Multivariate time series
August 28, 2017
We can forecast better by joining multiple time series
Temporal join is a fundamental operation for time series analysis
Huohua enables fast distributed temporal join of large scale unaligned time series
$8.90 $8.95
$8.90
$9.06 $9.10
8:00 11:00 14:00 17:00 20:00
corn price
75°F
72°F71°F
72°F
68°F67°F
65°F
temperature
![Page 6: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/6.jpg)
What is temporal join?
August 28, 2017
A particular join function defined by a matching criteria over time
Examples of criteria
look-backward – find the most recent observation in the past
look-forward – find the closest observation in the future
time series 1 time series 2
look-forward
time series 1 time series 2
look-backwardobservation
![Page 7: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/7.jpg)
Temporal join with look-backward criteria
August 28, 2017
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00 AM
11:00 AM
time weather corn price
08:00 AM
10:00 AM
12:00 AM
![Page 8: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/8.jpg)
Temporal join with look-backward criteria
August 28, 2017
time weather
08:00 AM
10:00 AM
12:00 AM
time corn price
08:00 AM
11:00 AM
time weather corn price
08:00 AM 60 °F
10:00 AM
12:00 AM
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
![Page 9: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/9.jpg)
Temporal join with look-backward criteria
August 28, 2017
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00 AM
11:00 AM
time weather corn price
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM
![Page 10: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/10.jpg)
Temporal join with look-backward criteria
August 28, 2017
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00 AM
11:00 AM
time weather corn price
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
![Page 11: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/11.jpg)
time corn price
08:00 AM
11:00 AM
time corn price
08:00 AM
11:00 AM
time corn price
08:00 AM
11:00 AM
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
Temporal join with look-backward criteria
August 28, 2017
time weather
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
time corn price
08:00 AM
11:00 AM
time weather corn price
08:00 AM 60 °F
10:00 AM 70 °F
12:00 AM 80 °F
…
…
Hundreds of thousands of data sources with unaligned timestamps
Thousands of market data sets
We need fast and scalable distributed temporal join
![Page 12: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/12.jpg)
Issues with existing solutions
August 28, 2017
A single time series may not fit into a single machine
Forecasting may involve hundreds of time series
Existing packages don’t support temporal join or can’t handle large time series
MatLab, R, SAS, Pandas
Even Spark based solutions fall short
PairRDDFunctions, DataFrame/Dataset, spark-ts
![Page 13: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/13.jpg)
Huohua – a new time series library for Spark
August 28, 2017
Goal
provide a collection of functions to manipulate and analyze time series at scale
group, temporal join, summarize, aggregate …
How
build a time series aware data structure
extending RDD to TimeSeriesRDD
optimize using temporal locality
reduce shuffling
reduce memory pressure by streaming
![Page 14: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/14.jpg)
What is a TimeSeriesRDD in Huohua?
August 28, 2017
TimeSeriesRDD extends RDD to represent time series data
associates a time range to each partition
tracks partitions’ time-ranges through operations
preserves the temporal order
TimeSeriesRDD
operations
time series
functions
![Page 15: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/15.jpg)
TimeSeriesRDD– an RDD representing time series
August 28, 2017
time temperature
6:00 AM 60°F
6:01 AM 61°F
… …
7:00 AM 70°F
7:01 AM 71°F
… …
8:00 AM 80°F
8:01 AM 81°F
… …
(6:00 AM, 60°F)(6:01 AM, 61°F)
…
RDD
(7:00 AM, 70°F)(7:01 AM, 71°F)
…
(8:00 AM, 80°F)(8:01 AM, 81°F)
…
![Page 16: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/16.jpg)
TimeSeriesRDD– an RDD representing time series
August 28, 2017
range: [06:00 AM, 07:00 AM)
range:[07:00 AM, 8:00 AM)
range: [8:00 AM, ∞)
TimeSeriesRDDtime temperature
6:00 AM 60°F
6:01 AM 61°F
… …
7:00 AM 70°F
7:01 AM 71°F
… …
8:00 AM 80°F
8:01 AM 81°F
… …
(6:00 AM, 60°F)(6:01 AM, 61°F)
…
(7:00 AM, 70°F)(7:01 AM, 71°F)
…
(8:00 AM, 80°F)(8:01 AM, 81°F)
…
![Page 17: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/17.jpg)
Group function
August 28, 2017
A group function groups rows with exactly the same timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM San Francisco 60°F
2:00 PM New York 71°F
2:00 PM San Francisco 61°F
3:00 PM New York 72°F
3:00 PM San Francisco 62°F
4:00 PM New York 73°F
4:00 PM San Francisco 63°F
group 1
group 2
group 3
group 4
![Page 18: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/18.jpg)
Group function
August 28, 2017
A group function groups rows with nearby timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM San Francisco 60°F
2:00 PM New York 71°F
2:00 PM San Francisco 61°F
3:00 PM New York 72°F
3:00 PM San Francisco 62°F
4:00 PM New York 73°F
4:00 PM San Francisco 63°F
group 1
group 2
![Page 19: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/19.jpg)
Group in Spark
August 28, 2017
Groups rows with exactly the same timestamps
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
![Page 20: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/20.jpg)
Data is shuffled and materialized
Group in Spark
August 28, 2017
RDD
groupBy
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
![Page 21: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/21.jpg)
Group in Spark
August 28, 2017
Data is shuffled and materialized
RDD
groupBy
RDD
1:00PM 1:00PM
3:00PM 3:00PM
2:00PM
4:00PM
2:00PM
4:00PM
![Page 22: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/22.jpg)
Group in Spark
August 28, 2017
Data is shuffled and materialized
RDD
groupBy
RDD
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
![Page 23: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/23.jpg)
Group in Spark
August 28, 2017
Temporal order is not preserved
RDD
groupBy
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
![Page 24: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/24.jpg)
Group in Spark
August 28, 2017
Another sort is required
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
![Page 25: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/25.jpg)
Group in Spark
August 28, 2017
Another sort is required
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
2:00PM 2:00PM
4:00PM 4:00PM
1:00PM 1:00PM
3:00PM 3:00PM
![Page 26: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/26.jpg)
Group in Spark
August 28, 2017
Back to correct temporal order
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
![Page 27: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/27.jpg)
Group in Spark
August 28, 2017
Back to temporal order
RDD
groupBy sortBy
RDD RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
1:00PM 1:00PM
2:00PM 2:00PM
3:00PM 3:00PM
4:00PM 4:00PM
![Page 28: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/28.jpg)
Group in Huohua
August 28, 2017
Data is grouped locally as streams
TimeSeriesRDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
![Page 29: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/29.jpg)
Group in Huohua
August 28, 2017
Data is grouped locally as streams
TimeSeriesRDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
![Page 30: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/30.jpg)
Group in Huohua
August 28, 2017
Data is grouped locally as streams
TimeSeriesRDD
1:00PM
2:00PM
1:00PM
3:00PM 3:00PM
4:00PM
4:00PM
2:00PM
![Page 31: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/31.jpg)
Group in Huohua
August 28, 2017
Data is grouped locally as streams
TimeSeriesRDD
1:00PM
2:00PM
1:00PM
3:00PM 3:00PM
4:00PM 4:00PM
2:00PM
![Page 32: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/32.jpg)
Benchmark for group
August 28, 2017
Running time of count after group
16 executors (10G memory and 4 cores per executor)
data is read from HDFS
0s
20s
40s
60s
80s
100s
20M 40M 60M 80M 100M
RDD DataFrame TimeseriesRDD
![Page 33: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/33.jpg)
Temporal join
August 28, 2017
A temporal join function is defined by a matching criteria over time
A typical matching criteria has two parameters
direction – whether it should look-backward or look-forward
window - how much it should look-backward or look-forward
look-backward temporal join
window
![Page 34: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/34.jpg)
Temporal join
August 28, 2017
A temporal join function is defined by a matching criteria over time
A typical matching criteria has two parameters
direction – whether it should look-backward or look-forward
window - how much it should look-backward or look-forward
look-backward temporal join
window
![Page 35: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/35.jpg)
Temporal join
August 28, 2017
Temporal join with criteria look-back and window of length 1
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
time series time series
![Page 36: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/36.jpg)
Temporal join
August 28, 2017
Temporal join with criteria look-back and window of length 1
How do we do temporal join in TimeSeriesRDD?
TimeSeriesRDD TimeSeriesRDD
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
![Page 37: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/37.jpg)
Temporal join in Huohua
August 28, 2017
Temporal join with criteria look-back and window of length 1
partition time space into disjoint intervals
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
![Page 38: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/38.jpg)
Temporal join in Huohua
August 28, 2017
Temporal join with criteria look-back and window of length 1
Build dependency graph for the joined TimeSeriesRDD
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
![Page 39: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/39.jpg)
Temporal join in Huohua
August 28, 2017
Temporal join with criteria look-back and window 1
Join data as streams per partition
1:00AM 1
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM 1:00AM1:00AM
2:00AM
4:00AM
5:00AM
3:00AM
5:00AM
![Page 40: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/40.jpg)
Temporal join in Huohua
August 28, 2017
Temporal join with criteria look-back and window 1
Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM 1:00AM1:00AM
2:00AM
![Page 41: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/41.jpg)
Temporal join in Huohua
August 28, 2017
Temporal join with criteria look-back and window 1
Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM
3:00AM
![Page 42: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/42.jpg)
Temporal join in Huohua
August 28, 2017
Temporal join with criteria look-back and window 1
Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM 3:00AM
5:00AM 5:00AM
![Page 43: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/43.jpg)
Benchmark for temporal join
August 28, 2017
Running time of count after temporal join
16 executors (10G memory and 4 cores per executor)
data is read from HDFS
0s
20s
40s
60s
80s
100s
20M 40M 60M 80M 100M
RDD DataFrame TimeseriesRDD
![Page 44: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/44.jpg)
Functions over TimeSeriesRDD
August 28, 2017
group functions such as window, intervalization etc.
temporal joins such as look-forward, look-backward etc.
summarizers such as average, variance, z-score etc. over
windows
Intervals
cycles
![Page 45: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/45.jpg)
Open Source
August 28, 2017
Not quite yet …
https://github.com/twosigma
![Page 46: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/46.jpg)
Future work
August 28, 2017
Dataframe / Dataset integration
Speed up
Richer APIs
Python bindings
More summarizers
![Page 47: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/47.jpg)
Key contributors
August 28, 2017
Christopher Aycock
Jonathan Coveney
Jin Li
David Medina
David Palaitis
Ris Sawyer
Leif Walsh
Wenbo Zhao
![Page 48: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/48.jpg)
Thank you
August 28, 2017
QA
![Page 49: External Presentation Template - twosigma.com · Software Engineer @ Focus on analytics related tools, libraries and Systems. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 0 3 6](https://reader030.vdocument.in/reader030/viewer/2022041200/5d30cad788c9937b5d8d0b0b/html5/thumbnails/49.jpg)
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy
any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment
advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”).
Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The
document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of
such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If
so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and
comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.