1 the stream star schema stephen a. broeker 1010
Post on 31-Mar-2015
221 Views
Preview:
TRANSCRIPT
1
The Stream Star Schema
Stephen A. Broeker
10
2
Conclusion
The Stream Star Schema processes data streams in real-time. Up to gigabits per second.
Stream Star performance is O(1).
20
3
phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets
are data rich. But real-time analysis po
Large Fast Dynamic Data Streams
30
4
phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets
Data rich. But poor in real-time analysis.
Large Fast Dynamic Data Streams
40
phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets
5
phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets
What are the consequences?
Large Fast Dynamic Data Streams
50
6
hard tosee patterns
Therefore difficult to detect problems.
Large Fast Dynamic Data Streams
60
7
Network monitoring at high speed is difficult:
Packets arrive every nanosecond on a 1Gbps NIC
Must use SRAM for per-packet processing
Traditional solution of sampling is inherently not accurate due to the loss of data.
Challenge of Network Monitoring
70
8
Achieve real-time OLAP for massive data streams.
Achieve cybernetic control for systems that depend on rapid data analysis.
Vision
80
9
Detection
90
10
Forensics
10
11
Data RATES are measured in bits per second.
So, Gigabits (Gb) ≠ Gigabytes (GB).
Data Rates versus Data Storage
Lowercase ‘b’
11
12
Data RATES are measured in bits per second.
Data STORAGE is measured in Bytes.
So, Gigabits (Gb) ≠ Gigabytes (GB).
Data Rates versus Data Storage
Lowercase ‘b’ Uppercase ‘B’
12
13
Ethernet Network Interface Card transferring data at 1 Gbps.
Data accumulates at 450MB per hour.
That’s 10.5 TB per day, 73.8 TB per week!
Data Storage based on Data Rate
13
14
What if BYTES were pennies?
Picturing Orders of Magnitude
X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA
106 = 220 109 = 230 1012 = 240 1015 = 250
14
15
What if BYTES were pennies?
Picturing Orders of Magnitude
X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA
106 = 220 109 = 230 1012 = 240 1015 = 250
15
16
What if BYTES were pennies?
Picturing Orders of Magnitude
X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA
106 = 220 109 = 230 1012 = 240 1015 = 250
16
17
What if BYTES were pennies?
Picturing Orders of Magnitude
X At 1Gbps, 2.2 PB accumulate per month.Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA
106 = 220 109 = 230 1012 = 240 1015 = 250
17
18
What if BYTES were pennies?
Picturing Orders of Magnitude
X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA
1018 = 260
17
19
The network stream is segmented into flows, which are inserted into a database.
Observed database input rate for 1 Gb Ethernet NIC: 700,000 flows per hour.
Existing databases can’t keep up!
From Streaming Data to Database
18
20
Disk Star Schema
STREAM Star Schema
Consider 2 Database Schemas
19
21So where’s the star?
Disk Star SchemaFrom Fact Table to Dimension Tables
Content Table
Sender Table
Subject TableRecipient Table
Destination IP TableContent
Destination IP
Sender
Recipient
Subject
That’s all there is to the “star” concept.
Here’s the star.
20
22
Value of the Disk Star Schema
Conserve Disk Space 21
23
Dimensions
Each Dimension gets a key. 22
24Resulting in a Dimension Table
1NF: No Repeating Groups
23
25Thus deriving a Fact Table.
Substitute Keys for Facts
24
26
Disk Star Schema = Slow data insertion time.
Relational databases are normalized to conserve space. Speed is sacrificed.
So real-time analysis is compromised.25
SlowBottleneck
27
Disk Star Schema
26
28
Disk Star Schema
27
29
Disk Star Schema
28
30
Disk Star Schema
29
31
Dimension table insertion time depends on the table size which is O(log n) where n is the number of records in a table.
Disk Star Schema insertion time, is the sum of all
dimension table insert times O(Ʃ1≤i ≤ l (log ni )) where l
is the number of attributes in the database and ni is the number of values for attribute i.
Can’t fill dimension tables fast enough!
Bottleneck
30
32
1,000,000,000 bit Ethernet NIC (1Gb)
700,000 Observed Flows per hour
460 MBs per hour, 10.5 TBs a day
All we can get is a snapshot-analysis!
Short Pause to Review Numbers
31
33
Disk Star Schema
STREAM Star Schema
Consider 2 Database Schemas
32
34
Stream Star Schema
33
Stream Star Schema
35
34
Stream Star Schema
36
Stream Star Schema
35
Stream Star Schema
37
Disk Star Schema
Nearly 1:1 Correspondence between string attributes and Dimension tables.
36
38
Disk Star Schema
Two kinds of tables - fact, dimension.All string dimensions have dimension tables.Minimize disk space.Dimension tables can be large.
Long insert time = O(Ʃ1≤i ≤ l (log ni ))No string duplication.
37
39
Many:1 38
Stream Star Schema
40
Three kinds of tables - fact, dimension, string.Few dimension tables.Dimension tables are small.Minimizes insertion time.I n s e r t t i m e i s c o n s t a n t.Allow string duplication. Allow string duplication.
39
Stream Star Schema
41
Side x Side Comparison
Slow FastOld New
40
42
Test Results
41
43
Test Results
The magnified area is different because I measured the insert time for (1, 10, 100) as opposed to (1000, 2000, 3000) streams.42
44
Test Results
The magnified area is different because of how MySQL works. I can only present a hypothesis since I don’t have the MySQL source code. But I suspect that MySQL is optimized for less than 100 streams for this problem. 43
45
Conclusion
44
46
Conclusion
The Stream Star Schema processes data streams in real-time. Up to gigabits per second.
Stream Star performance is O(1).
45
47
Hope
Detection
Forensics
RFID
46
48
There’s data flow
47
49
And then there’s DATA FLOW!
48
50
Disk Star Schema handles 3 million flows per hour, about this much.
49
51
The Stream Star Schemahandles 113 million flows per hour!
Disk Star Schema handles 3 million flows per hour, about this much.
50
52Nearly 40x Faster!51
53
For The Future
Implement the Stream Star Schema in the Cloud.
Use multiple Stream Star Schema computer nodes to handle an infinite stream. Storage could be handled similarly to S3.
52
54
For The Future
The Stream Star Schema fully supports the analysis of high-speed data streams thus enabling security applications and forensic processing.
53
55 END
top related