2
Data Warehousing
3
4
5
Technology Layout
6
Two-Level Computing
Large Data (10TB)and Mixed Workloads
7
Rough Sets
Outlook Temp. Humid. Wind Sport?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cold Normal Weak Yes
6 Rain Cold Normal Strong No
7 Overcast Cold Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cold Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Sport? = Yes Classes of records with the same values of the subset of the attributes
8
Information Systems
Data-based knowledge models, classifiers...
Database indices, data partitioning, data sorting...
Difficulty with fast updates of structures...
Outlook Temp. Humid. Wind Sport?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cold Normal Weak Yes
6 Rain Cold Normal Strong No
7 Overcast Cold Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cold Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Packs storing the values of records for column Salary
We can imagine the set of all records relevant to the given query, that is satisfying its SQL filter
SELECT COUNT(*) FROM EmployeesWHERE Salary > $
Rough Sets in Infobright
Salary > $
Using Knowledge Grid, we verify, which packs are irrelevant (disjoint with the set), relevant (fully inside the set) and suspect (overlapping)
We do not need irrelevant packs. We do not need to decompress relevant ones: we store their local COUNT(*) in the corresponding Data Pack Nodes
10
Information Systems in Infobright
Query
min OUT
max
Nulls
sum
match
???
pattern
11
SELECT MAX(A) FROM T WHERE B>15;
STEP 1 STEP 2 STEP 3DATA
Order Number
Order Date
Part ID
Quantity $Amt
005 20070214
234 500 1500.00
005 20070214
334 125 250.25
006 20070215
334 100 212.50
Supplier ID
Effective Date
Expiry Date
Part ID
Description
A456 20050315 Null 234 Pre-measured coffee packets – gold blend
A456 20061201 Null 235 Pre-measured coffee packets – silver blend
A456 20060501 Null 334 4-cup Cone coffee filters; quantity 50
Order Detail Table – assume many more rows
Supplier/Part Table – assume many more rows
Advanced Knowledge Nodes
Pack 1 Pack 2
Pack 1 0 1
Pack 2 1 0
Pack 3 0 0
13
Community Inspirations
Count DistinctCount(*) on Self-JoinsDecision TreesContingencies
New ObjectivesNew SchemasNew VolumesNew QueriesNew KNs
New Data TypesSQL ExtensionsFeature ExtractionData Compression
14
Conclusion
Technology based on interaction between rough and precise operations, open for adding new structures
Full product, simple framework, ad-hoc analytics, good load speed, 10:1 „all inclusive” compression
The core technology based on more data mining, rough sets, computing with rough values, et cetera
Infobright Community Edition (ICE) ready for a free usage and study, as well as open for contributions
15
References
D. Ślęzak, J. Wróblewski, V. Eastwood, P. Synak: Bright-house: An Analytic Data Warehouse for Ad-hoc Queries. PVLDB 1(2): 1337-1345 (2008).
M. Wojnarski, C. Apanowicz, V. Eastwood, D. Ślęzak, P. Synak, A. Wojna, J. Wróblewski: Method and System for Data Compression in a Relational Database. US Patent Application, 2008/0071818 A1.
J. Wróblewski, C. Apanowicz, V. Eastwood, D. Ślęzak, P. Synak, A. Wojna, M. Wojnarski: Method and System for Storing, Organizing and Processing Data in a Relational Database. US Patent Application, 2008/0071748 A1.