www.infobright.org slezak@infobright.com rsctc 2008 rough sets in data warehousing infobright...

Post on 14-Dec-2015

220 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

www.infobright.orgwww.infobright.comslezak@infobright.com

RSCTC 2008

Rough Sets inData Warehousing

Infobright CommunityEdition (ICE)

2

Data Warehousing

3

4

5

Technology Layout

6

Two-Level Computing

Large Data (10TB)and Mixed Workloads

7

Rough Sets

Outlook Temp. Humid. Wind Sport?

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cold Normal Weak Yes

6 Rain Cold Normal Strong No

7 Overcast Cold Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cold Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Sport? = Yes Classes of records with the same values of the subset of the attributes

8

Information Systems

Data-based knowledge models, classifiers...

Database indices, data partitioning, data sorting...

Difficulty with fast updates of structures...

Outlook Temp. Humid. Wind Sport?

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cold Normal Weak Yes

6 Rain Cold Normal Strong No

7 Overcast Cold Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cold Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Packs storing the values of records for column Salary

We can imagine the set of all records relevant to the given query, that is satisfying its SQL filter

SELECT COUNT(*) FROM EmployeesWHERE Salary > $

Rough Sets in Infobright

Salary > $

Using Knowledge Grid, we verify, which packs are irrelevant (disjoint with the set), relevant (fully inside the set) and suspect (overlapping)

We do not need irrelevant packs. We do not need to decompress relevant ones: we store their local COUNT(*) in the corresponding Data Pack Nodes

10

Information Systems in Infobright

Query

min OUT

max

Nulls

sum

match

???

pattern

11

SELECT MAX(A) FROM T WHERE B>15;

STEP 1 STEP 2 STEP 3DATA

Order Number

Order Date

Part ID

Quantity $Amt

005 20070214

234 500 1500.00

005 20070214

334 125 250.25

006 20070215

334 100 212.50

Supplier ID

Effective Date

Expiry Date

Part ID

Description

A456 20050315 Null 234 Pre-measured coffee packets – gold blend

A456 20061201 Null 235 Pre-measured coffee packets – silver blend

A456 20060501 Null 334 4-cup Cone coffee filters; quantity 50

Order Detail Table – assume many more rows

Supplier/Part Table – assume many more rows

Advanced Knowledge Nodes

Pack 1 Pack 2

Pack 1 0 1

Pack 2 1 0

Pack 3 0 0

13

Community Inspirations

Count DistinctCount(*) on Self-JoinsDecision TreesContingencies

New ObjectivesNew SchemasNew VolumesNew QueriesNew KNs

New Data TypesSQL ExtensionsFeature ExtractionData Compression

14

Conclusion

Technology based on interaction between rough and precise operations, open for adding new structures

Full product, simple framework, ad-hoc analytics, good load speed, 10:1 „all inclusive” compression

The core technology based on more data mining, rough sets, computing with rough values, et cetera

Infobright Community Edition (ICE) ready for a free usage and study, as well as open for contributions

15

References

D. Ślęzak, J. Wróblewski, V. Eastwood, P. Synak: Bright-house: An Analytic Data Warehouse for Ad-hoc Queries. PVLDB 1(2): 1337-1345 (2008).

M. Wojnarski, C. Apanowicz, V. Eastwood, D. Ślęzak, P. Synak, A. Wojna, J. Wróblewski: Method and System for Data Compression in a Relational Database. US Patent Application, 2008/0071818 A1.

J. Wróblewski, C. Apanowicz, V. Eastwood, D. Ślęzak, P. Synak, A. Wojna, M. Wojnarski: Method and System for Storing, Organizing and Processing Data in a Relational Database. US Patent Application, 2008/0071748 A1.

THANK YOU!!!

www.infobright.orgwww.infobright.comslezak@infobright.com

RSCTC 2008

top related