4.9.07 seite 1 information life cycle, information value and data management prof. rudolf bayer, ph....

43
4.9.07 Seite 1 Information Life Cycle, Information Value and Data Management Prof. Rudolf Bayer, Ph. D. Institut für Informatik Technische Universität München DEXA 2007

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

4.9.07

Seite 1

Information Life Cycle,Information Value

andData Management

Prof. Rudolf Bayer, Ph. D.Institut für Informatik

Technische Universität München

DEXA 2007

4.9.07

Seite 2

Some basic facts about datavolumes

Datavolumes are growing in industry by factor 1.7 per year, generally accepted1 Fileserver with 10 TBserves 1.000 users, 10 GB space per user

Private Datavolumes growing much faster?example: MyLifeBits, see http://research.microsoft.com/barc/mediapresence/MyLifeBits.aspx Significant shifts of document types with changing technolog, e.g. Video,

HDTV, 3D-HDTV with 6 cameras for fisheye effect?

Value of information growing at the same rate?

4.9.07

Seite 3

Some basic facts about storage

Cost of storage falling dramatically: 0.5 €/GB 500 €/TB = MyLifeBits of Gordon Bell 500 €/PB = in a few years (Jim Gray)

Bottom Line: capacity and prices of storage are moving faster than we can capture

storage is free!

4.9.07

Seite 4

Reclamation of storage?

To reclaim 1 GB = 1000 files = 0.5 €

Bottom Line: deletion of files is a tremendous waste of

time and mental effort Limiting the capacity of personal shares

in industry (< 5 GB) is a bad idea

4.9.07

Seite 5

What about access?

Disk speed: 50 MB/s raw, 5 MB/s real Network: 100 MB/s, no longer bottleneck Remote disk feasible GREP = brute force search

1 MB in < 1 sec1 GB in ~ 200 sec = 3 Min1 TB in 3.000 Min = 2 days1 PB in 2.000 days = 5,5 yearsParallelism?

Bottom Line: access is critical!!

4.9.07

Seite 6

What is memory?

Memory = Storage + Access

AccessRemember, where you put it in a directory Index it for near perfect memoryDB of metadata: file system is a large

multidimensional DWH

find it fast

4.9.07

Seite 7

State of the art: file directories?

Hierarchical organization of data Many criteria, e.g. by subject, by author, by doc type, by

time Old problem of libraries, never solved

physical organization is one-dimensional, shelves, but logical organization is multidimensional, i.e. databases

Cleanup and reorganization of file directories, to avoid complete chaos and lengthy search

Requires discipline and time

Solution: multidimensional meta-DB about data instead of file directories?

4.9.07

Seite 8

Part of my File Directory

4.9.07

Seite 9

Does indexing help?

Index height grows logarithmically, i.e. B-tree for 1 PB of data has height 4 to 5 Access to anything in < 100 ms compared to 5 years for

GREP

Indexing helps: full text index like

Google, Google desktop, Apple spotlight?

gigantic result sets, page rank of Google determines what the world is reading!

fulltext index is extremely helpful, but not the solution

4.9.07

Seite 10

Meta-DB?

Solution: DB for multidimensional properties of data/files to substitute file-directories?

by subject, by author, by doc type, by time, by GPS, by …??

Metadata must be captured automatically!

4.9.07

Seite 11

Back to industry today

Fileservers: serve 1.000 users store 10 million files 5 TB = 5 GB share per user take 17 hours to backup to tape robots with LTO2

technology: 5.000 GB / 80MB/s = 17,4 h are backed up on weekends, >95% unchanged

Consequences of massive fileserver consolidation in industry, dead end!

4.9.07

Seite 12

Some Hypotheses

1. Bandwidth of man without pictures and video Less than 10 files per day: doc, pdf, xls, ppt, … Less than 3 MB/day

2. Data have short life cycle!!! Largely ignored!

3. But are stored for many years on premium storage, intention of MyLifeBits, does not make sense for industry

Hypotheses are plausible checkable quantifyable

4.9.07

Seite 13

Life Cycle of Files

Most files have a surprisingly short life from create to last access: User directories

71% 2 days 84 % 3 days 58 % 4 days 50 % 1 day compare results of Meta Group

Project directories 91 % 7 days 100% 1 day 91 % 1 day

Goup directories 76% 1 day 39 % 1 month 84 % 1 day 85 % 1 month

Life of files is comparable to daily newspaper?

4.9.07

Seite 14

User5: Life Span of Files

life span

0

500

1000

1500

2000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43

days

nu

mb

er o

f fi

les

Reihe1

4.9.07

Seite 15

Projekt Directory: Accesses per days back

accesses to 19.939 files

0

500

1000

1500

2000

2500

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61

days back from today

file

s ac

cess

ed p

er d

ay

Reihe1

Message: less than 30 % of files touched in last 60 days

4.9.07

Seite 16

User5

accesses by volume

01000020000300004000050000600007000080000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43

days back from 1.9.06

Vo

lum

e in

KB

Reihe1

4.9.07

Seite 17

Life Cycle of Files

Most files have a surprisingly short life of only one to three days from create time to last access:

Life of files is comparable to daily newspaper?

Value of information decreases rapidly, e.g. departure gate of a flight or this PPT presentation

Consequences for storage and data organization?

4.9.07

Seite 18

Fileservers today

A

B

NFile-Server F with

Block-Interface

LAN

1

2

3

4 5

1 2 3

1

2

3

4

SAN

Backup-System B

Fileserver stores all files

4.9.07

Seite 19

Simple Idea: Split of Fileserver

A

B

N

LAN SAN

Backup-System B12

3

4

5

1

2

3

4

1 2 3 1 2 3 4 5

1 2 3 4

LAN or

SAN

Shared File-Cache C = Performance Disk

File-Store S = Capacity Disk

4.9.07

Seite 20

Multilevel Architektur

A

B

N

LAN

SAN

Backup-System B

1 2 3 1

2 3 4 5

1 2 3 4

File-Store S

Clients with or without private File-Caches

File-Cache

File-Cache

File-Cache

SAN orLAN

Performance disk is no longer a critical ressource!

4.9.07

Seite 21

Properties of FileCache Architecture 1

Mirroring of all important data True File-Cache: with classical cache-management

algorithms, write through replaces backup system, e. g. Tivoli TSM

Backup: only for File-Store, as background service, continuous backup faster than File-Server Backup at least by factor 10, backup windows disappear

Failure Modes: F-Cache and F-Store have independent failure modes

4.9.07

Seite 22

Properties of FileCache Architecture 2

Recovery of File-Cache: instant recovery, works as empty File-Cache

Recovery of File-Store: by volume, background, minimal impact only for old files

Storage Capacity: <10% of datavolume for File-Caches (32 % Metagroup) and 1/2 for File-Store

Storage Classes: FC-disks for F-Cache, SATA-disk for F-Store

Cost: lower than File-Servers, modulo SW cost Availability: extremely high, comparable to PLATIN

system No lost data!

4.9.07

Seite 23

Cache Size and Algorithms

My measurements show: very small FileCaches <10% of stored datavolume

LRU replacement should work perfectly:Only 5-10 days per year with high activity, e.g. collecting literature

for a dissertation or a projectVery short life cycleLRU could displace files depending on access patterns, e.g. PDF

and ZIP different from XLS files

4.9.07

Seite 24

FileCache Architecture for Databases?

Split relation R into 2 disjunct tables R = R1 + R2, e.g. R1 = live data, R2 = stale data

R1 := R) e.g. create_date > last_archive_date R2 := not R) e.g. create_date <=

last_archive_date View R = R1 + R2 Archiving transaction as cron job to move tuples from R1

to R2

4.9.07

Seite 25

Example of Archiving Transaction 1

declare table R1 (order_received datetime, …)

declare table R2 (order_received datetime, …)

create view R as

select * from R1 union select * from R2

declare table Archive_Date (last_move datetime, …)

declare @move_date datetime

select @move_date = last_move from Archive_Date

Generalizes to arbitrary number of table splits, e.g. for1. orders received,

2. orders in production,

3. orders shipped,

4. orders under warranty,

5. orders in archive

4.9.07

Seite 26

Example of Archiving Transaction 2

Transaction to move stale data from R1 to R2:

begin transelect @move_date = DATEADD (DAY, -13, GETDATE())

insert into R2

select * from R1 where order_received < @move_date

delete from R1 where order_received < @move_date

delete Archive_Date

insert into Archive_Date values (@move_date )

commit tran

4.9.07

Seite 27

User Interface?

User sees relation R query q(R) = select * from R where (R ) Automatic rewrite of q(R) as:

if ( and ( R) = empty then (R2 )

else if ( and not ( R) = empty then (R1 )

else (R )

Interesting query rewrite and query optimization problem, part of the optimizer, invisible at API!

e.g. (R ) is order_received < ‘April 3, 2007’

4.9.07

Seite 28

Integration of FileStore with ILM

FileStore has very low load Stores all data permanently and secured via backup Can manage versions Has database of meta data = 0.1 % of datavolume = 5 GB for a 5 TB

file server Can obey complex ILM rules according to Oxley-Sabanes Multidimensional database for metadata plus fulltext

Domain and user Directory path Filename Version number File extension Time of creation Time of last update GPS position Etc.

4.9.07

Seite 29

UB-tree for multidimensional index

4.9.07

Seite 30

The Quest for Eternity

Companies are forced by law, to preserve their data, tax regulations, Oxley-Sabanes act

What about people?

4.9.07

Seite 31

Egypt, Unas pyramid in Sakkara, 2400 BC

4.9.07

Seite 32

Trajan’s Column, 98-117

4.9.07

Seite 33

The Medici as Magi by Benozzo Gozzoli, 1460

4.9.07

Seite 34

Gordon Bell: MyLifeBits, 2007, record all

4.9.07

Seite 35

By Data Volume

4.9.07

Seite 36

Number of Items

4.9.07

Seite 37

What are people doing with a

complete, perfect life memory?

Spend the second half of your life

to watch the first half on TV?When do you stop recording?

Don´t watch, record for your children or for

alibi!

Quest for Eternity

4.9.07

Seite 38

Private Datavolumes

Human life = 100 years = 1.200 months = 1.2 TB/life with 1 GB/Month (Gordon Bells life) = 36.000 days = 36 TB/life with 1 GB/day (digitizing private videos)

4.9.07

Seite 39

Person Tracking (mobile phones)

Human life = 100 years = 1.200 months = 36.000 days * 60 Positions/day = 2*106 positions/life = 50-100 MB/life

4.9.07

Seite 40

Size of DB for Metadata

Number of new objects: 10 per day = 360.000/life, peanuts for a DB Automatic collection of metadata is easy: Bytes

Date created 10Last change 10Last access 10Object name 50Title 100Directory path or URI 200object type 2Author 50Location of creation 50People present 100Version number 2Total per document < 1 KB

Total per personal life Meta-DB < 500 MB = 1 stick

4.9.07

Seite 41

Multidimensional Meta-DB

Meta-DB is a multidimensional DWH for all data objects Multidimensional indexing allows high precision recall Use UB-Tree as index, works well up to that

dimensionality

4.9.07

Seite 42

Complete, perfect life memory

What will it be used for? (Schäuble)

How will it affect our lives?

Think about it !

4.9.07

Seite 43