datawarehousing 101

Upload: ckcalling

Post on 03-Jun-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Datawarehousing 101

    1/70

    Data Warehousing 101Everything

    you never wantedto know aboutbig databases

    but were forced

    to find out anyway

    Josh BerkusOpen Source Bridge 2011

  • 8/12/2019 Datawarehousing 101

    2/70

    contents

    covering

    concepts of DW

    some DWtechniques

    databases

    not covering

    hardware

    analtics!reportingtools

  • 8/12/2019 Datawarehousing 101

    3/70

  • 8/12/2019 Datawarehousing 101

    4/70

    BIGDATA

  • 8/12/2019 Datawarehousing 101

    5/70

    190

  • 8/12/2019 Datawarehousing 101

    6/70

    What is a!data warehouse"#

  • 8/12/2019 Datawarehousing 101

    7/70

    Big Data#

  • 8/12/2019 Datawarehousing 101

    8/70

  • 8/12/2019 Datawarehousing 101

    9/70

    $%T& vs DW man single"row

    writes

    current data

    queries generatedb user acti#it

    $ 1s responsetimes

    1000%s of users

    few large batchimports

    ears of data

    queries generatedb large reports

    queries can run forminutes!hours

    10%s of users

  • 8/12/2019 Datawarehousing 101

    10/70

    $%T& vs DW

    big data formany

    concurrentrequests to

    small amounts

    of data each

    big data

    for low

    concurrencyrequests to very

    large amountsof data each

  • 8/12/2019 Datawarehousing 101

    11/70

    synony's(

    subc)asses

  • 8/12/2019 Datawarehousing 101

    12/70

    archiving

  • 8/12/2019 Datawarehousing 101

    13/70

    archiving

    WO&' data( )write once* read ne#er+

    grows indefinitel

    usuall a result of regulator compliance main concern( storage efficienc

  • 8/12/2019 Datawarehousing 101

    14/70

    data 'ining

  • 8/12/2019 Datawarehousing 101

    15/70

    data 'ining

    the database where you don't know what's inthere, but you want to find out

    lots of data ,-B to .B/

    mostl )semi"structured+

    data produced as a side effect of other

    business processes needs ."intensi#e processing

  • 8/12/2019 Datawarehousing 101

    16/70

    BI* Business Inte))igenceD++* Decision +u,,ort

    $%A&* $n)ine Ana)ytica)&rocessing

    Ana)ytics

  • 8/12/2019 Datawarehousing 101

    17/70

  • 8/12/2019 Datawarehousing 101

    18/70

    BI-D++-$%A&-Ana)ytics

    databases which support visualization oflarge amounts of data

    data is fairl well understood

    most data can be reduced to categories*geograph* and taonom

    primaril about indeing

  • 8/12/2019 Datawarehousing 101

    19/70

    What is a!di'ension"#

  • 8/12/2019 Datawarehousing 101

    20/70

    di'ensions vs. facts

    3act-able

    customers! accounts

    categorsubcategor

    sub"subcategor

  • 8/12/2019 Datawarehousing 101

    21/70

    di'ension e/a',)es location!region!countr!quadrant

    product categori4ation

    &5 transaction tpe

    account heirarch

    6. address OS!#ersion!build

  • 8/12/2019 Datawarehousing 101

    22/70

    di'ension synony's

    facet

    taonomsecondar inde

    #iew

  • 8/12/2019 Datawarehousing 101

    23/70

    What is ET%#

  • 8/12/2019 Datawarehousing 101

    24/70

    E/tract Transfor' %oad how ou turn eternal raw data into useful

    database data

    7pache logs 8 web analtics DB

    S9 .OS files 8 financial reporting DB

    O5-. ser#er 8 10"ear data warehouse

    also called :5- when the transformation is

    done inside the database

  • 8/12/2019 Datawarehousing 101

    25/70

    &ur,ose of ET%-E%T

    getting data into the data warehouse

    clean up garbage data

    split out attributes )normali4e+ dimensional data

    deduplication

    calculate materiali4ed #iews ! indees

  • 8/12/2019 Datawarehousing 101

    26/70

    ET% Too)s

    .E.T.T.%.E.

  • 8/12/2019 Datawarehousing 101

    27/70

    ET% Too)s

  • 8/12/2019 Datawarehousing 101

    28/70

    Ad2hoc scri,ting

  • 8/12/2019 Datawarehousing 101

    29/70

    E%T Ti,s

    think volume

    bulk processing or parallel processing

    no row"at"a"time* document"at"a"time

    insert into permanent storage should bethe last step

    no updates

  • 8/12/2019 Datawarehousing 101

    30/70

    3ueues not E/tract

  • 8/12/2019 Datawarehousing 101

    31/70

    What kind of

    database shou)d Iuse for DW#

  • 8/12/2019 Datawarehousing 101

    32/70

    4 Ty,es

    1; Standard &elational

    2; ;

  • 8/12/2019 Datawarehousing 101

    33/70

  • 8/12/2019 Datawarehousing 101

    34/70

    standard re)ationa)

    the all-purpose solution for not-that-big data adequate for all tasks

    but not ecellent at an of them

    eas to use

    low resource requirements

    well"supported b all software familiar

    not suitable for reall big data

  • 8/12/2019 Datawarehousing 101

    35/70

  • 8/12/2019 Datawarehousing 101

    36/70

    What5s 6&

  • 8/12/2019 Datawarehousing 101

    37/70

    6assive)y&ara))e)

    &rocessing

    )i ft

  • 8/12/2019 Datawarehousing 101

    38/70

    a,,)iance software

    6&&

  • 8/12/2019 Datawarehousing 101

    39/70

    6&&

    cpu-intensive data warehousing data mining* some analtics

    supporting comple quer logic

    moderatel big data ,1"200-B/

    drawbacks( proprietar* epensi#e

    now hbridi4es with other tpes

  • 8/12/2019 Datawarehousing 101

    40/70

    What5s a

    co)u'n store#

    ) t

  • 8/12/2019 Datawarehousing 101

    41/70

    co)u'n store

    ) t

  • 8/12/2019 Datawarehousing 101

    42/70

    co)u'n store

    inversion of a row store:

    indexes become datadata becomes indexes

    ) t

  • 8/12/2019 Datawarehousing 101

    43/70

    co)u'n stores

    co)u'n stores

  • 8/12/2019 Datawarehousing 101

    44/70

    co)u'n stores

    for aggregations and transformations ofhighly structured data

    good for B6* analtics* some archi#ing

    moderatel big data ,0;?"100-B/

    bad for data mining

    slow to add new data ! purge data usuall support compression

  • 8/12/2019 Datawarehousing 101

    45/70

    What5s

    'a,-reduce#

    'a,-reduce

  • 8/12/2019 Datawarehousing 101

    46/70

    'a,-reduce

    'a,-reduce

  • 8/12/2019 Datawarehousing 101

    47/70

    'a,-reduce

    'a,-reduce

  • 8/12/2019 Datawarehousing 101

    48/70

    'a,-reduce

    // mapfunction(doc) { for (var i in doc.links) emit([doc.parent, i], null); }}// reducefunction(keys, values) { return null;}

    'a,-reduce// Mapfunction (doc) {

    i (d l d l)

  • 8/12/2019 Datawarehousing 101

    49/70

    'a,-reduce emit(doc.val, doc.val)}// Reducefunction (keys, values, rereduce) { // !is computes t!e standard deviation of t!e mapped results

    var std"eviation#$.$; var count#$; var total#$.$; var s%rotal#$.$;

    if (&rereduce) { // !is is t!e reduce p!ase, 'e are reducin over emitted values from // t!e map functions. for(var i in values) { total # total values[i];

    s%rotal # s%rotal (values[i] * values[i]); } count # values.lent!; } else { // !is is t!e rereduce p!ase, 'e are re+reducin previosuly // reduced values. for(var i in values) { count # count values[i].count; total # total values[i].total; s%rotal # s%rotal values[i].s%rotal; } }

    var variance # (s%rotal + ((total * total)/count)) / count; std"eviation # Mat!.s%rt(variance);

    // t!e reduce result. t contains enou! information to -e rereduced // 'it! ot!er reduce results. return {std"eviationstd"eviation,countcount, totaltotal,s%rotals%rotal};

    };

    'a,-reduce vs 6&&

  • 8/12/2019 Datawarehousing 101

    50/70

    'a,-reduce vs. 6&&

    open source petabtes

    write routines bhand

    inefficient

    generic cheap W ! cloud

    D6C tools

    proprietar terabtes

    ad#anced quersupport

    efficient

    specific needs good W

    integrated tools

  • 8/12/2019 Datawarehousing 101

    51/70

    What5s enter,rise

    search#

    enter,rise search

  • 8/12/2019 Datawarehousing 101

    52/70

    enter,rise search

    E)astic+earch

    enter,rise search

  • 8/12/2019 Datawarehousing 101

    53/70

    enter,rise search

    when you need to do D with a huge pile ofpartly processed !documents"

    does( light data mining* light B6!analtics

    best )full tet+ and keword search

    supports )approimate results+

    lots of special features for web data

  • 8/12/2019 Datawarehousing 101

    54/70

  • 8/12/2019 Datawarehousing 101

    55/70

    What5s a

    windowing 8uery#

    regu)ar aggregate

  • 8/12/2019 Datawarehousing 101

    56/70

    regu)ar aggregate

    windowing function

  • 8/12/2019 Datawarehousing 101

    57/70

    windowing function

  • 8/12/2019 Datawarehousing 101

    58/70

    0123 events (event4id 5,event4type 36,start M370M89,duration 53R:02,event4desc 36

    );

    7323 M06( t)

  • 8/12/2019 Datawarehousing 101

    59/70

    7323 M06(concurrent)M(tally)=:3R (=R"3R 1? start)07 concurrent

  • 8/12/2019 Datawarehousing 101

    60/70

    strea' ,rocessing +3%

    replace multiple queries with a singlequer

    a#oid scanning large tables multiple times

    replace pages of application code and

  • 8/12/2019 Datawarehousing 101

    61/70

    What5s a

    'ateria)ied view#

    8uery resu)ts as tab)e

  • 8/12/2019 Datawarehousing 101

    62/70

    8uery resu)ts as tab)e

    calculate once* read man time comple!epensi#e queries

    frequentl referenced

    not necessaril a whole quer often part of a quer

    might be manuall or automaticall

    updated depends on product

    non2re)ationa) 'atviews

  • 8/12/2019 Datawarehousing 101

    63/70

    non re)ationa) 'atviews

    ouchDB 9iews cache results of map!reduce obs

    updated on data read

    Solr ! :lastic Search )3aceted Search+ cached indeed results of comple searches

    updated on data change

    'aintaining 'atviews

  • 8/12/2019 Datawarehousing 101

    64/70

    'aintaining 'atviews

    BE+T* update mat#iewsat batch load time

    G$$D* update mat#iew according

    to clock!calendar:AI;* update mat#iew on data request

    BAD for DW* update mat#iewsusing a trigger

    'atview ti,s

  • 8/12/2019 Datawarehousing 101

    65/70

    ,

    mat#iews should be small 1!10 to E of &7< on each node

    each mat#iew should support se#eral

    queries or one reall reall important one

    truncate F append* don%t update

    inde mat#iews like cra4

    if the are not indees themsel#es

  • 8/12/2019 Datawarehousing 101

    66/70

    What5s $%A

    cubes

  • 8/12/2019 Datawarehousing 101

    67/70

    Site&e

    peat

    9isito

    rs

    Browse

    r

    dri))2down

  • 8/12/2019 Datawarehousing 101

    68/70

    $%A&

  • 8/12/2019 Datawarehousing 101

    69/70

    On5ine 7naltical .rocessing 9isuali4ation technique

    all data as a multi"dimensional space

    great for decision support

    . G &7< intensi#e

    hard to do on reall big data

    Works well with column stores

    7ontact

  • 8/12/2019 Datawarehousing 101

    70/70

    Josh Berkus( oshHpgeperts;com blog( blogs;ittoolbo;com!database!soup

    twitter( Hfu44chef

    .ostgreSA5( www;postgresql;org pgeperts( www;pgeperts;com

    -his talk is copright 2011 Josh Berkus and is licensed under the reati#e ommons 7ttributionlicense;