intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the...

50
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Introduc)on to Data Analysis and Explora)on (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach and Kumar)

Upload: others

Post on 20-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Introduc)on  to    Data  Analysis  and    Explora)on  

(from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach

and Kumar)

Page 2: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Analysis  of  data  (from  Wikipedia)  

•  Data  analysis    is  a  process  of  inspec'ng,  cleaning,  transforming,  and  modeling  data  with  the  goal  of  highligh)ng  useful  informa)on,  sugges)ng  conclusions,  and  suppor)ng  decision  making.    

•  Data  analysis  has  mul)ple  facets  and  approaches,  encompassing  diverse  techniques  under  a  variety  of  names,  in  different  business,  science,  and  social  science  domains.  

•  Data  explora)on:  a  preliminary  explora)on  of  the  data  to  beFer  understand  its  characteris)cs.  Now  developing  into  richer  paradigms.  

•  Data  mining  is  a  par)cular  data  analysis  technique  that  focuses  on  modeling  and  knowledge  discovery  for  predic)ve  rather  than  purely  descrip)ve  purposes.    

•  Business  intelligence  covers  data  analysis  that  relies  heavily  on  aggrega)on,  focusing  on  business  informa)on.  

Page 3: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Why  Analyze  Data?  Commercial  Viewpoint  •  Lots  of  data  is  being  collected    

and  warehoused    – Web  data,  e-­‐commerce  –  purchases  at  department/  grocery  stores  

–  Bank/Credit  Card    transac)ons  

•  Computers  have  become  cheaper  and  more  powerful  

•  Compe))ve  Pressure  is  Strong    –  Provide  beFer,  customized  services  for  an  edge  (e.g.  in  Customer  Rela)onship  Management)  

 

Page 4: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Why  Analyze  Data?  Scien)fic  Viewpoint  

•  Data  collected  and  stored  at    enormous  speeds  (GB/hour)  

–  remote  sensors  on  a  satellite  

–  telescopes  scanning  the  skies  –  microarrays  genera)ng  gene    expression  data  

–  scien)fic  simula)ons    genera)ng  terabytes  of  data  

•  Tradi)onal  techniques  infeasible  for  raw  data  •  Data  mining  may  help  scien)sts    

–  in  classifying  and  segmen)ng  data  –  in  Hypothesis  Forma)on  

Page 5: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Analyzing  Large  Data  Sets  -­‐  Mo)va)on  

•  There  is  o)en  informa.on  “hidden”  in  the  data  that  is    not  readily  evident  

•  Human  analysts  may  take  weeks  to  discover  useful  informa.on  

•  Much  of  the  data  is  never  analyzed  at  all  

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

1995 1996 1997 1998 1999

The Data Gap

Total new disk (TB) since 1995

Number of analysts

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

Page 6: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

What  is  data  explora)on?  

•  Key  mo)va)ons  of  data  explora)on  include  –  Helping  to  select  the  right  tool  for  preprocessing  or  analysis  –  Making  use  of  humans’  abili)es  to  recognize  paFerns  

•   People  can  recognize  paFerns  not  captured  by  data  analysis  tools      

•  Related  to  the  area  of  Exploratory  Data  Analysis  (EDA)  –  Created  by  sta)s)cian  John  Tukey  –  Seminal  book  is  Exploratory  Data  Analysis  by  Tukey  –  A  nice  online  introduc)on  can  be  found  in  Chapter  1  of  the  NIST  

Engineering  Sta)s)cs  Handbook  hFp://www.itl.nist.gov/div898/handbook/index.htm  

A preliminary exploration of the data to better understand its characteristics.

Page 7: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Techniques  Used  In  Data  Explora)on      

•  In  EDA,  as  originally  defined  by  Tukey  – The  focus  was  on  visualiza)on  – Clustering  and  anomaly  detec)on  were  viewed  as  exploratory  techniques  

–  In  data  mining,  clustering  and  anomaly  detec)on  are  major  areas  of  interest,  and  not  thought  of  as  just  exploratory  

•  Basic  tradi)onal  techniques  of  data  explora)on  – Summary  sta)s)cs  – Visualiza)on  – Online  Analy)cal  Processing  (OLAP)    

Page 8: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Summary  Sta)s)cs  

•  Summary  sta)s)cs    are  numbers  that  summarize  proper)es  of  the  data  

– Summarized  proper)es  include  frequency,  loca)on  and  spread  

•   Examples:    loca)on  -­‐  mean                                          spread  -­‐  standard  devia)on  

– Most  summary  sta)s)cs  can  be  calculated  in  a  single  pass  through  the  data    

Page 9: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Frequency  and  Mode  

•  The  frequency  of  an  aFribute  value  is  the  percentage  of  )me  the  value  occurs  in  the    data  set    – For  example,  given  the  aFribute  ‘gender’  and  a  representa)ve  popula)on  of  people,  the  gender  ‘female’  occurs  about  50%  of  the  )me.  

•  The  mode  of  a  an  aFribute  is  the  most  frequent  aFribute  value        

•  The  no)ons  of  frequency  and  mode  are  typically  used  with  categorical  data  

Page 10: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Percen)les  •  For  con)nuous  data,  the  no)on  of  a  percen)le  is  more  useful.    

Given  an  ordinal  or  con)nuous  aFribute  x  and  a  number  p  between  0  and  100,  the  pth  percen)le        is  a  value                  of  x  such  that  p%  of  the  observed  values  of  x  are  less  than                      

 •  For  instance,  the  50th  percen)le                is  the  value            such  that  50%  of  all  values  of  x  are  less  than              

xp

xp

xp

x50%

x50%

Page 11: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Measures  of  Loca)on:  Mean  and  Median  

•  The  mean  is  the  most  common  measure  of  the  loca)on  of  a  set  of  points.      

•  However,  the  mean  is  very  sensi)ve  to  outliers.        •  Thus,  the  median  or  a  trimmed  mean  is  also  commonly  used.  

Page 12: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Measures  of  Spread:  Range  and  Variance  

•  Range  is  the  difference  between  the  max  and  min  

•  The  variance  or  standard  devia)on  is  the  most  common  measure  of  the  spread  of  a  set  of  points.    

     •  However,  this  is  also  sensi)ve  to  outliers,  so  that  other  measures  are  oden  used.      

Page 13: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Visualiza)on        Visualiza)on  is  the  conversion  of  data  into  a  visual  or  tabular  format  so  that  the  characteris)cs  of  the  data  and  the  rela)onships  among  data  items  or  aFributes  can  be  analyzed  or  reported.  

•  Visualiza)on  of  data  is  one  of  the  most  powerful  and  appealing  techniques  for  data  explora)on.    – Humans  have  a  well  developed  ability  to  analyze  large  amounts  of  informa)on  that  is  presented  visually  

–  Can  detect  general  paFerns  and  trends  –  Can  detect  outliers  and  unusual  paFerns        

 

Page 14: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Example:  Sea  Surface  Temperature  •  The  following  shows  the  Sea  Surface  Temperature  (SST)  for  July  1982  – Tens  of  thousands  of  data  points  are  summarized  in  a  single  figure  

   

 

Page 15: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Representa)on  

•  Is  the  mapping  of  informa)on  to  a  visual  format  •  Data  objects,  their  aFributes,  and  the  rela)onships  among  data  objects  are  translated  into  graphical  elements  such  as  points,  lines,  shapes,  and  colors.  

•  Example:    – Objects  are  oden  represented  as  points  –  Their  aFribute  values  can  be  represented  as  the  posi)on  of  the  points  or  the  characteris)cs  of  the  points,  e.g.,  color,  size,  and  shape  

–  If  posi)on  is  used,  then  the  rela)onships  of  points,  i.e.,  whether  they  form  groups  or  a  point  is  an  outlier,  is  easily  perceived.  

Page 16: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Selec)on  •  Is  the  elimina)on  or  the  de-­‐emphasis  of  certain  objects  and  aFributes  

•  Selec)on  may  involve  choosing  a  subset  of  aFributes    – Dimensionality  reduc)on  is  oden  used  to  reduce  the  number  of  dimensions  to  two  or  three  

– Alterna)vely,  pairs  of  aFributes  can  be  considered  •  Selec)on  may  also  involve  choosing  a  subset  of  objects  –   A  region  of  the  screen  can  only  show  so  many  points  

– Can  sample,  but  want  to  preserve  points  in  sparse  areas    

Page 17: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Visualiza)on  Techniques:  Histograms  •  Histogram    

–  Usually  shows  the  distribu)on  of  values  of  a  single  variable  –  Divide  the  values  into  bins  and  show  a  bar  plot  of  the  number  of  objects  

in  each  bin.    –  The  height  of  each  bar  indicates  the  number  of  objects  –  Shape  of  histogram  depends  on  the  number  of  bins  

•  Example:  Petal  Width  (10  and  20  bins,  respec)vely)      

Page 18: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Two-­‐Dimensional  Histograms  •  Show  the  joint  distribu)on  of  the  values  of  two  aFributes    

•  Example:  petal  width  and  petal  length  –  What  does  this  tell  us?    

Page 19: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Visualiza)on  Techniques:  Box  Plots  •  Box  Plots    

–  Invented  by  J.  Tukey  – Another  way  of  displaying  the  distribu)on  of  data    – Following  figure  shows  the  basic  part  of  a  box  plot  

 outlier

10th percentile

25th percentile

75th percentile

50th percentile

10th percentile

Page 20: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Example  of  Box  Plots    

•  Box  plots  can  be  used  to  compare  aFributes  

 

Page 21: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

What  is  Data  Mining?  • Many  Defini.ons  

–  Non-­‐trivial  extrac.on  of  implicit,  previously  unknown  and  poten.ally  useful  informa.on  from  data  

–  Explora.on  &  analysis,  by  automa.c  or    semi-­‐automa.c  means,  of    large  quan..es  of  data    in  order  to  discover    meaningful  paGerns      

Page 22: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

What  is  (not)  Data  Mining?  

  What is Data Mining? Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) –  Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

  What is not Data Mining?

–  Look up phone number in phone directory –  Query a Web search engine for information about “Amazon”

Page 23: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

•  Draws  ideas  from  machine  learning/AI,  paGern  recogni.on,  sta.s.cs,  and  database  systems  

•  Tradi.onal  Techniques  may  be  unsuitable  due  to    – Enormity  of  data  – High  dimensionality    of  data  

– Heterogeneous,    distributed  nature    of  data  

Origins  of  Data  Mining  

Machine Learning/ Pattern

Recognition

Statistics/ AI

Data Mining

Database systems

Page 24: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Data  Mining  Tasks  

•  Predic)on  Methods  – Use  some  variables  to  predict  unknown  or  future  values  of  other  variables.    

•  Descrip)on  Methods  – Find  human-­‐interpretable  paFerns  that  describe  the  data.    

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Page 25: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Methods  

•  Classifica)on  [Predic)ve]  •  Clustering  [Descrip)ve]  •  Associa)on  Rule  Discovery  [Descrip)ve]  •  Sequen)al  PaFern  Discovery  [Descrip)ve]  •  Regression  [Predic)ve]  •  Anomaly  Detec)on  [Predic)ve]    

Page 26: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Classifica)on:  Defini)on  

•  Given  a  collec)on  of  records  (training  set  )  –  Each  record  contains  a  set  of  a4ributes,  one  of  the  aFributes  is  the  class.  

•  Find  a  model    for  class  aFribute  as  a  func)on  of  the  values  of  other  aFributes.  

•  Goal:  previously  unseen  records  should  be  assigned  a  class  as  accurately  as  possible.  –  A  test  set  is  used  to  determine  the  accuracy  of  the  model.  Usually,  the  given  data  set  is  divided  into  training  and  test  sets,  with  training  set  used  to  build  the  model  and  test  set  used  to  validate  it.  

Page 27: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Classifica)on  Example  

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

Test Set

Training Set

Model Learn

Classifier

Page 28: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Classifica)on:  Applica)on  1  •  Fraud  Detec)on  

–  Goal:  Predict  fraudulent  cases  in  credit  card  transac)ons.  –  Approach:  

•  Use  credit  card  transac)ons  and  the  informa)on  on  its  account-­‐holder  as  aFributes.  

–  When  does  a  customer  buy,  what  does  he  buy,  how  oden  he  pays  on  )me,  etc  

•  Label  past  transac)ons  as  fraud  or  fair  transac)ons.  This  forms  the  class  aFribute.  

•  Learn  a  model  for  the  class  of  the  transac)ons.  •  Use  this  model  to  detect  fraud  by  observing  credit  card  transac)ons  on  an  account.  

Page 29: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Classifica)on:  Applica)on  2  

•  Customer  AFri)on/Churn:  – Goal:  To  predict  whether  a  customer  is  likely  to  be  lost  to  a  compe)tor.  

– Approach:  •  Use  detailed  record  of  transac)ons  with  each  of  the  past  and  present  customers,  to  find  aFributes.  

–  How  oden  the  customer  calls,  where  he  calls,  what  )me-­‐of-­‐the  day  he  calls  most,  his  financial  status,  marital  status,  etc.    

•  Label  the  customers  as  loyal  or  disloyal.  •  Find  a  model  for  loyalty.  

From [Berry & Linoff] Data Mining Techniques, 1997

Page 30: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Classifica)on:  Applica)on  3  

•  Sky  Survey  Cataloging  –  Goal:  To  predict  class  (star  or  galaxy)  of  sky  objects,  especially  visually  faint  ones,  based  on  the  telescopic  survey  images  (from  Palomar  Observatory).  

–  3000  images  with  23,040  x  23,040  pixels  per  image.  

–  Approach:  •  Segment  the  image.    •  Measure  image  aFributes  (features)  -­‐  40  of  them  per  object.  •  Model  the  class  based  on  these  features.  •  Success  Story:  Could  find  16  new  high  red-­‐shid  quasars,  some  of  the  farthest  objects  that  are  difficult  to  find!  

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Page 31: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Classifying  Galaxies  

Early

Intermediate

Late

Data Size: •  72 million stars, 20 million galaxies •  Object Catalog: 9 GB •  Image Database: 150 GB

Class: •  Stages of Formation

Attributes: •  Image features, •  Characteristics of light

waves received, etc.

Courtesy: http://aps.umn.edu

Page 32: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Clustering  Defini)on  

•  Given  a  set  of  data  points,  each  having  a  set  of  aFributes,  and  a  similarity  measure  among  them,  find  clusters  such  that  – Data  points  in  one  cluster  are  more  similar  to  one  another.  

– Data  points  in  separate  clusters  are  less  similar  to  one  another.  

•  Similarity  Measures:  – Euclidean  Distance  if  aFributes  are  con)nuous.  – Other  Problem-­‐specific  Measures.  

Page 33: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Illustra)ng  Clustering    Euclidean Distance Based Clustering in 3-D space.

Intracluster distances are minimized

Intercluster distances are maximized

Page 34: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Clustering:  Applica)on  1  •  Market  Segmenta)on:  

–  Goal:  subdivide  a  market  into  dis)nct  subsets  of  customers  where  any  subset  may  conceivably  be  selected  as  a  market  target  to  be  reached  with  a  dis)nct  marke)ng  mix.  

–  Approach:    •  Collect  different  aFributes  of  customers  based  on  their  geographical  and  lifestyle  related  informa)on.  

•  Find  clusters  of  similar  customers.  •  Measure  the  clustering  quality  by  observing  buying  paFerns  of  customers  in  same  cluster  vs.  those  from  different  clusters.    

Page 35: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Clustering:  Applica)on  2  

•  Document  Clustering:  – Goal:  To  find  groups  of  documents  that  are  similar  to  each  other  based  on  the  important  terms  appearing  in  them.  

– Approach:  To  iden)fy  frequently  occurring  terms  in  each  document.  Form  a  similarity  measure  based  on  the  frequencies  of  different  terms.  Use  it  to  cluster.  

– Gain:  Informa)on  Retrieval  can  u)lize  the  clusters  to  relate  a  new  document  or  search  term  to  clustered  documents.  

Page 36: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Illustra)ng  Document  Clustering  •  Clustering  Points:  3204  Ar)cles  of  Los  Angeles  Times.  •  Similarity  Measure:  How  many  words  are  common  in  these  

documents  (ader  some  word  filtering).  

Category TotalArticles

CorrectlyPlaced

Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746

Sports 738 573

Entertainment 354 278

Page 37: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Clustering  of  S&P  500  Stock  Data  

Discovered Clusters Industry Group

1 Applied-Matl-DOW N,Bay-Network-Down,3-COM-DOWN,Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

DSC-Comm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,

Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,Sun-DOW N

Technology1-DOWN

2 Apple-Comp-DOW N,Autodesk-DOWN,DEC-DOWN,ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,

Computer-Assoc-DOWN,Circuit-City-DOWN,Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,

Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

3 Fannie-Mae-DOWN,Fed-Home-Loan-DOW N,MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,

Schlumberger-UPOil-UP

  Observe Stock Movements every day.   Clustering points: Stock-{UP/DOWN}   Similarity Measure: Two points are more similar if the events

described by them frequently happen together on the same day. v  Association rules can be used to quantify a similarity measure.

Page 38: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Associa)on  Rule  Discovery:  Defini)on  

•  Given  a  set  of  records  each  of  which  contain  some  number  of  items  from  a  given  collec)on;  –  Produce  dependency  rules  which  will  predict  occurrence  of  an  item  based  on  occurrences  of  other  items.  

TID Items

1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Page 39: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Associa)on  Rule  Discovery:  Applica)on  1  

•  Marke)ng  and  Sales  Promo)on:  –  Let  the  rule  discovered  be            {Bagels,  …  }  -­‐-­‐>  {Potato  Chips}  –  Potato  Chips  as  consequent  =>  Can  be  used  to  determine  what  should  be  done  to  boost  its  sales.  

–  Bagels  in  the  antecedent  =>  Can  be  used  to  see  which  products  would  be  affected  if  the  store  discon)nues  selling  bagels.  

–  Bagels  in  antecedent  and  Potato  chips  in  consequent  =>  Can  be  used  to  see  what  products  should  be  sold  with  Bagels  to  promote  sale  of  Potato  chips!  

Page 40: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Associa)on  Rule  Discovery:  Applica)on  2  

•  Supermarket  shelf  management.  – Goal:  To  iden)fy  items  that  are  bought  together  by  sufficiently  many  customers.  

– Approach:  Process  the  point-­‐of-­‐sale  data  collected  with  barcode  scanners  to  find  dependencies  among  items.  

– A  classic  rule  -­‐-­‐  •  If  a  customer  buys  diaper  and  milk,  then  he  is  very  likely  to  buy  beer.  

•  So,  don’t  be  surprised  if  you  find  six-­‐packs  stacked  next  to  diapers!  

Page 41: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Associa)on  Rule  Discovery:  Applica)on  3  

•  Inventory  Management:  –  Goal:  A  consumer  appliance  repair  company  wants  to  an)cipate  the  nature  of  repairs  on  its  consumer  products  and  keep  the  service  vehicles  equipped  with  right  parts  to  reduce  on  number  of  visits  to  consumer  households.  

–  Approach:  Process  the  data  on  tools  and  parts  required  in  previous  repairs  at  different  consumer  loca)ons  and  discover  the  co-­‐occurrence  paFerns.  

Page 42: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Sequen)al  PaFern  Discovery  

•  Given  is  a  set  of  objects,  with  each  object  associated  with  its  own  'meline  of  events,  find  rules  that  predict  strong  sequen)al  dependencies  among  different  events.  

•  Rules  are  formed  by  first  discovering  paFerns.  Event  occurrences  in  the  paFerns  are  governed  by  )ming  constraints.  

(A B) (C) (D E)

<= ms

<= xg >ng <= ws

(A B) (C) (D E)

Page 43: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Sequen)al  PaFern  Discovery:  Examples  

•  In  telecommunica)ons  alarm  logs,    –  (Inverter_Problem    Excessive_Line_Current)                    (Rec)fier_Alarm)  -­‐-­‐>  (Fire_Alarm)  

•  In  point-­‐of-­‐sale  transac)on  sequences,  –  Computer  Bookstore:            (Intro_To_Visual_C)    (C++_Primer)  -­‐-­‐>                (Perl_for_dummies,Tcl_Tk)  

–  Athle)c  Apparel  Store:          (Shoes)  (Racket,  Racketball)  -­‐-­‐>  (Sports_Jacket)  

Page 44: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Regression  •  Predict  a  value  of  a  given  con)nuous  valued  variable  based  on  

the  values  of  other  variables,  assuming  a  linear  or  nonlinear  model  of  dependency.  

•  Greatly  studied  in  sta)s)cs,  neural  network  fields.  •  Examples:  

–  Predic)ng  sales  amounts  of  new  product  based  on  adve)sing  expenditure.  

–  Predic)ng  wind  veloci)es  as  a  func)on  of  temperature,  humidity,  air  pressure,  etc.  

–  Time  series  predic)on  of  stock  market  indices.  

Page 45: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Devia)on/Anomaly  Detec)on  •  Detect  significant  devia)ons  from  normal  behavior  •  Applica)ons:  

– Credit  Card  Fraud  Detec)on  

– Network  Intrusion    Detec)on  

Page 46: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Anomaly/Outlier  Detec)on  

•  What  are  anomalies/outliers?  –  The  set  of  data  points  that  are  considerably  different  than  the  

remainder  of  the  data  

•  Variants  of  Anomaly/Outlier  Detec)on  Problems  –  Given  a  database  D,  find  all  the  data  points  x  ∈  D  with  anomaly  scores  

greater  than  some  threshold  t  –  Given  a  database  D,  find  all  the  data  points  x  ∈  D  having  the  top-­‐n  

largest  anomaly  scores  f(x)  –  Given  a  database  D,  containing  mostly  normal  (but  unlabeled)  data  

points,  and  a  test  point  x,  compute  the  anomaly  score  of  x  with  respect  to  D  

•  Applica)ons:    –  Credit  card  fraud  detec)on,  telecommunica)on  fraud  detec)on,  

network  intrusion  detec)on,  fault  detec)on  

Page 47: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Importance  of  Anomaly  Detec)on  Ozone  Deple)on  History  •  In  1985  three  researchers  (Farman,  

Gardinar  and  Shanklin)  were  puzzled  by  data  gathered  by  the  Bri)sh  Antarc)c  Survey  showing  that  ozone  levels  for  Antarc)ca  had  dropped  10%  below  normal  levels  

•  Why  did  the  Nimbus  7  satellite,  which  had  instruments  aboard  for  recording  ozone  levels,  not  record  similarly  low  ozone  concentra)ons?    

•  The  ozone  concentra)ons  recorded  by  the  satellite  were  so  low  they  were  being  treated  as  outliers  by  a  computer  program  and  discarded!   Sources:

http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html

Page 48: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Anomaly  Detec)on  Schemes    •  General  Steps  

–  Build  a  profile  of  the  “normal”  behavior  •  Profile  can  be  paFerns  or  summary  sta)s)cs  for  the  overall  popula)on  

–  Use  the  “normal”  profile  to  detect  anomalies  •  Anomalies  are  observa)ons  whose  characteris)cs  differ  significantly  from  the  normal  profile  

•  Types  of  anomaly  detec)on    schemes  –  Graphical  &  Sta)s)cal-­‐based  –  Distance-­‐based  –  Model-­‐based  

Page 49: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Challenges  of  Data  Mining  

•  Scalability  •  Dimensionality  •  Complex  and  Heterogeneous  Data  •  Data  Quality  •  Data  Ownership  and  Distribu)on  •  Privacy  Preserva)on  •  Streaming  Data  

Page 50: intro data analysis and explorationtanca.faculty.polimi.it/wp-content/uploads/intro... · (from the Lecture Notes of the book: Introduction to Data Mining, by Tan,Steinbach ... •

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Next  Subject:  OLAP  •  On-­‐Line  Analy)cal  Processing  (OLAP)  was  proposed  by  E.  F.  Codd,  the  father  of  the  rela)onal  database.  

•  Rela)onal  databases  put  data  into  tables,  while  OLAP  uses  a  mul)dimensional  array  representa)on.    – Such  representa)ons  of  data  previously  existed  in  sta)s)cs  and  other  fields  

•  There  are  a  number  of  data  analysis  and  data  explora)on  opera)ons  that  are  easier  with  such  a  data  representa)on.