it it's happening now analytics in een wereld van big data

Post on 02-Jul-2015

258 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Big data? Volgens schattingen van IBM genereren we dagelijks 2.5 quintiljoen bytes aan gegevens. Dagelijks, u leest het goed. Of, bekijk het zo: 90% van de beschikbare gegevens wereldwijd zijn aangemaakt in de afgelopen twee jaar. Hallucinant. Gartner voorspelt dat tegen 2015 minstens 4,4 miljoen jobs zullen gecreëerd worden, gelinkt aan big data en analytics. Mogen we over een trend spreken? Wij denken het wel. Een trend die u niet wilt missen! Want ook ú haalt uw voordeel uit de analyse van uw gegevens. Komen aan bod in de workshop, mét talrijke voorbeelden: het analytics proces model in vogelperspectief, de verschillende stappen: data preprocessing, analytics en post processing, recente nieuwe toepassingen, zoals proces analytics, social media analytics en fraude analytics.

TRANSCRIPT

Advances  in  Data  Mining  and  Big  Data  Analytics

Prof.  dr.  Bart  Goethals  Advanced  Database  Research  &  Modelling  Department  of  Mathematics  &  Computer  Science

Cevora    -­‐  19  November  2014

Big  Data  Analytics    or  …

• Statistics  • Data  Mining  • Knowledge  Discovery  in  Data  • Analytics  • Data  Science  • …

2

Big  Data  is  like  teenage  sex:    

• everyone  talks  about  it,    • nobody  really  knows  how  to  do  it,  • everyone  thinks  everyone  else  is  doing  it,    • so  everyone  claims  they  are  doing  it…  

[Dan  Ariely]        

3See  also  Data  News  Survey  March  2014

The  Goal  of  Big  Data

Goal  is  the  same:     Find  useful  patterns  or  models  in  Data  

Emphasis  Changes:     Volume     Velocity     Variety     V…      

4

Big  Data  Volume

5

[source:  EMC]

Is  Big  better?

• Yes!  But,  some  fundamental  principles:  [U.  Fayyad]  • Data  gains  value  exponentially  when  integrated  and  coalesced.  When  fragmented:  dramatic  value  loss.  

• Fusing  data  together  from  disparate  or  independent  sources  is  difficult  and  impossible  to  maintain.      

• 80%  of  the  effort  of  Data  Mining  goes  to  getting  the  right  data  together.      

• Standardisation.  Data  governance  and  policy.  Data  privacy,  encryption  and  masking.  Data  infrastructure.  

• Data  is  a  primary  competency  and  not  a  side-­‐activity.

6

Is  Big  a  problem?

• Data  can  (not)  be  summarised  (sampling)  • Too  much  information  lost  for  reasonable  sizes  • We  need  to  find  patterns  that  are  useful  and  valid  for  all  data    • Personalized  Recommendation  • Personalized  Advertising  • Rare  diseases  

• Current  analytics  methods  do  not  scale  or  produce  satisfactory  results

7

Big  Data  Velocity  (60s  on  the  internet)

8[Source:  Qmee]

Big  Data  Variety

• Data  can  be    • structured,    • semi-­‐structured,    • text,    • images,    • video,    • time  series,    • click-­‐streams,    • graphs  or  (social)  networks,  …  • …

9

Big  Data  Value• Predict  voting  behaviour  based  on  Twitter  (~1M  tweets)[UA  Master  thesis  Christophe  Van  Gysel]  

• Detect  Fiscal  Fraud  based  on  network  of  ~7M  transactions[UA  Applied  Data  Mining,  Prof.  dr.  David  Martens]  

• Recognise  cyberpedophiles [UA  Computational  Linguistics,  Prof.  dr.  Walter  Daelemans]  

• e-­‐Health,  predict  rare  diseases[UA  Biomina,  UZA,  Prof.  dr.  Bart  Goethals]  

• Mining  Train  Delays [UA,  Prof.  dr.  Bart  Goethals  and  Infrabel]  

• Personalised  Advertising,  Recommendation,  Cross-­‐selling,  Product  placement,  Distribution  planning  

• …

10

What  about  the  methods?  

• Association-­‐,  Pattern  Discovery  • Classification,  Prediction,  Regression  • Clustering  • Recommendation  • Exploration  • Summarization  • Visualization

11

Association-­‐,  Pattern  Discovery

• Imagine  a  supermarket  • What  sets  of  products  frequently  bought  together?  • What  products  influence  the  sales  of  each  other?

12

Challenge

Number  of  potentially  interesting  patterns  is  larger  than  the  number  of  particles  in  the  universe

13

Association-­‐,  Pattern  Discovery

• “75%  of  all  customers  that  buy  diapers  also  buy  beer”

14

1515

Different  patterns  for  different  data

• Patients,  symptoms,  diseases  • Movies,  ratings,  viewers  • Friends,  Likes,  Status  Updates,  Interactions  • Routes,  Trucks,  Packages,  Distributors,  Locations  

• Sequences,    spatial,  time  series,  graphs,  multi-­‐relations,  RDF,  …

16

Classification  /  Prediction

17

How  to  separate  two  classes  of  objects  from  each  other

Rare  diseases

• Neonatal  heel  prick  used  for  detection  of  potential Medium-­‐chain  acyl-­‐coenzyme  A  dehydrogenase  deficiency  

• Classify  whether  expensive  genetic  test  is  required  

• Intensive  Care,  fast  prediction  of  e.g.  kidney  failure

18

[UA  Biomina]

Fraud  detection

19

[De  Standaard,  Prof.  dr.  David  Martens,  UA  Applied  Data  Mining  research  group]

Twitter  brengt  raad

20

Voting  behaviour  prediction  on  Twitter

21

[UA  Master  thesis  Christophe  Van  Gysel]

22

Classification  methods

• Pattern  Based  Classification  • Nearest  Neighbour  Classification  • Decision  Trees  • Support  Vector  Machines  • Neural  Networks  • Random  Forsests  • Conditional  Random  Fields  • …

23

Recommendation  methods

• A  customer  arrives  on  your  web-­‐shop:  show  her  the  product  she  doesn’t  know  yet,  but  might  be  interested  in  

• For  Any  (online)  shop!    Famous  example:  Netflix   (pattern  mining  is  even  used  to  produce  new  series:  ‘House  of  Cards’)  

• Recommendation  is  everywhere.  • Understand  user-­‐intent!  

• Methods:  • Collaborative  Filtering  • Matrix  Factorisation  • …

24

Sentiment  analysis

25

Clustering:  grouping  similar  things  together

What  is  a  natural  grouping  of  these  objects?

26

Male  vs.  Female

27

Young  vs.  Old

28

Simpson  family  vs.  Others

29

Similarity  is  hard  to  measure

curse  of  dimensionality

30

Enough  about  the  MethodsWhat  about  privacy?  

• Most  methods  function  on  anonymised  data  • Problem  solved:  No!  • Patterns  or  predictions  themselves  can  also  cause  Privacy  Infringement

31

32

Privacy  Preserving  Data  Mining  Discrimination  Aware  Data  Mining     methods  exist!

Conclusion

33

http://www.uantwerpen.be/bart-­‐goethals  bart.goethals@uantwerp.be

top related