practical computing with chaos

42
© 2014 MapR Technologies 1 © 2014 MapR Technologies Practical Computing With Chaos Ted Dunning June 9, 2015

Upload: hadoop-summit

Post on 12-Aug-2015

206 views

Category:

Technology


3 download

TRANSCRIPT

  1. 1. 2014 MapR Technologies 1 2014 MapR Technologies
  2. 2. 2014 MapR Technologies 2 Practical Computing with Chaos Ted Dunning, Chief Applications Architect MapR Technologies Email [email protected] [email protected] Twitter @Ted_Dunning
  3. 3. 2014 MapR Technologies 3 e-book available courtesy of MapR Also at MapR booth http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman June 2014 (published by OReilly)
  4. 4. 2014 MapR Technologies 4 Practical Machine Learning series (OReilly) Machine learning is becoming mainstream Need pragmatic approaches that take into account real world business settings: Time to value Limited resources Availability of data Expertise and cost of team to develop and to maintain system Look for approaches with big benefits for the effort expended
  5. 5. 2014 MapR Technologies 5 Agenda Monty Hall Randomized geo-coding Thompson sampling Bayesian Bandits Targeting Bayesian ranking Dithering (sound, signals) Synthetic data (preview)
  6. 6. 2014 MapR Technologies 6 Lets Start with Trouble Monty Hall problem (oops, done) Three doors, one with a fabulous prize You pick one Monte shows you one of the remaining doors is empty You can switch at this point to the other door or not Should you switch?
  7. 7. 2014 MapR Technologies 7
  8. 8. 2014 MapR Technologies 8
  9. 9. 2014 MapR Technologies 9
  10. 10. 2014 MapR Technologies 10 The Real Problem Doing the math isnt too hard Convincing somebody you have the right answer is really hard
  11. 11. 2014 MapR Technologies 11 Live Coding With REAL Chaos
  12. 12. 2014 MapR Technologies 12 Geo-coding
  13. 13. 2014 MapR Technologies 13 Geo-coding Some databases have disk locality key locality The primary key is totally ordered Embedding a total ordering of the points in a plane is possible But loses some distance information A line is not a square! We want to do proximity searches This gets harder in the polar regions for most codings
  14. 14. 2014 MapR Technologies 14 Space Filling Curve 0 1 23 01 2 3 0 1 2 3 0 1 2 3 0 1 2 3
  15. 15. 2014 MapR Technologies 15 Space Filling Curve 0123 2 3 3 1 0 2 2 3 1 1 0 0 3 20 1
  16. 16. 2014 MapR Technologies 16 Z-coding Interleave Bits 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  17. 17. 2014 MapR Technologies 17 Neighbors Often Share Prefix 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10 00. 11.11 10. 01.01 00. 11.01
  18. 18. 2014 MapR Technologies 18 Often, not always Close Far
  19. 19. 2014 MapR Technologies 19 Random Sampling to Derive Keys 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  20. 20. 2014 MapR Technologies 20 "00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  21. 21. 2014 MapR Technologies 21 "00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  22. 22. 2014 MapR Technologies 22 "00.01.10" - "00.01.11" "00.11.00" - "00.11.11" "01.00.10" "01.10.00" - "01.10.10 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  23. 23. 2014 MapR Technologies 23 Dithering
  24. 24. 2014 MapR Technologies 24 4 bit sine wave (listen for artifacts as volume decreases) White dithering (artifacts gone, we hear through the noise) Noise shaping (noise is easier to hear through)
  25. 25. 2014 MapR Technologies 25 0 1 2 3 4 5 6 42024 Time
  26. 26. 2014 MapR Technologies 26 The Shape of the Noise Noise Frequency 0.4 0.2 0.0 0.2 0.4 010003000
  27. 27. 2014 MapR Technologies 27 The Effect After Averaging 0 1 2 3 4 5 6 42024 Time
  28. 28. 2014 MapR Technologies 28 Thompson Sampling
  29. 29. 2014 MapR Technologies 29 Learning in the Real World In the real world we get to pick our training examples Do we try this restaurant or not? Learning has real and opportunity costs Not learning has real and opportunity costs as well Every sub-optimal choice we make incurs regret We would like to minimize this But we cant quantify regret without incurring regret!
  30. 30. 2014 MapR Technologies 30 An Example Pick one of five options Purple, blue, green, red, yellow Each has a random payoff If you pick a bad option, regret = mean(best) mean(yours) The best known algorithm uses randomization Best = minimal regret + minimal code complexity
  31. 31. 2014 MapR Technologies 31 Demo The Algorithm
  32. 32. 2014 MapR Technologies 32 Synthetic Data
  33. 33. 2014 MapR Technologies 33 select IR.ENC_KEY ,IR.ENCOUNTER_ ,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD ,IR.sub_source_cd ,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr ,IR.HIC_NBR ,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC ,IR.PROVIDER_NBR ,IR.provider_type ,IR.PROVIDER_SOURCE_CD ,IR.cms_provider_ty e ,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc ,IR.proc_cd ,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd ,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC FROM (SELECT distinct enc.encounter_key as ENC_KEY, enc.encounter_nbr as ENCOUNTER_, typ.encounter_type_cd as ETYPE, bt.bill_type, cnt.contract_nbr as CONTR_, ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD, enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS, enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as HDR_ERRCD, eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR, prv.provider_type, prv.PROVIDER_SOURCE_CD, diag.cms_provider_type, sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC, svc.rev_cd, rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd, dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY, st.rec_state_cd, sts.rec_status_cd, derr.error_cd as DG_ERRCD, derr.error_desc as DG_ERRDESC FROM oicpcuhg.ir_encounter enc ` Can You See the Problem?
  34. 34. 2014 MapR Technologies 34 INNER JOIN oicpcuhg.ir_encountertype typ ON (typ.encounter_type_key = enc.encounter_type_key) LEFT OUTER JOIN oicpcuhg.ir_billtype bt ON (bt.bill_type_key = enc.bill_type_key) LEFT OUTER JOIN oicpcuhg.ir_contract cnt ON (cnt.contract_key = enc.contract_key) LEFT OUTER JOIN oicpcuhg.ir_datasource ds ON (ds.source_key = enc.data_source_key) LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob ON (lob.lob_key = enc.lob_key) INNER JOIN oicpcuhg.ir_member m ON ( m.hp_cd = enc.hp_cd AND m.member_source_cd = enc.member_source_cd AND m.member_nbr = enc.member_nbr) LEFT OUTER JOIN oicpcuhg.ir_encountererror eerror ON (eerror.encounter_key = enc.encounter_key and eerror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error eerr ON (eerr.error_key = eerror.error_key) LEFT OUTER JOIN oicpcuhg.ir_provider prv ON (prv.hp_cd = enc.hp_cd and prv.provider_source_cd = enc.provider_source_cd and prv.provider_nbr = enc.provider_nbr)
  35. 35. 2014 MapR Technologies 35 LEFT OUTER JOIN oicpcuhg.ir_encounterspecialty esp ON (esp.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_specialty sp ON (sp.specialty_key = esp.specialty_key) LEFT OUTER JOIN oicpcuhg.ir_service svc ON (svc.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_revenue rev ON (rev.rev_cd = svc.rev_cd) LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag ON (diag.encounter_key = enc.encounter_key) INNER JOIN oicpcuhg.ir_diagcd dgcd ON (dgcd.diag_cd_key = diag.diag_cd_key) INNER JOIN oicpcuhg.ir_recordstate st ON (st.rec_state_key = diag.rec_state_key) INNER JOIN oicpcuhg.ir_recordstatus sts ON (sts.rec_status_key = diag.rec_status_key) LEFT OUTER JOIN oicpcuhg.ir_diagnosiserror derror ON (derror.diagnosis_key = diag.diagnosis_key and derror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error derr ON (derr.error_key = derror.error_key)) IR INNER JOIN oicpcuhg.umr_req_inbound umr ON (trim(umr.member_nbr) = IR.member_Nbr AND trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND trim(umr.diag1) = IR.diag_cd)
  36. 36. 2014 MapR Technologies 36 One Attack The customer cant give you the data They cant trust you, by law But they can probably summarize the data How many columns What types Perhaps statistical summaries
  37. 37. 2014 MapR Technologies 37 Bug Replication Without Security Violation Customer You DataData DataFake DataFake x y x y
  38. 38. 2014 MapR Technologies 38 The Upshot So random numbers are useful But simple distributions not so much How can YOU generate cool data?
  39. 39. 2014 MapR Technologies 39 e-book available courtesy of MapR http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman June 2014 (published by OReilly)
  40. 40. 2014 MapR Technologies 40 Last October: Time Series Databases by Ted Dunning and Ellen Friedman Oct 2014 (published by OReilly)
  41. 41. 2014 MapR Technologies 41 Coming in February: Real World Hadoop by Ted Dunning and Ellen Friedman Feb 2015 (published by OReilly)
  42. 42. 2014 MapR Technologies 42 Thank you for coming today!