rachel sholder - final presentation

59
Rachel Sholder August 5, 2015 Data Science, Analytics, & Visualization Internship

Upload: rachel-sholder

Post on 17-Aug-2015

9 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Rachel Sholder - Final Presentation

Rachel SholderAugust 5, 2015

Data Science, Analytics, &Visualization Internship

Page 2: Rachel Sholder - Final Presentation

Introduction

Page 3: Rachel Sholder - Final Presentation

3

• Lehigh University – rising senior• Majoring in Mathematics with a Probability and Statistics

Concentration and an Actuarial Science Minor

Courses Taken:• Calculus I-III, Differential Equations, Linear Algebra, Principles

of Economics, Financial Mathematics, Probability and Statistics, Theory of Probability, Real Analysis, Abstract Algebra, Random Processes & Applications, Fundamentals of Programming

This Upcoming Semester:• Intro to Data Science, Statistical Computing, Complex

Variables

Introduction

Page 4: Rachel Sholder - Final Presentation

4

• “I hope to see my classroom experiences translate to applications in the real world. After three years of college and three years of classroom learning, I want to see applications of statistical inference, bar graphs, and linear algebra.”

• “I hope to explore a compelling field I would like to work in upon graduation. This internship opportunity will be a great time to confirm my aspirations of becoming a data analyst/scientist.”

Summer Internship Aspirations

Page 5: Rachel Sholder - Final Presentation

5

• Medicare Stars• Evaluation of Data Tools• Tekathon II

Main Three Projects

Page 6: Rachel Sholder - Final Presentation

Medicare Stars

Page 7: Rachel Sholder - Final Presentation

7

• Take a public data source and turn it into meaningful data• Evaluate data preparation tools• Provide intelligence about the different insurances—helpful

for healthcare payers• Provide general insights—helpful for CMS or for consumer

evaluation plans• Showcase Knowledgent’s capabilities

Objectives of this Project

Page 8: Rachel Sholder - Final Presentation

8

• The Centers for Medicare and Medicaid Services (CMS) developed the Five Star Quality Rating System to help educate Medicare Advantage (MA) organizations on quality and provide transparent Medicare plan information, as well as improve the overall quality of services the Medicare plans provide. 

• In return, MA organizations receive funding from the CMS.

Medicare Stars Overview

Page 9: Rachel Sholder - Final Presentation

9

• For plans covering health services, the overall score for quality (Part C summary rating) of those services has 36 individual measures which are categorized into five separate domains: staying healthy, managing long-term conditions, ratings of health plan responsiveness and care, health plan member complaints and appeals, and health plan telephone customer service.

• For plans covering drug services, the overall score for quality (Part D summary rating) of those services has 17 individual measures which are categorized into four separate domains: drug plan customer service, drug plan member complaints and Medicare audit findings, member experience with drug plan, and drug pricing and patient safety.

Overall Star Rating

Page 10: Rachel Sholder - Final Presentation

10

1 Star Poor

2 Star Below Average

3 Star Average

4 Star Above Average

5 Star Excellent

Star Ratings

Page 11: Rachel Sholder - Final Presentation

11

• Patients more likely to choose a plan with a higher star rating• Plans can receive quality bonus payments (QBP) from the

federal government when they have good star ratings• Patients can choose to leave their current plan for a 5-star

plan at any point during December 8th to November 30th• Patients will not be allowed to enroll in a plan if it received low

scores for 3 straight years

Why Care?

Page 12: Rachel Sholder - Final Presentation

12

• A current plan’s star rating is based primarily upon data from two years prior to the current benefit year.

• An improvement from one year to the next might not be seen in a star rating.

Things to Take Note Of

Page 14: Rachel Sholder - Final Presentation

14

Data Folders

Page 15: Rachel Sholder - Final Presentation

15

The data is a disaster:• Metrics change annually• Companies and plans change names• The domains the items are in change dramatically over years

(some to the point where some are incomparable)• The thresholds for stars are predetermined sometimes. Other

times, they are determined by a variable and a SAS clustering procedure on that variable

• The variables in the dataset have different scales and meanings

• The data set contains 10+ different types of missing entries (ex: plan too new/too small to be measured, not enough data available, plan not required to report measure)

Data Problems

Page 16: Rachel Sholder - Final Presentation

16

Object-Oriented Model

Page 17: Rachel Sholder - Final Presentation

17

Work Flow

Step Description Step Description

Append DatasetsAppend annual files from 2008 to 2015

NormalizeDetermine the key fields and pivot the other fields

Derive Year Column

Create a column with the correlated year

Merge w/ Stars DataCombine this dataset with the star ratings dataset

Remove NullsRemove data that is blank or has the word “NULL”

Merge w/ Threshold Data

Combine this dataset with the threshold data dataset

Clean Up FieldsClean up badly formatted fields (e.g. remove quotes)

Export DatasetMake dataset available for use

Append Datasets

Remove Nulls

1

1

2

2

Clean Up

Fields

3

3

Merge w/ Threshold

4

4Derive

Year Column

Merge w/ Stars Data

5 6

5

Normalize

7

6

8Export Dataset

7

8

Page 18: Rachel Sholder - Final Presentation

18

Messy Dataset

Page 19: Rachel Sholder - Final Presentation

19

Cleaned Dataset

Page 20: Rachel Sholder - Final Presentation

20

• Payer- How is my plan performing?

• Consumer- Of the plans offered in my area, which plan has the best overall

performance?

• Industry Analyst- Can we predict how CMS will change their star rating thresholds for

next year?

• Prospective Client- How does this analysis showcase your skills?

Have fun, Jack!!

Next Step – Tableau Visualizations

Page 21: Rachel Sholder - Final Presentation

Evaluation of Data Tools

Page 22: Rachel Sholder - Final Presentation

22

Initial ResearchOnline Data Tools Website Availability

Trifacta http://www.trifacta.com/trial/ 14 day free trial

Paxata http://www.paxata.com/schedule-a-demo Schedule a demo

Alteryx http://pages.alteryx.com/FreeTrial-v1.html?sc=Web%20Direct&scd=direct&lsm=Web%20Direct&lsd=direct&src=sc105

14 day free trial

Tamr http://www.tamr.com/schedule-demo/ Schedule a demo

ClearStory http://www.clearstorydata.com/demo/ Request a trial or demo

OpenRefine https://github.com/OpenRefine/OpenRefine Free

Data Wrangler http://vis.stanford.edu/wrangler/ Free

Lavastorm http://www.lavastorm.com/ 30 day trial

Datameer http://www.datameer.com/ 14 day free trial

Data Preparator http://www.datapreparator.com/ Free

Page 23: Rachel Sholder - Final Presentation

23

Six Data Preparation Tools

Page 24: Rachel Sholder - Final Presentation

24

Main Focus on Three

Please not that this is not a formal evaluation of the tools.

Page 25: Rachel Sholder - Final Presentation

25

• Founded: 2010• Headquarters: Irvine, CA• Description: Alteryx is the leader in data blending and

advanced analytics.• Categories: Predictive Analytics, Data Integration, Analytics• Funding Received: $78M • Price:

- Personal/Desktop: $3,995 per user- Personal/Desktop with Spatial: $12,995 per user- Personal/Desktop with Data (TomTom, Experian, Dun & Bradstreet,

US Census): $29,995- Server: $58,500 per server

Page 26: Rachel Sholder - Final Presentation

26

Page 27: Rachel Sholder - Final Presentation

27

Page 28: Rachel Sholder - Final Presentation

28

Page 29: Rachel Sholder - Final Presentation

29

Page 30: Rachel Sholder - Final Presentation

30

1. Ease of Use – very easy to use2. Interface Sophistication – not very advanced, but simplicity

here isn’t a bad thing3. Customer Support – excellent 4. Self-Service Capability – easy to use on own5. Learning Curve – very easy to learn6. Tool Syntax – none 7. Accessibility – easy to download trial and start using8. Time Spent Using Tool – a few days (about 3)9. Best Features – simple interface that is easy to navigate

and easy to self-learn10.Shortcomings – encountered none

Page 31: Rachel Sholder - Final Presentation

31

• Founded: 1999• Headquarters: Boston, MA• Description: Business Data Analytics • Categories: Business Intelligence, Big Data Analytics,

Enterprise• Funding Received: N/A• Price (unverified):

- Personal/Desktop: $3,500 per user- Server: $150,000+ per server

Page 32: Rachel Sholder - Final Presentation

32

Page 33: Rachel Sholder - Final Presentation

33

Page 34: Rachel Sholder - Final Presentation

34

Page 35: Rachel Sholder - Final Presentation

35

Page 36: Rachel Sholder - Final Presentation

36

Page 37: Rachel Sholder - Final Presentation

37

1. Ease of Use – definitely not hard to use, but hard to figure out at first

2. Interface Sophistication – complex and advanced3. Customer Support – superior 4. Self-Service Capability – needed help from Lavastorm

experts5. Learning Curve – definitely a strong learning curve6. Tool Syntax – slight programming background is helpful7. Accessibility – easy to download trial and start using8. Time Spent Using Tool – about a week total9. Best Features – ability to see the data flow clearly, see how

the data moves through the process flow, and see the data at any point in the process

10.Shortcomings – hard to figure out on own

Page 38: Rachel Sholder - Final Presentation

38

• Founded: 2012• Headquarters: San Francisco, CA• Description: Trifacta is a software company developing

productivity platforms for data analysis, management, and manipulation.

• Categories: Data Preparation, Data Cleanup, Data Wrangling, Big Data in Hadoop

• Funding Received: $41.3M • Price: Unavailable online

Page 39: Rachel Sholder - Final Presentation

39

Page 40: Rachel Sholder - Final Presentation

40

Page 41: Rachel Sholder - Final Presentation

41

Page 42: Rachel Sholder - Final Presentation

42

Page 43: Rachel Sholder - Final Presentation

43

Page 44: Rachel Sholder - Final Presentation

44

1. Ease of Use – definitely not as easy as Alteryx, but not as complex as Lavastorm

2. Interface Sophistication – excellent data profiling and predictive capabilities

3. Customer Support – averaged about a day response time4. Self-Service Capability – needed help frequently from

Trifacta experts and our IT department5. Learning Curve – takes a little getting used to6. Tool Syntax – laid out for you in the tool7. Accessibility – easy to access8. Time Spent Using Tool – weeks 9. Best Features – its interface: data profiling histograms,

predictive capabilities, suggestion cards10.Shortcomings – did a lot of debugging and troubleshooting

Page 45: Rachel Sholder - Final Presentation

45

1. Ease of Use2. Interface Sophistication3. Customer Support4. Self-Service Capability5. Learning Curve6. Tool Syntax

7. Accessibility8. Time Spent Using Tool 9. Best Features10.Shortcomings

Evaluation of the Tools

Please not that this is not a formal evaluation with equivalent test cases.

Page 46: Rachel Sholder - Final Presentation

Tekathon II

Page 47: Rachel Sholder - Final Presentation

47

• Goal: provide real-world use cases using Trifacta

Purpose of Tekathon II

• Use Case (does it solve an industry problem?)• Innovative (New way of doing things, Differentiation against competition)• Timing (is there market interest/readiness for the solution?)

• Target buyers/users/decision makers and stakeholders have been identified• Source for funding identified (top line and bottom line benefit)

Idea30 %

Execution30 %

Business Model20 %

Presentation20 %

• How well did the team present their solution?• How well did the team educate the audience on tool capability and their approach?

• Leadership and Teamwork - how well did the team work together?• Efficient – how resourceful was the team?• Effective – Did the team deliver?

Page 48: Rachel Sholder - Final Presentation

48

The HLS Team

• Coordination- Jeff Evernham- Gaurav Suri

• Demonstration- Rachael Fahey

• SMEs- Ram Mohan- Saar Golde

• Data Wrangling- Rachel Sholder- Dip Kharod

• Visualization- Jose Garcia- Amila Bewtra

• IT Support- Drillon Berisha

Page 49: Rachel Sholder - Final Presentation

49

• Pharmacovigilance - The detection, assessment, understanding, and prevention of adverse

effects of drugs

• Adverse Event- Any unfavorable and unintended sign, symptom, or disease associated

with the use of a medicinal product

• FDA reports adverse events – FAERS data• Over 7 million reports since 1997 – world’s largest database

of adverse events

Our Focus

Page 50: Rachel Sholder - Final Presentation

50

• Seven different files released quarterly

1. Demographics2. Drugs3. Reaction4. Therapy5. Indication6. Outcome7. Reporting Source

FAERS Data

• Related by ISR (individual safety report) in 2004-2012 Q3 and primary ID in 2012 Q4 to present

Page 51: Rachel Sholder - Final Presentation

51

Data Problems

Separate quarters must be

appended Data is normalized

across 7 tables

Large number of

missing fields

Drug Name Variations

New information

replaces earlier reports

Duplicate entries

Periodic format

changes

Data Entry Errors and

Typos

Units are Inconsistent

Same active ingredient in different

drugs

Page 52: Rachel Sholder - Final Presentation

52

• Preliminary work in Trifacta• Created clean datasets used for team’s visualizations• Lots of work in Trifacta:

- Appended datasets- Dedup- Joined datasets- Cleaned up fields- Enriched data with another dataset- Exported wrangled dataset

• Lots of troubleshooting:- Standardize function didn’t work- Unable to see job results

My Contributions

Page 53: Rachel Sholder - Final Presentation

53

• Trifacta works on a sample, not on full datasets• Sophisticated interface• Software does not always work• A lot can be accomplished in one day

Lessons Learned

Page 54: Rachel Sholder - Final Presentation

Conclusions

Page 55: Rachel Sholder - Final Presentation

55

• What data preparation actually is- Data preparation and cleanup takes 80% of the time- Data is messy

• Real-life applications (ex: exploratory statistics)• Understand differences and connections between data

science and data analytics• Glimpse into a data related career

Lessons Learned from Internship

Page 56: Rachel Sholder - Final Presentation

56

• Free time options (ex: learn R)• More of a shadowing opportunity• Someone in person• Notify company about interns

Improvements to the Internship Position

Page 57: Rachel Sholder - Final Presentation

57

• Not treated like an intern• Inclusion in events (ex: Tekathon)• Mathematics and statistics applications• Good preparation for my Intro to Data Science class• The people• And, last but not least…Pizza Fridays  

Favorite Parts of the Internship Program

Page 58: Rachel Sholder - Final Presentation

58

• Three Takeaways from my Internship (tomorrow)• Evaluation of Data Tools• How to Become a Data Scientist• Tekathon II Overview• Tekathon II – HLS Team

If you have not already, make sure you check out:• Meet the Interns: Rachel Sholder• Machine Learning Lessons

Blog Posts Coming Soon

Page 59: Rachel Sholder - Final Presentation

Thank you!