rachel sholder - final presentation
TRANSCRIPT
Rachel SholderAugust 5, 2015
Data Science, Analytics, &Visualization Internship
Introduction
3
• Lehigh University – rising senior• Majoring in Mathematics with a Probability and Statistics
Concentration and an Actuarial Science Minor
Courses Taken:• Calculus I-III, Differential Equations, Linear Algebra, Principles
of Economics, Financial Mathematics, Probability and Statistics, Theory of Probability, Real Analysis, Abstract Algebra, Random Processes & Applications, Fundamentals of Programming
This Upcoming Semester:• Intro to Data Science, Statistical Computing, Complex
Variables
Introduction
4
• “I hope to see my classroom experiences translate to applications in the real world. After three years of college and three years of classroom learning, I want to see applications of statistical inference, bar graphs, and linear algebra.”
• “I hope to explore a compelling field I would like to work in upon graduation. This internship opportunity will be a great time to confirm my aspirations of becoming a data analyst/scientist.”
Summer Internship Aspirations
5
• Medicare Stars• Evaluation of Data Tools• Tekathon II
Main Three Projects
Medicare Stars
7
• Take a public data source and turn it into meaningful data• Evaluate data preparation tools• Provide intelligence about the different insurances—helpful
for healthcare payers• Provide general insights—helpful for CMS or for consumer
evaluation plans• Showcase Knowledgent’s capabilities
Objectives of this Project
8
• The Centers for Medicare and Medicaid Services (CMS) developed the Five Star Quality Rating System to help educate Medicare Advantage (MA) organizations on quality and provide transparent Medicare plan information, as well as improve the overall quality of services the Medicare plans provide.
• In return, MA organizations receive funding from the CMS.
Medicare Stars Overview
9
• For plans covering health services, the overall score for quality (Part C summary rating) of those services has 36 individual measures which are categorized into five separate domains: staying healthy, managing long-term conditions, ratings of health plan responsiveness and care, health plan member complaints and appeals, and health plan telephone customer service.
• For plans covering drug services, the overall score for quality (Part D summary rating) of those services has 17 individual measures which are categorized into four separate domains: drug plan customer service, drug plan member complaints and Medicare audit findings, member experience with drug plan, and drug pricing and patient safety.
Overall Star Rating
10
1 Star Poor
2 Star Below Average
3 Star Average
4 Star Above Average
5 Star Excellent
Star Ratings
11
• Patients more likely to choose a plan with a higher star rating• Plans can receive quality bonus payments (QBP) from the
federal government when they have good star ratings• Patients can choose to leave their current plan for a 5-star
plan at any point during December 8th to November 30th• Patients will not be allowed to enroll in a plan if it received low
scores for 3 straight years
Why Care?
12
• A current plan’s star rating is based primarily upon data from two years prior to the current benefit year.
• An improvement from one year to the next might not be seen in a star rating.
Things to Take Note Of
13
• All of the data can be found here: http://www.cms.gov/Medicare/Prescription-Drug-Coverage/PrescriptionDrugCovGenIn/PerformanceData.html.
Data Inventory
14
Data Folders
15
The data is a disaster:• Metrics change annually• Companies and plans change names• The domains the items are in change dramatically over years
(some to the point where some are incomparable)• The thresholds for stars are predetermined sometimes. Other
times, they are determined by a variable and a SAS clustering procedure on that variable
• The variables in the dataset have different scales and meanings
• The data set contains 10+ different types of missing entries (ex: plan too new/too small to be measured, not enough data available, plan not required to report measure)
Data Problems
16
Object-Oriented Model
17
Work Flow
Step Description Step Description
Append DatasetsAppend annual files from 2008 to 2015
NormalizeDetermine the key fields and pivot the other fields
Derive Year Column
Create a column with the correlated year
Merge w/ Stars DataCombine this dataset with the star ratings dataset
Remove NullsRemove data that is blank or has the word “NULL”
Merge w/ Threshold Data
Combine this dataset with the threshold data dataset
Clean Up FieldsClean up badly formatted fields (e.g. remove quotes)
Export DatasetMake dataset available for use
Append Datasets
Remove Nulls
1
1
2
2
Clean Up
Fields
3
3
Merge w/ Threshold
4
4Derive
Year Column
Merge w/ Stars Data
5 6
5
Normalize
7
6
8Export Dataset
7
8
18
Messy Dataset
19
Cleaned Dataset
20
• Payer- How is my plan performing?
• Consumer- Of the plans offered in my area, which plan has the best overall
performance?
• Industry Analyst- Can we predict how CMS will change their star rating thresholds for
next year?
• Prospective Client- How does this analysis showcase your skills?
Have fun, Jack!!
Next Step – Tableau Visualizations
Evaluation of Data Tools
22
Initial ResearchOnline Data Tools Website Availability
Trifacta http://www.trifacta.com/trial/ 14 day free trial
Paxata http://www.paxata.com/schedule-a-demo Schedule a demo
Alteryx http://pages.alteryx.com/FreeTrial-v1.html?sc=Web%20Direct&scd=direct&lsm=Web%20Direct&lsd=direct&src=sc105
14 day free trial
Tamr http://www.tamr.com/schedule-demo/ Schedule a demo
ClearStory http://www.clearstorydata.com/demo/ Request a trial or demo
OpenRefine https://github.com/OpenRefine/OpenRefine Free
Data Wrangler http://vis.stanford.edu/wrangler/ Free
Lavastorm http://www.lavastorm.com/ 30 day trial
Datameer http://www.datameer.com/ 14 day free trial
Data Preparator http://www.datapreparator.com/ Free
23
Six Data Preparation Tools
24
Main Focus on Three
Please not that this is not a formal evaluation of the tools.
25
• Founded: 2010• Headquarters: Irvine, CA• Description: Alteryx is the leader in data blending and
advanced analytics.• Categories: Predictive Analytics, Data Integration, Analytics• Funding Received: $78M • Price:
- Personal/Desktop: $3,995 per user- Personal/Desktop with Spatial: $12,995 per user- Personal/Desktop with Data (TomTom, Experian, Dun & Bradstreet,
US Census): $29,995- Server: $58,500 per server
26
27
28
29
30
1. Ease of Use – very easy to use2. Interface Sophistication – not very advanced, but simplicity
here isn’t a bad thing3. Customer Support – excellent 4. Self-Service Capability – easy to use on own5. Learning Curve – very easy to learn6. Tool Syntax – none 7. Accessibility – easy to download trial and start using8. Time Spent Using Tool – a few days (about 3)9. Best Features – simple interface that is easy to navigate
and easy to self-learn10.Shortcomings – encountered none
31
• Founded: 1999• Headquarters: Boston, MA• Description: Business Data Analytics • Categories: Business Intelligence, Big Data Analytics,
Enterprise• Funding Received: N/A• Price (unverified):
- Personal/Desktop: $3,500 per user- Server: $150,000+ per server
32
33
34
35
36
37
1. Ease of Use – definitely not hard to use, but hard to figure out at first
2. Interface Sophistication – complex and advanced3. Customer Support – superior 4. Self-Service Capability – needed help from Lavastorm
experts5. Learning Curve – definitely a strong learning curve6. Tool Syntax – slight programming background is helpful7. Accessibility – easy to download trial and start using8. Time Spent Using Tool – about a week total9. Best Features – ability to see the data flow clearly, see how
the data moves through the process flow, and see the data at any point in the process
10.Shortcomings – hard to figure out on own
38
• Founded: 2012• Headquarters: San Francisco, CA• Description: Trifacta is a software company developing
productivity platforms for data analysis, management, and manipulation.
• Categories: Data Preparation, Data Cleanup, Data Wrangling, Big Data in Hadoop
• Funding Received: $41.3M • Price: Unavailable online
39
40
41
42
43
44
1. Ease of Use – definitely not as easy as Alteryx, but not as complex as Lavastorm
2. Interface Sophistication – excellent data profiling and predictive capabilities
3. Customer Support – averaged about a day response time4. Self-Service Capability – needed help frequently from
Trifacta experts and our IT department5. Learning Curve – takes a little getting used to6. Tool Syntax – laid out for you in the tool7. Accessibility – easy to access8. Time Spent Using Tool – weeks 9. Best Features – its interface: data profiling histograms,
predictive capabilities, suggestion cards10.Shortcomings – did a lot of debugging and troubleshooting
45
1. Ease of Use2. Interface Sophistication3. Customer Support4. Self-Service Capability5. Learning Curve6. Tool Syntax
7. Accessibility8. Time Spent Using Tool 9. Best Features10.Shortcomings
Evaluation of the Tools
Please not that this is not a formal evaluation with equivalent test cases.
Tekathon II
47
• Goal: provide real-world use cases using Trifacta
Purpose of Tekathon II
• Use Case (does it solve an industry problem?)• Innovative (New way of doing things, Differentiation against competition)• Timing (is there market interest/readiness for the solution?)
• Target buyers/users/decision makers and stakeholders have been identified• Source for funding identified (top line and bottom line benefit)
Idea30 %
Execution30 %
Business Model20 %
Presentation20 %
• How well did the team present their solution?• How well did the team educate the audience on tool capability and their approach?
• Leadership and Teamwork - how well did the team work together?• Efficient – how resourceful was the team?• Effective – Did the team deliver?
48
The HLS Team
• Coordination- Jeff Evernham- Gaurav Suri
• Demonstration- Rachael Fahey
• SMEs- Ram Mohan- Saar Golde
• Data Wrangling- Rachel Sholder- Dip Kharod
• Visualization- Jose Garcia- Amila Bewtra
• IT Support- Drillon Berisha
49
• Pharmacovigilance - The detection, assessment, understanding, and prevention of adverse
effects of drugs
• Adverse Event- Any unfavorable and unintended sign, symptom, or disease associated
with the use of a medicinal product
• FDA reports adverse events – FAERS data• Over 7 million reports since 1997 – world’s largest database
of adverse events
Our Focus
50
• Seven different files released quarterly
1. Demographics2. Drugs3. Reaction4. Therapy5. Indication6. Outcome7. Reporting Source
FAERS Data
• Related by ISR (individual safety report) in 2004-2012 Q3 and primary ID in 2012 Q4 to present
51
Data Problems
Separate quarters must be
appended Data is normalized
across 7 tables
Large number of
missing fields
Drug Name Variations
New information
replaces earlier reports
Duplicate entries
Periodic format
changes
Data Entry Errors and
Typos
Units are Inconsistent
Same active ingredient in different
drugs
52
• Preliminary work in Trifacta• Created clean datasets used for team’s visualizations• Lots of work in Trifacta:
- Appended datasets- Dedup- Joined datasets- Cleaned up fields- Enriched data with another dataset- Exported wrangled dataset
• Lots of troubleshooting:- Standardize function didn’t work- Unable to see job results
My Contributions
53
• Trifacta works on a sample, not on full datasets• Sophisticated interface• Software does not always work• A lot can be accomplished in one day
Lessons Learned
Conclusions
55
• What data preparation actually is- Data preparation and cleanup takes 80% of the time- Data is messy
• Real-life applications (ex: exploratory statistics)• Understand differences and connections between data
science and data analytics• Glimpse into a data related career
Lessons Learned from Internship
56
• Free time options (ex: learn R)• More of a shadowing opportunity• Someone in person• Notify company about interns
Improvements to the Internship Position
57
• Not treated like an intern• Inclusion in events (ex: Tekathon)• Mathematics and statistics applications• Good preparation for my Intro to Data Science class• The people• And, last but not least…Pizza Fridays
Favorite Parts of the Internship Program
58
• Three Takeaways from my Internship (tomorrow)• Evaluation of Data Tools• How to Become a Data Scientist• Tekathon II Overview• Tekathon II – HLS Team
If you have not already, make sure you check out:• Meet the Interns: Rachel Sholder• Machine Learning Lessons
Blog Posts Coming Soon
Thank you!