Download - Rachel Sholder - Final Presentation
![Page 1: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/1.jpg)
Rachel SholderAugust 5, 2015
Data Science, Analytics, &Visualization Internship
![Page 2: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/2.jpg)
Introduction
![Page 3: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/3.jpg)
3
• Lehigh University – rising senior• Majoring in Mathematics with a Probability and Statistics
Concentration and an Actuarial Science Minor
Courses Taken:• Calculus I-III, Differential Equations, Linear Algebra, Principles
of Economics, Financial Mathematics, Probability and Statistics, Theory of Probability, Real Analysis, Abstract Algebra, Random Processes & Applications, Fundamentals of Programming
This Upcoming Semester:• Intro to Data Science, Statistical Computing, Complex
Variables
Introduction
![Page 4: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/4.jpg)
4
• “I hope to see my classroom experiences translate to applications in the real world. After three years of college and three years of classroom learning, I want to see applications of statistical inference, bar graphs, and linear algebra.”
• “I hope to explore a compelling field I would like to work in upon graduation. This internship opportunity will be a great time to confirm my aspirations of becoming a data analyst/scientist.”
Summer Internship Aspirations
![Page 5: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/5.jpg)
5
• Medicare Stars• Evaluation of Data Tools• Tekathon II
Main Three Projects
![Page 6: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/6.jpg)
Medicare Stars
![Page 7: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/7.jpg)
7
• Take a public data source and turn it into meaningful data• Evaluate data preparation tools• Provide intelligence about the different insurances—helpful
for healthcare payers• Provide general insights—helpful for CMS or for consumer
evaluation plans• Showcase Knowledgent’s capabilities
Objectives of this Project
![Page 8: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/8.jpg)
8
• The Centers for Medicare and Medicaid Services (CMS) developed the Five Star Quality Rating System to help educate Medicare Advantage (MA) organizations on quality and provide transparent Medicare plan information, as well as improve the overall quality of services the Medicare plans provide.
• In return, MA organizations receive funding from the CMS.
Medicare Stars Overview
![Page 9: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/9.jpg)
9
• For plans covering health services, the overall score for quality (Part C summary rating) of those services has 36 individual measures which are categorized into five separate domains: staying healthy, managing long-term conditions, ratings of health plan responsiveness and care, health plan member complaints and appeals, and health plan telephone customer service.
• For plans covering drug services, the overall score for quality (Part D summary rating) of those services has 17 individual measures which are categorized into four separate domains: drug plan customer service, drug plan member complaints and Medicare audit findings, member experience with drug plan, and drug pricing and patient safety.
Overall Star Rating
![Page 10: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/10.jpg)
10
1 Star Poor
2 Star Below Average
3 Star Average
4 Star Above Average
5 Star Excellent
Star Ratings
![Page 11: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/11.jpg)
11
• Patients more likely to choose a plan with a higher star rating• Plans can receive quality bonus payments (QBP) from the
federal government when they have good star ratings• Patients can choose to leave their current plan for a 5-star
plan at any point during December 8th to November 30th• Patients will not be allowed to enroll in a plan if it received low
scores for 3 straight years
Why Care?
![Page 12: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/12.jpg)
12
• A current plan’s star rating is based primarily upon data from two years prior to the current benefit year.
• An improvement from one year to the next might not be seen in a star rating.
Things to Take Note Of
![Page 13: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/13.jpg)
13
• All of the data can be found here: http://www.cms.gov/Medicare/Prescription-Drug-Coverage/PrescriptionDrugCovGenIn/PerformanceData.html.
Data Inventory
![Page 14: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/14.jpg)
14
Data Folders
![Page 15: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/15.jpg)
15
The data is a disaster:• Metrics change annually• Companies and plans change names• The domains the items are in change dramatically over years
(some to the point where some are incomparable)• The thresholds for stars are predetermined sometimes. Other
times, they are determined by a variable and a SAS clustering procedure on that variable
• The variables in the dataset have different scales and meanings
• The data set contains 10+ different types of missing entries (ex: plan too new/too small to be measured, not enough data available, plan not required to report measure)
Data Problems
![Page 16: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/16.jpg)
16
Object-Oriented Model
![Page 17: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/17.jpg)
17
Work Flow
Step Description Step Description
Append DatasetsAppend annual files from 2008 to 2015
NormalizeDetermine the key fields and pivot the other fields
Derive Year Column
Create a column with the correlated year
Merge w/ Stars DataCombine this dataset with the star ratings dataset
Remove NullsRemove data that is blank or has the word “NULL”
Merge w/ Threshold Data
Combine this dataset with the threshold data dataset
Clean Up FieldsClean up badly formatted fields (e.g. remove quotes)
Export DatasetMake dataset available for use
Append Datasets
Remove Nulls
1
1
2
2
Clean Up
Fields
3
3
Merge w/ Threshold
4
4Derive
Year Column
Merge w/ Stars Data
5 6
5
Normalize
7
6
8Export Dataset
7
8
![Page 18: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/18.jpg)
18
Messy Dataset
![Page 19: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/19.jpg)
19
Cleaned Dataset
![Page 20: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/20.jpg)
20
• Payer- How is my plan performing?
• Consumer- Of the plans offered in my area, which plan has the best overall
performance?
• Industry Analyst- Can we predict how CMS will change their star rating thresholds for
next year?
• Prospective Client- How does this analysis showcase your skills?
Have fun, Jack!!
Next Step – Tableau Visualizations
![Page 21: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/21.jpg)
Evaluation of Data Tools
![Page 22: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/22.jpg)
22
Initial ResearchOnline Data Tools Website Availability
Trifacta http://www.trifacta.com/trial/ 14 day free trial
Paxata http://www.paxata.com/schedule-a-demo Schedule a demo
Alteryx http://pages.alteryx.com/FreeTrial-v1.html?sc=Web%20Direct&scd=direct&lsm=Web%20Direct&lsd=direct&src=sc105
14 day free trial
Tamr http://www.tamr.com/schedule-demo/ Schedule a demo
ClearStory http://www.clearstorydata.com/demo/ Request a trial or demo
OpenRefine https://github.com/OpenRefine/OpenRefine Free
Data Wrangler http://vis.stanford.edu/wrangler/ Free
Lavastorm http://www.lavastorm.com/ 30 day trial
Datameer http://www.datameer.com/ 14 day free trial
Data Preparator http://www.datapreparator.com/ Free
![Page 23: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/23.jpg)
23
Six Data Preparation Tools
![Page 24: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/24.jpg)
24
Main Focus on Three
Please not that this is not a formal evaluation of the tools.
![Page 25: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/25.jpg)
25
• Founded: 2010• Headquarters: Irvine, CA• Description: Alteryx is the leader in data blending and
advanced analytics.• Categories: Predictive Analytics, Data Integration, Analytics• Funding Received: $78M • Price:
- Personal/Desktop: $3,995 per user- Personal/Desktop with Spatial: $12,995 per user- Personal/Desktop with Data (TomTom, Experian, Dun & Bradstreet,
US Census): $29,995- Server: $58,500 per server
![Page 26: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/26.jpg)
26
![Page 27: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/27.jpg)
27
![Page 28: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/28.jpg)
28
![Page 29: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/29.jpg)
29
![Page 30: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/30.jpg)
30
1. Ease of Use – very easy to use2. Interface Sophistication – not very advanced, but simplicity
here isn’t a bad thing3. Customer Support – excellent 4. Self-Service Capability – easy to use on own5. Learning Curve – very easy to learn6. Tool Syntax – none 7. Accessibility – easy to download trial and start using8. Time Spent Using Tool – a few days (about 3)9. Best Features – simple interface that is easy to navigate
and easy to self-learn10.Shortcomings – encountered none
![Page 31: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/31.jpg)
31
• Founded: 1999• Headquarters: Boston, MA• Description: Business Data Analytics • Categories: Business Intelligence, Big Data Analytics,
Enterprise• Funding Received: N/A• Price (unverified):
- Personal/Desktop: $3,500 per user- Server: $150,000+ per server
![Page 32: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/32.jpg)
32
![Page 33: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/33.jpg)
33
![Page 34: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/34.jpg)
34
![Page 35: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/35.jpg)
35
![Page 36: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/36.jpg)
36
![Page 37: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/37.jpg)
37
1. Ease of Use – definitely not hard to use, but hard to figure out at first
2. Interface Sophistication – complex and advanced3. Customer Support – superior 4. Self-Service Capability – needed help from Lavastorm
experts5. Learning Curve – definitely a strong learning curve6. Tool Syntax – slight programming background is helpful7. Accessibility – easy to download trial and start using8. Time Spent Using Tool – about a week total9. Best Features – ability to see the data flow clearly, see how
the data moves through the process flow, and see the data at any point in the process
10.Shortcomings – hard to figure out on own
![Page 38: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/38.jpg)
38
• Founded: 2012• Headquarters: San Francisco, CA• Description: Trifacta is a software company developing
productivity platforms for data analysis, management, and manipulation.
• Categories: Data Preparation, Data Cleanup, Data Wrangling, Big Data in Hadoop
• Funding Received: $41.3M • Price: Unavailable online
![Page 39: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/39.jpg)
39
![Page 40: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/40.jpg)
40
![Page 41: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/41.jpg)
41
![Page 42: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/42.jpg)
42
![Page 43: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/43.jpg)
43
![Page 44: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/44.jpg)
44
1. Ease of Use – definitely not as easy as Alteryx, but not as complex as Lavastorm
2. Interface Sophistication – excellent data profiling and predictive capabilities
3. Customer Support – averaged about a day response time4. Self-Service Capability – needed help frequently from
Trifacta experts and our IT department5. Learning Curve – takes a little getting used to6. Tool Syntax – laid out for you in the tool7. Accessibility – easy to access8. Time Spent Using Tool – weeks 9. Best Features – its interface: data profiling histograms,
predictive capabilities, suggestion cards10.Shortcomings – did a lot of debugging and troubleshooting
![Page 45: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/45.jpg)
45
1. Ease of Use2. Interface Sophistication3. Customer Support4. Self-Service Capability5. Learning Curve6. Tool Syntax
7. Accessibility8. Time Spent Using Tool 9. Best Features10.Shortcomings
Evaluation of the Tools
Please not that this is not a formal evaluation with equivalent test cases.
![Page 46: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/46.jpg)
Tekathon II
![Page 47: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/47.jpg)
47
• Goal: provide real-world use cases using Trifacta
Purpose of Tekathon II
• Use Case (does it solve an industry problem?)• Innovative (New way of doing things, Differentiation against competition)• Timing (is there market interest/readiness for the solution?)
• Target buyers/users/decision makers and stakeholders have been identified• Source for funding identified (top line and bottom line benefit)
Idea30 %
Execution30 %
Business Model20 %
Presentation20 %
• How well did the team present their solution?• How well did the team educate the audience on tool capability and their approach?
• Leadership and Teamwork - how well did the team work together?• Efficient – how resourceful was the team?• Effective – Did the team deliver?
![Page 48: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/48.jpg)
48
The HLS Team
• Coordination- Jeff Evernham- Gaurav Suri
• Demonstration- Rachael Fahey
• SMEs- Ram Mohan- Saar Golde
• Data Wrangling- Rachel Sholder- Dip Kharod
• Visualization- Jose Garcia- Amila Bewtra
• IT Support- Drillon Berisha
![Page 49: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/49.jpg)
49
• Pharmacovigilance - The detection, assessment, understanding, and prevention of adverse
effects of drugs
• Adverse Event- Any unfavorable and unintended sign, symptom, or disease associated
with the use of a medicinal product
• FDA reports adverse events – FAERS data• Over 7 million reports since 1997 – world’s largest database
of adverse events
Our Focus
![Page 50: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/50.jpg)
50
• Seven different files released quarterly
1. Demographics2. Drugs3. Reaction4. Therapy5. Indication6. Outcome7. Reporting Source
FAERS Data
• Related by ISR (individual safety report) in 2004-2012 Q3 and primary ID in 2012 Q4 to present
![Page 51: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/51.jpg)
51
Data Problems
Separate quarters must be
appended Data is normalized
across 7 tables
Large number of
missing fields
Drug Name Variations
New information
replaces earlier reports
Duplicate entries
Periodic format
changes
Data Entry Errors and
Typos
Units are Inconsistent
Same active ingredient in different
drugs
![Page 52: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/52.jpg)
52
• Preliminary work in Trifacta• Created clean datasets used for team’s visualizations• Lots of work in Trifacta:
- Appended datasets- Dedup- Joined datasets- Cleaned up fields- Enriched data with another dataset- Exported wrangled dataset
• Lots of troubleshooting:- Standardize function didn’t work- Unable to see job results
My Contributions
![Page 53: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/53.jpg)
53
• Trifacta works on a sample, not on full datasets• Sophisticated interface• Software does not always work• A lot can be accomplished in one day
Lessons Learned
![Page 54: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/54.jpg)
Conclusions
![Page 55: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/55.jpg)
55
• What data preparation actually is- Data preparation and cleanup takes 80% of the time- Data is messy
• Real-life applications (ex: exploratory statistics)• Understand differences and connections between data
science and data analytics• Glimpse into a data related career
Lessons Learned from Internship
![Page 56: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/56.jpg)
56
• Free time options (ex: learn R)• More of a shadowing opportunity• Someone in person• Notify company about interns
Improvements to the Internship Position
![Page 57: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/57.jpg)
57
• Not treated like an intern• Inclusion in events (ex: Tekathon)• Mathematics and statistics applications• Good preparation for my Intro to Data Science class• The people• And, last but not least…Pizza Fridays
Favorite Parts of the Internship Program
![Page 58: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/58.jpg)
58
• Three Takeaways from my Internship (tomorrow)• Evaluation of Data Tools• How to Become a Data Scientist• Tekathon II Overview• Tekathon II – HLS Team
If you have not already, make sure you check out:• Meet the Interns: Rachel Sholder• Machine Learning Lessons
Blog Posts Coming Soon
![Page 59: Rachel Sholder - Final Presentation](https://reader036.vdocument.in/reader036/viewer/2022062515/55d19b2fbb61eba2418b4651/html5/thumbnails/59.jpg)
Thank you!