data analysis in a divan
TRANSCRIPT
Data Analysis on a Divan:
Let’s talk about our problems...
Dr Grazziela Figueredo
2
Data Analyst
• Who that?
• What does it do?
• Why are there so many people talking about it?
• http://www.nottingham.ac.uk/adac/meet-the-team/meet-the-team.aspx
3
The Hype Cycle - 2014
4
2015
5
2016
6
Data Analyst – What is expected• Technical expertise
• Stats• Machine learning• HPC, MPI, …• Hadoop (Scala, Spark, Storm, the zoo…)• R, Matlab, SQL, Python, tableau, google graphs,… • Sentiment Analysis• Bioinformatics• Maths…
• Interpersonal skills• Communication (spoken, written)• Salesperson• Creative thinking• Management skills• Teamwork• Fluctuate between different disciplines• Eager to learn• Etc etc…
InfoGrazzphics (outdated already)
8
Common Data Analysis Phases
Problem definition
Agreement
Planning
Pre-processing
Analysis
Verification/Validation
Results Report
9
Talking to the Client• The clients speaks as if you were an expert in their
field…• Multidisciplinary contexts• New jargon• If you don’t understand: ask questions • Ask for literature• Interaction is the key!
Problem definition Agreement
10
Work Plan• Difficult to determine the time required for the analysis• Prepare the data for analysis• Define deliverables
• Depends on the data• Type of analysis• Amount of money to pay for the analysis• Availability of the team• Technical expertise available• Assessment of the data• Infrastructure available
Agreement Planning
11
Data Formats• Different sources, different formats• Same data, different formats of files• Fusion, consistency• Selection of relevant data
Pre-processing
12
And suddenly your import script is not working anymore… why is that?
13
Large data, short memory/few resources? What infrastructure do you need? Who you are going to share it with?What are the team priorities for resource allocation?
Pre-processing Analysis
14
Incomprehensible errors… back to programming life…
Analysis
15
Torture the data until it confesses?• Large data does not always mean useful data• The more the merrier?• Difficulties of dealing with small data• Generalisation• Models without robustness• Missing values
• Data with no detectable patterns• Was the data collected correctly? • Was the correct data collected?
Analysis Verification/Validation Results Report
16
Clients• As in any area of CS:
• Unrealistic deadlines• Even when the client doesn’t know what to get from the analysis
• Unrealistic expectations (i.e. major analysis breakthrough with 12 data points)
• Disappointment when the result of the analysis does not produce what was expected (i.e. a major breakthrough)
• Get discouraged and stop believing in data analysis• You need them to validate your results (mostly)• Complicated solutions with high performance vs simple solutions
with lower performance (different clients, different preferences)• Interactive/iterative process is always very useful
• Data scientists need love and validation ;)
Planning Verification/Validation Results Report
17
Disappointing Results? Says who? According to who?Is one scientist trash another scientist’s gold?
Before VIP selection After VIP selection Cross validation Accuracy Cross validation Accuracy
NB Sensitivity 0.63±0.18 62.12±12.11 0.60±0.18 57.59±12.86Specificity 0.60±0.21 0.54±0.21 SVMs Sensitivity 0.93±0.08 67.55±8.75 0.80±0.13 64.00±10.98Specificity 0.24±0.18 0.35±0.19 RF Sensitivity 0.86±0.13 65.12±10.16 0.86±0.12 61.86±9.73Specificity 0.28±0.19 0.19±0.18 RBF Sensitivity 0.83±0.13 65.29±11.10 0.68±0.18 60±11.57Specificity 0.35±0.20 0.45±0.22 MLP Sensitivity 0.75±0.17 66.33±12.28 0.76±0.14 66.83±11.57Specificity 0.51±0.23 0.51±0.21
Results Report