data analysis in a divan

17
Data Analysis on a Divan: Let’s talk about our problems... Dr Grazziela Figueredo

Upload: gpfigueredo

Post on 08-Feb-2017

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Analysis in a Divan

Data Analysis on a Divan:

Let’s talk about our problems...

Dr Grazziela Figueredo

Page 2: Data Analysis in a Divan

2

Data Analyst

• Who that?

• What does it do?

• Why are there so many people talking about it?

• http://www.nottingham.ac.uk/adac/meet-the-team/meet-the-team.aspx

Page 3: Data Analysis in a Divan

3

The Hype Cycle - 2014

Page 4: Data Analysis in a Divan

4

2015

Page 5: Data Analysis in a Divan

5

2016

Page 6: Data Analysis in a Divan

6

Data Analyst – What is expected• Technical expertise

• Stats• Machine learning• HPC, MPI, …• Hadoop (Scala, Spark, Storm, the zoo…)• R, Matlab, SQL, Python, tableau, google graphs,… • Sentiment Analysis• Bioinformatics• Maths…

• Interpersonal skills• Communication (spoken, written)• Salesperson• Creative thinking• Management skills• Teamwork• Fluctuate between different disciplines• Eager to learn• Etc etc…

Page 7: Data Analysis in a Divan

InfoGrazzphics (outdated already)

Page 8: Data Analysis in a Divan

8

Common Data Analysis Phases

Problem definition

Agreement

Planning

Pre-processing

Analysis

Verification/Validation

Results Report

Page 9: Data Analysis in a Divan

9

Talking to the Client• The clients speaks as if you were an expert in their

field…• Multidisciplinary contexts• New jargon• If you don’t understand: ask questions • Ask for literature• Interaction is the key!

Problem definition Agreement

Page 10: Data Analysis in a Divan

10

Work Plan• Difficult to determine the time required for the analysis• Prepare the data for analysis• Define deliverables

• Depends on the data• Type of analysis• Amount of money to pay for the analysis• Availability of the team• Technical expertise available• Assessment of the data• Infrastructure available

Agreement Planning

Page 11: Data Analysis in a Divan

11

Data Formats• Different sources, different formats• Same data, different formats of files• Fusion, consistency• Selection of relevant data

Pre-processing

Page 12: Data Analysis in a Divan

12

And suddenly your import script is not working anymore… why is that?

Page 13: Data Analysis in a Divan

13

Large data, short memory/few resources? What infrastructure do you need? Who you are going to share it with?What are the team priorities for resource allocation?

Pre-processing Analysis

Page 14: Data Analysis in a Divan

14

Incomprehensible errors… back to programming life…

Analysis

Page 15: Data Analysis in a Divan

15

Torture the data until it confesses?• Large data does not always mean useful data• The more the merrier?• Difficulties of dealing with small data• Generalisation• Models without robustness• Missing values

• Data with no detectable patterns• Was the data collected correctly? • Was the correct data collected?

Analysis Verification/Validation Results Report

Page 16: Data Analysis in a Divan

16

Clients• As in any area of CS:

• Unrealistic deadlines• Even when the client doesn’t know what to get from the analysis

• Unrealistic expectations (i.e. major analysis breakthrough with 12 data points)

• Disappointment when the result of the analysis does not produce what was expected (i.e. a major breakthrough)

• Get discouraged and stop believing in data analysis• You need them to validate your results (mostly)• Complicated solutions with high performance vs simple solutions

with lower performance (different clients, different preferences)• Interactive/iterative process is always very useful

• Data scientists need love and validation ;)

Planning Verification/Validation Results Report

Page 17: Data Analysis in a Divan

17

Disappointing Results? Says who? According to who?Is one scientist trash another scientist’s gold?

  Before VIP selection After VIP selection  Cross validation Accuracy Cross validation Accuracy

NB Sensitivity 0.63±0.18 62.12±12.11 0.60±0.18 57.59±12.86Specificity 0.60±0.21 0.54±0.21 SVMs Sensitivity 0.93±0.08 67.55±8.75 0.80±0.13 64.00±10.98Specificity 0.24±0.18 0.35±0.19 RF Sensitivity 0.86±0.13 65.12±10.16 0.86±0.12 61.86±9.73Specificity 0.28±0.19 0.19±0.18 RBF Sensitivity 0.83±0.13 65.29±11.10 0.68±0.18 60±11.57Specificity 0.35±0.20 0.45±0.22 MLP Sensitivity 0.75±0.17 66.33±12.28 0.76±0.14 66.83±11.57Specificity 0.51±0.23 0.51±0.21

 

Results Report