exploratory data analysis - grape.ics.uci.edu · stats 170a: project in data science data...

56
Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

Stats170A:ProjectinDataScience

DataVisualizationandExploratoryDataAnalysis

Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine

Page 2: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2

Overview

• Lectures/Homeworks uptothispoint– Datamanagement(relationalDBs,query languages,PostgreSQL)– Datamanipulation inPython (Pandas)– Dataformats(JSON,XML)– PracticalexperiencewithTwitterdata,IMDBdata

• Next2weeks– Reviewofdatavisualizationandexploration– Basicprinciplesofmachinelearning (andsomestatistics)– Machinelearningwithtextdata

Page 3: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 3

HowthisCoursewillwork

• Q1:Weeks1to6:LecturesandAssignments– Reviewgeneralprinciplesofdatascience– Weeks1to3:databases,dataextraction,datacleaning– Weeks4to6:textanalysis,dataexploration,machinelearning– Combination oflectures,assignments,andbackground reading

• Q1:Weeks7to10:ProjectProposals– Projectproposals fromstudent teams– Feedbackfrom instructors, refineproposal, oralpresentationatendofquarter

• Q2:WorkonProjects– Buildanduseaprototype system/pipeline – Develop ideas,implement algorithms,makeuseoflibrariesandpackages– Conductexperimentswithrealdatasets– Testandevaluateyoursysteminasystematicmanner– Communicateyour results(presentations andreports)

Page 4: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 4

Assignment5

RefertotheWikipage

DuenoononMondayFebruary12th toEEEdropbox

Notechange:duebeforeclass(by2pm)

Page 5: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 5

Page 6: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 6

TypesofData

Page 7: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 7

TypesofDataforaSingleVariable

• Real-valued,continuous– e.g.,aperson’sweightorincome– valuesmaybediscretrized andbounded, butwewillthinkofasontherealline

• Integer– e.g.Yearofbirth,numberofyearsincollege– Couldabeareal-valuedvariablethatisquantized (ageinyears)

• Ordinal– e.g.,education level={kindergarten, highschool, college,gradschool,…}

• Categorical– e.g.,{red,blue,yellow}or{CA,MA,NY,AZ,….}ortextstrings

(Notethatmanyvisualizationandmachinelearningtechniquesimplicitlyassumereal-valueddata,andotherdatatypesareconvertedtorealsorrep)

Page 8: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 8

MultipleVariables

• Morethan1variable,oftenreferredtoasmultivariateormultidimensional

• Ofteninterestedinrelationshipsbetweenvariablesandgeometricstructureofthedata(forreal-valueddata),e.g.,isitclustered?

• Smallnumbersofvariablescanplotthedataandlookatrelationships

• Forlargenumbersweuseexploratorytechniques– E.g.,clusteringanddimension reduction

• Notethatmanyvisualizationandmachinelearningtechniques implicitlyassumereal-valueddata….categoricaldatatypesareoftenconvertedtoreals(e.g.,binary)orrepresented viagrouping, colors,oricons

Page 9: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 9

DatawithContext

• Time-seriesdata– Avariablewhosevaluesareindexedbytime– Wecanalsohavemultidimensional time-series

• Sequencedata– Avariableindexedbyposition– E.g.,words(categorical)intext,orDNAsequences

• Spatialdata– Datawhosevaluesareindexedspatially,e.g.,bylat/lon orbycity– Canalsohavemultidimensional time-series

• Spatio-temporal– Indexedbybothspaceandtime,e.g.,stormtracks,vehicletrajectories, etc

Page 10: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 10

StockMarketIndiceslastweek

Page 11: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 11

NightLightsfromNorthandSouthKorea

Fromhttps://www.vox.com

Page 12: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 12

WherePeopleRun

From:https://flowingdata.com/2014/02/05/where-people-run/#jp-carousel-33695

Page 13: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 13

RelationalData

• Nentities,i =1,….N• NxNrelations:

– canberepresentedasanarrayy(i,j)=1ifi isconnectedtoj,0otherwise– Example:asocialnetwork

• Cancombinewithotherdata,e.g.,– Eachrelationcouldhavemetadata,e.g.,text– Eachrelationcouldbetime-dependent, y(i,j,t)isatimeseriesovertimet

Page 14: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 14

Visualizationofanemailnetworkusing2-dimensionalgraphdrawingor“embedding”

Datafrom500researchersatHewlett-Packardoverapproximately1year.

Variousstructuralelementsofthenetworkareapparent

Page 15: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 15

PhilosophybehindthisClass

• Provideanexperienceofhowdatascienceworksinthereal-world– Defining aproblem– Identifying, understanding, exploring relevantdata– Extracting,cleaning,managementofdata– Explorationandanalysisofdata– Buildingmodels fromdata(e.g.,viamachinelearning)– Evaluatingmodels:howwelldotheypredict– Communicating yourresultstoothers

• Tietogetherideasfromdifferentcoursesyouhavetakenandgiveyouexperienceinapplyingtheseideastoreal-worlddata– Databases,software,algorithms,machinelearning, statistics

Page 16: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 16

DataScience:fromDatatoActions

DataManagement

RawData

PredictiveModeling

ExploratoryDataAnalysis

Consumers

ExternalBusinessCustomers

InternalBusinessCustomers

Scientists

Government

DataWrangling

Databases,Algorithms,SoftwareEngineering

MachineLearning,Statistics

DomainknowledgeBusinessknowledge

Page 17: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 17

WhyVisualizationandExploration?

• Peoplearegoodatpatternrecognition– Atspottingclusters,trends,outliers, structure…thatcomputersmanymiss

• Usuallytwotypesofusers1. Thedatascientistwhowantstoexplore/analyze/understand

- Forthedatascientist,visualizationandexplorationarepartofaniterativeprocess

2. Thepersonwhoneedsaquicksummary tomakeadecision- Fortheconsumerwewanttocommunicateinformationquicklyandclearly- e.g.,foramedicaldoctor,forapolicy-maker,foraconsumer

- Fordatascientists…itsalwaysagoodideatolookatyourdata- Helpstounderstandwherethesemanticsofthedata…whatthemeasurements

actuallymean

Page 18: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 18

WhatisExploratoryDataAnalysis?

• Broaderthanjustvisualization

• EDA={visualization,clustering,dimensionreduction,….}

• Forsmallnumbersofvariables,EDA=visualization

• Forlargenumbersofvariables,weneedtobecleverer– Clustering,dimension reduction, embedding algorithms– Thesearetechniques thatessentiallyreducehigh-dimensional datato

something wecanlookat

• PioneeredbyJohnTukey (statisticianatBellLabs,Princeton)inthe1960’s– “letthedataspeak”

Page 19: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 19

ExploratoryDataAnalysis:SingleVariables

Page 20: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 20

SummaryStatistics

Mean:“centerofdata”Mode:locationofhighestdatadensityVariance:“spreadofdata”Skew:indicationofnon-symmetry

Range:max- minMedian:50%ofvaluesbelow,50%aboveQuantiles:e.g.,valuessuchthat25%,50%,75%aresmaller

NotethatsomeofthesestatisticscanbemisleadingE.g.,meanfordatawith2clustersmaybeinaregionwithzerodata

Page 21: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 21

HistogramofUnimodal Data

6 7 8 9 10 11 12 13 140

200

400

600

800

1000

1200

1000datapoints simulatedfromaNormaldistribution, mean10,variance1,30bins

Page 22: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 22

Histograms:Unimodal Data

6 7 8 9 10 11 12 130

5

10

15

20

25

30

35

40

6 7 8 9 10 11 12 130

5

10

15

20

25

100datapoints fromaNormal,mean10,variance1,with5,10,30bins

6 7 8 9 10 11 12 130

2

4

6

8

10

12

Page 23: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 23

HistogramofMultimodalData

15000datapoints simulatedfromamixtureof3Normaldistributions, 300bins

5 6 7 8 9 10 11 12 13 140

50

100

150

200

250

300

350

400

Page 24: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 24

HistogramofMultimodalData

15000datapoints simulatedfromamixtureof3Normaldistributions, 300bins

5 6 7 8 9 10 11 12 13 140

50

100

150

200

250

300

350

400

5 6 7 8 9 10 11 12 13 140

1000

2000

3000

4000

5000

6000

5 6 7 8 9 10 11 12 13 140

500

1000

1500

2000

2500

3000

3500

5 6 7 8 9 10 11 12 13 140

20

40

60

80

100

120

140

160

Page 25: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 25

SkewedData

0 1 2 3 4 5 6 7 8 90

50

100

150

200

250

300

350

400

450

5000datapoints simulatedfromanexponentialdistribution, 100bins

Page 26: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 26

AnotherSkewedDataSet

0 20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

10000datapoints simulatedfromamixtureof2exponentials, 100bins

Page 27: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 27

SameSkewedDataaftertakingLogs(base10)

-4 -3 -2 -1 0 1 2 30

50

100

150

200

250

300

350

10000datapoints simulatedfromamixtureof2exponentials, 100bins

Page 28: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 28

Whatwillthemeanormediantellusaboutthisdata?

9 10 11 12 13 14 15 160

100

200

300

400

500

600

700

800

900

Page 29: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 29

HistogramwithOutliers

Xvalues

Numberof

Individuals

PimaIndiansDiabetesData,FromUCIrvineMachineLearningRepository

Page 30: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 30

HistogramwithOutliers

bloodpressure=0?

DiastolicBloodPressure

Numberof

Individuals

PimaIndiansDiabetesData,FromUCIrvineMachineLearningRepository

Page 31: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 31

BoxPlots:DiabetesData

BodyMassIndex

HealthyIndividuals

DiabeticIndividuals

Twoside-by-sidebox-plotsofindividualsfromthePimaIndiansDiabetesDataSet

Note:significantoverplotting herethatcouldeasilybemissed

Page 32: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 32

BoxPlots:DiabetesData

BodyMassIndex

HealthyIndividuals

DiabeticIndividuals

Box = middle 50% of data

Plotsalldatapoints outside“whiskers”

1.5xQ3-Q1

Q2(median)

Q3

Q1

UpperWhisker

LowerWhisker

Twoside-by-sidebox-plotsofindividualsfromtheDiabetesDataSet

Page 33: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 33

MultipleBoxPlots:DiabetesData

healthy diabetic healthy diabetic

DiastolicBloodPressure

24-hourSerumInsulin

PlasmaGlucose

Concentration

BodyMassIndex

Page 34: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 34

HorizontalBoxPlot forPlanetData

From:https://seaborn.pydata.org/examples/horizontal_boxplot.html

Page 35: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 35

ExploringPairsofVariables

Page 36: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 36

RelationshipsbetweenPairsofVariables

• SaywehaveavariableYwewanttopredictandmanyvariablesXthatwecouldusetopredictY

• InexploratorydataanalysiswemaybeinterestedinquicklyfindingoutifaparticularXvariableispotentiallyusefulatpredictingY

• Options?– Linearcorrelation

– Scatterplot:plotYvaluesversusXvalues

Page 37: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 37

LinearDependencebetweenPairsofVariables

• Covarianceandcorrelationmeasurelineardependence

• AssumewehavetwovariablesorattributesXandYandnobjectstaking valuesx(1),…,x(n)andy(1),…,y(n).ThesamplecovarianceofXandYis:

• ThecovarianceisameasureofhowXandYvarytogether.– largeandpositive iflargevaluesofXareassociatedwith largevaluesofY

andsmallX⇒ smallY

• (PearsonLinear)Correlation=scaledcovariance,variesbetween-1and1

∑=

−−=n

iyiyxix

nYXCov

1

))()()((1),(

21

1

2

1

2

1

))(())((

))()()((),(

⎟⎠

⎞⎜⎝

⎛−−

−−=

∑ ∑

= =

=

n

i

n

i

n

i

yiyxix

yiyxixYXρ

Page 38: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 38

DataSetonHousingPricesinBoston

1 CRIM percapitacrimeratebytown

2 ZN proportionofresidentiallandzonedforlotsover25,000ft2

3 INDUS proportionofnon-retailbusiness acrespertown

4 NOX Nitrogen oxide concentration(partsper10million)

5 RM averagenumberofroomsperdwelling

6 AGE proportionofowner-occupiedunitsbuiltpriorto1940

7 DIS weighteddistancestofiveBostonemploymentcentres

8 RAD indexofaccessibilitytoradialhighways

9 TAX full-valueproperty-taxrateper$10,000

10 PTRATIO pupil-teacherratiobytown

11 MEDV Medianvalueofowner-occupiedhomesin$1000's

(widely useddatasetinresearchinregression(prediction) research)

Page 39: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 39

MatrixofPairwiseLinearCorrelations

Industry

Nitrousoxide

Percentageoflargeresidentiallots

CrimeRate

-1 0 +1

DataoncharacteristicsofBostonhousing

Average#rooms

Medianhousevalue

Proportionofoldhouses

Distancetoemployment

centers

Highwayaccessibility

Propertytaxrate

Student-teacherratio

Page 40: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 40

ExamplesofX-Yplotsandlinearcorrelationvalues

Page 41: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 41

ExamplesofX-Yplotsandlinearcorrelationvalues

Page 42: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 42

LinearDependence

Non-LinearDependence

Lackoflinearcorrelationdoesnotimply lackofdependence

Page 43: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 43

SummaryStatisticsforAnscombe’s 4DataSets

SummaryStatisticsofDataSet1N =11MeanofX =9.0MeanofY =7.5

SummaryStatisticsofDataSet3N =11MeanofX =9.0MeanofY =7.5

SummaryStatisticsofDataSet4N =11MeanofX =9.0MeanofY =7.5

SummaryStatisticsofDataSet2N =11MeanofX =9.0MeanofY =7.5

Anscombe,Francis(1973),GraphsinStatisticalAnalysis,TheAmericanStatistician,pp.195-199.

4datasets,eachwith2variablesXandY,withthesamesummarystatistics(imagine thatPython reports thesesummariesandwehavenotyetlookedatthedata)

Page 44: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 44

Anscombe’s 4DataSets

Anscombe,Francis(1973),GraphsinStatisticalAnalysis,TheAmericanStatistician,pp.195-199.

GuesstheLinearCorrelationValuesforeachDataSet

Page 45: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 45

SummaryStatisticsforeachDataSet

SummaryStatisticsofDataSet1N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82

SummaryStatisticsofDataSet3N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82

SummaryStatisticsofDataSet4N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82

SummaryStatisticsofDataSet2N =11MeanofX =9.0MeanofY =7.5Intercept=3Slope=0.5Correlation=0.82

Anscombe,Francis(1973),GraphsinStatisticalAnalysis,TheAmericanStatistician,pp.195-199.

Lesson:summarystatisticscanbemisleading

Page 46: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 46

Dangersofsearchingforcorrelationsinhigh-dimensionaldata

Simulated 50randomGaussian/normaldatavectors,eachwith100variablesResultsina50x100datamatrix

Belowisahistogramof the100choose2pairsofcorrelationcoefficients

Evenifdataareentirelyrandom(nodependence) thereisaveryhighprobabilitysomevariableswillappeardependent justbychance.

Page 47: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47

CorrelationsinaLargeRandomDataSet

From:https://seaborn.pydata.org/examples/many_pairwise_correlations.html

Page 48: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 48

Conclusionssofar?

• Summarystatisticsareuseful…..uptoapoint

• Linearcorrelationmeasurescanbemisleading

• Therereallyisnosubstituteforplotting/visualizingthedata

Page 49: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 49

ScatterPlots

• Plotthevalueofonevariableagainsttheother

• Simple…butcanbeveryinformative,canrevealmorethansummarystatistics

• Forexample,wecan…– Seeifvariablesaredependentoneachother (beyond lineardependence)– Detectifoutliersarepresent– Cancolor-codetooverlaygroup information (e.g.,colorpointsbyclasslabelfor

classificationproblems)

Page 50: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 50

0 2 4 6 8 10 12 14

x 104

0

0.5

1

1.5

2

2.5x 105

MEDIAN PERCAPITA INCOME

MEDIANHOUSEHOLD INCOME

(from US Zip code data: each point = 1 Zip code)

units = dollars

Page 51: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 51

ConstantVarianceversusChangingVariance

variationinYdoesnotdependonX variationinY changeswiththevalueofXe.g.,Y=annualtaxpaid,X=income

Page 52: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 52

Scatter-PlotMatrices:ExampleforDiabetesData

Page 53: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 53

UsingColortoShowGroupInformationinScatterPlots

Figurefromwww.originlab.com

Irisclassificationdataset,3classes

Page 54: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 54

AnotherExamplewithGroupingbyColor

Figurefromhci.stanford.edu

Page 55: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 55

OutlierDetection

• Definitionofanoutlier?– Noprecisedefinition– Generally….”Adatapoint thatissignificantlydifferent totherestofthedata”– Buthowdowedefine“significantlydifferent”? (manyanswerstothis…..)– Typicallyassumedtomeanthatthepointwasmeasuredinerror,orisnotatrue

measurement insomesense

Outliersin1dimension Outlierin2dimensions

1 2 3 4 5 6 7 8 92

3

4

5

6

7

8

9

X VALUES

Y VA

LUES

Page 56: exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 56

Assignment5

RefertotheWikipage

DuenoononMondayFebruary12th toEEEdropbox

Notechange:duebeforeclass(by2pm)