pandas plotting capabilities -...

27
Pandas plotting capabilities Pandas built-in capabilities for data visualization it's built-off of matplotlib, but it's baked into pandas for easier usage. It provides the basic statistic plot types. Let's take again the mortgage example in the file train.csv (http://bit.do/train-csv) import pandas as pd import numpy as np loan_data = pd.read_csv("train.csv", index_col="Loan_ID") loan_data.head() Now say that you want to make a histogram of the column 'LoanAmount'. You can find all the possible matplotlib output on the gallery page http://matplotlib.org/gallery.html Call the appropriate method with: loan_data['LoanAmount'].plot.hist() As you can see the default binning it's not very descriptive in our case, so let's change some parameter

Upload: others

Post on 15-Nov-2019

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

Pandasplottingcapabilities

Pandasbuilt-incapabilitiesfordatavisualizationit'sbuilt-offofmatplotlib,butit'sbakedintopandasforeasierusage.Itprovidesthebasicstatisticplottypes.Let'stakeagainthemortgageexampleinthefiletrain.csv(http://bit.do/train-csv)

import pandas as pdimport numpy as nploan_data = pd.read_csv("train.csv", index_col="Loan_ID")loan_data.head()

Nowsaythatyouwanttomakeahistogramofthecolumn'LoanAmount'.Youcanfindallthepossiblematplotlib outputonthegallerypagehttp://matplotlib.org/gallery.htmlCalltheappropriatemethodwith:

loan_data['LoanAmount'].plot.hist()

Asyoucanseethedefaultbinningit'snotverydescriptiveinourcase,solet'schangesomeparameter

Page 2: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

loan_data['LoanAmount'].plot.hist(bins=40)

Ifthedefaultappearanceseemstoyouquitedull,youcanchangeitusingastylesheet

import matplotlib.pyplot as pltplt.style.use('ggplot')loan_data['LoanAmount'].plot.hist(bins=40)

Page 3: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

NowwewanttocomparethedistributionofApplicantIncome byLoan_Status:

loan_data.boxplot(column="ApplicantIncome",by="Loan_Status")loan_data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)

ItseemsthattheIncomeisnotabigdecidingfactoronitsown,asthereisnoappreciabledifferencebetweenthepeoplewhoreceivedandweredeniedtheloan

Page 4: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

Seaborn• Ifyouwanttouseadvancedplottingfeaturesyoucanimportseaborn inyour

code.Seaborn isaPythonvisualizationlibrarybasedonmatplotlib.Itprovidesahigh-levelinterfacefordrawingattractivestatisticalgraphics.

• Here:

http://seaborn.pydata.org/examples/index.html

youcanfindanextensiveexamplesgalleryaboutwhatyoucanobtainfromit.

• Generallyspeaking:1. scatterplotshelpyoutounderstandrelationbetweentwocontinuousvariables.2. box-plothelpsyoutofindoutliersinyourdata.3. Histogramhelpsyoufindspreadofthedatawhencontinuous.4. Correlationplothelpsyoulearncorrelationwithvariables.5. Barplot helpsyoutounderstandrelationbetweenyourcontinous and

categoricalvariable.

Page 5: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

import seaborn as snsimport pandas as pdimport numpy as nploan_data = pd.read_csv("train.csv", index_col="Loan_ID")

impute_grps = loan_data.pivot_table(values=["LoanAmount"], index=["Gender","Married","Self_Employed"], aggfunc=np.mean)

loan_data['LoanAmount'].fillna(value=(impute_grps.loc[tuple([loan_data['Gender'][0], loan_data['Married'][0], loan_data['Self_Employed'][0]])][0]), inplace=True)

loan_data['Gender'].fillna(value=(loan_data['Gender'].mode())[0], inplace=True)loan_data['Married'].fillna(value=(loan_data['Married'].mode())[0], inplace=True)loan_data['Self_Employed'].fillna(value=(loan_data['Self_Employed'].mode())[0], inplace=True)

sns.countplot(x='Loan_Status',data=loan_data)

Page 6: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

sns.countplot(x='Loan_Status',hue='Credit_History',data=loan_data)

Thisisasimplecountofthecategoricalcolumn'Loan_Status'.Butlet'sinsertadistinctionbetweenpeoplewithorwithoutacredithistory:theplotwilltellusthatit'shighadvisabletohaveacredithistoryifyouwantaloan…

Page 7: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

sns.heatmap(loan_data.corr(),cmap='coolwarm',annot=True)

Withthecorr() functionyoucanobtainamatrixformedbyapairwisecorrelationofnumericalcolumns,excludingnanvalues,thatcangiveauseful"firstlook"aboutthedata:

Page 8: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

FacetGrid isthegeneralwaytocreategridsofplotsbasedoffofafeature:

import matplotlib.pyplot as pltg = sns.FacetGrid(loan_data, col="Married", row="Education",hue="Gender")g = g.map(plt.hist, "LoanAmount")

Page 9: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

Nowlet'sreloadthekaggle datasetsalaries.csv:wewillseeothertypesofseaborn datavisualization(http://bit.do/salaries-csv)salaries=pd.read_csv("Salaries.csv", index_col='Id')salaries.head()sns.jointplot(x='BasePay',y='Benefits',data=salaries,kind='scatter')#try reg,hex, kde

Page 10: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

sns.rugplot(salaries["TotalPayBenefits"]) sns.kdeplot(salaries["TotalPayBenefits"])

rugplots justdrawadashmarkforeverypointonaunivariatedistribution.TheyarethebuildingblockofaKDE(KernelDensityEstimator)plot.TheseKDEplotsreplaceeverysingleobservationwithaGaussian(Normal)distributioncenteredaroundthatvalue.

Page 11: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

sns.boxplot(x="Year", y="TotalPayBenefits",data=salaries,palette='rainbow')

Theboxshowsthequartilesofthedatasetwhilethewhiskersextendtoshowtherestofthedistribution,exceptforpointsthataredeterminedtobe“outliers”usingamethodthatisafunctionoftheinter-quartilerange.

Page 12: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

sns.violinplot(x="Year", y="TotalPayBenefits",data=salaries,palette='magma')

Aviolinplotshowsthedistributionofquantitativedataacrossseverallevelsofone(ormore)categoricalvariablessuchthatthosedistributionscanbecompared.Unlikeaboxplot,inwhichalloftheplotcomponentscorrespondtoactualdatapoints,theviolinplotfeaturesakerneldensityestimationoftheunderlyingdistribution.

Page 13: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

lm = sns.lmplot(x='ApplicantIncome',y='LoanAmount',data=loan_data,palette='coolwarm')axes=lm.axesaxes[0,0].set_xlim(0,20000)

lmplot allowsyoutodisplaylinearmodels,butitalsoconvenientlyallowsyoutosplitupthoseplotsbasedoffoffeatures,aswellascoloringthehuebasedoffoffeatures.

Page 14: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

Plotly andCufflinks• Plotly isalibrarythatallowsyoutocreateinteractiveplots

thatyoucanuseindashboardsorwebsites(youcansavethemashtmlfilesorstaticimages).

• Cufflinksisalibrarythatallowsyoutocallplotsdirectlyoffofapandasdataframe.

• ToseedirectlythegeneratedplotswewilluseJupyternotebook,thatAnacondahasinstalledforyou

• Theselibrariesarenotcurrentlyavailablethroughcondabutareavailablethroughpip.SoyouhavetoinstallpipinAnaconda,anduseittoinstallthem:

conda install pippip install plotlypip install cufflinks

Page 15: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

UseaterminalCLItowritethefollowingcommand:

jupyter notebook

Thedefaultbrowserwillbeopenedwithafilebrowserofthedirectoryfromwhichyouhaveissuedthecommand.WiththisbrowseryoucanopenapreviouslysavedJupiternotebookfile,oryoucancreateanewnotebookwiththe"New"buttonontheright:

SelectthePython[conda root]andanewtabwillbeopenedwiththefamiliarIPythonprompt

Page 16: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

IfyoutypeacommandandpressEnterthecodecellwillenteranewline:toexecutethecodeyoumustpressShift+Enter

Plotly isopen-source,butyouneedtotellitthatyou'reworkingoffline.Asyoucanseeonthesitehttp://plot.ly,thereisacorrespondingcompanythatmakesmoneyhostingyourvisualizationsanddashboards.Thesitehostsagreatgalleryofexamples.

Page 17: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for
Page 18: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for
Page 19: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for
Page 20: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for
Page 21: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

IfyouwanttousethislibraryinIPython youcansavetheinteractiveplotsonafileandthenuseplot()

fig = dataframe.iplot(kind='bar',asFigure=True)plot(fig)

Page 22: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

Exercises!Seaborn comeswiththepossibilitytoloadadatasetfromanonlinerepository(requiresinternet),withthefunctionload_dataset(string).Youcanobtainthelistofavailabledatasetswithget_dataset_names()Youcanfindinthereferencesthepageongithub thathoststhem.

import seaborn as sns%matplotlib inlinesns.get_dataset_names()tips = sns.load_dataset('tips')tips.head()

Feelfreetoexperimentwithalltheavailabledatasets.Forourexercisesyoumustusethetipsdataset,thatgives244recordaboutthecustomersofarestaurant.

Yourjobwillbetoreproducetheplotshowninthefollowingpages:

Page 23: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for
Page 24: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for
Page 25: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for
Page 26: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

ExerciseOnKaggle youcanfindadatabasethatcontainsthe911callsoftheMontgomeryCounty,PA.Thefileisdownloadedforyouandyoucanfindithere:

http://bit.do/911-csv

Explorethedata,andbuildaheatmap ofthenumberofcallsperHourandDayofWeek,showingclearlythatbetween16:00and17:00thereisamaximum.

MontgomeryCounty,locallyalsoreferredtoasMontco,isacountylocatedintheCommonwealthofPennsylvania.Asofthe2010census,thepopulationwas799,874,makingitthethird-mostpopulouscountyinPennsylvania,afterPhiladelphiaandAlleghenyCounties.ThecountyseatisNorristown.MontgomeryCountyisverydiverse,rangingfromfarmsandopenlandinUpperHanovertodenselypopulatedrowhouse streetsinCheltenham.

Page 27: Pandas plotting capabilities - Dottoratophdphysics.cloud.ba.infn.it/wp-content/uploads/2017/06/8-Diacono-3Maggio.pdf · Pandas plotting capabilities Pandas built-in capabilities for

• http://matplotlib.org/users/pyplot_tutorial.html• http://matplotlib.org/gallery.html• http://www.labri.fr/perso/nrougier/teaching/matplotlib/• http://matplotlib.org/api/pyplot_api.html• http://scipy-lectures.github.io/matplotlib/matplotlib.html• http://seaborn.pydata.org/• http://seaborn.pydata.org/examples/index.html• https://plot.ly/python/reference/• http://web.quant-platform.com/trial/yves/Plotly_Cufflinks.html• http://images.plot.ly/plotly-

documentation/images/python_cheat_sheet.pdf• https://plot.ly/ipython-notebooks/cufflinks/• https://github.com/mwaskom/seaborn-data