pandas plotting capabilities -...

Post on 15-Nov-2019

22 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Pandasplottingcapabilities

Pandasbuilt-incapabilitiesfordatavisualizationit'sbuilt-offofmatplotlib,butit'sbakedintopandasforeasierusage.Itprovidesthebasicstatisticplottypes.Let'stakeagainthemortgageexampleinthefiletrain.csv(http://bit.do/train-csv)

import pandas as pdimport numpy as nploan_data = pd.read_csv("train.csv", index_col="Loan_ID")loan_data.head()

Nowsaythatyouwanttomakeahistogramofthecolumn'LoanAmount'.Youcanfindallthepossiblematplotlib outputonthegallerypagehttp://matplotlib.org/gallery.htmlCalltheappropriatemethodwith:

loan_data['LoanAmount'].plot.hist()

Asyoucanseethedefaultbinningit'snotverydescriptiveinourcase,solet'schangesomeparameter

loan_data['LoanAmount'].plot.hist(bins=40)

Ifthedefaultappearanceseemstoyouquitedull,youcanchangeitusingastylesheet

import matplotlib.pyplot as pltplt.style.use('ggplot')loan_data['LoanAmount'].plot.hist(bins=40)

NowwewanttocomparethedistributionofApplicantIncome byLoan_Status:

loan_data.boxplot(column="ApplicantIncome",by="Loan_Status")loan_data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)

ItseemsthattheIncomeisnotabigdecidingfactoronitsown,asthereisnoappreciabledifferencebetweenthepeoplewhoreceivedandweredeniedtheloan

Seaborn• Ifyouwanttouseadvancedplottingfeaturesyoucanimportseaborn inyour

code.Seaborn isaPythonvisualizationlibrarybasedonmatplotlib.Itprovidesahigh-levelinterfacefordrawingattractivestatisticalgraphics.

• Here:

http://seaborn.pydata.org/examples/index.html

youcanfindanextensiveexamplesgalleryaboutwhatyoucanobtainfromit.

• Generallyspeaking:1. scatterplotshelpyoutounderstandrelationbetweentwocontinuousvariables.2. box-plothelpsyoutofindoutliersinyourdata.3. Histogramhelpsyoufindspreadofthedatawhencontinuous.4. Correlationplothelpsyoulearncorrelationwithvariables.5. Barplot helpsyoutounderstandrelationbetweenyourcontinous and

categoricalvariable.

import seaborn as snsimport pandas as pdimport numpy as nploan_data = pd.read_csv("train.csv", index_col="Loan_ID")

impute_grps = loan_data.pivot_table(values=["LoanAmount"], index=["Gender","Married","Self_Employed"], aggfunc=np.mean)

loan_data['LoanAmount'].fillna(value=(impute_grps.loc[tuple([loan_data['Gender'][0], loan_data['Married'][0], loan_data['Self_Employed'][0]])][0]), inplace=True)

loan_data['Gender'].fillna(value=(loan_data['Gender'].mode())[0], inplace=True)loan_data['Married'].fillna(value=(loan_data['Married'].mode())[0], inplace=True)loan_data['Self_Employed'].fillna(value=(loan_data['Self_Employed'].mode())[0], inplace=True)

sns.countplot(x='Loan_Status',data=loan_data)

sns.countplot(x='Loan_Status',hue='Credit_History',data=loan_data)

Thisisasimplecountofthecategoricalcolumn'Loan_Status'.Butlet'sinsertadistinctionbetweenpeoplewithorwithoutacredithistory:theplotwilltellusthatit'shighadvisabletohaveacredithistoryifyouwantaloan…

sns.heatmap(loan_data.corr(),cmap='coolwarm',annot=True)

Withthecorr() functionyoucanobtainamatrixformedbyapairwisecorrelationofnumericalcolumns,excludingnanvalues,thatcangiveauseful"firstlook"aboutthedata:

FacetGrid isthegeneralwaytocreategridsofplotsbasedoffofafeature:

import matplotlib.pyplot as pltg = sns.FacetGrid(loan_data, col="Married", row="Education",hue="Gender")g = g.map(plt.hist, "LoanAmount")

Nowlet'sreloadthekaggle datasetsalaries.csv:wewillseeothertypesofseaborn datavisualization(http://bit.do/salaries-csv)salaries=pd.read_csv("Salaries.csv", index_col='Id')salaries.head()sns.jointplot(x='BasePay',y='Benefits',data=salaries,kind='scatter')#try reg,hex, kde

sns.rugplot(salaries["TotalPayBenefits"]) sns.kdeplot(salaries["TotalPayBenefits"])

rugplots justdrawadashmarkforeverypointonaunivariatedistribution.TheyarethebuildingblockofaKDE(KernelDensityEstimator)plot.TheseKDEplotsreplaceeverysingleobservationwithaGaussian(Normal)distributioncenteredaroundthatvalue.

sns.boxplot(x="Year", y="TotalPayBenefits",data=salaries,palette='rainbow')

Theboxshowsthequartilesofthedatasetwhilethewhiskersextendtoshowtherestofthedistribution,exceptforpointsthataredeterminedtobe“outliers”usingamethodthatisafunctionoftheinter-quartilerange.

sns.violinplot(x="Year", y="TotalPayBenefits",data=salaries,palette='magma')

Aviolinplotshowsthedistributionofquantitativedataacrossseverallevelsofone(ormore)categoricalvariablessuchthatthosedistributionscanbecompared.Unlikeaboxplot,inwhichalloftheplotcomponentscorrespondtoactualdatapoints,theviolinplotfeaturesakerneldensityestimationoftheunderlyingdistribution.

lm = sns.lmplot(x='ApplicantIncome',y='LoanAmount',data=loan_data,palette='coolwarm')axes=lm.axesaxes[0,0].set_xlim(0,20000)

lmplot allowsyoutodisplaylinearmodels,butitalsoconvenientlyallowsyoutosplitupthoseplotsbasedoffoffeatures,aswellascoloringthehuebasedoffoffeatures.

Plotly andCufflinks• Plotly isalibrarythatallowsyoutocreateinteractiveplots

thatyoucanuseindashboardsorwebsites(youcansavethemashtmlfilesorstaticimages).

• Cufflinksisalibrarythatallowsyoutocallplotsdirectlyoffofapandasdataframe.

• ToseedirectlythegeneratedplotswewilluseJupyternotebook,thatAnacondahasinstalledforyou

• Theselibrariesarenotcurrentlyavailablethroughcondabutareavailablethroughpip.SoyouhavetoinstallpipinAnaconda,anduseittoinstallthem:

conda install pippip install plotlypip install cufflinks

UseaterminalCLItowritethefollowingcommand:

jupyter notebook

Thedefaultbrowserwillbeopenedwithafilebrowserofthedirectoryfromwhichyouhaveissuedthecommand.WiththisbrowseryoucanopenapreviouslysavedJupiternotebookfile,oryoucancreateanewnotebookwiththe"New"buttonontheright:

SelectthePython[conda root]andanewtabwillbeopenedwiththefamiliarIPythonprompt

IfyoutypeacommandandpressEnterthecodecellwillenteranewline:toexecutethecodeyoumustpressShift+Enter

Plotly isopen-source,butyouneedtotellitthatyou'reworkingoffline.Asyoucanseeonthesitehttp://plot.ly,thereisacorrespondingcompanythatmakesmoneyhostingyourvisualizationsanddashboards.Thesitehostsagreatgalleryofexamples.

IfyouwanttousethislibraryinIPython youcansavetheinteractiveplotsonafileandthenuseplot()

fig = dataframe.iplot(kind='bar',asFigure=True)plot(fig)

Exercises!Seaborn comeswiththepossibilitytoloadadatasetfromanonlinerepository(requiresinternet),withthefunctionload_dataset(string).Youcanobtainthelistofavailabledatasetswithget_dataset_names()Youcanfindinthereferencesthepageongithub thathoststhem.

import seaborn as sns%matplotlib inlinesns.get_dataset_names()tips = sns.load_dataset('tips')tips.head()

Feelfreetoexperimentwithalltheavailabledatasets.Forourexercisesyoumustusethetipsdataset,thatgives244recordaboutthecustomersofarestaurant.

Yourjobwillbetoreproducetheplotshowninthefollowingpages:

ExerciseOnKaggle youcanfindadatabasethatcontainsthe911callsoftheMontgomeryCounty,PA.Thefileisdownloadedforyouandyoucanfindithere:

http://bit.do/911-csv

Explorethedata,andbuildaheatmap ofthenumberofcallsperHourandDayofWeek,showingclearlythatbetween16:00and17:00thereisamaximum.

MontgomeryCounty,locallyalsoreferredtoasMontco,isacountylocatedintheCommonwealthofPennsylvania.Asofthe2010census,thepopulationwas799,874,makingitthethird-mostpopulouscountyinPennsylvania,afterPhiladelphiaandAlleghenyCounties.ThecountyseatisNorristown.MontgomeryCountyisverydiverse,rangingfromfarmsandopenlandinUpperHanovertodenselypopulatedrowhouse streetsinCheltenham.

• http://matplotlib.org/users/pyplot_tutorial.html• http://matplotlib.org/gallery.html• http://www.labri.fr/perso/nrougier/teaching/matplotlib/• http://matplotlib.org/api/pyplot_api.html• http://scipy-lectures.github.io/matplotlib/matplotlib.html• http://seaborn.pydata.org/• http://seaborn.pydata.org/examples/index.html• https://plot.ly/python/reference/• http://web.quant-platform.com/trial/yves/Plotly_Cufflinks.html• http://images.plot.ly/plotly-

documentation/images/python_cheat_sheet.pdf• https://plot.ly/ipython-notebooks/cufflinks/• https://github.com/mwaskom/seaborn-data

top related