pre processing big data

23
Pre-Processing Big Data Techniques to improve quality of big data analysis Maloy Manna

Upload: maloy-manna-pmp

Post on 22-Jan-2018

205 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Pre-ProcessingBigDataTechniquestoimprovequalityofbigdataanalysis

MaloyManna

Abstract

• Datainthereal-worldisalmostalwaysdirty,incomplete,scatteredorinconsistent.Fordatascientists,‘janitorwork’iskeyhurdletodatainsights.• Whetheryouusebigdataforanalyticsordatascience,withincreasingvarietyandvelocityofbigdata,thedatapre-processingstepcanbethemosttime-consumingstepinyourdatapipeline.• WithfeatureengineeringconceptsandpracticalexamplesinPythonandR,thiswebinarwillfocusontechnicalconsiderationsanddataengineeringtechniquestooptimizedatapreparationtogetthemostvaluefromyourbigdatapipeline.

Speakerprofile

MaloyMannaEngineering,DataInnovationLab

• Buildingdatadrivenproductsandservicesforover15years

• Workedat:insuranceleaderAXA,informationleaderThomsonReuters,datasciencestartupSaama,consultingfirmsInfosys&TCS

linkedin.com/in/maloy @itsmaloy biguru.wordpress.com

Agenda

• Dataengineeringpatterns• Datapreparation|pre-processing• Exploratorydataanalysis• Datacleaningtechniques• Datareductiontechniques• Datatransformation• Dataintegration

Datastrategy• Startwiththebusinessquestion(s)• Definegoals…andmetrics• Initialhypothesis...dataneeds• Experiment…gaininsight• Takeactions...refine• Prioritize…buildroadmap

Datasciencelifecycle

• Pre-cycle|Datastrategy

PermanentPOC

• Datastrategy• Lifecycle• ...Infrastructure

Datapipeline

• Integrations

Analytics

• Insightsfromanalytics

Datapreparation

Whypre-processdata?

• Errorsindatacollection• Measurementerror• Humanerrors• Namingconventions• Duplicaterecords• Incompletedata• Inconsistentdata• “Noise”indata

Datapreparation

• Dataacquisition• Datapreparation• Dataintegration• Datatransformation• Datacleaning• Datareduction

• Keyfactorinmodelquality• Insightsbasedon“trusted”data

Datapre-processing

Datapreparation• ExploratoryDataAnalysis• Datacleaning• Datareduction• Datatransformation• Dataintegration

• Keyfactorinmodelquality• Insightsbasedon“trusted”data

Exploratorydataanalysis

Datavisualization• Example:Anscombe’s quartet

Exploratorydataanalysis

Goodnessoffit• R-squared[explainedvariation/totalvariation]

• Notsufficienttillresidualplotsareexaminedforbias• AdjustedR-squared

• Adjustsfornumberofexplanatoryvariablesinamodelrelativetonumberofdatapoints

Datacleaning

• Reformatdatavaluesorlayout• Standardizedata[commonunits]• Correcterroneousvalues• Fillin/Excludemissingvalues

• Validatingdata(e.g.dates/addresses)

Datacleaning

Handlingmissingdata[tactics]• Ignorerecordswithmissingdata• Fillinvalues(ifknown/available)• Useglobalconstante.g.NULL,unknown• Useattributevaluemean• Infermostprobablevalue

DatacleaningSmoothingnoise

• Regression• UsingClassintervalsor“Binning”• Clusteringandremovingoutliers• K-meansclustering[kobservations,nclusters]

DatareductionDimensionalityreduction|Whyreduce?

• Toomanyvariables• Multi-collinearity[highlycorrelatedmultiplepredictorvariables]• Lesscomputation• Reducesnoise,improvesmodelperformance• Compressdata,reducestorage

Datareduction• Dimensionalityreduction• Numerization – [non-numericattributestonumeric]

• UsefulforSVM[supportvectormachine]andneuralnetworks• Categorization– [non-categoricalattributestocategorical]

• e.g.dummyvariable(binarystates)• UsefulforNaiveBayesandBayesiannetworks

• Featureextractione.g.PCA[PrincipalComponentAnalysis]• Featurereduction

• Usesremovaloflow/almost-zerovarianceandhighlycorrelatedvariables• Reducescomputationcosts• Improvesmodelinterpretability

Datareduction• PCA– PrincipalComponentAnalysis

• Goalistoreduced-dimensionaldatasetintok-dimensionalsubspace(wherek<=d)toincreasecomputationalefficiency• Inessence,originalvariablesreducedtoanewsetofvariablesinlinearcombination,calledprincipalcomponents.• Dataneedstobestandardizedbeforehand• scikit-learnprovidesimplementation

Datatransformation• Modifydatatoformsuitableforanalysisandmodeling• Standardtransformationfunctions/needs:• Reshapedata(sort,append|feature generation,pivot)• Joindata(union,intersection,join,match)• Subsetdata(filter,drop,distinct)• Aggregate(group,windowing)• Mathematicaloperations

Dataintegration• Datapipelinelevel|Individualdatasetlevel• Standardizingschema• Metadatamanagementcrucial• Automationiskey• Toolsautomateseveralmachinelearningtasks• Deduplication,onlineentityresolution,dataenrichment/geocoding• Referencemetadatacatalog,taggingandsearch• RESTAPImicroservices forintegrationwithanalytics

References&furtherreading• PCA– PrincipalComponentAnalysis:https://en.wikipedia.org/wiki/Principal_component_analysis

• Datasciencelifecycle:http://www.datasciencecentral.com/profiles/blogs/the-data-science-project-lifecycle

• R-squaredconcepts:http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit

• Dimensionalityreduction:https://en.wikipedia.org/wiki/Dimensionality_reduction• Rreshape2packagereference:https://cran.r-project.org/web/packages/reshape2/reshape2.pdf

• Sparktransformations:http://spark.apache.org/docs/latest/programming-guide.html#transformations

• Thetotallymanagedanalyticspipeline:https://segment.com/blog/the-totally-managed-analytics-pipeline/