merge multiple csv in single data frame using r

18
Merge Multiple files into single dataframe using R Yogesh Khandelwal

Upload: yogesh-khandelwal

Post on 09-Jan-2017

671 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Merge Multiple CSV in single data frame using R

Merge Multiple files into single dataframe using R

Yogesh Khandelwal

Page 2: Merge Multiple CSV in single data frame using R

Problem Description• The zip file contains 332 comma-separated-value (CSV) files

containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv".

• Data Source: http://spark-public.s3.amazonaws.com/compdata/data/specdata.zip

Page 3: Merge Multiple CSV in single data frame using R
Page 4: Merge Multiple CSV in single data frame using R

Variable Name

Page 5: Merge Multiple CSV in single data frame using R

Variables in file

• Date: the date of observation in YYYY-MM-DD format (year-month-day) ,Datatype:factor

• sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter),Datatype:num

• nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter),Datatype:num

• Id:location id,Datatype:int

Page 6: Merge Multiple CSV in single data frame using R

Before we start we should know

• Functions in R

• How to merge data files

Page 7: Merge Multiple CSV in single data frame using R

Functions in R

Page 8: Merge Multiple CSV in single data frame using R

Functions in RFunctions are created using the function() directive and arestored as R objects just like anything else. In particular, they are Robjects of class “function”.

f <- function(<arguments>) {## Do something interesting}

• Functions in R are “first class objects”, which means that they can be treated much like any other R object. Importantly,• Functions can be passed as arguments to other functions.• Functions can be nested, so that you can define a function inside of another function• The return value of a function is the last expression in the function• body to be evaluated.

Page 9: Merge Multiple CSV in single data frame using R

Function contd..

• For ex:Function name

Function defination

Function call

Page 10: Merge Multiple CSV in single data frame using R

Our objective

• How we can merge no. of files into single data frame?

• How to apply same function to different files in efficient way?

Page 11: Merge Multiple CSV in single data frame using R

How to merge two different files?

Page 12: Merge Multiple CSV in single data frame using R

• No.of options available like

1. Use merge() function2. Use rbind(),cbind() etc.

Page 13: Merge Multiple CSV in single data frame using R

How to merge no.of files as a single data frame

• Approach 1files<-list.files("specdata",full.names = TRUE)dat<-NULLfor(i in 1:332){ dat<-rbind(dat,read.csv(files[i]))}

• Further we can run various command on merged file object as per our need some are like:1. Str(dat)2. Head(dat)3. Tail(dat) etc.

Notes:full.names= a logical value. If TRUE, the directory path is prepended to the file names to give a relative file path. If FALSE, the file names (rather than paths) are returned.

Page 14: Merge Multiple CSV in single data frame using R

How to handle missing value in R ?

Page 15: Merge Multiple CSV in single data frame using R

contd.• In R, NA is used to represent any value that is 'not available' or 'missing' (in

the | statistical sense)• Missing values play an important role in statistics and data analysis. Often,

missing values must not be ignored, but rather they should be carefully studied to see if there's an underlying pattern or cause for their missingness.

• For ex:• X<-c(1,2,NA,4)• Y<-c(NA,2,3,1)• >x+y• [1] NA 4 NA 5

• Multiple options are available in R to handle NA values like • Is.NA()• Set na.rm=TRUE as a function argument

> mean(X) [1] NA > mean(X,na.rm = TRUE) [1] 2.333333

Page 16: Merge Multiple CSV in single data frame using R

Apply what we learn to our dataset

Function defination

Page 17: Merge Multiple CSV in single data frame using R

Function call

pollutantmean('specdata','nitrate',1:10) [1] 0.7976266

Page 18: Merge Multiple CSV in single data frame using R

Thank You!!