unit 3 data wrangling - irp-cdn.multiscreensite.com 3 data... · introduction to data wrangling...
Embed Size (px)
TRANSCRIPT
-
1
UNIT 3 Data Wrangling
-
2
Introduction to Data Wrangling
Data pre-processing (aka "wrangling") can be defined as the preparation of data for analysis
with data mining and visualization tools. There are many problems which can interfere with a
successful analysis; some of them can be readily addressed with simple pre-processing
techniques, which we will explore in this unit.
A range of data issues can be avoided by early planning of data collection. If data scientists
can anticipate that a study of customer satisfaction will need customer income levels, for
example, then in organizing a survey they can arrange to ask about income, whereas without
anticipating those, data scientists never think to ask and end up with poor data. Data scientists
generally have no say in the original collection of data and are simply handed a data set. Then
there are only two options at their hands: (1) wrangling with the data to reduce or eliminate
problems; (2) reporting on the problems and how to avoid them in future data collection.
It is often said that data scientists spend about 70 per cent of their time in data wrangling.
Only after that does it really make sense to do any analysis. If data scientists do not take the
time to ensure that data is in good shape before doing any analysis, they often run a big risk
of wasting a lot of time later on, or worse, losing the faith of their project stakeholders.
The most important thing to keep in mind about data cleaning, is that it's an iterative process.
Iterate on first detecting, and then correcting bad records. For example, one might have text
where we expect to find numeric data. So the word two instead of the number two. Some data
items might not be designed according to pre-defined specification. They might be missing
entire fields or they might have extra fields.
In measuring the quality of data, data scientists measure the degree to which entries in data
set conform to a defined schema, or to other constraints. They also look at accuracy. This is
the degree to which entries conform to gold standard data. Completeness of data is
straightforward i.e. do we have all the records we should have. Data consistency also is an
important aspect of data quality. Data scientists need to ensure that there is consistency
among the fields that represent the same data across systems. Finally, data uniformity, which
means whether values for distance, for example, use the same units; Is it miles, or is it
kilometres.
-
3
Introduction to R
Installing R and Packages R is a programming environment, which uses a simple programming language, allows for rapid
development of new tools according to user demand. These tools are distributed as packages,
which any user can download to customize the R environment. Base R and most R packages
are available for download from the Comprehensive R Archive Network (CRAN) in the
following web address:
cran.r-project.org
R packages are the fuel that drives the growth and popularity of R. R packages are bundles of
code, data, documentation, and tests that are easy to share with others.
Before one can use a package, one will first have to install it. Some packages, like the base
package are automatically installed. Other packages, like for example the ‘ggplot2’ package,
will not come with the bundled R installation but need to be installed.
Many (but not all) R packages are organized and available from CRAN, a network of servers
around the world that store identical, up-to-date, versions of code and documentation for R.
Using the ‘install.packages’ function data scientists can easily install these packages from
inside R. CRAN also maintains a set of Task Views that identify all the packages associated
with a particular task.
In addition to CRAN, data scientists also have bioconductor which has packages for the
analysis of high-throughput genomic data, as well as for example the github and bitbucket
repositories of R package developers. You can easily install packages from these repositories
using the devtools package.
R comes with several basic data management, analysis, and graphical tools. R's power and
flexibility lies in its array of packages (currently more around 6,000).
Data scientists can work directly in R, but most prefer a graphical interface. For starters:
• RStudio, an Integrated Development Environment (IDE)
• Deducer, a Graphical User Interface (GUI)
RStudio
R is the name of the programming language itself and RStudio is a convenient interface. There
a several fundamental building blocks of R and RStudio. These blocks are the interface,
running code, and basic commands.When you first launch RStudio, you will be greeted by an
interface that looks like this:
http://cran.r-project.org/http://www.rstudio.com/ide/http://www.deducer.org/pmwiki/index.php?n=Main.DeducerManual?from=Main.HomePage
-
4
Insert diagram here.
The panel in the upper right contains the workspace as well as a history of the commands that
are entered. Any plots that you generate will show up in the panel in the lower right corner.
The panel on the left is the console. Each time RStudio is launched, it will have the same text
at the top of the console telling you the version of R. Below that information is the prompt
where R commands are entered. Interacting with R is all about typing commands and
interpreting the output. These commands and their syntax are the window to access data,
organize, describe, and perform statistical computations.
For the purposes of this lesson, we will be using the following packages frequently:
• ‘foreign’ package to read data files from other stats packages
• ‘readxl’ package for reading Excel files
• ‘dplyr’ package for various data management tasks
• ‘reshape2’ package to easily melt data to long form
• ‘ggplot’ and ‘ggplot’ packages for elegant data visualization using the Grammar of
Graphics
• ‘GGally’ package for scatter plot matrices
• ‘vcd’ package for visualizing and analyzing categorical data
• ‘lattice’ is a powerful and elegant high-level data visualization system
Installing R Packages
To use packages in R, let’s install them using the ‘install.packages’ function, which typically
downloads the package from CRAN.
#install.packages("foreign")
#install.packages("readxl")
#install.packages("dplyr")
#install.packages("reshape2")
#install.packages("ggplot2")
#install.packages("GGally")
#install.packages("vcd")
Loading R Packages
When data scientists need an R package for R sessions, the specific packages must be loaded
into the R environment using the ‘library’ or ‘require’ functions.
-
5
library(foreign)
library(readxl)
library(dplyr)
library(reshape2)
require(ggplot2)
require(GGally)
require(vcd)
To get a description of the version of R and its attached packages used in the current session,
one can use the ‘sessionInfo’ function;
sessionInfo()
Essential features of R programming
• R code can be entered into the command line directly or saved to a script, which can be
run inside a session using the source function,
• Commands are separated either by a; or by a newline,
• R is case sensitive,
• The # character at the beginning of a line signifies a comment, which is not executed,
• Help files for R functions are accessed by preceding the name of the function with ?
(e.g. ?require),
• R stores both data and output from data analysis in objects,
• Things are assigned to and stored in objects using the
-
6
require(foreign)
# SPSS files
dat.spss
-
7
The most commonly used concept in R is the notation; object[row,column]. Let’s review a few
examples of this notation;
dat.csv[2,3], which produces the single cell value object of; ‘[1] 4’.
dat.csv[,3] command omits row value which implies all rows of the dataframe (here all rows
in column 3), as shown below;
[1] 4 4 4 4 4 4 3 1 4 3 4 4 4 4 3 4 4 4 4 4 4 4 3 1 1 3 4 4 4 2 4 4 4 4 4
[36] 4 4 4 1 4 4 4 4 3 4 4 3 4 4 1 2 4 1 4 4 1 4 1 4 1 4 4 4 4 4 4 4 4 4 1
[71] 4 4 4 4 4 1 4 4 4 1 4 4 4 1 4 4 4 4 4 4 2 4 4 1 4 4 4 4 1 4 4 4 3 4 4
[106] 4 4 4 3 4 4 1 4 4 1 4 4 4 4 3 1 4 4 4 3 4 4 2 4 3 4 2 4 4 4 4 4 3 1 3
[141] 1 4 4 1 4 4 4 4 1 3 3 4 4 1 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 1 3 2 3
[176] 4 4 4 4 4 4 4 4 4 2 2 4 2 4 3 4 4 4 2 4 2 4 4 4 4
Omitting column values implies all columns. For instance, the following syntax displays the
values of the student dataframe for the rows 2 and 3 and columns 2 and 3;
dat.csv[2:3, 2:3]
Here is the result of the syntax above;
## female race
## 2 1 4
## 3 0 4
Variables in R directly can be accessed by using their names, either with object["variable"]
notation or object$variable notation.
Activity 1
R vectors
R’s ‘c’ function is used to combine values of common type together to form a vector. It can be
used to access non-sequential rows and columns from a data frame. For instance, to get column
1 for rows 1, 3 and 5, let’s run the following syntax;
dat.csv[c(1,3,5), 1]
## [1] 70 86 172
Additionally, to get row 1 values for variables female, prog and socst, data scientists can
employ an R syntax such as;
dat.csv[1,c("female", "prog", "socst")]
## female prog socst
## 1 0 1 57
-
8
Modifying Variable Names in R
The function ‘colnames’ enables data scientists to manipulate R variable names. The structure
of the syntax below first creates an R vector for the variables of the dataframe with the
colnames function and it then changes variable name ‘ID’ by using the indexing to ‘ID2’.
colnames(dat.csv)
-
9
Exploring Data with R Let’s read in some data into R and store it in our object, ‘d’. Then, let’s explore and get to know
these data, which contain a number of school, test, and demographic variables for 200 students.
d
-
10
Data Wrangling with R R package ‘dplyr’ is a widely used package to modify data. The package has five main
functions which we will be using each of these functions later in the unit in detail.
Let's begin by reading the dataset and storing it in object d. Then, sort data using the ‘arrange’
function from the ‘dplyr’ package.
d
-
11
Below a total score is created and ‘cut’ function is used to recode continuous ranges into
categories.
library(dplyr)
d
-
12
dboth
-
13
Introduction to Data Wrangling with R Data scientists often must deal with untidy or incomplete data. The raw data obtained from
different data sources is often unusable at the beginning of every data science project. The
activity that data scientists perform on the raw data to make it usable to input to statistical
modelling and machine learning algorithms is called data wrangling or data munging.
Similarly, to create an efficient ETL (extract, transform and load) pipeline or create data
visualizations, data scientists should be prepared to do a lot of data wrangling.
Data wrangling is a process of data manipulation and transformation that enables analysis. In
other words, it is the process of manually converting or mapping data from one raw form into
another format that allows for more convenient consumption of the data with the help of semi-
automated tools.
Data wrangling is an important part of any data science project. By dropping null values,
filtering, and selecting the right data, and working with time series, you can ensure that any
machine learning or treatment you apply to your cleaned-up data is fully effective.
It is important to remember three steps goals of data wrangling when working with data. These
are;
1. Figure out what you need to do,
2. Describe those tasks in the form of a computer program (i.e. R),
3. Execute the program.
Data Wrangling with R ‘dplyr’ package
Remember that the ‘dplyr’ package makes the steps involved in data wrangling effective by; it
helps data scientists think about data manipulation challenges, it provides simple functions that
correspond to the most common data manipulation tasks and it uses efficient backends, so data
scientists spend less time waiting for the processing power of the computers.This section
introduces dplyr’s basic set of tools, and shows how to apply them to data frames.
In data wrangling, there a few tasks that any data science project needs to deal with. Some of
these tasks are;
• Filtering rows in data,,
• Selecting columns of data
• Adding new variables in data,
• Sorting data, and
• Aggregating data.
-
14
In introduction to R, we have explored some of these tasks with R. In this section, we will
explore these tasks and others with a powerful R package, ‘dplyr’ package. The package,
‘dplyr’ gives data scientists tools to do these tasks, and it does so in a way that streamlines the
analytics workflow. It may be said that ‘dplyr’ is almost perfectly suited to data science work,
as it is performed.
As mentioned earlier in the unit, the package ‘dplyr’ has five main commands. These
commands are; filter, select, mutate, arrange, and summarize. Below, we will explore each of
the steps in greater detail.
1. Filter ()
The function ‘filter’ subsets data by keeping rows that meet specified conditions. An example
of the function is provided below.
library(dplyr)
library(ggplot2)
head(diamonds)
df.diamonds_ideal
-
15
head(df.diamonds_ideal)
carat cut color price clarity
0.23 Ideal E 326 SI2
0.23 Ideal J 340 VS1
0.31 Ideal J 344 SI2
0.30 Ideal I 348 SI2
0.33 Ideal I 403 SI2
0.33 Ideal I 403 SI2
3. Mutate ()
Mutate function enables users to add variables to a dataset. For example, lets add a new
variable, ‘price_per_carat’ to the data frame’ df.diamonds_ideal’.
df.diamonds_ideal
-
16
The syntax, arrange (df.disordered_data, num_var), orders the data points of the dataframe
whilst the syntax arrange(df.disordered_data, desc(num_var)) sorts the data in descending
order.
5. Summarize ()
The function ‘summarize’ is a very useful function which enables data scientists to compute
summary statistics of the data. Having a look at the summary statistics of the data allows data
scientists to understand the distributional features of the data which we will explore further in
the unit data exploration and visualisation.
summarize(df.diamonds_ideal, avg_price = mean(price, na.rm = TRUE) )
avg_price
3457.542
As you have noticed that the syntax and function of all these verbs are very similar.
• The first argument is always a data frame,
• The subsequent arguments describe what to do with the data frame. You can refer to
columns in the data frame directly without using $, and
• The result is always a new data frame.
Together these properties make it easy to chain together (merge) multiple simple steps to
achieve a complex result. These five functions provide basis of a language of data manipulation
with ‘dplyr’. In summary, data scientists alter an untidy or incomplete data frame in five useful
ways;
1. Reorder the rows by the function (arrange()),
2. Pick observations and variables of interest by the function filter() and,
3. Pick observations and variables of interest by the function select()),
4. Add new variables that are functions of existing variables by the function (mutate())
and finally,
5. Collapse many values to a summary by the function (summarise()).
Activity 3.
Chaining in ‘dplyr’ Moving beyond the examples above, the real power of the ‘dplyr’ package comes when data
scientists chain different commands together (or, chain different ‘dplyr’ commands together
with commands and functions from other packages).
-
17
In the ‘dplyr’ syntax, data scientists use the ‘%>%’ operator to connect one command to
another. The output of one command becomes the input for the next command. For example;
df.diamonds_ideal_chained %
filter(cut=="Ideal") %>% select(carat, cut, color, price, clarity) %>%
mutate (price_per_carat = price/carat)
head(df.diamonds_ideal_chained)
carat cut color price clarity price_per_carat
0.23 Ideal E 326 SI2 1417.391
0.23 Ideal J 340 VS1 1478.261
0.31 Ideal J 344 SI2 1109.677
0.30 Ideal I 348 SI2 1160.000
0.33 Ideal I 403 SI2 1221.212
0.33 Ideal I 403 SI2 1221.212
The code above created a new, reshaped dataset. More specifically, we did ‘chained’ together
multiple R commands and directed the output of that set of commands into a new data frame
called ‘df.diamonds_ideal_chained’. In other words, what we did was;
1. take the diamonds dataset,
2. then filter it, keeping only the rows where ‘cut’ equals ‘Ideal’,
3. then select specific variables, ‘carat’, ‘cut’, ‘color’, ‘price, ‘clarity’, and
4. then create a new variable, ‘price_per_carat’ using ‘mutate’.
Let’s explore a bit more complex structure of chaining by combining ‘dplyr’ and ‘ggplot’
packages together.
diamonds %>%
filter(cut == "Ideal") %>%
ggplot(aes(x=color,y=price)) + geom_boxplot()
Insert chart here.
In this data exploration example, what we did was to work on the 'diamonds' data frame and
then filter down to rows where ‘cut == Ideal’, and then, plot the data with ggplot to create a
boxplot chart as shown above. Lastly, lets’ create a histogram of ‘ideal cut’ diamonds, in a
small multiple layout (Small multiples are called ‘facets’ in ‘ggplot’ terminology).
http://www.sharpsightlabs.com/histogram-in-r-basic/http://www.sharpsightlabs.com/small-multiples-ggplot/http://www.sharpsightlabs.com/small-multiples-ggplot/
-
18
diamonds %>%
filter(cut == "Ideal") %>%
ggplot(aes(price)) + geom_histogram() + facet_wrap(~ color)
Insert table here.
Four crucial data wrangling tasks with R There is an array of data wrangling tasks in any dataset in practise. Each dataset is unique which
requires customised data wrangling tasks to be thought through and implemented. This is due
to the fat that each data generation process involves, collects and generates data that given the
business goals need to be dealt with at the data wrangling stage. Having said that, there are four
widely used data wrangling tasks that are in the toolset of data scientists. These are;
1. Adding a column to an existing dataframe,
2. Getting data summaries by subgrouping,
3. Sorting the results, and
4. Reshaping the dataframe.
Before going into detail for each of these main data wrangling tasks by using R, let’s create a
hypothetical data frame by entering values into R. Then, store this dataset as ‘CompanyData’.
fy
-
19
$ revenue: num 65225 108249 156508 29321 37905 ...
$ profit : num 14013 25922 41733 8505 9737 ...
As is seen, the dataframe has 9 observations and 4 variables. The variable ‘fy’ has numeric
data, which in fact represents the ‘year’ (essentially a date field). In R, date fields need to be
represented as factors though. The variables ‘company’ is a factor variable with three levels
(Apple, Google, and Microsoft), whereas the variables of ‘revenue’, and ‘profit’ are numeric
variables. To perform group by analyses by year, one can change the ‘fy’ column of numbers
into a column that contains R categories (i.e. factors) with the following command;
CompanyData$fy
-
20
CompanyData
-
21
Remember, ‘ddply’ can apply more than one function at a time, for example;
myResults
-
22
Wide format on the other hand means that the dataframe has multiple measurement columns
across each row, such as;
Insert table here.
Let’s use R’s ‘reshape2’ package and its ‘melt’ function to reshape dataframes, in particulalry
reshape dataframes that are wide format into long format. The function ‘melt’ uses the
following format to assign results to a variable named ‘longData’; longData
-
23
data scientists face the dilemma of whether to use of only those rows with complete information
or impute in a plausible value for the missing observations.
For instance, the values of the gender in the rows 13 to 18 have missing data in the following
dataset. There are also other missing vakus in this dataset. Some rows of the
annocuecementsView and ParentAnsweringSurvey variables also have some missing values.
These missing values is the fucnon of the data generation process, such as, data entry errors or
incomplete answers to ghe questionerie.
Insert diagram here.
Missing completely at random (MCAR) describes data where the complete cases are a
random sample of the originally identified set of cases. Since the complete cases are
representative of the originally identified sample, inferences based on only the complete
cases are applicable to the larger sample and the target population. Missing at random
(MAR) to describe data that are missing for reasons related to completely observed variables
in the data set (Rubin, 1976).
In this lesson, we will learn how to apply multiple imputation models (commonly applied
imputation technique by data scientist in the industry) by using R. There are several imputation
packages in R. However, we will be using the package ‘Amelia’ to master the imputation of
datasets with incomplete data.
There are two versions of Amelia in R. First, Amelia II exists as a package for the R statistical
software package. Data scientists can utilize their knowledge of the R language to run Amelia
II at the command line or to create scripts that will run Amelia II and preserve the commands
for future use. Alternatively, data scientists can use AmeliaView, where an interactive
Graphical User Interface (GUI) enables setting options and run Amelia package without any
knowledge of the R programming language. We will be practising with AmeliaView.
AmeliaView Menu Guide
Below is a guide to the AmeliaView menus with references back to the users's guide. The same
principles from the user's guide apply to AmeliaView. The only difference is how you interact
with the program. Whether you use the GUI or the command line versions, the same underlying
code is being called, and so you can read the command line-oriented discussion above even if
you intend to use the GUI.
-
24
Loading AmeliaView
The way to load AmeliaView is to open an R session and type the following two commands:
library(Amelia)
AmeliaView()
This will bring up the AmeliaView window on any platform. On the Windows operating
system, there is an alternative way to start AmeliaView from the Desktop. Once installed, there
should be a desktop icon for AmeliaView. Simply double-click this icon and the AmeliaView
window should appear. If, for some reason, this approach does not work, simply open an R
session, and use the approach above.
Insert screenshot here.
Loading a data set into AmeliaView
AmeliaView load with a welcome screen that has buttons which can load a data in many of the
common formats. Each of these will bring up a window for choosing your dataset. Note that
these buttons are only a subset of the possible ways to load data in AmeliaView. Under the File
menu, you will find more options, including the datasets included in the package (africa and
freetrade). You will also find import commands for Comma-Separated Values (.CSV), Tab-
Delimited Text (.TXT), Stata v.5-10 (. DTA), SPSS (.DAT), and SAS Transport (. XPORT).
Note that when using a CSV file, AmeliaView assumes that your file has a header.
Insert screenshot here.
Variable dashboard
Once a dataset is loaded, AmeliaView will show the variable dashboard.
In this mode, you will see a table of variables, with the current options for each of them shown,
along with a few summary statistics. You can reorder this table by any of these columns by
clicking on the column headings. This might be helpful to, say, order the variables by mean or
amount of missingness.
Insert screenshot here.
-
25
You can set options for individual variables by the right-click context menu or through the
Variables menu. For instance, clicking \Set as Time-Series Variable" will set the currently
selected variable in the dashboard as the time-series variable. Certain options are disabled until
other options are enabled. For instance, you cannot add a lagged variable to the imputation
until you have set the time-series variable. Note that any factor in the data is marked as a ID
variable by default, since a factor cannot be included in the imputation without being set as an
ID variable, a nominal variable, or the cross-section variable. If there is a factor that fails to
meet one of these conditions, a red ag will appear next to the variable name. Here are some of
the commonly used functionalities in AmeliaView:
• Set as Time-Series Variable - Sets the currently selected variable to the time-series
variable. The time-series variable will have a clock icon next to it.
• Set as Cross-Section Variable - Sets the currently selected variable to the cross-section
variable. The cross-section variable will have a person icon next to it.
• Unset as Time-Series Variable - Removes the time-series status of the variable.
• Unset as Cross-Section Variable - Removes the cross-section status of the variable.
• Add Lag/Lead - Adds versions of the selected variables either lagged back (\lag") or
forward(\lead").
• Remove Lag/Lead - Removes any lags or leads on the selected variables.
• Plot Histogram of Selected - Plots a histogram of the selected variables. This command
will attempt to put all of the histograms on one page, but if more than nine histograms
are requested, they will appear on multiple pages.
• Add Transformation. - Adds a transformation setting for the selected variables. Note
that each variable can only have one transformation and the time-series and cross-
section variables cannot be transformed.
• Remove Transformation - Removes any transformation for the selected variables.
• Add or Edit Bounds - Opens a dialog box to set logical bounds for the selected variable.
The Variable menu and the variable dashboard are the place to set variable-level options, but
global options are set in the Options menu. Under the global options menu, data scientists can
have advance settings of the AmeliaView changed. In this lesson, we will not be exploring
these advance options.
-
26
Imputing and checking diagnostics
Once data scientists have set all the relevant options in AmeliaView, they then can impute data
by clicking the ‘Impute!’ button in the toolbar. Once the imputations are complete, “Successful
Imputation!" message appears at the bottom bar. By clicking on this message data scientists
can open the folder containing the imputed datasets and explore the results.
If there was an error during the imputation, the output log will pop-up and provides the error
message along with some information about how to fix the problem. Once the problem is fixed
simply click “Impute!" again. to understand how AmeliaView ran, simply click the “Show
Output Log" button. The log also shows the call to the ‘amelia’ function in R.
Insert chart here.
Upon the successful completion of an imputation, the diagnostics menu will become available.
• Compare Plots - This will display the relative densities of the observed (red) and
imputed (black) data. The density of the imputed values are the average You will have
to replace the x argument in the amelia call to the name of you dataset in the R session.
imputations across all of the imputed datasets.
Insert screenshot here.
• Overimpute - This will run Amelia on the full data with one cell of the chosen variable
artificially set to missing and then check the result of that imputation against the truth.
The resulting plot will plot average imputations against true values along with 90%
con_dence intervals. These are plotted over a y = x line for visual inspection of the
imputation model.
• Number of overdispersions - When running the overdispersion diagnostic, you need to
run the imputation algorithm from several overdispersed starting points in order to get
a clear idea of how the chain are converging. Enter the number of imputations here.
• Number of dimensions - The overdispersion diagnostic must reduce the dimensionality
of the paths of the imputation algorithm to either one or two dimensions due to graphical
restraints.
-
27
Data wrangling with SAS Enterprise Guide
-
28
Data Wrangling with Python