unit 3 data wrangling - irp-cdn.multiscreensite.com 3 data... · introduction to data wrangling...

of 28 /28
1 UNIT 3 Data Wrangling

Author: others

Post on 22-May-2020




0 download

Embed Size (px)


  • 1

    UNIT 3 Data Wrangling

  • 2

    Introduction to Data Wrangling

    Data pre-processing (aka "wrangling") can be defined as the preparation of data for analysis

    with data mining and visualization tools. There are many problems which can interfere with a

    successful analysis; some of them can be readily addressed with simple pre-processing

    techniques, which we will explore in this unit.

    A range of data issues can be avoided by early planning of data collection. If data scientists

    can anticipate that a study of customer satisfaction will need customer income levels, for

    example, then in organizing a survey they can arrange to ask about income, whereas without

    anticipating those, data scientists never think to ask and end up with poor data. Data scientists

    generally have no say in the original collection of data and are simply handed a data set. Then

    there are only two options at their hands: (1) wrangling with the data to reduce or eliminate

    problems; (2) reporting on the problems and how to avoid them in future data collection.

    It is often said that data scientists spend about 70 per cent of their time in data wrangling.

    Only after that does it really make sense to do any analysis. If data scientists do not take the

    time to ensure that data is in good shape before doing any analysis, they often run a big risk

    of wasting a lot of time later on, or worse, losing the faith of their project stakeholders.

    The most important thing to keep in mind about data cleaning, is that it's an iterative process.

    Iterate on first detecting, and then correcting bad records. For example, one might have text

    where we expect to find numeric data. So the word two instead of the number two. Some data

    items might not be designed according to pre-defined specification. They might be missing

    entire fields or they might have extra fields.

    In measuring the quality of data, data scientists measure the degree to which entries in data

    set conform to a defined schema, or to other constraints. They also look at accuracy. This is

    the degree to which entries conform to gold standard data. Completeness of data is

    straightforward i.e. do we have all the records we should have. Data consistency also is an

    important aspect of data quality. Data scientists need to ensure that there is consistency

    among the fields that represent the same data across systems. Finally, data uniformity, which

    means whether values for distance, for example, use the same units; Is it miles, or is it


  • 3

    Introduction to R

    Installing R and Packages R is a programming environment, which uses a simple programming language, allows for rapid

    development of new tools according to user demand. These tools are distributed as packages,

    which any user can download to customize the R environment. Base R and most R packages

    are available for download from the Comprehensive R Archive Network (CRAN) in the

    following web address:


    R packages are the fuel that drives the growth and popularity of R. R packages are bundles of

    code, data, documentation, and tests that are easy to share with others.

    Before one can use a package, one will first have to install it. Some packages, like the base

    package are automatically installed. Other packages, like for example the ‘ggplot2’ package,

    will not come with the bundled R installation but need to be installed.

    Many (but not all) R packages are organized and available from CRAN, a network of servers

    around the world that store identical, up-to-date, versions of code and documentation for R.

    Using the ‘install.packages’ function data scientists can easily install these packages from

    inside R. CRAN also maintains a set of Task Views that identify all the packages associated

    with a particular task.

    In addition to CRAN, data scientists also have bioconductor which has packages for the

    analysis of high-throughput genomic data, as well as for example the github and bitbucket

    repositories of R package developers. You can easily install packages from these repositories

    using the devtools package.

    R comes with several basic data management, analysis, and graphical tools. R's power and

    flexibility lies in its array of packages (currently more around 6,000).

    Data scientists can work directly in R, but most prefer a graphical interface. For starters:

    • RStudio, an Integrated Development Environment (IDE)

    • Deducer, a Graphical User Interface (GUI)


    R is the name of the programming language itself and RStudio is a convenient interface. There

    a several fundamental building blocks of R and RStudio. These blocks are the interface,

    running code, and basic commands.When you first launch RStudio, you will be greeted by an

    interface that looks like this:


  • 4

    Insert diagram here.

    The panel in the upper right contains the workspace as well as a history of the commands that

    are entered. Any plots that you generate will show up in the panel in the lower right corner.

    The panel on the left is the console. Each time RStudio is launched, it will have the same text

    at the top of the console telling you the version of R. Below that information is the prompt

    where R commands are entered. Interacting with R is all about typing commands and

    interpreting the output. These commands and their syntax are the window to access data,

    organize, describe, and perform statistical computations.

    For the purposes of this lesson, we will be using the following packages frequently:

    • ‘foreign’ package to read data files from other stats packages

    • ‘readxl’ package for reading Excel files

    • ‘dplyr’ package for various data management tasks

    • ‘reshape2’ package to easily melt data to long form

    • ‘ggplot’ and ‘ggplot’ packages for elegant data visualization using the Grammar of


    • ‘GGally’ package for scatter plot matrices

    • ‘vcd’ package for visualizing and analyzing categorical data

    • ‘lattice’ is a powerful and elegant high-level data visualization system

    Installing R Packages

    To use packages in R, let’s install them using the ‘install.packages’ function, which typically

    downloads the package from CRAN.








    Loading R Packages

    When data scientists need an R package for R sessions, the specific packages must be loaded

    into the R environment using the ‘library’ or ‘require’ functions.

  • 5








    To get a description of the version of R and its attached packages used in the current session,

    one can use the ‘sessionInfo’ function;


    Essential features of R programming

    • R code can be entered into the command line directly or saved to a script, which can be

    run inside a session using the source function,

    • Commands are separated either by a; or by a newline,

    • R is case sensitive,

    • The # character at the beginning of a line signifies a comment, which is not executed,

    • Help files for R functions are accessed by preceding the name of the function with ?

    (e.g. ?require),

    • R stores both data and output from data analysis in objects,

    • Things are assigned to and stored in objects using the

  • 6


    # SPSS files


  • 7

    The most commonly used concept in R is the notation; object[row,column]. Let’s review a few

    examples of this notation;

    dat.csv[2,3], which produces the single cell value object of; ‘[1] 4’.

    dat.csv[,3] command omits row value which implies all rows of the dataframe (here all rows

    in column 3), as shown below;

    [1] 4 4 4 4 4 4 3 1 4 3 4 4 4 4 3 4 4 4 4 4 4 4 3 1 1 3 4 4 4 2 4 4 4 4 4

    [36] 4 4 4 1 4 4 4 4 3 4 4 3 4 4 1 2 4 1 4 4 1 4 1 4 1 4 4 4 4 4 4 4 4 4 1

    [71] 4 4 4 4 4 1 4 4 4 1 4 4 4 1 4 4 4 4 4 4 2 4 4 1 4 4 4 4 1 4 4 4 3 4 4

    [106] 4 4 4 3 4 4 1 4 4 1 4 4 4 4 3 1 4 4 4 3 4 4 2 4 3 4 2 4 4 4 4 4 3 1 3

    [141] 1 4 4 1 4 4 4 4 1 3 3 4 4 1 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 1 3 2 3

    [176] 4 4 4 4 4 4 4 4 4 2 2 4 2 4 3 4 4 4 2 4 2 4 4 4 4

    Omitting column values implies all columns. For instance, the following syntax displays the

    values of the student dataframe for the rows 2 and 3 and columns 2 and 3;

    dat.csv[2:3, 2:3]

    Here is the result of the syntax above;

    ## female race

    ## 2 1 4

    ## 3 0 4

    Variables in R directly can be accessed by using their names, either with object["variable"]

    notation or object$variable notation.

    Activity 1

    R vectors

    R’s ‘c’ function is used to combine values of common type together to form a vector. It can be

    used to access non-sequential rows and columns from a data frame. For instance, to get column

    1 for rows 1, 3 and 5, let’s run the following syntax;

    dat.csv[c(1,3,5), 1]

    ## [1] 70 86 172

    Additionally, to get row 1 values for variables female, prog and socst, data scientists can

    employ an R syntax such as;

    dat.csv[1,c("female", "prog", "socst")]

    ## female prog socst

    ## 1 0 1 57

  • 8

    Modifying Variable Names in R

    The function ‘colnames’ enables data scientists to manipulate R variable names. The structure

    of the syntax below first creates an R vector for the variables of the dataframe with the

    colnames function and it then changes variable name ‘ID’ by using the indexing to ‘ID2’.


  • 9

    Exploring Data with R Let’s read in some data into R and store it in our object, ‘d’. Then, let’s explore and get to know

    these data, which contain a number of school, test, and demographic variables for 200 students.


  • 10

    Data Wrangling with R R package ‘dplyr’ is a widely used package to modify data. The package has five main

    functions which we will be using each of these functions later in the unit in detail.

    Let's begin by reading the dataset and storing it in object d. Then, sort data using the ‘arrange’

    function from the ‘dplyr’ package.


  • 11

    Below a total score is created and ‘cut’ function is used to recode continuous ranges into




  • 12


  • 13

    Introduction to Data Wrangling with R Data scientists often must deal with untidy or incomplete data. The raw data obtained from

    different data sources is often unusable at the beginning of every data science project. The

    activity that data scientists perform on the raw data to make it usable to input to statistical

    modelling and machine learning algorithms is called data wrangling or data munging.

    Similarly, to create an efficient ETL (extract, transform and load) pipeline or create data

    visualizations, data scientists should be prepared to do a lot of data wrangling.

    Data wrangling is a process of data manipulation and transformation that enables analysis. In

    other words, it is the process of manually converting or mapping data from one raw form into

    another format that allows for more convenient consumption of the data with the help of semi-

    automated tools.

    Data wrangling is an important part of any data science project. By dropping null values,

    filtering, and selecting the right data, and working with time series, you can ensure that any

    machine learning or treatment you apply to your cleaned-up data is fully effective.

    It is important to remember three steps goals of data wrangling when working with data. These


    1. Figure out what you need to do,

    2. Describe those tasks in the form of a computer program (i.e. R),

    3. Execute the program.

    Data Wrangling with R ‘dplyr’ package

    Remember that the ‘dplyr’ package makes the steps involved in data wrangling effective by; it

    helps data scientists think about data manipulation challenges, it provides simple functions that

    correspond to the most common data manipulation tasks and it uses efficient backends, so data

    scientists spend less time waiting for the processing power of the computers.This section

    introduces dplyr’s basic set of tools, and shows how to apply them to data frames.

    In data wrangling, there a few tasks that any data science project needs to deal with. Some of

    these tasks are;

    • Filtering rows in data,,

    • Selecting columns of data

    • Adding new variables in data,

    • Sorting data, and

    • Aggregating data.

  • 14

    In introduction to R, we have explored some of these tasks with R. In this section, we will

    explore these tasks and others with a powerful R package, ‘dplyr’ package. The package,

    ‘dplyr’ gives data scientists tools to do these tasks, and it does so in a way that streamlines the

    analytics workflow. It may be said that ‘dplyr’ is almost perfectly suited to data science work,

    as it is performed.

    As mentioned earlier in the unit, the package ‘dplyr’ has five main commands. These

    commands are; filter, select, mutate, arrange, and summarize. Below, we will explore each of

    the steps in greater detail.

    1. Filter ()

    The function ‘filter’ subsets data by keeping rows that meet specified conditions. An example

    of the function is provided below.





  • 15


    carat cut color price clarity

    0.23 Ideal E 326 SI2

    0.23 Ideal J 340 VS1

    0.31 Ideal J 344 SI2

    0.30 Ideal I 348 SI2

    0.33 Ideal I 403 SI2

    0.33 Ideal I 403 SI2

    3. Mutate ()

    Mutate function enables users to add variables to a dataset. For example, lets add a new

    variable, ‘price_per_carat’ to the data frame’ df.diamonds_ideal’.


  • 16

    The syntax, arrange (df.disordered_data, num_var), orders the data points of the dataframe

    whilst the syntax arrange(df.disordered_data, desc(num_var)) sorts the data in descending


    5. Summarize ()

    The function ‘summarize’ is a very useful function which enables data scientists to compute

    summary statistics of the data. Having a look at the summary statistics of the data allows data

    scientists to understand the distributional features of the data which we will explore further in

    the unit data exploration and visualisation.

    summarize(df.diamonds_ideal, avg_price = mean(price, na.rm = TRUE) )



    As you have noticed that the syntax and function of all these verbs are very similar.

    • The first argument is always a data frame,

    • The subsequent arguments describe what to do with the data frame. You can refer to

    columns in the data frame directly without using $, and

    • The result is always a new data frame.

    Together these properties make it easy to chain together (merge) multiple simple steps to

    achieve a complex result. These five functions provide basis of a language of data manipulation

    with ‘dplyr’. In summary, data scientists alter an untidy or incomplete data frame in five useful


    1. Reorder the rows by the function (arrange()),

    2. Pick observations and variables of interest by the function filter() and,

    3. Pick observations and variables of interest by the function select()),

    4. Add new variables that are functions of existing variables by the function (mutate())

    and finally,

    5. Collapse many values to a summary by the function (summarise()).

    Activity 3.

    Chaining in ‘dplyr’ Moving beyond the examples above, the real power of the ‘dplyr’ package comes when data

    scientists chain different commands together (or, chain different ‘dplyr’ commands together

    with commands and functions from other packages).

  • 17

    In the ‘dplyr’ syntax, data scientists use the ‘%>%’ operator to connect one command to

    another. The output of one command becomes the input for the next command. For example;

    df.diamonds_ideal_chained %

    filter(cut=="Ideal") %>% select(carat, cut, color, price, clarity) %>%

    mutate (price_per_carat = price/carat)


    carat cut color price clarity price_per_carat

    0.23 Ideal E 326 SI2 1417.391

    0.23 Ideal J 340 VS1 1478.261

    0.31 Ideal J 344 SI2 1109.677

    0.30 Ideal I 348 SI2 1160.000

    0.33 Ideal I 403 SI2 1221.212

    0.33 Ideal I 403 SI2 1221.212

    The code above created a new, reshaped dataset. More specifically, we did ‘chained’ together

    multiple R commands and directed the output of that set of commands into a new data frame

    called ‘df.diamonds_ideal_chained’. In other words, what we did was;

    1. take the diamonds dataset,

    2. then filter it, keeping only the rows where ‘cut’ equals ‘Ideal’,

    3. then select specific variables, ‘carat’, ‘cut’, ‘color’, ‘price, ‘clarity’, and

    4. then create a new variable, ‘price_per_carat’ using ‘mutate’.

    Let’s explore a bit more complex structure of chaining by combining ‘dplyr’ and ‘ggplot’

    packages together.

    diamonds %>%

    filter(cut == "Ideal") %>%

    ggplot(aes(x=color,y=price)) + geom_boxplot()

    Insert chart here.

    In this data exploration example, what we did was to work on the 'diamonds' data frame and

    then filter down to rows where ‘cut == Ideal’, and then, plot the data with ggplot to create a

    boxplot chart as shown above. Lastly, lets’ create a histogram of ‘ideal cut’ diamonds, in a

    small multiple layout (Small multiples are called ‘facets’ in ‘ggplot’ terminology).


  • 18

    diamonds %>%

    filter(cut == "Ideal") %>%

    ggplot(aes(price)) + geom_histogram() + facet_wrap(~ color)

    Insert table here.

    Four crucial data wrangling tasks with R There is an array of data wrangling tasks in any dataset in practise. Each dataset is unique which

    requires customised data wrangling tasks to be thought through and implemented. This is due

    to the fat that each data generation process involves, collects and generates data that given the

    business goals need to be dealt with at the data wrangling stage. Having said that, there are four

    widely used data wrangling tasks that are in the toolset of data scientists. These are;

    1. Adding a column to an existing dataframe,

    2. Getting data summaries by subgrouping,

    3. Sorting the results, and

    4. Reshaping the dataframe.

    Before going into detail for each of these main data wrangling tasks by using R, let’s create a

    hypothetical data frame by entering values into R. Then, store this dataset as ‘CompanyData’.


  • 19

    $ revenue: num 65225 108249 156508 29321 37905 ...

    $ profit : num 14013 25922 41733 8505 9737 ...

    As is seen, the dataframe has 9 observations and 4 variables. The variable ‘fy’ has numeric

    data, which in fact represents the ‘year’ (essentially a date field). In R, date fields need to be

    represented as factors though. The variables ‘company’ is a factor variable with three levels

    (Apple, Google, and Microsoft), whereas the variables of ‘revenue’, and ‘profit’ are numeric

    variables. To perform group by analyses by year, one can change the ‘fy’ column of numbers

    into a column that contains R categories (i.e. factors) with the following command;


  • 20


  • 21

    Remember, ‘ddply’ can apply more than one function at a time, for example;


  • 22

    Wide format on the other hand means that the dataframe has multiple measurement columns

    across each row, such as;

    Insert table here.

    Let’s use R’s ‘reshape2’ package and its ‘melt’ function to reshape dataframes, in particulalry

    reshape dataframes that are wide format into long format. The function ‘melt’ uses the

    following format to assign results to a variable named ‘longData’; longData

  • 23

    data scientists face the dilemma of whether to use of only those rows with complete information

    or impute in a plausible value for the missing observations.

    For instance, the values of the gender in the rows 13 to 18 have missing data in the following

    dataset. There are also other missing vakus in this dataset. Some rows of the

    annocuecementsView and ParentAnsweringSurvey variables also have some missing values.

    These missing values is the fucnon of the data generation process, such as, data entry errors or

    incomplete answers to ghe questionerie.

    Insert diagram here.

    Missing completely at random (MCAR) describes data where the complete cases are a

    random sample of the originally identified set of cases. Since the complete cases are

    representative of the originally identified sample, inferences based on only the complete

    cases are applicable to the larger sample and the target population. Missing at random

    (MAR) to describe data that are missing for reasons related to completely observed variables

    in the data set (Rubin, 1976).

    In this lesson, we will learn how to apply multiple imputation models (commonly applied

    imputation technique by data scientist in the industry) by using R. There are several imputation

    packages in R. However, we will be using the package ‘Amelia’ to master the imputation of

    datasets with incomplete data.

    There are two versions of Amelia in R. First, Amelia II exists as a package for the R statistical

    software package. Data scientists can utilize their knowledge of the R language to run Amelia

    II at the command line or to create scripts that will run Amelia II and preserve the commands

    for future use. Alternatively, data scientists can use AmeliaView, where an interactive

    Graphical User Interface (GUI) enables setting options and run Amelia package without any

    knowledge of the R programming language. We will be practising with AmeliaView.

    AmeliaView Menu Guide

    Below is a guide to the AmeliaView menus with references back to the users's guide. The same

    principles from the user's guide apply to AmeliaView. The only difference is how you interact

    with the program. Whether you use the GUI or the command line versions, the same underlying

    code is being called, and so you can read the command line-oriented discussion above even if

    you intend to use the GUI.

  • 24

    Loading AmeliaView

    The way to load AmeliaView is to open an R session and type the following two commands:



    This will bring up the AmeliaView window on any platform. On the Windows operating

    system, there is an alternative way to start AmeliaView from the Desktop. Once installed, there

    should be a desktop icon for AmeliaView. Simply double-click this icon and the AmeliaView

    window should appear. If, for some reason, this approach does not work, simply open an R

    session, and use the approach above.

    Insert screenshot here.

    Loading a data set into AmeliaView

    AmeliaView load with a welcome screen that has buttons which can load a data in many of the

    common formats. Each of these will bring up a window for choosing your dataset. Note that

    these buttons are only a subset of the possible ways to load data in AmeliaView. Under the File

    menu, you will find more options, including the datasets included in the package (africa and

    freetrade). You will also find import commands for Comma-Separated Values (.CSV), Tab-

    Delimited Text (.TXT), Stata v.5-10 (. DTA), SPSS (.DAT), and SAS Transport (. XPORT).

    Note that when using a CSV file, AmeliaView assumes that your file has a header.

    Insert screenshot here.

    Variable dashboard

    Once a dataset is loaded, AmeliaView will show the variable dashboard.

    In this mode, you will see a table of variables, with the current options for each of them shown,

    along with a few summary statistics. You can reorder this table by any of these columns by

    clicking on the column headings. This might be helpful to, say, order the variables by mean or

    amount of missingness.

    Insert screenshot here.

  • 25

    You can set options for individual variables by the right-click context menu or through the

    Variables menu. For instance, clicking \Set as Time-Series Variable" will set the currently

    selected variable in the dashboard as the time-series variable. Certain options are disabled until

    other options are enabled. For instance, you cannot add a lagged variable to the imputation

    until you have set the time-series variable. Note that any factor in the data is marked as a ID

    variable by default, since a factor cannot be included in the imputation without being set as an

    ID variable, a nominal variable, or the cross-section variable. If there is a factor that fails to

    meet one of these conditions, a red ag will appear next to the variable name. Here are some of

    the commonly used functionalities in AmeliaView:

    • Set as Time-Series Variable - Sets the currently selected variable to the time-series

    variable. The time-series variable will have a clock icon next to it.

    • Set as Cross-Section Variable - Sets the currently selected variable to the cross-section

    variable. The cross-section variable will have a person icon next to it.

    • Unset as Time-Series Variable - Removes the time-series status of the variable.

    • Unset as Cross-Section Variable - Removes the cross-section status of the variable.

    • Add Lag/Lead - Adds versions of the selected variables either lagged back (\lag") or


    • Remove Lag/Lead - Removes any lags or leads on the selected variables.

    • Plot Histogram of Selected - Plots a histogram of the selected variables. This command

    will attempt to put all of the histograms on one page, but if more than nine histograms

    are requested, they will appear on multiple pages.

    • Add Transformation. - Adds a transformation setting for the selected variables. Note

    that each variable can only have one transformation and the time-series and cross-

    section variables cannot be transformed.

    • Remove Transformation - Removes any transformation for the selected variables.

    • Add or Edit Bounds - Opens a dialog box to set logical bounds for the selected variable.

    The Variable menu and the variable dashboard are the place to set variable-level options, but

    global options are set in the Options menu. Under the global options menu, data scientists can

    have advance settings of the AmeliaView changed. In this lesson, we will not be exploring

    these advance options.

  • 26

    Imputing and checking diagnostics

    Once data scientists have set all the relevant options in AmeliaView, they then can impute data

    by clicking the ‘Impute!’ button in the toolbar. Once the imputations are complete, “Successful

    Imputation!" message appears at the bottom bar. By clicking on this message data scientists

    can open the folder containing the imputed datasets and explore the results.

    If there was an error during the imputation, the output log will pop-up and provides the error

    message along with some information about how to fix the problem. Once the problem is fixed

    simply click “Impute!" again. to understand how AmeliaView ran, simply click the “Show

    Output Log" button. The log also shows the call to the ‘amelia’ function in R.

    Insert chart here.

    Upon the successful completion of an imputation, the diagnostics menu will become available.

    • Compare Plots - This will display the relative densities of the observed (red) and

    imputed (black) data. The density of the imputed values are the average You will have

    to replace the x argument in the amelia call to the name of you dataset in the R session.

    imputations across all of the imputed datasets.

    Insert screenshot here.

    • Overimpute - This will run Amelia on the full data with one cell of the chosen variable

    artificially set to missing and then check the result of that imputation against the truth.

    The resulting plot will plot average imputations against true values along with 90%

    con_dence intervals. These are plotted over a y = x line for visual inspection of the

    imputation model.

    • Number of overdispersions - When running the overdispersion diagnostic, you need to

    run the imputation algorithm from several overdispersed starting points in order to get

    a clear idea of how the chain are converging. Enter the number of imputations here.

    • Number of dimensions - The overdispersion diagnostic must reduce the dimensionality

    of the paths of the imputation algorithm to either one or two dimensions due to graphical


  • 27

    Data wrangling with SAS Enterprise Guide

  • 28

    Data Wrangling with Python