tools for data analysis - university of warwick · unix history in a nutshell developed at at&t...
TRANSCRIPT
Foundations of Data
Analytics
Tools for Data
Analysis
Objectives
Introduce a broad collection of tools used in data analytics
Outline the capabilities and uses of each tool
– Provide examples of tool usage
Allow you to select the appropriate tools to work with
– Based on your preferences, e.g. GUI or command line
Very quick introduction to each tool
– More information on the web, in the library
CS910 Foundations of Data Analytics2
The philosophy of tool choice
A huge array of tools is available, with overlapping functionality
A good data analyst knows a good selection of tools
– Can pick the right one for the job
Many people know only one or two (often unsuitable) tools
– Hence, much of the world’s analytics is performed in spreadsheets
Knowing that a tool exists and what it can do is often enough
– Can decide if learning to use it is time-effective
– Will introduce some options here
– Will not give formal training
No single answer to tool selection
– Often a matter of personal choice
CS910 Foundations of Data Analytics3
Unix Tools
“Unix tools” covers many simple tools developed as part of the Unix operating system
– They manipulate data files represented as lines of text: flat files, comma separated value (CSV) files
– Allow simple analysis and data preparation
– Widely available in Linux, MacOS, Windows
“I use all these nearly every day. The best part is, once you know they exist, these tools are available on every unix machine you will ever use. Nothing else (except maybe perl) is as universal – you don’t have to worry about versions or anything. Being comfortable with these tools means you can get work done anywhere – any EC2 instance you boot up will have them, as will any unixserver you ssh into.”
CS910 Foundations of Data Analytics4
Unix History in a nutshell
Developed at AT&T Bell Labs in late 1960’s for PDP11
– Made available in mid-1970’s
– Developed and sold by AT&T in the 1980’s
– Commercial variants emerged: Solaris, SCO…
Standardized via POSIX in 1989
– POSIX: Portable Operating System Interface based on Unix
GNU foundation launched free implementations in 1980s
– Linux started in 1991 as a free POSIX-compliant OS kernel
– Many Linux distributions available: Ubuntu, Fedora, Debian…
CS910 Foundations of Data Analytics5
Tool availability
Available on any Unix machine
Available on any Linux machine
– Such as those in DCS, e.g. joshua
Available on any modern Mac
– Based on BSD kernel
– Open the ‘console’ and type away
On Windows:
– Various ports of individual tools or collections of tools
– Cygwin, open source port of many linux tools to Windowshttp://cygwin.com/install.html
CS910 Foundations of Data Analytics6
Command line tools
These are command line tools – no fancy GUI
Each tool performs a single simple function
– Additional functionality has crept in over time
– Now some are more like a swiss army knife
Can be combined via scripts, piping
Information available on each tool:
– Via ‘man’ command: e.g. man cat
– Via program itself: sort –help
– Via the web: many instructions/examples online
Short course on unix tools from Cambridge:
– http://www.cl.cam.ac.uk/teaching/1213/UnixTools/materials.html
CS910 Foundations of Data Analytics7
nmap.org/movies/
Example Data Set
Show examples using the “adult census data”
– http://archive.ics.uci.edu/ml/machine-learning-databases/adult/
– File: adult.data
~32K individuals, one per line
– Age, Gender, Employment Type, Years of Education…
Widely studied in Machine Learning community
– Prediction task: is income > 50K?
CS910 Foundations of Data Analytics8
Standard input, output and Piping
Unix commands can read and write files
– Special case: standard input (stdin) and standard output (stdout)
– By default, a command reads from stdin, writes to stdout
Some commonly used tools are ‘wc’ and ‘cat’
– wc does a simple wordcount
– cat reads a file, writes it to stdout
– Pipe ‘|’ connects the stdout of one command to stdin of next
Examples:
– cat adult.data | wc
◼ Output: 32562 488415 3974305 [lines words characters]
– cat adult.data| wc | wc
◼ Output: 1 3 24
CS910 Foundations of Data Analytics9
Redirection
Can use < to redirect a file to stdin, and > to redirect stdout
– >> appends to an existing file
Examples:
– wc < adult.data
◼ 32562 488415 3974305
– wc < adult.data > wordcountcat wordcount
◼ 32562 488415 3974305
– cat adult.data | wc >> wordcount
wc options:
– -l / -w / -c : print number of lines / words / characters
CS910 Foundations of Data Analytics10
Basic Commands
ls: list files in a directory
– ls adult
◼ adult.data adult.names adult.test
Options to commands are often single letters preceded by -
– ls –l adult◼ total 5852
-rwx------ 1 grahamc dcsstaff 3974305 Oct 8 18:03 adult.data
-rwx------ 1 grahamc dcsstaff 5229 Oct 8 18:03 adult.names
-rwx------ 1 grahamc dcsstaff 2003153 Oct 8 18:04 adult.test
– ls –la public_html◼ total 5860
drwx------ 2 grahamc dcsstaff 4096 Oct 8 18:04 .
drwx------ 39 grahamc dcsstaff 4096 Oct 8 18:04 ..
-rwx------ 1 grahamc dcsstaff 3974305 Oct 8 18:03 adult.data
-rwx------ 1 grahamc dcsstaff 5229 Oct 8 18:03 adult.names
-rwx------ 1 grahamc dcsstaff 2003153 Oct 8 18:04 adult.test
CS910 Foundations of Data Analytics11
Viewing files: cat, head, tail
cat file shows contents of file
head shows first few lines of a file
– head adult.data◼ 39, State-gov, 77516, Bachelors, 13, Never-married,
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse,
38, Private, 215646, HS-grad, 9, Divorced,
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners,
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty,
37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial,
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service,
52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse,
31, Private, 45781, Masters, 14, Never-married, Prof-specialty,
42, Private, 159449, Bachelors, 13, Married-civ-spouse, Exec-managerial,
Tail shows last few lines of a file– Tail –n 5 adult.data
◼ 40, Private, 154374, HS-grad, 9, Married-civ-spouse, Machine-op-inspct,
58, Private, 151910, HS-grad, 9, Widowed, Adm-clerical, Unmarried, White,
22, Private, 201490, HS-grad, 9, Never-married, Adm-clerical, Own-child,
52, Self-emp-inc, 287927, HS-grad, 9, Married-civ-spouse, Exec-managerial,
CS910 Foundations of Data Analytics12
Viewing files: more or less
more lets you page through a file
– Page down/space to advance
less is a more flexible replacement
– Can page up to go back
– Q to quit
CS910 Foundations of Data Analytics13
The sort command
sort: sorts the input
Default: sort lines by alphabetic order
– sort adult.data
Configurable
– -r: reverse sort
– -n: numeric sort
– -k: column on which to sort (assume space separates fields)
◼ sort adult.data –k5 | less
◼ sort adult.data –n –k5 | less
– -f: ignore (upper/lower) case
– -m: merge multiple sorted files together
CS910 Foundations of Data Analytics14
Cut
cut: select certain columns from the file
– Default: assume tab separates columns
– -f: specifiy which fields to select
– -c: specify which character positions in each line to select
◼ cut –c1-9 adult.data | head
◼ cut –c1,3,5,7,9 adult.data | head
– -d: specify the field delimiter
◼ cut –f1 adult.data | head
◼ cut –f1,2 –d, adult.data | head
◼ cut –f1,2 –d\ adult.data | head
◼ cut –f1,3,5,7,9 –d, adult.data | head
CS910 Foundations of Data Analytics15
Uniq
uniq: omit (or report) repeated lines
◼ cut –f1 –d, adult.data | uniq | head
◼ cut –f1 –d, adult.data | sort –n | uniq | head
– Count the number of occurrences with -c
◼ cut –f1 –d, adult.data | sort –n | uniq –c | head
◼ cut –f1 –d, adult.data | sort –n | uniq –c | sort –rn | head
CS910 Foundations of Data Analytics16
grep
grep: search for lines that match some text
◼ grep Masters adult.data | head
◼ grep Masters adult.data | wc –l
– -i: ignore case
– -v: invert behaviour, select non-matching lines
– -An, -Bn: print n lines of context appearing After / Before the match
◼ grep –A1 –B2 Hungary adult.data | less
Can handle regular expressions for flexible matching
◼ grep Married.*England adult.data | less
◼ grep ^90 adult.data | less
CS910 Foundations of Data Analytics17
Grep + regular expressions
Grep’s regular expression syntax:
– ^ : start of line
– $ : end of line
– \ : “escape” next character: \$ to match a $ sign
– [abc] : match any character of abc
– [a-z] : match any character in range a to z
– . (dot) : match any character
– * : match 0 or more occurrences of preceding expression
– \{n\} : match n instances of preceding expression
Example: grep “\(21\)\{2\}” adult.data
egrep for “extended” regular expressions:
◼ egrep “England|Mexico” adult.data | headCS910 Foundations of Data Analytics
18
sed
sed: stream editor
– Most commonly used to substitute some text for others
– sed ‘s/expression/replacement/g’
◼ sed ‘s/Private/Secret/g’ adult.data | head
◼ sed ‘s/, /\t/g’ adult.data | head
◼ sed ‘s/, /\n/g’ adult.data | head
CS910 Foundations of Data Analytics19
join
join: do a database-style join on two sorted text files
– -1 n -2 m: try to match n’th field of first file with m’th field of second
– Output all combinations of matches
– e.g. join list of people + postcodes with average income in postcode
Example:
◼ grep United-States -v adult.data | head -n 20 | cut -f 4,14 -d, | sort –k 2 > adult.join1grep United-States -v adult.data | head -n 20 | cut -f 1,14 -d, | sort –k 2 > adult.join2join -1 2 -2 2 adult.join1 adult.join2
CS910 Foundations of Data Analytics20
Editors: nano, pico, emacs
Unix editors were once notoriously unfriendly
– vi, vim, and ed all required memorizing complex commands
Modern editors are now much more usable
– pico and nano are easy to pick up and use
– emacs is very powerful and configurable
If working on a GUI based system, many options
– Local text editors in Windows, Macs, Linux
CS910 Foundations of Data Analytics21 h
ttp
://x
kcd
.co
m/3
78
/
scripting
Don’t have to write ever longer command lines
Can put sequences of commands into scripts
– With loop controls: automate processing, reduce errors
◼ #/bin/bashfor i in 1 2 do
wc adult.join$ifor ((j=1; j<=2; j++))do
echo $((i+j))done
donedate
CS910 Foundations of Data Analytics22
Programming
Can write programs in your language of choice
– Java: powerful, general purpose language
– Python: popular, mathematical language
– Perl: popular for processing text
Teaching a language is definitely out of scope of this module
– Foundations (CS917) module gives crash course in Java
– You can use any language you know for homeworks, project
◼ Data Analytics is about getting an answer, less about how
Will give a brief introduction to R, a statistical tool/language
CS910 Foundations of Data Analytics23
Tools for working with statistical data: R
R: flexible language with a lot of support for statistical operations
– Successor to ‘S’ language
– Open-source, available in Windows, Mac, Linux, Cygwin
Inbuilt support for many data manipulation operations
– Read in data from CSV (comma-separated values) format
– Compute sample mean, variance, quantiles
– Find line of best fit (linear regression)
– Flexible plotting tools, output to screen or file
– Lots more statistical tools available as libraries
Steep learning curve, but GUIs and help is available
– Will use the R Studio GUI https://www.rstudio.com/products/rstudio/download/
CS910 Foundations of Data Analytics24
Quick example in R
data <- read.csv(“adult.test“, header=F)# read in data in comma-separated value format
summary(data) # show a summary of all attributessummary (data[5]) # show a summary of years of educationd <- table(data[5]) # tabulate the dataplot (d) # plot the frequency distributionplot(ecdf(data[5]$V5)) # plot the (empirical) CDF
data2 <- read.csv(“adult.data”, header=F)qqplot(data[5]$V5, data2[5]$V5), type=“l”) # make a quantile-quantile plot of two (empirical) dbns
pdf(file=“qq.pdf”) # send output to a PDF fileqqplot(data[5]$V5, data2[5]$V5), type=“l”) dev.off() # close the file!quit() # quit!
CS910 Foundations of Data Analytics25
Spreadsheets
Many options: Excel, OpenOffice, Google Spreadsheets
Great for quick viewing, exploration and plotting of small data
– Excel 2003: 65536 rows
– Excel 2007, 2010, 2013, 2016: 1M rows
– Google sheets: up to 256 columns, or up to 200,000 cells
Quick plotting tools:
– Select data to plot, hit ‘plot’ button, fiddle with options
– Sometimes takes a long time to make plots how you want
– Tricky to get multiple plots with the same formatting
CS910 Foundations of Data Analytics26 0
5
10
15
20
0 5 10 15 20
adu
lt.t
est
ye
ars
of
ed
uca
tio
n
Adult.data years of education
Data Processing in Spreadsheets
Decent data manipulation functionality
– Sort, selection, reformatting
– Some tasks more difficult within the spreadsheet metaphor
Limitations of data processing in spreadsheets
– Capacity limits (row limits, cell limits)
– Can’t always keep a record of what was done (repeatability)
◼ Can put sequence of unix tool commands in a script
– Prone to errors: may select wrong range of cells etc.theconversation.com/economists-an-excel-error-and-the-misguided-push-for-austerity-13584
◼ An economics paper argued in favour of austerity measures
◼ Missed out Australia, Austria, Belgium, Canada, and Denmark from calculations, skewing the conclusion
CS910 Foundations of Data Analytics27
Data Processing in Spreadsheets
Sort: select data and click on ‘sort’
Aggregation:
– =sum(range), =count(range), =average(range), =median(range)
=if(test, [value if true], [value if false])
– “Smart filling” lets you drag to extend
=countif(range, condition)
Pivot tables let you explore the data cube
Exercise: compute the number of people from each country in adult.data
– Compare to the effort to do this with unix tools (cut, sort, uniq)
CS910 Foundations of Data Analytics28
Plotting in Excel
Scatter plot of age vs years of education
– Select columnns
– Insert - ‘scatter plot’
Bar chart of gender breakdown
– Derive necessary counts
– Insert - ‘Column’
CS910 Foundations of Data Analytics29
0
2
4
6
8
10
12
14
16
18
0 20 40 60 80 100
Series1
0
5000
10000
15000
20000
25000
Male Female
Series1
Gnuplot
Powerful plotting tool, driven by a script
– Easier to generate multiple, consistent plots
– Write script as a text file
– Call gnuplot scriptname
Pros and cons:
– Flexible output: create PDF, JPG, PNG, EPS, EMF…
– Plot data and functions
– Configure almost every aspect of the output
– Sometimes arcane commands, cryptic abbreviations
CS910 Foundations of Data Analytics30
Gnuplot function plotting
– set term emf enhanced font "Calibri,18" size 600,400set output "pareto.emf" set log yset log xset xrange [1: 1e6]set yrange [1e-6: 1]set format y "10^{%L}”set format x "10^{%L}”unset keyplot x**(-1.0)
– set output "exp.emf"plot x**(-1.0)*exp(-0.0001*x)
– cdf_lognormal(x)=0.5+0.5*erf((x)/sqrt(2.0))set output "lognorm.emf"plot 1.0-cdf_lognormal(0.5*log(0.01*x))
CS910 Foundations of Data Analytics31
10-6
10-5
10-4
10-3
10-2
10-1
100
100
101
102
103
104
105
106
10-6
10-5
10-4
10-3
10-2
10-1
100
100
101
102
103
104
105
106
10-6
10-5
10-4
10-3
10-2
10-1
100
100 101 102 103 104 105 106
Gnuplot data plotting
Scatter plot of age versus years of education:– set term emf enhanced font "Calibri,18"
set output "ageeducation.emf"set title "Age versus Education"set xlabel "Age"set ylabel "Years of Education"set key underplot "adult/adult.data" using 1:5 \with points title 'Adult data'
Add a line of best fit:– y(x)=a*x+b
fit y(x) "adult/adult.data" using 1:5 via a,bplot "adult/adult.data" u 1:5 w p t 'Adult', y(x) w l t ‘Fit'
CS910 Foundations of Data Analytics32
0
2
4
6
8
10
12
14
16
10 20 30 40 50 60 70 80 90
Year
s o
f Ed
uca
tio
n
Age
Age versus Education
Adult data
Gnuplot data plotting
Bar chart of gender breakdown:
– Process data to generate sums:
◼ cut -f 10 -d, adult/adult.data | sort | uniq -c > gendercount.txt
– Gnuplot script:◼ set term emf enhanced font "Calibri,18"
set output "gender.emf"set style data histograms set style histogram cluster gap 1set style fill solid border -1set yrange [0:]plot "gendercount.txt" using 1:xticlabel(2) title " "
CS910 Foundations of Data Analytics33
0
5000
10000
15000
20000
25000
Female Male
Report writing: Wordprocessors
Many options: MS Word, OpenOffice Writer, Google Docs
Adequate for report writing (e.g. project report)
– Nice GUI interface, configurable
– Can be difficult if you have many figures
– 3rd party support for bibliographic data (Endnote)
CS910 Foundations of Data Analytics34
Report writing: LaTeX
LaTeX: a scientific document preparation system
Describe how you want your document to be, and compile it
More of a learning curve, but very powerful
– Stops you getting too involved in fine details
– Support for producing beautiful mathematical formulae
– Produce PDF output easily from LaTeX (text) source file:
◼ pdflatex myfile.tex
– Support automatic bibliography creation via bibtex
– Automatic updating cross-references via \label and \ref
Covered in more detail in CS908 Research Methods
CS910 Foundations of Data Analytics35
Is this on the test?
From 2014 exam:
Many acceptable answers for each question (and also poor/wrong answers…)
Background reading
Warwick past papers http://www2.warwick.ac.uk/services/exampapers?q=cs910&department=Any&year=Any
http://www.cl.cam.ac.uk/teaching/1213/UnixTools/materials.html
CS910 Foundations of Data Analytics36
LaTeX example
\documentclass{article}\usepackage[margin=2cm]{geometry}\usepackage{graphicx}\title{This is my report}\author{Your name}\begin{document}\maketitle\begin{abstract}This is an abstract for the document\end{abstract}
\section{Introduction}This is the introduction to my document
\begin{figure}\includegraphics{figure.pdf}\caption{This is a figure}\label{fig:first}\end{figure}Please see figure~\ref{fig:first}.\end{document}
CS910 Foundations of Data Analytics37