introduction to data cleaning with spreadsheets
Post on 21-Aug-2014
169 Views
Preview:
DESCRIPTION
TRANSCRIPT
An Introduction to data cleaning with spreadsheets
Anders Pedersen, @anpe
School of Data
Spreadsheets: The beginning of each and every data story
• Which were the top growth sectors in this quarter?
• What was the crime in the capital region in 2013 compared to 2012?
• Is there a house bubble waiting around the corner?
It is time for journalists themselves to tame this beast called spreadsheets!
Spreadsheets: Excel or google docs
Some basic terminology• data is organized in rows and columns
(rows go across the page, columns go top down)
• each field holding data is called a cell• Rows are numbered, • columns are referred to by letters• each cell has column and a row, or a
specific code (e.g. A1 is the top left cell
Some key features to explore today• Sorting and filtering• Basic formulas• Pivot tables
Tricky bits:- don’t include summaries in pivot table- pivot tables cannot remember when you change your data
Data sources for exercise
• Education: Secondary school enrollment for 2012 from Data.gov.ph http://data.gov.ph/catalogue/dataset/sy-2012-enrollment-data-secondary
Sorting - finding the best and the worst • The 10 best paid sectors• The 10 oldest cities• The 10 poorest countries• …
• If excel is a tool box for journalists, sorting is the hammer!
How to sort
• 1) Mark all your data• 2) In the Data tab go to sort range
Sorting...
• 3) Check the Data hasheader row check box• 4) Select the column you want to sort
Filtering - getting a better sense of your data• 1) Turn on Filtering
via the Data tab (Data → Filter)
Filtering...• 2) Filter options now appear at top
Filtering...• 3) Now click on the • blue triangular arrow
Filtering...• 4) Select the sectionyou wish to filter
Filtering...• 5) A green arrowwill now appear on topof the column
Moving forward!
• Sorting and filtering - check!• Basic formulas • Pivot tables
Basic formulas• Let us know try to sum up some of the
values in the dataset…
• What is it good for: when you do analysis and when you need to check if calculations by your colleagues are right
Basic formulas• Go to column H: In the second row (cell H2), type “=sum(f2+g2)”
Basic formulas• We now have a sum
• Now try to see if this cell can be calculated for average “=average(f2:g2)”
Basic formulas• You can also copy your calculations across
cells
Now only Pivot tables to go• Sorting and filtering - check!• Basic formulas - check!• Pivot tables
Pivot tables• finding stories inside datasets
• particularly well fitting for organised datasets with clear categories and sub-categories
Pivot tables• Mark the full area of the dataset• Go to Data → Pivot table report
Pivot tables• Pivot tables allows you to work on rows,
column values and filters• We start by droppinga column header into Rows • Then we drop one of our value columns into Values
Basic formulas• We now have a nice summary of the budget
for each department
Filtering pivot tables• We can now go ahead and filter the Pivot
table• Add the column you wish to filter by
Filtering pivot tables• Then select one or more categories withinthe column you wish to keep
Pivot tables• We can finally add several value columns to
the pivot table
Exercises• Find the sectors of the national budget that
grew the most in percentage• Identify the budget lines, which had the
biggest absolute increase in the budget• Generate a pivot table based on the
national budget comparing 2014 and 2013 in specific sectors
top related