getting it the rightest
TRANSCRIPT
Getting it the rightest
you can
Thomas Hargrove, Scripps NewsJohn Perry, Atlanta Journal-Constitution
Janet Roberts, ReutersJennifer LaFleur, Reveal | The Center for Investigative Reporting
IRE 2015 CAR Conference, Atlanta
Beware duplicates
Every time Saint Paul, Minn., housing inspectors made follow-up visits to check on violations, all of the data entries from the previous visit were logged again. So every violation was listed in the database multiple times.
Do integrity checks from your desk
Beware dates
Did 592,000 people in Ohio really vote before they registered?
Do integrity checks from your desk
Does it make sense?
“We select things for publication just to make available a wide scope of data to the public ... There is some burden on the public to decide whether or not to use the material.”
--Kathleen McGuire, Sourcebook of Criminal Justice Statistics
(a/k/a: The case of the disappearing lifers)
Do integrity checks from your desk
Do the data conform to the real world?
Are half of the records male, half female?
In a national data set, are about 13 percent of the records from California?
Are racial minorities adequately represented?
Do integrity checks from your desk
Check for patterns in missing data.
Do patterns render estimates inaccurate?
Do integrity checks from your desk
Think like a statistician
Do integrity checks from your desk
a/k/a: How George Will became the darling of statistics teachers
"In 1992-93, none of the five states with the highest teachers'
salaries were among the 15 states with the highest SAT scores.
And the 10 states with the lowest per pupil spending included
four . . . among the 10 states with the highest SAT scores."
--George Will, 1993
Statistical checks: From the simple to the sophisticated
Do integrity checks from your desk
R-squared = .82
ss2 = 43 + 0.95(ss1)
Descriptive statistics:
Frequency
Average
Mode
Beware the documentation
Do integrity checks with other sources
Yes, that’s Harold Spaeth’s view and mostly I think he’s right, though I’d substitute the word more “efficient” for more “accurate.”
--Lee Epstein
(Find a power user, and compare notes.)
What’s missing?
An estimated 30 percent of felony convictions are missing from the Minnesota public convictions file.
(ask the keepers of the data)
Do integrity checks with other sources
Check those codes
Do integrity checks with other sources
(a/k/a: The codes are not what they seem)
Data spanned six years. Sometime in those six years, the violation codes changed. No one in the Housing Violations Bureau knew when the switch was made, and no one had definitions for the previous codes.
(a/k/a: Why to pull some paper records)
Beware elements of change
Do integrity checks with other sources
The “feename” – name of the property owner –in the Saint Paul Housing Bureau’s code violations database is pulled in from property tax rolls. It shows the current owner. That person may not have owned the property at the time of the violation.
(a/k/a: Why to pull some paper records)
Summarize cases by institutions, then spot check results.
Do integrity checks with other sources
Is it true only 6 percent of hospital emergency cases are transferred from other hospitals?
Beware nulls!Technology bites
Null scariness from the FDA’s MAUDE database
http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm
Beware nulls!Technology bites
We want to explore reports involving Promus heart stents , but NOT the Promus Element devices.
First, let’s see what’s in there for Promus.
Beware nulls!Technology bites
There are 50 records that mention Promus. We can see by scrolling that four are the Promus Element that we wish to exclude.
Beware nulls!Technology bites
You’re supposed to have 46 records, but you got 30. What are the missing 16 records?
Beware false joins in "encrypted“ data.
Technology bites
Medicare 5 percent sample: Doctors IDs were encrypted in some files, not in others.
Don’t alter original data.
As you report and just before you publish
Make a copy of the original data file. Put it somewhere and don’t touch it again.
Don’t edit an original column or field. Make a copy and edit that.
Document as you go
As you report and just before you publish
Keep track of all of your queries so you can retrace your steps or find where you went wrong.
As you integrity check your data, annotate the queries to remember what you learned.
Cross check
As you report and just before you publish
If you summed data in SQL, can you reproduce the results in a pivot table?
If you’re summing, do a list. Make sure there‘s nothing wacky in that list that would cause your count to be wrong; e.g., duplicates.
If you have various data sources that should yield the same conclusions, do they?
Beware the single case
As you report and just before you publish
Never report on one data record without pulling the paper report or talking to the person in question.
What if it was a data entry error?
What if there are circumstances you don’t understand?
Recreate the wheel
As you report and just before you publish
For every fact, number, finding in your story, write an original query or formula to support it.
Go back to your original data.
Try to arrive at the same conclusion in different ways.