data journalism 101 - day 1 by michael j. berens
TRANSCRIPT
Donald W. Reynolds National Center for Business
Journalism at ASU
Michael J. Berens – The Seattle Times
Data Journalism 101
Skills – rooted in past
Skills – lost in space
He said. She said. Now I’m going to tell you
who’s telling the truth.
Poll Question: Have you ever been denied public data?
1) Yes 2) No
Finding a serial killer
Finding deadly
germs and dirty
hospitals
Tracking elephant deaths inside America’s zoos
Tracking fraudulent
medical devices and profiteers
Tracking the exploitation of
vulnerable seniors
Cops who own crack houses
Secret release of fugitives
Sexual misconduct in health care
Jailing the poor
Nursing errors
Unsanitary hospitals
Most dangerous highway
Most dangerous intersection
Number of deadly police chases
Most dangerous area for crime
Most unsanitary restaurants
“Quantitative”
Poll Question: Why were you denied data?
• Too expensive
• Agency claimed info was not a public
record.
• Agency claimed the request was a burden.
Negotiating for data • Delay - we’re working on it.
• Deny – it’s proprietary software
• Divert – yours for just $12,000
“If you don’t know who I am, then maybe your best course of action
would be to tread lightly.”
""Walter White in "Breaking Bad"
Step One File layout
(secret weapon to finding stories)
Fields, position, type, length
Field
Number Variable Type Format Label Comment
1 SEQ_NO Char $10. Sequence Number
Unique sequence number assigned to each record within a year. First four digits
are the year of discharge.
2 REC_KEY Num 11. Record Key Unique number assigned to each CHARS record. Added in 2003.
3 STAYTYPE Char $1 Type of Stay
1 = Inpatient
2 = Observation patient
4 HOSPITAL Char $4 Hospital Number
DOH assigned hospital number.
Fourth character describes the Medicare certified unit type with:
blank = acute care
R = Rehabilitation unit
P = Psychiatric unit
S = Swing bed unit
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
A = Alcohol (discontinued after 1992)
B = Bone marrow transplants (discontinued after 2000)
E = Extended care (discontinued after 2001)
H = Tacoma General & Group Health combined (discontinued after 1992)
I = Group Health only at Tacoma General (discontinued after 1992)
5 LINENO Num 3. Number of Reported Revenue Items Codes
6 ZIPCODE Char $5 Patient's Zip Code
99999 indicates the zip code is unknown.
99998 indicates homelessness (some homeless patients may have a zip code for a shelter or
other temporary location).
Blanks indicate non-U.S. residence.
7 STATERES Char $2 State of Residence
State abbreviation used by U.S. Postal Service.
This is assigned from the zip code.
Residents with zip code 99998 are assigned to Washington
XX = invalid zip code or a non-U.S. residence.
Code keys
Finding stories that lurk in code
keys
Stories that hide in plain sight
E9220 HANDGUN ACCIDENT
E9221 SHOTGUN ACCIDENT
E9222 HUNTING RIFLE ACCIDENT
E9223 MILITARY FIREARM ACCID
E9224 ACCIDENT - AIR GUN
E9225 ACCIDENT-PAINTBALL GUN
E9228 FIREARM ACCIDENT NEC
E9229 FIREARM ACCIDENT NOS
E9230 FIREWORKS ACCIDENT
E9231 BLASTING MATERIALS ACCID
E9232 EXPLOSIVE GASES ACCIDENT
E9238 EXPLOSIVES ACCIDENT NEC
E9239 EXPLOSIVES ACCIDENT NOS
E9240 ACC-HOT LIQUID & STEAM
E9241 ACCID-CAUSTIC SUBSTANCE
Secret release of fugitives – code in court data
Rising tide of innocent people killed in police chases –
code in NHTSA data
How many people contracted a hospital-acquired
infection during heart surgery – code in hospital data
----------------------
Power of two – combining data
Death certificates – list of adult family homes
Know the rules of the data. No detail is too small.
Tips
Step Two File format
Every computer file has an extension:
.txt Text file .csv Comma-separated value .dbf Database format .html Hyper-text mark-up language .mdb Microsoft database (Access file) .pdf Portable Document Format
Rule of thumb: Always request comma-delimited text if Excel format is unavailable
Two database structures: 1) Fixed length 2) Delimited
Fixed-length file
Berens 2312 Columbus blue Anderson 4563625 Seattle violet
Becker 45453 New York light brown
Delimited file
berens,272464,Seattle,blue
In general, how long do you wait for public data? 1) Quickly - within a few weeks at most 2) Slowly – often takes a month or more 3) Never – there’s always some issue
Poll Question:
Talk first. File a request last.
Tip
Blank canvas - importing
Go to “Data” tab, then look for “Text” icon
CASE DATE TIME COUNTY AREA WOUND INJURY TYPE CAUSE
1 11/21/87 645 Sauk south neck minor victim in car-stray bullet
2 11/21/87 730 Marathon centrl arm major sp loaded firearm in vehicle
3 11/21/87 930 Oneida north chest fatal si careless handling-tree involvd
4 11/21/87 945 Juneau south chest major victim in line of fire
5 11/21/87 950 Buffalo centrl leg major sp victim out of sight of shooter
6 11/21/87 1000 Portage centrl foot major si careless handling-tree involvd
7 11/21/87 1000 Portage centrl chest major sp careless handling-tree invovld
8 11/21/87 1135 Rock south head fatal victim in line of fire
9 11/21/87 1235 Columbia south head major si careless handling-tree involvd
10 11/21/87 1300 Columbia south abdomn fatal si victim fell from tree
11 11/21/87 1440 Shawano centrl chest fatal victim out of sight of shooter
12 11/21/87 1445 Trempealeau centrl neck major ricochet-off gun
13 11/21/87 1445 Columbia south leg major sp gun hammer struck an object
14 11/21/87 1630 Langlade north arm minor victim out of sight of shooter
15 11/22/87 815 Trempealeau centrl head major ricochet-bullet thru deer
16 11/22/87 900 Oconto centrl toe major si careless handling-tree involvd
17 11/22/87 900 Trempealeau centrl leg major sp victim in line of fire
18 11/22/87 1130 Buffalo centrl head minor sp victim out of sight of shooter
19 11/22/87 1143 Door north hand major si unloading firearm-defective
Make a copy of the database. Call it “master file” and never touch it. Always work from a copy. Hint: Keep a log of everything
Tip
Importing a fixed-length file
Always show your results to the sources in your story. Remember: You’re one keystroke away from a career-ending error
Tip
What (and where) is your favorite source of Web-based data?
Answer in the chat box
Searching for Microsoft
Instant database – 17,583 records
http://www.fda.gov/
Look for the entire download
Code key
http://ire.org/nicar
Don’t be
obsolete.
Unleash your inner watchdog