Wednesday, September 3, 2014

How we do this: The inverted pyramid of data journalism


Lecture notes: 2014-09-03

This lecture is inspired by this article by Paul Bradshaw: The inverted pyramid of data journalism.

OK, more than inspired. My lecture is the article. We’ll go over the story above together and talk about how it relates to what I hope we get out of this course.

Here are my additions:


  • I love the phrase “Data journalism begins in one of two ways: either you have a question that needs data, or a dataset that needs questioning.” That’s my new signature phrase.
  • The process of getting the data is probably the hardest, and often the most time consuming part of the process. If you are requesting data from an organization, don’t be surprised if you have to go back to them to ask for more. Here are some tips to avoid that.
    • Ask for a record layout of the data before you submit your request. The record layout tells you what fields are available and what they mean. This let’s you know what to ask for.
    • Talk to someone on the phone (yes, call them) who works with the data. If you are able, ask them how they would approach the questions you are asking.
    • If you cover an agency, ask for their data expiry calendar. This not only tells you how often they shed/delete data, it also tells you all the data they have!
  • Ask for CSV or Excel, but don’t give up with PDFs. There are tools …. Cometdocs, Tabula are two.
  • Some Google search techniques:
    • Search a specific site with
    • Quotes gives you exact word. “crime statistics” gives you the two words together
    • Wildcard: “crime statistics * 2013” helps you narrow without being exact.
    • Exclude with minus. to get crime statistics, but not FBI, use “crime statistics” -FBI
    • Get specific file types with filetype:xls


If compiling is the most time consuming, cleaning is the most labor-intensive. You’ll spend more time cleaning data than using it, I assure you. We’ll cover data cleaning later in the course, but here are some tools in addition to what Bradshaw has mentioned.


Here is where the critical mind of a reporter comes in.
  • Numbers alone can’t tell a story. Or not a very interesting one. Find the people!
  • There will be errors in the data. You will have to make subjective decisions, like creating categories, leaving out bad data. Explain yourself.


  • As Bradshaw says, find the data set to compare. Variance in data or change over time are the key component to your story, almost guaranteed.
  • Visualization is often how you combine data sets, and it can help you find the story as much as it can explain it. I build data visualizations first for the reporter to help them find the story. We’ll do this late in the course, using Tableau.
  • We’ll try to get to correlation … seeing how two different data points relate. But remember: Correlation does not mean Causation!!!!


Bradshaw continues in “Six ways of communicating data journalism”. Read it, because telling the story is what sets you apart from any fool who can use Excel.

And specifically on the “Visualize” part: Take my class in the Spring.


Post a Comment