Saturday, January 14, 2017

Tools we use in Data-Driven Reporting

Sixteen weeks is not really enough time to cover all the skills you might need in data reporting, but neither is a lifetime. New tools and methods to munge and make sense of data crop up every day, and that's one of the reasons this is such an interesting segment of Journalism today. I learn something new every day. Not just with my stories, but in my technical skillset.

For this class in Spring 2017, we'll use a number software packages as we learn the technical skills behind data journalism. Most of them are free, but some are paid packages where I have arranged for educational licenses.

Microsoft Excel and Google Spreadsheets

The workhorse of data. While Microsoft Office is not free, as a student you can get it for next to nothing at the Campus Computer Store. If you don't own this software already, go buy it the first week of class. Google Spreadsheets is a free alternative and we will use it, too.

Tableau

Tableau is data visualization software for business analytics, but it is also a wonderful tool for journalists to explore and present data. While expensive on the retail market, we have a student license for you for the duration of the class. Join IRE for $25 as a student and you get a license with your registration. Lastly, there is Tableau Public, which is free, but you have to save all your work publicly online.

Open Refine and Trifacta

Once called Google Refine, Open Refine is a tool for cleaning and shaping data. It is free and open source, though it hasn't been updated in a while. Trifacta Wrangler is another tool that can do similar things and is probably more intuitive, but I have less experience with it. It is worth your time to check it out.

Regular expressions

More a programming language for pattern matching. More a skill than a tool, but we use regex101.com as a tool to learn and create them.

MySQL, MySQL Workbench and Navicat

MySQL Community Server is probably the world's most popular database software, but don't quote me on that. I don't think it is technically open source, as Oracle controls the software, but it is free. Because it is so widely used, there are tons of tutorials, documentation and questions asked and answered on Stack Overflow. Those alternative documentations give it just a little bit of an edge over PostgreSQL in learnability, but both are based on the Structured Query Language.

We need a client to talk to the database and ask it questions. This semester there are two options: MySQL Workbench and Navicat Essentials. Workbench is free for Mac and PC, but there is currently a heinous conflict in the current version of Workbench and Macintosh Sierra operating system. As such, I've arranged for licenses to Navicat Essentials for MySQL, which is a stripped down version of the popular Navicat, my go-to tool in my professional job. 

Anaconda for Python, csvkit

We will just break the surface of how using Python programming language can help you solve data challenges with aplomb. Anaconda (and Miniconda) makes installing and using Python modules easier no matter your operating system, and it comes with Jupyter Notebooks, which makes running, explaining and sharing your code easier. Csvkit is a Python module that you use on the command line or in Python scripts that can help you clean, combine and prepare data for analysis.

Texas A&M Geoservices

We'll often need to find the latitude and longitude of addresses in our data to allow for easier mapping. Texas A&M Geoservices has an excellent platform that includes a batch geocoding service. They are kind enough to extend those services to educational purposes like this class.

I'll probably add to this post as I think of other tools worth mentioning.

0 comments:

Post a Comment