Monday, November 3, 2014

Introduction to Tableau

(Postdated lecture/handout)

Lecture - Introduction to Tableau

This lecture is designed to introduce students to Tableau so they can see how it can be used to explore a data set.

Versions of Tableau

There are many flavors of Tableau, but they are essentially the same program.
  • Tableau Desktop is the full version. It can connect to files and databases. You can save your work to your computer at any time and later publish it online on Tableau Public. It can be quite expensive if buying directly, but you can get a free educational license as a student or a free professional license through an IRE membership.
  • Tableau Public is both a website and software.  The software is essentially the same as Tableau Desktop, but you can't save files to your local computer; you can only publish files online on the Tableau Public website. Anyone (including yourself) can download workbooks from Tableau Public and continue to work on them.
  • Tableau Server is a business class version of the software. Not really applicable to us.

Getting started

Launch Tableau and you are first taken to the screen to connect to your data. Choose your file type and open the file. For this we'll choose Microsoft Excel and then go find the Travis Gun Deaths.

t-check data types.png


  • We'll use a data set about gun deaths in Travis County. I'll talk about and show the original data in class, and explain how it was cleaned up for use in Tableau.
  • Download this data file for this lesson: 11 Data - Travis Gun Deaths
  • The next thing that comes up is a view of your data. This is the time to check that all your data types are correct … that numbers, strings, dates and locations are noted properly. Once all is checked (Travis Gun Deaths should be fine) click on the big red Go to Worksheet button.
  • Note on the left panel the Dimensions and Measures shelves. All the categorical data like Death Type, ZIP and such are Dimensions, while all the quantitative values that you can add up have been put on the Measures. (It's actually a bit more complicated than that, but that will do for now.)
  • In our case, Age has been counted as a Measure, but we'll never add ages together, so we'll drag and drop it up to the Dimensions area.

Exploring through Show me

Tableau can be both very easy to use and quite complicated. The Show Me palette can help guide you on ways to explore your data.
  • If you don't see the whole Show Me palette at the top-right of the screen, click on Show Me so it drops down.

t-show me.png

  • Click on the Date dimensions, and the hold down the Command key and click on the Number of Records measure. (I'm assuming Mac use throughout … use Ctrl where applicable.)
  • As you click on measures and dimensions together, the Show Me palette will light up the chart types that might work with your data. Since we are looking at a date, let's choose the bar graph. (Left side, 3rd down.)
  • What you get (probably) is a simple bar chart that shows the number of records by year, probably showing horizontal.
  • When I look at date graphics, I like the dates to show at the bottom, a more vertical alignment, so I can see the years go from left-to-right. You can flip the access on the chart by clicking on the flip access button, which looks like this: Screen Shot 2014-11-01 at 9.57.51 AM.png
  • We can change that to see it by month by using the dropdown in the Date field that is now on the columns shelf and pulling down to the Month that also shows the year. (It shows May 2011: see screenshot.)
Screen Shot 2014-11-01 at 9.49.16 AM.png
  • Now we could use some help reading the number of deaths for each month, so let's add some labels. Take the Number of Records measures and move it to Marks palette and drop it right on top of the square called Label. This should add the number of records -- i.e., number of deaths -- to each bar.
  • If you have trouble discerning which exact month a bar represents, you can hover over that bar with your cursor to see a tooltip that gives details about that data. Now we can see April 2010 had that most deaths with 14.
  • I wonder if all Aprils are that bad. We can check by changing the Date field on the Columns shelf to the other Month listed there, where it just shows "May". Now we are looking at deaths by month for all three years, and over that period April was fairly normal, but November is pretty high.
  • Could this November trend be holiday stress? Why might explore a little further by dragging the Death Type dimension to the Marks shelf and drop it on Color. This now divides our bars so we can see all the suicides, homicides and accidents. November is showing the most suicides, so maybe that is a lead to talk to mental health professionals.
  • If you find the bars difficult to discern all three types of death together like this, you can change the chart type on the Marks shelf by using the dropdown there to move from from Bar to Line.
  • If you don't like that move, you can use the back button at top left to undo that move and any many more previously. You can then use the forward button to return. Stop where you wish, and then change the name of the sheet at the bottom to something useful by double-clicking on Sheet 1 and renaming. It works similar to Excel sheets.
  • Down by the sheet names, click on icon that show a bar chart and + sign on it to start a new blank sheet. Screen Shot 2014-11-01 at 1.36.23 PM.png(The other one starts a dashboard, which we'll talk about later.)
  • Let's do one last thing with dates. Click on Date and Number of records and choose the Line graph from the Show Me tab.
  • Go to Date in the columns shelf to change from YEAR and pull down to combined years set under More and choose Weekday. This gives you the most deaths by day of the week. Drop your Number of Records measure on the Label tab on Marks and then rename your sheet Weekday.
  • Last thing with this section: We've done a bit of work, let's save the workbook so you don't lose it. Under File > Save or click on the disc icon. Save it where you can find it again.

Another bar chart with Age

Bar charts are pretty telling. Let's make another one.
  • Create a new sheet
  • Click on Age. Hold down Command and click on Death Type and Number of Records to add them to the selection.
  • Click on the stacked bar chart in Show Me. (3rd row in the middle).
  • This shows you the deaths by type, by year. In this case, I think it is more readable as horizontal bars, so use Screen Shot 2014-11-01 at 9.57.51 AM.png to flip the axis.
  • Drag Number of Records measure to the Label tab on Marks.

Screen Shot 2014-11-01 at 12.14.40 PM.png


This gives us a good view by date, but when we use color to show the death type, we lose the total number of deaths for each age. Let's add that back.
  • On the bottom axis is the label Number of Records. Right-click on that to get a pop-up menu and choose "Add Reference Line, Band or Box."
  • Set the following:
    • Scope: Per cell
    • Value: Number of Records by SUM.
    • Label: Value
  • This adds the total number to the end of the bar. Sometimes it looks funky if there is only one color bar, but I don't know how to fix that ;-(
  • Save your sheet as "Age, type of death"

Null values


Note the Null value for age. We could exclude this to get rid of it, but it's better to first understand why it is there. Let's find out what record it is.
  • Click on or hover over the Null record until the tooltip comes up.
  • Roll over the the table looking icon on the far right of the tooltip and click on it.
  • This brings up a "View Data" that shows you the Summary of all the records used in that bar.
  • Click on Underlying and that will show you the individual records that make up the bar. In this case, that records has a blank where age is. We would want to go back to our original data source to find out why. (We simply could not determine the age of that victim.)

Screen Shot 2014-11-01 at 12.25.28 PM.png

Maps

Tableau has a lot of map functionality built into it. If your data set has geography like state, city, zip and such, it will try it's best to map that data. Sometimes you have to help it. (When you get outside of common shapes like that to say, school districts or oil fields, it is much harder to use Tableau.)
  • We'll start with a new sheet, click on Zip and Number of Records and go to the Show Me pallet and choose the Map with the blue dots. This draws a dot by ZIP code where the dot is bigger based on the number of records.
Screen Shot 2014-11-01 at 11.17.02 AM.png
  • You'll notice there is a dot way out in California. This is a case where someone was shot in Austin but died years later out of state. We will exclude it from the map as an exercise to show how. Click on the dot (or draw a sqare around it with your cursor) to select it, then in the tooltip click the Exclude link. It can be challenging to get that tooltip to come up, but you'll get the hang of it.
  • Once that record is excluded (it's not deleted, only hidden from view) the map will refocus to Travis County, though it's kinda hard to discern that with the way the map looks, so we'll work on that.
  • In the menus across the very top of the screen, go to Map > Map Options.
  • This will bring up a palette on the left where you can change a number of things.
    • Change the Style to "Normal."
    • Click on Streets and Highways
  • Now we can understand what we are looking at a bit more. Let's see how the type of death plays out on the map. Drag the Death Type onto the Color mark and you'll see how they fall upon the map.
  • We can isolate those Death Types by creating a filter. If you click on the dropdown for Death Type dimension and choose Show Quick Filter, then you will get a palette on the right-hand side of the screen. (You might have to close the Show Me palette to see it).
  • Now you can choose the death types individually and see how they fall upon the map. Let's add some things to make it easier to see:
    • Let's add the Number of Records measure to the Label mark so you can more easily see some values, too.
    • Click on the Size tab in Marks and make the side larger. This makes all of the dots bigger so the smaller ones are easier to see.
    • You might see some interesting trends worth pursuing … what part of the cities see more homicides? What is up with 19 suicides up by the Arboretum?
  • You can add some demographic data that Tableau stores into the map with a few steps.
    • Under the Map menu, go to Background Maps and choose Tableau Classic.
    • Make sure Map Options are up (Map > Map Options).
    • Click on Data Layer toward the bottom of the Map Options, and look under US Households to "Household Income (median)". This colors the background map by income, but default to By State. Change By to Zip Code. Change to Block Group to see the difference.
    • Zip is probably more appropriate. You can clearly see how the homicides cluster in areas of lower median income. Not sure you can call it causation, but clearly worth further reporting.
  • Set your filter to show all the Death Types, then name your sheet "Map."

Tables

Sometimes a table of numbers is still the best way to show data, but Tableau can add some visual cues to help you read it.
  • Start a new sheet.
  • Click on Date, then hold down the Command key to add Gun Type and Number of Records to your selection.
  • In the Show Me palette, choose the colored table at top right.
  • You should get a table that shows deaths by year, by guy type. Clearly handgun use is highest, as shown by the darker shade of green.
  • Save that sheet as "Gun type"

Dashboards

You can put your various sheets together on a dashboard to show a common thread. You can use filters to build interactivity within the sheets. Dashboards can be tied together as a Story. Tableau can become a powerful presentation tool. We'll cover some bare bones basics and Dashboards here, but we'll do more with Tableau in Data Visualization.
  • Create a new Dashboard by clicking on the tab at the bottom that looks like squares and a + sign: Screen Shot 2014-11-01 at 1.37.07 PM.png
  • Click on the sheet name "Map" and drag and drop it onto the dashboard.Screen Shot 2014-11-01 at 1.46.11 PM.png
  • Click on the sheet name "Age, type of death" and drag it toward the bottom of the dashboard like you are dropping it on the bottom part of the map. You'll see areas turn grey where the sheet will land when you drop it. If you drop at the bottom, the Map part should split in half and the bar chart should take up the bottom part.
  • Take "Gun type" sheet and drop it on the left half of the bar chart.
  • Take the "Weekday" sheet and drop it into the space under "Gun type". You'll have something like the screenshot in this section.
Screen Shot 2014-11-01 at 1.52.15 PM.png



Global filters

This is just a start on a dashboard. You can do a lot to make the dashboard prettier and display more information, but we'll save those nuances for another lesson. We'll show one last thing with the dashboard … using a filter across multiple sheets.
  • Click on the Death Type filter at the top right to see the dropdown for that filter.
  • Click on the dropdown and choose Apply to Worksheets > All Using This Data Source.
  • Now, when you check and uncheck the boxes in the Death Type filter, all the data on the dashboard filters according to your selections.


Tableau Desktop activation information

Tableau is loaded on lab machines, but this license below allows students in Fall 2014 Data-Driven Reporting to use Tableau on their own machines during the length of the class.

Go to the landing page to download Tableau and enter the key noted below. This key will activate enough licenses for your entire class for the duration of the course.

  • Desktop Key: TD5H-6033-9200-234C-FA7B
  • Instructions: Click on the link above and fill out the form on the right hand side of the page. Under "Job Title", mark Student; and under "Organization", please input "UT-Austin Journalism".

Monday, October 20, 2014

Using SQL: Group by join

(Postdated lecture/handout on SQL. This uses a Navicat Essentials and a dataset that was loaded on lab machines.)



SQL - Aggregation, grouping and joining


We're going to move to using MySQL on a local computer now. The data we are going to use should already be loaded onto the lab machines (please use lab machines today.) If that fails for whatever reason the data is in Mode as "utdata.msa2013" though the column names are a little different.


Throughout these handouts SQL syntax has been written out so you have a record of what different things do. I HEAVILY ENCOURAGE YOU TO TYPE IT OUT AND NOT COPY/PASTE. You will learn more if you make typos and mistakes along the way and have to figure out why and correct it.

Alcohol tax receipts

Our next section is going to use a set of data from the Texas Comptroller's office called the "mixed beverage gross receipts tax." Basically, the state takes a cut of every alcoholic drink sold over the counter by a bartender or waiter. (This does not include beer or liquor you buy at a package or grocery store, which is instead taxed under normal sales taxes, and not definable against any other food stuff you buy.) With the gross receipts, we can find out who sells the most alcohol by the drink.


The table "receipts" is our main data, but we are going to start with a subset "msa2013" which is a table of just the records for one year from Hays, Bastrop, Travis and Williamson counties.


  • Launch Navicat
  • If you don't see "localhost" in the left-hand column
    • Go to Connection, choose MySQL
    • Click OK on the window that opens
  • Double-click on "localhost"
    • leave the password blank and hit OK. These machines have the root password set as blank.
    • If you downloaded MySQL in class, your root password was saved on the desktop of your machine
  • Double-click on "utdata" to open that database.


Double-click on the msa2013 table and it should open so you can see the columns we can take a look at it. (We'll talk about what each one means.)

How many records are in msa2013?

If you ever think you have a 1000 records on a query, be suspicious. Remember the row limit in Mode? It is present in Navicat as well. You can turn off or change the row limit in Preferences if you need to, but we won't.

Build a query

Click on Queries under "alcohol", and then the + button at the bottom of the screen. Now we can type in our query:


SELECT count(*)
FROM msa2013


You are counting how many records match *, which is everything.


SAVE THIS SEARCH AS "msa2013count"  and take a screenshot of your SQL and results and submit them with the answer to this question in the assignment. How many records are in the msa2013 table?


That would be a lot of records to sort or filter in Excel, and that is just four counties for one year. The full "receipts" table has 10 years of data for all counties in Texas. Care to guess how many rows that might be? Well, don't. Figure it out for sure and tell me!

How many restaurants do we have? Using DISTINCT.

We can get a list of all the restaurant names by searching for DISTINCT names.


SELECT DISTINCT locname
FROM msa2013


It should look something like this:




And you should get 584 names.


If you didn't need to see the actual names but just wanted to know how many there were, we could count for just distinct locnames:


SELECT count(DISTINCT locname)
FROM msa2013


And your return would just be the number: 584.


Now … looking at your data, do you think there are 584 different restaurant locations? How might you find out how many unique locations there are?


SELECT DISTINCT
locname,
locaddress
FROM
msa2013
ORDER BY
locname


Now we have 694 records.




I ordered them by locname so you can see how the names might be the same for different addresses. In our previous query, we were counting different restaurant locations as the same because they had the same distinct locname. This is an important concept to understand and it means you need to understand what fields in your database are unique to each record.

AGGREGATE FUNCTIONS: SUM to add values together

OK, now we get to some power out of this database. Let's find out which locations (name & address) sell the most alcohol!


SELECT DISTINCT
locname,
locaddress,
SUM(grossreceipts)
FROM
msa2013
GROUP BY
locname,
locaddress
ORDER BY
SUM(grossreceipts) DESC
Let's break this down:
  • with grossreceipts inside the SUM statement, we are adding all the values of grossreceipts for this query.
  • with the GROUP BY statement, we are grouping all our returns where locname and locaddress are the same. This is important. You must always have a GROUP BY statement if you ask for any other field with an aggregation function like SUM or AVG, MIN, MAX, COUNT.
  • with the ORDER BY statement, we are ordering the returns by the SUM in DESCENDING order (DESC for short). The default order is ASC or smallest at top.


It looks like this:


An aside: Using an alias

Now we are going to simplify this statement just a little, and introduce a nice concept called an alias. I hate retyping SUM(whatever) each time, so I'm going to create an alias for that value called "TotalSales", but I don't have to put it in quotes because I'm not using any spaces. I can use this alias to refer to that value later, like in the ORDER BY field below.


SELECT DISTINCT
locname as Location,
locaddress as Address,
SUM(grossreceipts) as TotalSales
FROM
msa2013
GROUP BY
locname,
locaddress
ORDER BY
totalsales DESC


You can do this with almost anything, and it comes in handy with more complicated queries. If you don't have spaces, you can just write the word. If you add a space, you would then have to put it in quotes and this can do crazy stuff depending on your databases. (If you did it with MySQL/Navicat like we have here, it would change the value of that field to a string instead of a number and your order would be different. So … don't alias with spaces!)


SAVE THIS QUERY AS "top locations" and take a screenshot of the SQL and results for the following quiz question: What is the name, location and amount of the 2nd highest sales location? You'll submit that in the assignment.

INNER JOIN brings multiple tables together

Our data set includes establishments from multiple counties, but if you look at the data the "loccounty" is just a number. Luckily, this data comes with a lookup table to find the county name that matches this number.


Double-click on the table "counties" and take a look. Note the "code" column … the number here matches the "loccounty" column in the msa2013 table. We can bring these two tables together with a JOIN. (Reference on MySQL, or  W3, which I think is clearer.) We'll make the join, and add fields to the SELECT statement for the code and county from "counties."


SELECT
msa2013.locname,
msa2013.locaddress,
msa2013.loccounty,
counties.code,
counties.county
FROM
msa2013
INNER JOIN
counties ON (msa2013.loccounty = counties.code)


When you call fields from more than one table, it is good practice to refer to both the table and the field to avoid conflicts. (You might have fields of the same name in both tables.)


Voila! Now we have the county name of each record.




Now, DOUBLECHECK YOUR NUMBER OF RECORDS. There should be 7915, the same as your total records. Joins can be tricky. If your lookup table has more than one match, then you might get more records than you bargained for. We'll deal with that in a bit.

More SUMs!

Now let's find out which county sold the most booze.


SELECT
counties.County,
sum(msa2013.grossreceipts) as TotalSales
FROM
msa2013
INNER JOIN
counties ON (msa2013.loccounty = counties.code)
GROUP BY
counties.county
ORDER BY
totalsales DESC


Let's break it down:
  • We are asking for the "counties.County" field. Remember the table is listed first, then the field. I capitalized County to make it pretty in the results.
  • In our second field, we are summing the "msa2013.grossreceipts" and calling it TotalSales.
  • Our main table is msa2013, so that's our FROM
  • We are using an INNER JOIN to connect to the table "counties". An INNER JOIN finds only the records that are a match in both tables.
  • We are matching on the "msa2013.loccounty" in our first table, and "counties.code" in the second.
  • Since we have a SUM, we must also GROUP BY all the non-aggregate fields in the SELECT statement. That's counties.County in our case.
  • And we order by TotalSales descending to get the most sales on the top.


The result:




Travis county sells the most. No surprise there, since they have the highest population. But do they sell the most alcohol per person? To find out, we would have to know the populations of those counties. We have a table for that with "populations." Double-click on it and take a look. This also has a "code" field for the county, and the name, and the population! We can use this table, and then do some awesome math. First, let's get the population in our query. Try this:


SELECT
population.County,
population.Population,
sum(msa2013.grossreceipts) as TotalSales
FROM
msa2013
INNER JOIN
population ON (msa2013.loccounty = population.code)
GROUP BY
population.County,
population.Population
ORDER BY
totalsales DESC


What is wrong with these results?




Since our "population" table has a Travis record for each year from 2004 to 2013, we are getting a result for each of those. Before we go off and fix it, let's look at all the Travis county records in the population table. Start a new query:


SELECT *
FROM
population
WHERE
county = "Travis"



You can see there is a record for each year's population. We can fix or original statement with a WHERE statement to to get just the year 2013-01-01.


SELECT
population.County,
population.population,
sum(msa2013.grossreceipts) as TotalSales
FROM
msa2013
INNER JOIN
population ON (msa2013.loccounty = population.code)
WHERE
population.year = "2013-01-01"
GROUP BY
population.county, population.population
ORDER BY
totalsales DESC


This is the result you want:




An aside: Dates in SQL

At the risk of confusing things, there are lots of things you can do with dates. You can look at references at W3 and the MySQL site. One that I use a lot is to extract the year from a date, which allows me to get all the records within a given year, even if the dates fall throughout that year. So you might write the WHERE part of the above query like this:


WHERE
YEAR(population.year) = "2013"


Back to sales per population

OK, now that we have the total sales and the population, we can do some match within the query to get the amount sold per person:


SELECT
population.County,
population.Population,
sum(msa2013.grossreceipts) as TotalSales,
(sum(msa2013.grossreceipts)/population.population) as SalesPerPerson
FROM
msa2013
INNER JOIN
population ON (msa2013.loccounty = population.code)
WHERE
YEAR(population.year) = "2013"
GROUP BY
county, population
ORDER BY
SalesPerPerson desc


SAVE THIS QUERY AS "county2013" and use it to answer the in-class assignment question: What is the sales per person in Travis county for 2013?. Take screenshot of the query and the result.


ON-YOUR-OWN QUEST:
The only difference between the table "msa2013" and the table "receipts" is msa2013 is for only the Austin area. You should be able to use a very similar query to the one above against the "receipts" table to find the sales per person in 2013 for every county in the state.


So … what are the top five counties in the state for sales per person in 2013? Submit a screenshot of the results that shows the county names, their values (rounded to nearest penny) and a screenshot to the homework assignment. Submit the SQL query and a screenshot of the results.


To get the rounding, you will have to search the internet on how to do that in MySQL. Think about how you might write the google question, and/or think about what sources we've used above when talking about specific MySQL syntax.

-- 30 --