Monday, October 20, 2014

Using SQL: Group by join

(Postdated lecture/handout on SQL. This uses a Navicat Essentials and a dataset that was loaded on lab machines.)



SQL - Aggregation, grouping and joining


We're going to move to using MySQL on a local computer now. The data we are going to use should already be loaded onto the lab machines (please use lab machines today.) If that fails for whatever reason the data is in Mode as "utdata.msa2013" though the column names are a little different.


Throughout these handouts SQL syntax has been written out so you have a record of what different things do. I HEAVILY ENCOURAGE YOU TO TYPE IT OUT AND NOT COPY/PASTE. You will learn more if you make typos and mistakes along the way and have to figure out why and correct it.

Alcohol tax receipts

Our next section is going to use a set of data from the Texas Comptroller's office called the "mixed beverage gross receipts tax." Basically, the state takes a cut of every alcoholic drink sold over the counter by a bartender or waiter. (This does not include beer or liquor you buy at a package or grocery store, which is instead taxed under normal sales taxes, and not definable against any other food stuff you buy.) With the gross receipts, we can find out who sells the most alcohol by the drink.


The table "receipts" is our main data, but we are going to start with a subset "msa2013" which is a table of just the records for one year from Hays, Bastrop, Travis and Williamson counties.


  • Launch Navicat
  • If you don't see "localhost" in the left-hand column
    • Go to Connection, choose MySQL
    • Click OK on the window that opens
  • Double-click on "localhost"
    • leave the password blank and hit OK. These machines have the root password set as blank.
    • If you downloaded MySQL in class, your root password was saved on the desktop of your machine
  • Double-click on "utdata" to open that database.


Double-click on the msa2013 table and it should open so you can see the columns we can take a look at it. (We'll talk about what each one means.)

How many records are in msa2013?

If you ever think you have a 1000 records on a query, be suspicious. Remember the row limit in Mode? It is present in Navicat as well. You can turn off or change the row limit in Preferences if you need to, but we won't.

Build a query

Click on Queries under "alcohol", and then the + button at the bottom of the screen. Now we can type in our query:


SELECT count(*)
FROM msa2013


You are counting how many records match *, which is everything.


SAVE THIS SEARCH AS "msa2013count"  and take a screenshot of your SQL and results and submit them with the answer to this question in the assignment. How many records are in the msa2013 table?


That would be a lot of records to sort or filter in Excel, and that is just four counties for one year. The full "receipts" table has 10 years of data for all counties in Texas. Care to guess how many rows that might be? Well, don't. Figure it out for sure and tell me!

How many restaurants do we have? Using DISTINCT.

We can get a list of all the restaurant names by searching for DISTINCT names.


SELECT DISTINCT locname
FROM msa2013


It should look something like this:




And you should get 584 names.


If you didn't need to see the actual names but just wanted to know how many there were, we could count for just distinct locnames:


SELECT count(DISTINCT locname)
FROM msa2013


And your return would just be the number: 584.


Now … looking at your data, do you think there are 584 different restaurant locations? How might you find out how many unique locations there are?


SELECT DISTINCT
locname,
locaddress
FROM
msa2013
ORDER BY
locname


Now we have 694 records.




I ordered them by locname so you can see how the names might be the same for different addresses. In our previous query, we were counting different restaurant locations as the same because they had the same distinct locname. This is an important concept to understand and it means you need to understand what fields in your database are unique to each record.

AGGREGATE FUNCTIONS: SUM to add values together

OK, now we get to some power out of this database. Let's find out which locations (name & address) sell the most alcohol!


SELECT DISTINCT
locname,
locaddress,
SUM(grossreceipts)
FROM
msa2013
GROUP BY
locname,
locaddress
ORDER BY
SUM(grossreceipts) DESC
Let's break this down:
  • with grossreceipts inside the SUM statement, we are adding all the values of grossreceipts for this query.
  • with the GROUP BY statement, we are grouping all our returns where locname and locaddress are the same. This is important. You must always have a GROUP BY statement if you ask for any other field with an aggregation function like SUM or AVG, MIN, MAX, COUNT.
  • with the ORDER BY statement, we are ordering the returns by the SUM in DESCENDING order (DESC for short). The default order is ASC or smallest at top.


It looks like this:


An aside: Using an alias

Now we are going to simplify this statement just a little, and introduce a nice concept called an alias. I hate retyping SUM(whatever) each time, so I'm going to create an alias for that value called "TotalSales", but I don't have to put it in quotes because I'm not using any spaces. I can use this alias to refer to that value later, like in the ORDER BY field below.


SELECT DISTINCT
locname as Location,
locaddress as Address,
SUM(grossreceipts) as TotalSales
FROM
msa2013
GROUP BY
locname,
locaddress
ORDER BY
totalsales DESC


You can do this with almost anything, and it comes in handy with more complicated queries. If you don't have spaces, you can just write the word. If you add a space, you would then have to put it in quotes and this can do crazy stuff depending on your databases. (If you did it with MySQL/Navicat like we have here, it would change the value of that field to a string instead of a number and your order would be different. So … don't alias with spaces!)


SAVE THIS QUERY AS "top locations" and take a screenshot of the SQL and results for the following quiz question: What is the name, location and amount of the 2nd highest sales location? You'll submit that in the assignment.

INNER JOIN brings multiple tables together

Our data set includes establishments from multiple counties, but if you look at the data the "loccounty" is just a number. Luckily, this data comes with a lookup table to find the county name that matches this number.


Double-click on the table "counties" and take a look. Note the "code" column … the number here matches the "loccounty" column in the msa2013 table. We can bring these two tables together with a JOIN. (Reference on MySQL, or  W3, which I think is clearer.) We'll make the join, and add fields to the SELECT statement for the code and county from "counties."


SELECT
msa2013.locname,
msa2013.locaddress,
msa2013.loccounty,
counties.code,
counties.county
FROM
msa2013
INNER JOIN
counties ON (msa2013.loccounty = counties.code)


When you call fields from more than one table, it is good practice to refer to both the table and the field to avoid conflicts. (You might have fields of the same name in both tables.)


Voila! Now we have the county name of each record.




Now, DOUBLECHECK YOUR NUMBER OF RECORDS. There should be 7915, the same as your total records. Joins can be tricky. If your lookup table has more than one match, then you might get more records than you bargained for. We'll deal with that in a bit.

More SUMs!

Now let's find out which county sold the most booze.


SELECT
counties.County,
sum(msa2013.grossreceipts) as TotalSales
FROM
msa2013
INNER JOIN
counties ON (msa2013.loccounty = counties.code)
GROUP BY
counties.county
ORDER BY
totalsales DESC


Let's break it down:
  • We are asking for the "counties.County" field. Remember the table is listed first, then the field. I capitalized County to make it pretty in the results.
  • In our second field, we are summing the "msa2013.grossreceipts" and calling it TotalSales.
  • Our main table is msa2013, so that's our FROM
  • We are using an INNER JOIN to connect to the table "counties". An INNER JOIN finds only the records that are a match in both tables.
  • We are matching on the "msa2013.loccounty" in our first table, and "counties.code" in the second.
  • Since we have a SUM, we must also GROUP BY all the non-aggregate fields in the SELECT statement. That's counties.County in our case.
  • And we order by TotalSales descending to get the most sales on the top.


The result:




Travis county sells the most. No surprise there, since they have the highest population. But do they sell the most alcohol per person? To find out, we would have to know the populations of those counties. We have a table for that with "populations." Double-click on it and take a look. This also has a "code" field for the county, and the name, and the population! We can use this table, and then do some awesome math. First, let's get the population in our query. Try this:


SELECT
population.County,
population.Population,
sum(msa2013.grossreceipts) as TotalSales
FROM
msa2013
INNER JOIN
population ON (msa2013.loccounty = population.code)
GROUP BY
population.County,
population.Population
ORDER BY
totalsales DESC


What is wrong with these results?




Since our "population" table has a Travis record for each year from 2004 to 2013, we are getting a result for each of those. Before we go off and fix it, let's look at all the Travis county records in the population table. Start a new query:


SELECT *
FROM
population
WHERE
county = "Travis"



You can see there is a record for each year's population. We can fix or original statement with a WHERE statement to to get just the year 2013-01-01.


SELECT
population.County,
population.population,
sum(msa2013.grossreceipts) as TotalSales
FROM
msa2013
INNER JOIN
population ON (msa2013.loccounty = population.code)
WHERE
population.year = "2013-01-01"
GROUP BY
population.county, population.population
ORDER BY
totalsales DESC


This is the result you want:




An aside: Dates in SQL

At the risk of confusing things, there are lots of things you can do with dates. You can look at references at W3 and the MySQL site. One that I use a lot is to extract the year from a date, which allows me to get all the records within a given year, even if the dates fall throughout that year. So you might write the WHERE part of the above query like this:


WHERE
YEAR(population.year) = "2013"


Back to sales per population

OK, now that we have the total sales and the population, we can do some match within the query to get the amount sold per person:


SELECT
population.County,
population.Population,
sum(msa2013.grossreceipts) as TotalSales,
(sum(msa2013.grossreceipts)/population.population) as SalesPerPerson
FROM
msa2013
INNER JOIN
population ON (msa2013.loccounty = population.code)
WHERE
YEAR(population.year) = "2013"
GROUP BY
county, population
ORDER BY
SalesPerPerson desc


SAVE THIS QUERY AS "county2013" and use it to answer the in-class assignment question: What is the sales per person in Travis county for 2013?. Take screenshot of the query and the result.


ON-YOUR-OWN QUEST:
The only difference between the table "msa2013" and the table "receipts" is msa2013 is for only the Austin area. You should be able to use a very similar query to the one above against the "receipts" table to find the sales per person in 2013 for every county in the state.


So … what are the top five counties in the state for sales per person in 2013? Submit a screenshot of the results that shows the county names, their values (rounded to nearest penny) and a screenshot to the homework assignment. Submit the SQL query and a screenshot of the results.


To get the rounding, you will have to search the internet on how to do that in MySQL. Think about how you might write the google question, and/or think about what sources we've used above when talking about specific MySQL syntax.

-- 30 --