Wednesday, October 15, 2014

Using SQL: The Select statement, using Mode Analytics

(Postdated lecture notes/handout)

SQL: SELECT FROM WHERE ORDER BY

Throughout these handouts SQL has been written out so you have a record of what different things do. I HEAVILY ENCOURAGE YOU TO TYPE IT OUT AND NOT COPY/PASTE. You will learn more if you make typos and mistakes along the way and have to figure out why and correct it.

There aren’t that many commands in SQL, and you can guess what they do by looking at them.

SQL commands
  • SELECT
    • used to find data
  • INSERT
    • used to add data
  • UPDATE
    • used to change data
  • DELETE
    • used to remove data

SELECT is your workhorse. You’ll build your SELECT statements with some basic keywords:
  • SELECT
    • followed by the fields you are looking for
  • FROM
    • the table (or tables) where your are looking
  • WHERE
    • certain conditions exists.
  • GROUP BY
    • deals with how you combine data
  • HAVING
    • allows you to filter data after it is grouped
  • ORDER BY
    • the sorting order your want your data

SELECT and FROM are required.

The syntax is a lot like English, so at it's basic level it is simple to understand, but it can do a lot.

SELECT
 name,
 address,
 hometown
FROM
 customers
WHERE
 hometown = 'Austin';

That says give me the name, address and hometown of all the customers that have a hometown of Austin. Simple, right?

The SELECT basics

We are going to use ModeAnalytics to learn basic SQL. Log into Mode. (You should've gotten an invitation from me already and already have created a login.) Click on New Query.

For the following exercise, we are going to use a table called "est2013_48" in the schema "utdata." It sometimes helps to have the table open to reference, so in a new browser tab go to this url:


Using SELECT to view everything

We are going to query some population estimate data that is already uploaded into Mode. Type in the following, and then click Run.

SELECT *
FROM utdata.est2013_48;

What it means

  • SELECT means to find, and it is followed by what fields you want. In this case, we want every field, so we are using the wildcard *
  • FROM utdata.est2013_48 means to look in the schema “utdata” for the table “est2013_48”.
  • The semicolon at the end of the statement says that it is ended. It’s like -30- at the end of a story. (Wait, you youngsters probably don’t get that.)

(I’ll take a minute to explain what each column means in this data set because I know the data, but I’m not going to write it out here.)

Using LIMIT

Did anyone notice how many results we got? Do you think there are only 100 records in this data set?

LIMIT saves you time by only returning a limited number of records. SQL managers like Mode and MySQL Workbench set a base limit by default, but you can override it.

Click off the “Limit 100” button at the top of the Mode window and see how many records you get. Should be 3102.

You can also set the limit within your query:

SELECT *
FROM utdata.est2013_48
LIMIT 2000;

Adding the LIMIT to your query will override Mode’s “Limit” button. For now, you can turn that button back on and set it to 100.

Use SELECT for just some information

But maybe we don’t want to show every columns. Maybe we only want the fields for the city name and estimates for each year. So the fields we need are: name, estimatesbase2010, popestimate2011, popestimate2012, popestimate2013. So we will select those by name, separating them with a comma.

SELECT
 name,
 estimatesbase2010,
 popestimate2011,
 popestimate2012,
 popestimate2013
FROM
utdata.est2013_48;

You’ll see that I’ve indented some of the lines. This is good coding practice because it helps you discern what goes where and why.

Setting column header names

Your data may have long or hard to understand names with not spaces, etc, but you can rename them by using as "New Name" in your query. Use double quotes for these names:

SELECT
 name as "City",
 estimatesbase2010 as "2010",
 popestimate2011 as "2011",
 popestimate2012 as "2012",
 popestimate2013 as "2013"
FROM
utdata.est2013_48;
What we are doing here is just changing the name of the column that is displayed.

Syntax differences

I may have mentioned somewhere that SQL syntax may differ depending on the database and client you use, and that comes into play here. Mode wants us to use double quotes when when we are referencing column names, and single quotes when we are representing values within the data, which we do below. This is different than MySQL/Navicat, where you can call values in double quotes. You just have to deal with the nuances. (For instance, I sent a support ticket into Mode when I wanted to fine the name value 'CHILI'S GRILL & BAR' because the query got confused with the apostrophe in CHILI'S. The answer was I used two apostrophes inside the quotes: 'CHILI''S GRILL & BAR' …which looks like a doublequote, but isn't. Confused? Yeah, me, too.)

Let's add some ORDER (BY)

ORDER BY is the term for sorting in SQL. You'll notice that our cities are in no particular order at all, and some are repeated but not together. It's kind of a mess. Let's ORDER BY name.

SELECT
 name as "City",
 estimatesbase2010 as "2010",
 popestimate2011 as "2011",
 popestimate2012 as "2012",
 popestimate2013 as "2013"
FROM
 utdata.est2013_48
ORDER BY
 name;

Using WHERE to filter data

What if we don’t want all the data? What if we only want certain data where a condition is true?

Did anyone notice that some cities are listed more than once in our data? This data includes population estimates not only for "places" but it also breaks out where different places may cross boundaries, like how parts of Austin are in both Travis and Williamson counties. It’s quite a mess. Just believe me when I say if we want just the populations of a whole "place", we want to find the records where sumlev = 162, but first we're going to add the sumlev and county fields so you can see the filtering.

SELECT
 sumlev,
 county,
 name as "City",
 estimatesbase2010 as "2010",
 popestimate2011 as "2011",
 popestimate2012 as "2012",
 popestimate2013 as "2013"
FROM
utdata.est2013_48
ORDER BY
 name;

Now you can at least see how the sumlev and county relate to each other. Now let's use the WHERE keyword to filter these to just the rows with a sumlev of 162

SELECT
 sumlev,
 county,
 name as "City",
 estimatesbase2010 as "2010",
 popestimate2011 as "2011",
 popestimate2012 as "2012",
 popestimate2013 as "2013"
FROM
 utdata.est2013_48
WHERE
 sumlev = '162'
ORDER BY
 name

A couple of notes about this line:
  • We are putting '162' in single quotes because this field is considered a string instead of a number. This is a quirk of census data. If this were really a number, we wouldn't put it in quotes. (Numbers come next.) But sometimes these codes start with 0, which would be dropped if this field were made into a number, and that would cause problems for us later on. But notice the use of single quotes here when you are referencing a value in the table.
  • We are using an equals sign here like a math operator. And yes you can actually do math here, too. (Also next)
  • Also note the syntax difference here with Mode. In MySQL/Navicat, I would've used sumlev = "162".

Compound WHERE and operators

So, now we are going to add on another filter. We want just the cities (sumlev = '162'), but only with a 2013 population greater than 50,000.

SELECT
 sumlev,
 county,
 name as "City",
 estimatesbase2010 as "2010",
 popestimate2011 as "2011",
 popestimate2012 as "2012",
 popestimate2013 as "2013"
FROM
 utdata.est2013_48
WHERE
 sumlev = '162' AND popestimate2013 > 50000
ORDER BY
 name;

QUIZ QUESTION
Take a screen shot of your SQL and the result and use it to answer the question in today's assignment: What were the estimated populations of Abilene and Austin for the years 2010 and 2013?

-- 30 --

0 comments:

Post a Comment