(Postdated)
The following are the lecture notes used during a Data Cleaning module of the class.
The following are the lecture notes used during a Data Cleaning module of the class.
Handout - Data Cleaning
Mr. People
Mr. People is an online tool to help you parse a list of names. We’re covering it first because it is quick and easy to show. It helps you take names that are all in one field and split them into first name, last name, etc. so you can better work with the data.
- Use the example data there to parse into a Table.
- Review the results. Note where it was able to parse and where it couldn’t.
- If you parse to tab-delimited text, you can then copy ‘n’ paste it Excel.
More tools
- Mr. Data Converter converts CSV files to JSON and other programming languages.
- Tabelizer creates an HTML table out of your CSV. Not really cleaning, but a sometimes useful tool.
Open Refine
Open Refine used to be called Google Refine, as it was a project by a Google engineer. It’s now free and open source.
Getting started
- Go to openrefine.org. Find Download, and download the Mac Kit to your desktop. Double-click on the .dmg file to open.
- Open Refine is available for Mac, Windows and Linux. It is free.
- Instead of dragging to Applications (you probably don’t have access), drag it to your own Documents folder.
- Double-click the Refine gem icon in your Documents folder to launch the app. This will open it in Chrome at http://127.0.0.1:3333/. You can use Firefox or Safari if you have to, but stay away from Internet Explorer.
Start a project
- Open and inspect it in Excel. What might be wrong with it? Close it
- In Refine, go to Create Project. Use Choose Files to find your worksheet and then click Next.
- Review the import settings. We don’t need to change anything this time, but you might in other cases. Name the project at the top and Create.
Getting around
- Change the number of rows you can see
- Page through the data using next, last, previous, first.
Fill down
We want the city to be present in all our data. We “know” that the city was entered only for the first value, and that the other rows get that same value until it reaches a new city. So we’ll fix that.
- Under the Area column, click the drop down to Edit cells > Fill down.
- Page through your data to see what it did further down when the city changed to Hereford.
Create a new column
You usually want to keep your original data so you can refer back to it, so we’ll make a new column before we change our data significantly.
- On the Street column, use the dropdown to Edit Column > Add a column based on this column.
- Name the new column “New Street.”
- Note the Expression statement. It simply has “value” which is saying just to take the current value of the old column and put it in the new column. The preview shows you what the old and new column will be. While we won’t do so right now, it is possible to make change to the values (the content) as you create the new column using GREL, a programming language specific to Refine.
- Click OK to create the new column.
Check out Undo
- Click on the Undo/Redo column and note that you can back out of the changes you’ve made. This is handy … allowing you to experiment in Refine.
- Try the Undo/Redo steps.
Search and replace
We’re going to fix the funky character that is there instead of spaces. We are going to use the GREL programming function “replace”. GREL is very powerful, and there is lots to learn, but we’ll stick with this one string statement today.
- In your New Street column, go choose the dropdown and go to Edit Cells > Transform.
- Now we’re going to use the Expression replace(value,”search”,”replace”) to make a change. Type** this into the Expression: replace(value,“”,””).
- ** Be sure to type the expression into Refine and DO NOT copy and paste from this doc. If you copy and paste you will get a parse error because it won’t understand the double quote marks.
- Remember that “value” means whatever is already in the cell, so we’ll keep that as it is. Since the thing we are searching for is the funky å character, copy and paste one of those from the preview window into the first set of quote. We are replacing it with a space, so put a space between the second pair of quotes.
- You can see the before and after in the preview window. Work on it until it looks right.
- Click OK. Note that Refine tells you how many cells were edited. Sometimes that is helpful to know if you got everything right.
Change case
Now we’ll fix the title case on all the street names.
- Go to the dropdown for NewStreets and choose Edit cells > Common transformations > To title case.
- This changes all the words in these cells. Sometimes title case won’t work if the contents aren’t all proper names, so you might have to go UPPERCASE.
Text facets
Now we’ll fix some of the street names.
- Choose the dropdown for New Streets and choose Facet > Text facet.
- This groups all the the cells that are the same together in the pane on the left, and tells you how many rows match that value.
- You can pick through that list and fix multiple cells at once by hovering over the item, choosing Edit, and then making your changes. Edit both the records for Armarda Pub, Gravelly Hill and compare them to see the different. Are there any common transformations that might fix that?
- On New Streets go to Edit cells > Common transformations > Collapse consecutive whitespace. Do the same for Trim leading and trailing whitespace.
Clustering
There are still 200+ choices for this field, and that is quite a lot to go through. There is a tool that can help you find similar fields.
- In the New Street facet pane on the left, click the Cluster button.
- This will bring up a window that finds clusters of cells based on an algorithm. There are a couple of different Methods and several Keying functions for each. You’ll use these to find likely matches and merge them. There is a link there if you want to learn more about the different algorithms.
- If you click on a value in the “Values in Cluster” column, it will activate the Merge checkbox and and set the New Cell Value column to that value. Sometimes you want to edit that to a new value.
- Make this first value “High Street, Erdington”.
- Choose the best value for the other matches (note you can skip a group if needed.)
- Click Merge Selected & Re-Cluster.
- Change the Keying Function to “ngram-fingerprint”.
- Note you can change the Ngram size. Change it to 10 to see what happens. Make your changes and then Merge Selected & Re-Cluster.
- Change the Keying Function to “metaphone 3”. With this one, change the sliders to the right of the values and see what happens. (I haven’t used these much, but could see how it might be useful to focus on certain things.)
- Continue on with the other Keying Functions, then try changing the Method.
Exporting
Once you’ve done the cleaning you want to do, you can export your file back out of Refine in one of several formats.
- Go under the Export link at the top-right of the page to Excel (or whatever format you want).
- This will download the file to our Downloads folder.
Practice with Refine
- Open the file in Excel and poke around at the data to see potential problems.
- What features in Refine might help you clean up this file?
- Do them!
More Clustering practice
- What might you be able to clean in this file in Refine?
- What methods might you use?
Resources to help with Refine
- You can Google just about anything about Refine.
Excel
Sometimes you are in Excel and you might be able to do some cleaning there instead of going back and forth to Refine. These are just some of the formulas that you might find useful. (Here also is a list of all the text functions for Excel.)
LEFT, RIGHT
Sometimes you want to pull just part of a field out, like the Zip code from an address. If what you want is consistent, you can pull just that from a cell into a new column. In this example, I wanted to make a real date out of separate month and year columns.
- Note the date column. It wasn’t there when I started. I had to build it from the date_of_incident (really a month) and year_of_incident columns. So delete that “date” column now.
- Create a new column to the left of the “time_of_incident” column. We are going to put parts of C and D there, but in a mmm-yyyy format, so Feb-2009. Then I’ll convert that later to a real date.
- The formula to get part of a string from a direction is: =direction(cell,number_of_chars)
- In the new E2 cell, enter this (don’t copy/paste): =LEFT(C2,3)&”-”&D2
- Copy that formula all the way down.
- Now click on the heading for that column and try to change the format of that column (command-1) to a different date format, 3/14/01. Nothing happens. Excel can’t change it to a date because it is text. This is a difficult thing to fix, but we will. We will create a new column, set the date value to be mmm-yyyy, and then copy and paste the values of our strings into that field.
- Create another new column to the left of “time_of_incident”.
- Click on the column header and do command-1 to get Format.
- Click on Category > Custom, and Type in mmm-yyyy. We’ve prepared this column to accept our new date format, but no matter how hard we try we can’t copy and paste from Excel into this column and make it work. Excel will always thing we are pasting just text. So we’ll fake it out by copying the contents of the column into Text Wrangler, then copying from there back into Excel.
- Click on the header of your created column with your built dates and copy that column.
- Launch Text Wrangler and paste the text into a new window.
- Make sure everything is selected again, and copy it.
- Now go to the first cell of your new column and paste it in. Now the dates should line up on the right instead of the left of the cell, which is an indication that it is a date.
- Click on the header of that column, do command-1 to get your formatting window, and change the date to *3/14/01. Make sure the dates are right.
Text to columns
We want the time_of_incident column to have a separate start time and end time. We will separate the content based on the hyphen.
- Create two empty columns to the left of subject_ethnicity. (you can do this by highlighting two columns at once, then right-clicking on the first of those and choosing Insert.)
- Copy and paste the time_of_incident into the left-most of the new columns. (We are preserving the original.)
- Click on the time_of_incident column. Go to the Data ribbon to Text to Columns.
- Choose Delimited, then hit Next, then for Delimiters, check the Other box and put in a hyphen. This will split the column at the hyphen. Hit Next.
- On the next screen, you’ll see a preview of what will happen. Hit Finish.
- You now have two columns. Name the first one Start Time and the other End Time.
Trim
Now that we created these two time columns, one of them is kinda whacked. If you edit one of the cells for End Time, you’ll see there is a leading pace. We can fix that.
- Create a new column to the right of End Time. Your End Time should be column H and the blank one column I. (If not, that’s OK, but you’ll need to adjust the formulas below.)
- The formula for to remove leading and trailing whitespace is: =TRIM(cell)
- In cell I2, put in this formula: =TRIM(H2)
- Copy that formula down the column.
- Now we need to copy the values of these cells back into this column, so we have the text instead of the formula.
- Click on the I2 column header and copy the contents.
- Right-click on the I2 column to select it then choose Paste Special, and then choose “Values”.
- Now you have your new, trimmed End Time column. You can delete the bad column.
Filtering to cluster
You can manually cluster and fix fields in Excel using Filter.
- Go to the loc_name field and highlight the column.
- Go to the Data ribbon and then click the Filter button (it looks like a funnel).
- Click on the dropdown that was created for that column, and then look at the values. We’re going to fix Berkman Dr.
- Click on the “Select All” button so no items are selected, and then choose the two Berkman values. Close the filter window.
- Now you can copy the first cell in the list (which is correct) and then use the magical copy point to apply it to the rest of the values there.
- If yuo have a lot of data, sometimes the magical copy point won’t work all the way to the bottom of the list. You might have to copy the value of the correct cell into your clipboard, and then select all the cells, and then copy the value into them.
Other tools to help
- Fuzzy Lookup add-in only works on PCs. It’s good for matching names or terms that aren’t quite the same. We’ll look at this one closer in our next segment.
Regular Expressions
Regular expressions (or regex for short) is a programming language used match patterns and is very useful to search and parse data. This lesson is just an introduction to a very powerful programming skill. We’ll be using the Regex Cheat Sheet. Download a copy for yourself.
- This is a list of drugs and their dosage that were in a single column in Excel. We want to split this into three different fields: Drug, Amount, Type.
- You couldn’t do a simple search and replace to separate these … the combinations are too complex. We need MORE POWA!
- We’ll do this by creating three groups of patterns using Regex, and then we will replace those patterns putting a tab in between so we have have different columns.
- Copy the contents of OpiodDrugs.
- We will put our expression in the top field. As we create groups, they will be displayed in the Match Groups below.
- Capturing the Drug
- We know that the drug is always at the beginning of the string, so we will put in the character for that: ^
- We know we want our drug name as a group so we will put in parenthesis to hold them: ^()
- We know we want to grab certain types of characters, mainly letters, so we are going to build an expression inside brackets to say why type of characters we want. In fact, we want everything until we find a number, which starts the Amount part of the string. So, we’ll fill that bracket with the kinds of characters we need. Start with the brackets: ^([])
- We know we only have uppercase letters, so we’ll start with A-Z. That means anything between the uppercase A and the uppercase Z: ^([A-Z])
- Well, that got the first character. We need more of them, so let’s use the * to get "zero or more of".: ^([A-Z]*)
- Take a minute to look down at the Match Groups. This is where you’ll see our progress in making our groups.
- That got the first word of each drug, but some drugs have a slash between them, and we need them both. Slashes are special characters in Regex because they can be a command, too. As such we have to “escape” the slash to let Regex know we are talking about the piece of text and not a command. You do that with a backslash. So, we’ll add the escaped slash to our brackets: ^([A-Z\/]*)
- We’re almost there. We have one drug, the FENTANYL TRANSDERMAL PATCH that has spaces. In fact, the TRANSDERMAL PATCH part of that is probably a “Type” like TABLET, but I think we’ll have to deal with that later. We’re going to add a space to our character brackets so we can keep this in our Drug group: ^([A-Z\/ ]*)
- Capturing the Amount
- Looking at this, our amount always have numbers, and then some characters that follow it, until you get to a space. So, if we can grab all those combinations to a space, we’ll be golden. We can also deal with the space between the groups so we don't include them in our group.
- We’ll start by creating our new group with new parenthesis: ^([A-Z\/ ]*) ()
- Now we’ll add our first character class, which are numbers. It could be any number from 0-9. We put them in the character brackets: ^([A-Z\/ ]*) ([0-9])
- OK, with the last one, we got a single number, but we need more than one, we we’ll add the * to catch “zero or more”: ^([A-Z\/ ]*) ([0-9]*)
- OK, we now we need to grap the slash, which we have to escape again with the backslash. Let’s add it: ^([A-Z\/ ]*) ([0-9\/]*)
- Now we need to get the series of letters for the amount. We could put in just the characters we need - MCGHR - but we’ll go ahead and grab all uppercase letters: ^([A-Z\/ ]*) ([0-9\/A-Z]*)
- We got all the amounts!
- Get the rest: The type
- Now we need to account for the space between the Amount and the Type, so add your space after. When you do, look what happens to Match 30 and 31. We lost our Amount there ... why? Because that group does not have a space after it. We can account for that with the * command, which is for "zero or more of".
- OK, we’re going to start our third group by adding some parentheses.: ^([A-Z\/ ]*) ([0-9\/A-Z]*) *()
- We know we need just uppercase letters, so let’s add those in our character brackets: ^([A-Z\/ ]*) ([0-9\/A-Z]*) ?([A-Z])
- Make sure to check all your groups, because Match 30 & 31 are problematic again. Add our * and we are good: ^([A-Z\/ ]*)([0-9\/A-Z]*)([A-Z]*)
- We’re almost there. We need to account for the spaces and commas, so we'll add both of those in the brackets: ^([A-Z\/ ]*)([0-9\/A-Z]*)([A-Z ,]*)
- So now we’ve captured all our content and we are storing them into three groups. We’ll be able to reference that group in a search and replace in Text Wrangler.
- Copy your Regular Expression from rubular and then go over Text Wrangler and open the Opiods file if you haven’t already.
- Do Apple-F to open the find and replace window. In the Find section, paste in your Regular Expression.
- In your paste section we will reference our groups using \1 for the first group, and \2 for the second, etc. In between these groups, we have to put a tab, which is a command character written as \t. So the replace field will look like this: \1\t\2\t\3
- Once last thing, on matching you have to click the GREP button so the search and replace will recognize you are using Regular Expressions.
- Hit Replace All, and you’ll end up with tabs where you want them. Copy and paste it into Excel so you can see.
- So, there you have it … your data in separate fields. You will need to run TRIM on those columns as we didn't get quite all spaces, but it's a start.
There is another Regex helper: http://regex101.com/, which is pretty good as well. It has lots of helper items to help you. You'll need to add "gm" into the modifier field to the right of the expression to make it work with this example.