Monday, January 13, 2014

KO Your Data with DQ

The decisions you make using data are only as good as the data themselves. When you look at a score in your gradebook, review a report from the state about test scores, or access a national database with discipline data, you're depending on quality data.

There are lots of things that go into building data quality (DQ). Some aspects are more technical---does the student's name use alphabetic characters (no numbers or symbols)? Others are more utilitarian---is the student's name in correct column? Is it spelled correctly? Is "John Smith" the same as "Johnathan Smith" or "Johnny Smith" or "J. Smith"? And some attributes are more about what the data represents---is the score for John Smith really his (and not some other student's)? Is the score in the box from the most recent math assessment (and not the reading assessment)?

Data quality isn't just one person's responsibility. It's everyone's job, whether you're collecting, entering, or using data. It's also a far bigger issue than I can cover in one post. If you're looking for additional resources, check out the Data Quality Campaign. I also advise you to develop your own bullshit detector, because a lot of what going into using data sets is questioning whether or not they make sense. For example, if one school is reporting 60 suspensions a year...and another school in the same district that is the same size is reporting only 2...you might want to ask a few questions.

For today, however, let's talk about a few things you can do in Excel to help with the technical components of data quality.

Excel has several options for cell formatting, including all sorts of customizations you can design. Sometimes, it even tries to be helpful by deciding what format you meant. Perhaps you, too, have been the victim of adding data like 9 - 12 only to have Excel automagically change it to 12-Sep. And when you change the format to specify Text, it further mangles your original entry to 41894 so you have to start all over again. Things like this show you how easy it is for data quality errors to creep into your workbook.

Other times, you may inherit or download a workbook where data quality has not been a priority. At this point, looking for data quality issues becomes a lot like data cleaning.

Now is the time to whip out a few of those functions that you always wondered why they were there: ISTEXT, ISNUMBER, ISNONTEXT, ISLOGICAL, ISBLANK. I know, these seem like "duh" formulas---why do you need a function to tell you whether or not there is text in the cell? Because like our sample above and our list of options, looks can be deceiving.

Here are a couple of examples of how they work. For a full list of IS functions, definitions, and examples, visit this page. Using a formula like ISTEXT or ISNUMBER returns a value of TRUE (Yes, the information in the cell is text/number.) or FALSE (No, the data in the cell isn't text/number.) They are written like this (evaluating whether "Excel Rocks" is text):





And look like this after you hit Enter:





Here's how things look for a FALSE statement using ISNUMBER:





And after:





But let's face it, you're not likely to create an an entirely new table of true/false values and then identify things from there. It's more likely that you would combine these with conditional formatting to highlight the types of data you have.

If you have a data table like this one (data source here):


You can combine these simple statements with some conditional formatting to get a quick visual for your data and easily spot problems.


I've applied two rules to the section of the table with the data. The first identifies numbers and the second text. I chose some fill for the cells, but you could just as easily pick something else.


Remember, Excel can be conflicted about things. For example, the cells with the dashes are formatted as numbers, but Excel is still reading them as text. This is why it's always important to do a quick check for data quality issues either before you submit or before you use a data set.



Depending upon your role, you have some decisions to make. We have cells in our data table that are blue---meaning that they contain text. If these were data I was going to upload to a database or otherwise share, then I need to address these DQ issues. If I'm the database administrator, I might wonder why these cells aren't calculated like the others and need to investigate some code. If these are data that I am going to use for making decisions about where to direct resources for next year, then I'm going to need to find additional information.

Keep in mind that Excel functions a bit like a spell checker in these situations. It can tell you if the right sort of data is in the right spot---but it can't tell you if the names and numbers are the correct ones for those cells. You'll need additional validations for that...or a really good eye and careful work.

Data quality starts with you---from how you enter attendance, label your gradebook, or even type a student's name. But you can use Excel to help you along the way. Building a culture in your school that includes DQ at all stages can lead to better decisions and outcomes.

No comments:

Post a Comment