Wednesday, June 27, 2012

Much Ado About Nothing


Surprisingly one of the most complex topics in statistics is dealing with nothing.  Nothing in our data can be something such as a question in a survey that was accidentally left blank, all the way to an data actual value of zero.  The trick to being able to correctly handle “nothings” in our data is to understand why it was there in the first place.

The simplest case is when we know we have a legitimate value of zero in our data.  For example, if we asked how many times we've used a park in the last year, we would expect there to be people who really did not use the park at all.  In these cases of nothing we do not need to take any special action.

Another common case is when a data point is  missing.  For example, say we ask for a rating of service and there is no response.  When this happens we should leave these responses blank (or if using a more advanced software package use the missing data code).  We absolutely do not want to replace these missing values with 0's, or any other value, since this will bias our results.  See the table for an example.



Now we move on to more tricky situations.  There are times when we have no response for a question but it still tells us something.  For example, say we hastily made a survey and included a question that read:

Check the political party that you consider yourself:

  • Republican
  • Democrat


What if we get a survey back that has neither option checked?  Did they skip the question or is the person a member of a different political party (Unaffiliated, Libertarian, etc.)?  We can't tell for sure.  Careful design of the survey could have prevented this by including a third “other” option.

An interesting example of nothing meaning much more than zero is the case when we have a “limit of detection”.  Say we want to learn about the amount people speed on a certain road by looking at records of speeding tickets.  When we look at this data we see that there are almost no tickets issued for 1 to 3 miles per hour over the speed limit.  Does this mean that no one drives at these speeds?  Surely not, it is most likely the case that people who speed a couple of miles over the speed limit are just not ticketed.  In this case we have a limit of detection problem, we know there should be values in our data that aren't showing up because they are too small.  To avoid this, we could use a less biased data source that does not have this problem.

These four issues just scrape the surface of possible problems with nothing in our data.  While it may seem intimidating, most cases do fall under these general categories.  So take some time to think about what the nothing means in your data.  You may be able to glean more information from it than initially thought.