Thursday, April 26, 2012

More doesn't mean better


Say we would like to know the average income of households in a elementary school district.  We don't have much money for this survey so we could either randomly select 10 adults from the school district and survey them or ask all 400 kindergartners through 4th graders what their parents make.  Which of the two samples do we think should get us closer to the truth about the average income in the area?  Most likely those 10 adults will give us a better figure than the 400 kids.  Why is this, isn't more data better?  Well yes and no.

Given everything is equal, the more data you can collect the better picture we are going to have.  But often times when looking at alternative methods for collecting data everything isn't equal.  Usually the options boil down to either collecting a lot of poor quality data, or less higher quality data.  Unfortunately there isn't a single rule that lets us pick which is right for our data collection needs, but there are some general rules to consider.

1) Quality of data usually trumps quantity of data. We can often learn just as much from a little bit of good data as a lot of poor quality data, with the possible benefit of spending less time and money collecting the lesser amount of higher quality data. The example above seems silly because we know how wrong kids can be about parent's income. However, many times we collect equally poor data because it is easy.  Quality always needs to be considered.

2) If our data is biased it doesn't matter how much we collect.  If the data that is collected is not the truth then it doesn't matter how much we collect, it will always lead us in the wrong direction.  Care needs to be taken so that the collected data ensures accurate results.  For an example, if the police would like to learn about teen drug use they will not be given accurate answers if a uniformed officer asks each kid personally when the last time they used drugs was.

3) Randomizing protects us from potential problems.  One of the reasons randomization is so powerful is because we reduce the problem of not getting a representative sample.  For instance, if we wanted to accurately gauge support for each candidate during an election it would be much better for us to randomly select 100 people and ask who they support rather than post the question on one of the candidate’s web pages and collect 10,000 responses.  Obviously there is going to be more people that favor the candidate which also go to the web page compared to the general public.  So our web page sample isn't correctly representing our population.

The main thing to remember is that quality and quantity of data need to be balanced.  More data is better, but poor quality data can only tell us part of the picture at best.  If questions arise about methods for collecting good data, or how much data is needed to get an accurate picture, then your local statistician can lend a helping hand.  Remember the whole point in collecting data is to make sure that we are pointed toward the truth.

No comments:

Post a Comment