Thursday, April 26, 2012

More doesn't mean better


Say we would like to know the average income of households in a elementary school district.  We don't have much money for this survey so we could either randomly select 10 adults from the school district and survey them or ask all 400 kindergartners through 4th graders what their parents make.  Which of the two samples do we think should get us closer to the truth about the average income in the area?  Most likely those 10 adults will give us a better figure than the 400 kids.  Why is this, isn't more data better?  Well yes and no.

Given everything is equal, the more data you can collect the better picture we are going to have.  But often times when looking at alternative methods for collecting data everything isn't equal.  Usually the options boil down to either collecting a lot of poor quality data, or less higher quality data.  Unfortunately there isn't a single rule that lets us pick which is right for our data collection needs, but there are some general rules to consider.

1) Quality of data usually trumps quantity of data. We can often learn just as much from a little bit of good data as a lot of poor quality data, with the possible benefit of spending less time and money collecting the lesser amount of higher quality data. The example above seems silly because we know how wrong kids can be about parent's income. However, many times we collect equally poor data because it is easy.  Quality always needs to be considered.

2) If our data is biased it doesn't matter how much we collect.  If the data that is collected is not the truth then it doesn't matter how much we collect, it will always lead us in the wrong direction.  Care needs to be taken so that the collected data ensures accurate results.  For an example, if the police would like to learn about teen drug use they will not be given accurate answers if a uniformed officer asks each kid personally when the last time they used drugs was.

3) Randomizing protects us from potential problems.  One of the reasons randomization is so powerful is because we reduce the problem of not getting a representative sample.  For instance, if we wanted to accurately gauge support for each candidate during an election it would be much better for us to randomly select 100 people and ask who they support rather than post the question on one of the candidate’s web pages and collect 10,000 responses.  Obviously there is going to be more people that favor the candidate which also go to the web page compared to the general public.  So our web page sample isn't correctly representing our population.

The main thing to remember is that quality and quantity of data need to be balanced.  More data is better, but poor quality data can only tell us part of the picture at best.  If questions arise about methods for collecting good data, or how much data is needed to get an accurate picture, then your local statistician can lend a helping hand.  Remember the whole point in collecting data is to make sure that we are pointed toward the truth.

Thursday, April 5, 2012

Picking Lottery Numbers for Profit


Recently the Mega Millions lottery topped an expected payout of $656,000,000 (or more than half the way to a billion dollars).  And even though picking the winning number is pure chance there is still some strategy to maximizing the expected returns from playing the lottery.

Wait...  didn't we just say that picking the winning number was just chance, so how can we improve our situation when the results are picked at random?  Well the first thing we need to do is separate the idea of picking the winning numbers and receiving money.  While picking the right numbers is completely chance based, the amount a winner is paid is not.  This is because if two (or more) people pick the same numbers then the pot is split amongst those people.  You can think of it this way; every other person that picks the same numbers as you reduces your possible earnings.  We saw in the record Mega Millions lottery drawing three people had the winning numbers, so the most that they could have won was about $219 million each (still a hefty sum but not $656 million).

So if the winning numbers are random doesn't that mean the numbers people pick are also random?  It turns out that isn't the case.  Numbers such as important dates, ages of children, and street addresses are used by people when making lottery number picks.  This means that people tend to pick the lower lottery numbers in favor of the larger ones (for months in important dates 1 through 12 are used, 1-31 for the days, etc).  So to minimize the amount of people selecting the same numbers  you should consider focusing on picking the larger ones.  It won't help your chances of winning, but it will help your payout if you do win.

This illustrates a common problem when dealing with data and “random selection”.  Even if the subjects in our study are selected at random it does not mean that there aren't other underlying factors needing consideration.  For example, randomly choosing a sample of people from Kansas will tend to have a different composition than a random sample from Maine.  We need to keep in mind that random does not mean we don't need to be concerned about specifics of the sample.  Knowing this will allow us to keep ourselves pointed toward the truth.