Monday, July 30, 2012

Percents Count


Two farmers entered in a contest to see who could grow a nicer stand of wheat with less weeds.  One farmer was very lazy and let his field overgrow with weeds while the other farmer, known to be a good farmer, meticulously kept an eye on his field.  After the harvest both farmers sent grain samples in to be examined for how much weed material they contained.  Surprisingly, when the results came back it showed that the good farmer had more weed material in his grain sample.  Thinking that something was strange, the good farmer stopped on by to see the test results for himself.  Sure enough the lazy farmer had 1 pound of weed material in his sample while the good farmer had 1.5 pounds of weed material.  But then something caught his eye, the total sample weight for the lazy farmer was 5 pounds, while it was 150 pounds for the good farmer.  Seeing this result the good farmer knows he had really won since his sample only contained 2% weed material while the lazy farmer had 20%.  The total weed weight difference was due to different sizes of grain samples.

Often when we want to compare two groups we would like to compare them on equal footing, and often times it is difficult (or impossible) to get samples of the same size.  In these cases we should consider reporting an overall percent (or proportion or rate) as opposed to counts so that the comparisons for each group are not influenced by the size of the sample taken.

Say we are looking at using a new curriculum to teach awareness about gambling to elementary school children.  We take two classes and pilot a different curriculum at each one.  After the program is over we test the students on their knowledge.

A blue dot means that a student passed the test while a red dot means a student didn't pass the test.  If we just count the number of passes we see Classroom A had 10 students that passed while Classroom B had 9 students that passed.  However, Classroom A has 40 students but Classroom B has 18.  Due to the unequal sample size we should not compare straight counts here, instead we should look at the percent of students that passed (the number that passed divided by the overall number of students in each class, then multiply the result by 100).  So for Classroom A we have 10/40=0.25 which means 25% of the students passed, and for Classroom B the results were 9/18=0.5 which is 50% of the students.  Even though more students passed in Classroom A a higher percent of students passed in Classroom B.  So we have evidence that the curriculum in Classroom B is better overall. 

Remember that when we want to make comparisons between groups do not let the size of each group skew the result.  Taking percentages can help us compare what is actually important and keep us pointed towards the truth

Wednesday, June 27, 2012

Much Ado About Nothing


Surprisingly one of the most complex topics in statistics is dealing with nothing.  Nothing in our data can be something such as a question in a survey that was accidentally left blank, all the way to an data actual value of zero.  The trick to being able to correctly handle “nothings” in our data is to understand why it was there in the first place.

The simplest case is when we know we have a legitimate value of zero in our data.  For example, if we asked how many times we've used a park in the last year, we would expect there to be people who really did not use the park at all.  In these cases of nothing we do not need to take any special action.

Another common case is when a data point is  missing.  For example, say we ask for a rating of service and there is no response.  When this happens we should leave these responses blank (or if using a more advanced software package use the missing data code).  We absolutely do not want to replace these missing values with 0's, or any other value, since this will bias our results.  See the table for an example.



Now we move on to more tricky situations.  There are times when we have no response for a question but it still tells us something.  For example, say we hastily made a survey and included a question that read:

Check the political party that you consider yourself:

  • Republican
  • Democrat


What if we get a survey back that has neither option checked?  Did they skip the question or is the person a member of a different political party (Unaffiliated, Libertarian, etc.)?  We can't tell for sure.  Careful design of the survey could have prevented this by including a third “other” option.

An interesting example of nothing meaning much more than zero is the case when we have a “limit of detection”.  Say we want to learn about the amount people speed on a certain road by looking at records of speeding tickets.  When we look at this data we see that there are almost no tickets issued for 1 to 3 miles per hour over the speed limit.  Does this mean that no one drives at these speeds?  Surely not, it is most likely the case that people who speed a couple of miles over the speed limit are just not ticketed.  In this case we have a limit of detection problem, we know there should be values in our data that aren't showing up because they are too small.  To avoid this, we could use a less biased data source that does not have this problem.

These four issues just scrape the surface of possible problems with nothing in our data.  While it may seem intimidating, most cases do fall under these general categories.  So take some time to think about what the nothing means in your data.  You may be able to glean more information from it than initially thought.


Wednesday, May 23, 2012

Point of View


The Swiss artist Felice Varini is a master of using point of view. If you view his work from the wrong spot it seems to just be a random collection of lines and colors, but if you are looking from the correct location it all comes together to form a stunning design. Below is one of his pieces titled ‘Huit carrés’.


If we didn't know what to look for when viewing ‘Huit carrés’ would we take the time to find out the correct spot to view it from? Or would we think that we see the whole picture when we see the room like in the left images above and then move on?

This is similar to the way we often look at our data. When we take quick “high level” glaces over our data, with overall averages, general trends, or simple charts, we may be missing the true picture it is painting. Often times, when the quick glances at our data show something interesting, that should be a tip off that there may be even more interesting trends to be found in our data. We just need to know where to look for them.

For example if we look at the overall counts for the violence data from the school district we see these numbers:

High School: 72
Middle School: 74
Intermediate School: 69
Elementary: 117

Male: 225
Female: 107

The picture that we see from this analysis is that males engage in more acts of violence, and there are about twice as many acts in Elementary schools and then similar numbers through the rest of the school levels.

But what if we take one more step into this analysis to get a different point of view? Lets look at gender and school level at the same time:

School Violence
Male
Female
High School
23
49
Middle School
50
24
Int. School
55
14
Elementary
97
20

Ah ha! We have learned something new here, notice that males commit more than twice the amount of acts of violence in all school levels except for High School, where females committed about twice the amount. Now we know something else about our data. It seems that there may be different times when violence issues occur with each gender -there are specific times when males and females need to be focused upon. We never could have seen this with just our first glance through the data.

We need to remember to take a couple of steps around when looking at our data to make sure that we have the correct point of view, because there are times when viewing it from an side angle does not give the same picture as viewing it straight on.

Thursday, April 26, 2012

More doesn't mean better


Say we would like to know the average income of households in a elementary school district.  We don't have much money for this survey so we could either randomly select 10 adults from the school district and survey them or ask all 400 kindergartners through 4th graders what their parents make.  Which of the two samples do we think should get us closer to the truth about the average income in the area?  Most likely those 10 adults will give us a better figure than the 400 kids.  Why is this, isn't more data better?  Well yes and no.

Given everything is equal, the more data you can collect the better picture we are going to have.  But often times when looking at alternative methods for collecting data everything isn't equal.  Usually the options boil down to either collecting a lot of poor quality data, or less higher quality data.  Unfortunately there isn't a single rule that lets us pick which is right for our data collection needs, but there are some general rules to consider.

1) Quality of data usually trumps quantity of data. We can often learn just as much from a little bit of good data as a lot of poor quality data, with the possible benefit of spending less time and money collecting the lesser amount of higher quality data. The example above seems silly because we know how wrong kids can be about parent's income. However, many times we collect equally poor data because it is easy.  Quality always needs to be considered.

2) If our data is biased it doesn't matter how much we collect.  If the data that is collected is not the truth then it doesn't matter how much we collect, it will always lead us in the wrong direction.  Care needs to be taken so that the collected data ensures accurate results.  For an example, if the police would like to learn about teen drug use they will not be given accurate answers if a uniformed officer asks each kid personally when the last time they used drugs was.

3) Randomizing protects us from potential problems.  One of the reasons randomization is so powerful is because we reduce the problem of not getting a representative sample.  For instance, if we wanted to accurately gauge support for each candidate during an election it would be much better for us to randomly select 100 people and ask who they support rather than post the question on one of the candidate’s web pages and collect 10,000 responses.  Obviously there is going to be more people that favor the candidate which also go to the web page compared to the general public.  So our web page sample isn't correctly representing our population.

The main thing to remember is that quality and quantity of data need to be balanced.  More data is better, but poor quality data can only tell us part of the picture at best.  If questions arise about methods for collecting good data, or how much data is needed to get an accurate picture, then your local statistician can lend a helping hand.  Remember the whole point in collecting data is to make sure that we are pointed toward the truth.

Thursday, April 5, 2012

Picking Lottery Numbers for Profit


Recently the Mega Millions lottery topped an expected payout of $656,000,000 (or more than half the way to a billion dollars).  And even though picking the winning number is pure chance there is still some strategy to maximizing the expected returns from playing the lottery.

Wait...  didn't we just say that picking the winning number was just chance, so how can we improve our situation when the results are picked at random?  Well the first thing we need to do is separate the idea of picking the winning numbers and receiving money.  While picking the right numbers is completely chance based, the amount a winner is paid is not.  This is because if two (or more) people pick the same numbers then the pot is split amongst those people.  You can think of it this way; every other person that picks the same numbers as you reduces your possible earnings.  We saw in the record Mega Millions lottery drawing three people had the winning numbers, so the most that they could have won was about $219 million each (still a hefty sum but not $656 million).

So if the winning numbers are random doesn't that mean the numbers people pick are also random?  It turns out that isn't the case.  Numbers such as important dates, ages of children, and street addresses are used by people when making lottery number picks.  This means that people tend to pick the lower lottery numbers in favor of the larger ones (for months in important dates 1 through 12 are used, 1-31 for the days, etc).  So to minimize the amount of people selecting the same numbers  you should consider focusing on picking the larger ones.  It won't help your chances of winning, but it will help your payout if you do win.

This illustrates a common problem when dealing with data and “random selection”.  Even if the subjects in our study are selected at random it does not mean that there aren't other underlying factors needing consideration.  For example, randomly choosing a sample of people from Kansas will tend to have a different composition than a random sample from Maine.  We need to keep in mind that random does not mean we don't need to be concerned about specifics of the sample.  Knowing this will allow us to keep ourselves pointed toward the truth.

Tuesday, March 27, 2012

When Yes and No are Not Enough


Imagine a world where all the people inhabiting it viewed everything as “black-or-white”.  A person is very satisfied with life until, at some point, they became instantly very unsatisfied.  Children in school either have no knowledge over a subject or they know everything about it.  And neighborhoods are either so safe that the residents have never even heard of a lock before or are so dangerous that armed guards must be hired each time a trip to the store is needed.  Indeed this would be a strange world to live in, so it may be surprising that often times the way we collect data makes it appear that we live in this sort of world.

Data of this type is called “binary” meaning only a yes or no type answer is recorded.  While this sort of data is useful for questions like: “Do you live in Kansas?”, “Did you vote for Bayes last election?”, or “Are you male or female?”, it is not as appropriate when there are varying degrees for the answer such as; “Where you satisfied with the seminar?”.  Instead we should use a rating scale when we want to collect this type of data.  Let's take a look at an example to see why this is.

Say we are presenting a seminar on technology and we want to assess whether the participants have learned about using Excel.  We will ask the people taking the class two questions before and after the seminar to measure the amount they learned.  The questions will be “Could you use Excel if needed?” and “On a scale of 1-10 rate how well could you use Excel if needed? (10 being better)”.  After the seminar we take a look at our results (it seems the seminar turnout was poor since there was only three participants).


Now looking at just the yes/no responses we may be disheartened -only one person improved from a No to a Yes, and it also seems that the class wasn't useful to another since they already could use Excel.  However, looking at the rating scale we could see that in actuality the seminar was a success, all of the participants learned, with an average learning of 3 on the scale.  The seminar was the same, yet just because of the way we collected our data our conclusions could be quite different.

So why is a rating scale better for data like this?  It is because it allows us to measure slight changes in the data easier.  It is relatively difficult to switch a response from a No to a Yes, but much easier to move a rating up or down one or two points.  Further, we could always convert our rating scale data back to a binary type by grouping the responses (say 1-5 = No and 6-10 = Yes), but we can not turn binary data into rating data.

When designing surveys or intake data, take a minute to think about what kind of data you are collecting, and consider if a rating scale can reasonably be used.  The extra “shades of gray” that you will find in your data will make it easier to keep you pointed towards the truth.

Wednesday, March 7, 2012

Switch now and save!


It seems that every time I turn on the TV there is some insurance company that is saying something to the effect of “People who switch from insurance from Company A to Company B save on average $500 a year for the same coverage.” but then the next ad is “People who switch from insurance Company B to Company A save on average $500 a year for the same coverage.”  How is this possible?  Does this mean that if everyone switched insurance companies that everyone's rates would go down?  The truth is sadly no (if it were true I would just switch companies enough times that I would get free insurance), but interestingly the ads are correct in their statements.  So what is going on?

First let's take a hypothetical sample:
Each shaded square is the quote from the insurance company that each person is currently using, while the unshaded squares are the quoted prices for the same coverage with the other company.  So Person 1 is paying Company A $1,320 for insurance while he could get the same policy from Company 2 for $845.


The first thing to note is that the rates for each company are the same on average ($1455.88 for Company A and B).  So the insurance companies offer the same coverage for the same price overall, and yet they both make the correct claim of “people who switch from them to us save $500”.

Next let's just see what happens if everyone switches policies:
Here are the savings if everyone switched insurance policies.  Now we see something strange; on average if everyone were to switch polices then each person would expect to save -$500 on their insurance, or everyone would pay $500 more overall.  This is quite the opposite of what the ads seem to claim.  So were are the savings?
  
Well we need to pay close attention to the wording in the insurance companies statement “Those who switch save on average $500”.  Now who would switch?  It will only be the people that have quotes that are lower (and arguably significantly lower) than their current rate.  So obviously the average rate for any switch will save money.  

So if we go back to our data now and have the people switch that received lower quotes we see:
Where the light green cells are customers who switched and the dark green are the customers who stayed with their old policy.  Now when we calculate the money saved from those who switched we see an average savings of $500 for each customer overall, no matter which policy they had and switched to.  Case closed.

This insurance example highlights a common problem that we have interpreting numerical results.  Statements are crafted in ways which are true but could be misleading (even if they are not meant to be).  It is the job of the person who reports the results to be as clear as possible, but it is also the responsibility of the people who rely on results to make sure they fully understand the statements.  Comments such as “the samples were randomly selected”, “outliers were thrown out”, or “out of our initial testing cohort these three results were found to be highly significant (p-value<0.01)”, seem to be run of the mill, but without fully understanding what actually happened they may cause the analysis to be of little practical use.  When something doesn't make sense, or is not spelled out clearly, we need to be sure to ask questions so that we can keep ourselves pointed toward the truth.