Introduction to Statistics (without math)
A non-mathematical introduction to basic statistical concepts
This short essay introduces basic statistical concepts without using much math. This is not a cookbook for working out statistics problems. This is a guide to understanding statistical concepts when you read the news, press releases, or scientific reports.
Warning: The following text requires the use of brain cells, contains no anecdotes and no quotes from "experts". Reader discretion advised.
Normal Distribution
If a population under study has a "normal distribution", when we "sample" (other terms, survey, poll) the population by measuring only a subset of the total population, we expect that our sample measurements will follow the "normal distribution".
In other words, if we sample say 100 people out of 1,200, we should get a cross section of the total population. We can use this sample to make an estimate on say, how the whole group of 1,200 might have voted.
If we graph our samples, we will get a result that should approximate the normal distribution. In the normal distribution, most results fall in the broad middle, with fewer results at either end.
The P(x) curve here illustrates an expected normal distribution. Other terms for the normal distribution are the "Bell curve" or Gaussian distribution.
(Not all distributions are normal. There are other statistical methods for dealing with those situations that are not described here. Because of this, when conducting a survey or or other research, a scatter plot of the data is typically drawn to just observe the raw data. If it does not resemble a normal distribution, then other statistical procedures should be used.)
Example
Suppose there are 1,500 students in a school. We conduct a survey and ask 200 randomly selected students some questions to learn something about the students. There is a small chance that we could randomly select a biased sample, say lots more boys than girls. Because of this, the sample size affects our ability to confidently draw conclusions. With any size sample, it is possible that we did not draw a randomly distributed set - although with larger sample sizes, up to a point, we can be more confident that we really did get a random distribution of samples.
Estimating the Sample Average
Most everyone knows that to calculate the average of a set of values you merely sum all the values and divide by the number of values. That is simple enough.
By averaging the samples from the overall population you do not, however, calculate the actual population mean. Instead, you have produced an estimate of the population mean. Depending on your sample size, your estimated average may be off (is almost certainly off) by some amount form the actual population mean.
You could see this by running the survey several times. Each time we randomly select 200 students. The result of each survey would result in a slightly different estimate of the population mean. Yet if you went and sampled each and every member of the population - all 1,500 students - you'd arrive at the actual population mean and it would probably not be the same as your sample estimates.
How far might our sample average be off from the actual mean? We will come back to that in a moment.
Standard Deviation
A reasonable question is "How random are our samples?" Some sample values will be less than the average, some close to the average, and some above the average. So how much do our sample values deviate from the average? That concept is captured in a calculation known as the standard deviation. We might have a set of say, student test scores, on a 100 point test. The average score might be 80, with a standard deviation of say, 7.
What does that mean? Without explaining why, we expect that in a normal distribution, 2/3ds of of our scores will fall within + or - one standard deviation unit (in this case one unit = 7 so 80 +/- 7). 95% of all the scores should fall within + or - two standard deviation units (80 +/- 14) and 99.7% of all values should fall within +/- 3 standard deviations.
There are two kinds of standard deviations: (1) sample standard deviation and (2) population standard deviation. If you did a survey of a subset you'll want to use (1). If you actually surveyed everyone, you'll use (2).
Confidence Intervals
From above, we know that our sample average (average of all subset samples) is not exactly the population mean (if we could sample every member of the population). Our estimated average will probably fall within some range around the exact population mean.
How wide is the interval about the real mean? That depends on how confident we want to be that our estimate is good or bad. Let's say we know that the values in the overall population lie between 0 and 100. We can say, with 100% confidence, that our the population mean will therefore lie somewhere between 0 and 100! Thus, to achieve 100% confidence (in a worst case scenario) our confidence interval would be as wide as all possible values.
If we are willing to give up some confidence that our estimate is correct, we can narrow the confidence interval.
Without explaining the math, we can say, with 95% confidence that the actual population mean (if we sampled every individual) lies within about + or - two standard deviations of our estimated average.
Let's go back to the school example. Suppose all 1,500 students took a test and we sampled 200 to learn that we have an estimated average test score of 80, with a standard deviation of 7. We can say that we are 95% confident that if we calculated a mean of all 1,500 test scores, the actual mean will lie within the range of (about) 80-14 to 80+14 or a range of 66 to 94.
What if we wanted to be only 80% confident that the real mean is within our interval? Since we are willing to be less confident, we can narrow the size of our interval to 80 - 9 to 80 + 9 or 71 to 89.
Another way to look at the confidence interval is as follows:
* At the 99% confidence level, there is a 1 in 100 chance we are wrong
* At the 95% confidence level, there is a 1 in 20 chance that our estimated average is wrong (outside the interval range)
* At the 90% confidence level, there is a 1 in 10 chance that our estimated average is wrong (outside the interval)
* At the 80% confidence level, there is a 1 in 5 chance we are wrong
* At the 66% confidence level, there is a 1 in 3 chance we are wrong
* At the 51% confidence level, there is a 1 in 2 chance that we are wrong (equal odds)
What Confidence Level Should Be Used?
By statistical convention or long term standards, the most commonly used confidence interval is 95%. There are some situations where you might choose less (90%, although rarely) or 99% (fairly common in health care) and sometimes even 99.9%. In some situations where the result is not too important - like say a market research survey - someone may have reason to choose an 80% confidence level. However, levels below 90% are considered pretty worthless for real results.
WHAT TO LOOK FOR
Look for the confidence interval in the report. If you do not see it mentioned, either ignore the report or look for a more detailed report.
If the confidence interval is less than 95%, try to understand why. A sleight-of-hand to make a finding sound important by claiming statistical significance is to change the confidence limit to 90% or even 80%. This may or may not be mentioned. What was not significant at the 95%, or 90% level might be significant at the 80% level - but that's a worthless result since there is a 1 in 5 chance we are completely wrong.
Survey results are often presented as "Candidate X polls 45% +/- 4.5 percentage points". Without knowing the confidence level, this is not useful information. Was this at the 95% confidence level? 80%?
What is the sample size? In some fields, like health care, numerous studies are completed with very small sample sizes ranging from n=1 to 30. Be very careful about interpreting these studies. Typically they are used to justify funding for further studies - their results are usually of no value in making future health care decisions.
(There are also two kinds of confidence intervals - one-sided and two-sided - which will not really be dealt with in this introduction.)
Hypothesis Testing
When many people hear the word "hypothesis" they may think "science", but the concept of a hypothesis is not restricted to science. A marketing manager might form a hypothesis: Our new ad campaign resulted in an increase in sales - or not. The marketing manager may want to know whether a change in sales was merely random or due to the ad campaign.
Hypothesis Testing is used to make tests about what we think we know. A hypothesis is either true or not true. A hypothesis is never probably true or probably false - a hypothesis only rejected or accepted.
A hypothesis contains two parts:
(1) The null hypothesis - this is true unless we have convincing evidence it is not true. Example - the increase in sales was not due to the ad campaign.
(2) The research hypothesis or alternative hypothesis - this will be accepted if we have convincing evidence. Example - the increase in sales was due to the ad campaign.
We can accept the null hypothesis or reject the null hypothesis and accept the research hypothesis. (We do not reject the research hypotheis - we only have the chance to accept it.) The goal of an experiment is to reject the null hypothesis.
Hypothesis Example
We've made a change to our classes at the school and we hope this leads to an increase in test scores.
Null hypothesis - the change made no difference, the mean equals 80.
Research hypothesis - the change made a difference, the mean is not equal to 80.
We conduct a new survey of a sample of students and we calculate a new average of 85. Do we accept or reject the null hypothesis?
Let's assume that our standard deviation is still 7. Since the new estimate lies within the 95% confidence interval we accept the null hypothesis. There is not a statistically meaningful difference.
Suppose instead our new average was 97. Do we accept or reject the null hypothesis? Since the new estimate lies outside the 95% confidence interval, we reject the null hypothesis and accept the research hypothesis.
The reality: a lot of people would assume that the new average of 85 was meaningful, but it could be entirely by chance of how we randomly sampled the population. The statistical tests help us to understand the impacts of chance on our results and to make statistically valid conclusions.
The p-value
In hypothesis testing, the confidence interval value (e.g. 95%) is typically referenced as the opposite of 95% - that is 5%. We end up saying "we accept the null hypothesis at the p> 0.05 level" (or p>5%) or similar. Accepting the null hypothesis is the same as saying that the results were "not statistically significant".
The p-values are, in turn, translated into English statements as follows:
| p>5% | Not signficant |
| p<5% | Significant at the 5% level |
| p<1% | Highly significant that the 1% level |
| p <0.1% | Very highly significant at the 0.1% level |
IMPORTANT POINT
The statistical words "significant", "highly significant" and "very highly significant" do NOT mean that the conclusion is important, profound, will change the world, or what ever hype sounding adjective you want to use. The words provide us with information about the confidence of our rejecting the null hypothesis. That is all. This is a critically important point - in news stories in the press you will frequently see a statistically "significant" finding translated into "important breakthrough" or "significant finding" or similar, when it is neither.
WHAT TO LOOK FOR
Occasionally, you may spot the use of a 90% - or worse, 80% - confidence interval - in order to claim significance in a research finding. If that is a scientific finding, you should normally expect at least a 95% confidence - and in health care, you should expect at least a 95% and even a 99% confidence.
When you read a report that says "the researchers are 80% confident" in the result, you should understand that this means, by normal science standards, no one else has any confidence in the result because this means there is 1 in 5 chance the researchers are wrong. Watch those confidence intervals!
Other Statistical Tests
Another common test is to compare two populations to each other. For example, suppose we have two factories that produce robots. Factory A has a defect rate of 2.3% and Factory B has a defect rate of 2.8%. Are these differences statistically significant?
Other statistical techniques are used to make sense of sequences of data. A very common type of simple data analysis is the regression. Suppose we have sales over ten years - what sort of sales might we expect next year and the year after? We can draw a trend line through the years on a graph and see where the trend looks to be going. Using statistical methods, we can develop and equation that enables us to predict a future value - and understand a confidence interval for our prediction and give us an idea of the reliability of our future predictions. The reliability or "closeness" of the prediction is often measured with a statistic called the "R-squared" value. (You may see references to it from time to time.) An R^2 value close to zero means our regression - and hence prediction - is worthless. An R^2 value close to 1 is really good. There are also other measures sometimes used.
Many types of regression are simple. Such as the sales problem.
Some though are more complex. Suppose back at the school, we decide to make many changes to our school to improve test scores. We change the number of class periods, we offer a new 7 am class option, we hire three extra tutors, and we create new entrance requirements for students wishing to take pre-calculus classes, plus we change text books in the English class. Now, we look at our test results over a few years. Do we see a correlation between our changes and the test results?
We may wish to write a formula to help us predict test scores - something like:
Test scores = NumPeriods * 7AMOption * MoreTutors * Requirements * Textbooks
This becomes a multiple regression problem rather than a simple regression problem. Statistical methods help us to identify how much weight to give to individual components of this equation (maybe the textbook is not relevant to the test scores?)
The above regression methods are designed to work with data that shows linear relationships (e.g. a trend line that sort of looks like it follows a line on a graph). When the data is non-linear, then other methods must be used. Non-linear data would be, for example, a trend that goes up and down over time.
When reviewing studies showing a trend, keep a close eye on the use of linear regression. A convenient sleight of hand is to select the beginning and ending points on non-linear datasets - thus cherry picking a selection of data that just happens to show an up or down trend and performing linear regression. If the rest of the data before and after is not disclosed, someone is probably trying to be deceptive. This is deception is quite common too. (More on this in a moment.)
In another example, suppose we would like to know if there is a relationship between the Federal budget and the Dow 30 Industrials stock market tally. We are asking, "Is there a correlation between the budget and the stock market?"
To answer that question we use statistical methods to evaluate the correlation between the two data series, arriving at a correlation coefficient ranging from -1 to 1. A value near zero means little or no correlation. A value closer to 1 means a good correlation between the two series. A value closer to -1 means they are inversely correlated (e.g. as the budget goes up, the Dow goes down).
When you read about correlations you should wonder about the value of the correlation coefficient. A low value is essentially worthless, but activists may tell you the data is correlated (which it is, sort of, but not very well.)
Some correlations are nonsensical. I once read in my economics text that some one had correlated the price of butter in India with the Dow 30 over a significant time period. A totally meaningless correlation. Unfortunately, there are a great many papers written that identify statistically significant - but meaningless - correlations. Watch out for these. Gary Taubes new book "Good calories, Bad Calories" describes a lot of meaningless correlations being used to establish government nutrition policies (e.g. Food Pyramid) that may now be incorrect due to their lack of meaningful evidence.
Correlations can be drawn between many items but it does not mean there is a connection between them unless you can identify a mechanism that links them together. For example, there is a nearby semiconductor manufacturing plant and slightly more people in the neighborhood have toenail fungus than in other neighborhoods. Statistics can identify the correlation - but that does not mean the semiconductor plant is causing toenail fungus unless you can identify a possible direct source of causation. It may just be an interesting correlation between data series and no more.
The increase in bloggers correlates very well with estimated annual global temperature, for example but blogging is probably not a real cause other than its often connected to "hot air" (that is a joke!)
Some data does not follow nice trend lines. For example, sales of downhill ski equipment are much stronger during the winter. Thus, sales records of ski equipment go up and down every year. This type of data - that basically repeats on a periodic basis - is a time series. Special techniques are used for analyzing time-series data.
There are many types of data that do following trend lines - as in linear trend lines. Some data, if drawn on a graph is wavy in nature, or rises or falls steeply. This requires non-linear trend analysis methods.
Some times it is not clear if there is a wave - or even multiple wave-like properties to data. For example, those who try to look at trends in Atlantic hurricances have about 130 years of data to look at (good data after WW II and poor data prior). Hurricanes are thought to have various cycles ranging over long periods of time. If we look at only a sample of a long period, depending on where we select the beginning and end points of our data, we may be looking at an increase or a decrease. For this reason, observing trends on many types of data requires careful consideration and recognition of possible sources of misinterpretation.
Be very careful when you see a linear trend or linear regression used to make future forecasts when the data being reviewed may not be linear in nature.
Consider another problem. We have 4 factories producing cereals. We sample 100 boxes of cereal from each factory for quality and taste. We'd like to check for meaningful differences. To do this, we would use a technique known as "Analysis of Variance".
Finally, there are a variety of advanced statistical methods, plus special methods used in specific fields. For example, in health care, a statistic called the "odds ratio" is often used. Another is "Number Need to Treat" or NNT. For example, for a specific drug, you may need to treat 20 people to for each patient that shows a positive outcome. That would be an NNT of 20, and you should interpret that as meaning the drug shows value for about 5% of the patients who take it and no value to the others. (You might be surprised to learn that most drugs do not work for most patients who take them.)
Useful Resources
I did not use this in writing the above, but it may of interest to others who would like some of the mathematical background which I intentionally omitted.
HyperStat Online Statistics Book (free!)


<< Home