Monday, October 19, 2009

Statistical Significance versus Statistical Power

This title based on sub chapter in Multivariate book (Multivariate Data Analysis Joseph F. Hair, Jr.; William C. Black; Barry J. Babin; Rolph E. Anderson; Ronald L. Tatham, Person Education International, 2006, Singapore)

A census of the entire population makes statistical inference unnecessary, because any difference or relationship, however small, is “true” and does exist. Rarely, if ever, is a census conducted, however. Therefore, the researcher id forced to draw inferences from a sample.

Types of Statistical Error and Statistical Power

Power: probability of correctly rejecting the null hypothesis when it false, that is, correctly finding a hypothesized relationship when it exist.

Determined as a function of:
1. The statistical significance level set by the researcher for type I error (alpha)
2. The sample sized used in the analysis
3. The effect size being examined

Interpreting statistical inferences requires that the researcher specify the acceptable levels of statistical error due to using sample (known as sampling error). The most common approach is to specify the level of type I error, also known as alpha. The type I error is the probability of rejecting the null hypothesis when actually true, or in simple terms, the chance of the test showing statistical significance when it actually is not present – the case of a “false positive”. By specifying an alpha level, the researcher sets the allowable limits for error and indicates the probability of concluding that significance exists when it really does not.

When specifying the level of type I error, the researcher also determines an associated error, termed the type II error or beta. The type II error is the probability of failing to reject the null hypothesis when actually false. An even more interesting probability is 1 – beta, termed the power of the statistical inference test. Power is probability of correctly rejecting the null hypothesis when it should be rejected. Thus power is the probability that statistical significance will be indicated if it is present. The relationship of the different error probabilities in the hypothetical setting of testing for the difference in two means is shown here:




Although specifying alpha establishes the level of acceptable statistical significance, it is the level of power that dictates the probability of success in finding the differences if they actually exist. Then why not set both alpha and beta at acceptable levels ? Because the type I and type II error are inversely related, and as the type I error becomes more restrictive (moves closer to zero), the probability of a type II error increases. Reducing the type I errors therefore reduces the power of the statistical test. Thus, the researcher must strike a balance between the level of alpha and resulting power.

Impact on Statistical Power
But why can’t high levels of power always be achieved ? Power is not solely a function of alpha. It is actually determined by three factors:

1.
Effect size: The probability of achieving statistical significance is based not only on statistical considerations but also on the actual magnitude of the effect of interest (e.g., a difference of means between two groups or the correlation between variables) in the population, termed the effect size. As one would expect, a larger effect is more likely to be found than a smaller effect, and thus more likely to impact the power of the statistical test. To assess the power of any statistical test, the researcher must first understand the effect being examined. Effect size are defined in standardized terms for ease of comparison. Mean differences are stated in terms of standard deviations, so that an effect size of .5 indicates that the mean difference is one-half of a standard deviation. For correlations, the effect size is based on the actual correlation between the variables

2. Alpha: As note earlier, as alpha becomes more restrictive, power decreases. Therefore, as the researcher reduces the chance of incorrectly saying an effect is significant when it is not, the probability of correctly finding an effect also decreases. Conventional guidelines suggest alpha levels of .05 or .01. The researcher must consider the impact of this decision on the power before selecting the alpha, however.

3. Sample size: At any given alpha level, increased sample sizes always produce greater power of statistical test. A potential problem then becomes too much power. By “too much” mean that by increasing sample size, smaller and smaller effects will be found to be statistically significant, until at very large sample sizes almost any effect is significant. The researcher must always be aware that sample size can affect the statistical test by either making it insensitive (at small sample sizes) or overly sensitive (at very large sample sizes).

The relationship among alpha, sample size, effect size, and power are quite complicated, and a number of sources of guidance are available. To achieve such power levels, all three factors – alpha, sample size and effect size – must be considered simultaneously.

Hypothesis Testing

The objective of statistics is to make inferences about unknown population parameters based on information contained in sample data. These inferences are phrased in two ways, as estimates of the respective parameters or as test of hypotheses about their values.

In many ways the formal procedure for hypothesis testing is similar to the scientific method. The scientist observes nature, formulate a theory, and then tests this theory against observations. The scientist poses a theory concerning one or more population parameters – that they equal specified values, then samples the population and compares observation with theory. If the observations disagree with the theory, the scientist rejects the hypothesis. If not, the scientist concludes either that the theory is true or that sample did not detect the difference between the real and hypothesized values of the population parameters.

Hypothesis tests are conducted in all fields in which theory can be tested against observation. Hypotheses can be subjected to statistical verification by comparing the hypotheses with observed sample data.

The objective of a statistical test is to test a hypothesis concerning the values of one or more population, called research hypothesis. For example, suppose that a political candidate, Jones, claims that he will gain more than 50% of votes in a city election and thereby emerge as the winner. If we don’t believe Jones’s claim, we might seek to support the research hypothesis that Jones is not favored by more than 50% of the electorate. For this research hypothesis, also called the alternative hypothesis, is obtaining by showing (using sample data as evidence) that the converse of the alternative hypothesis, the null hypothesis, is false. Thus support for one theory is obtained by showing lack of support for its converse, in a sense a proof by contradiction. Since we seek support for the alternative hypothesis that Jones’s claim is false, our alternative hypothesis is that p, the probability of selecting a voter favoring Jones, is less than .5. If we can show that the data support the rejection of the null hypothesis, p-.5 (the minimum value needed for a plurality), in favor of the alternative hypothesis, p<.5, we have achieved our research objective. Although it is common to speak of testing a null hypothesis, keep in mind that the research objective is usually to show support for the alternative hypothesis.

The element of a statistical test:
1. Null hypothesis, Ho
2. Alternative hypothesis, H1
3. Test statistics
4. Rejection region

The functioning parts of statistical test are the test statistic and associated rejected region. The test statistic is a function of sample measurement upon which statistical decision will be based. The rejection region specifies the value of the test statistic for which the null hypothesis is rejected. If for a particular sample the computed value of the test statistic fall in the rejection region, we reject the null hypothesis H0 and accept the alternative hypothesis H1. If the value of the test statistic does not fall into the rejection region, we accept H0.

Decision must often be made based on sample data. The statistical procedures that guide the decision making process are known as test of hypotheses. Sample observations of characteristic under consideration are made and descriptive statistics are calculated. These sample statistics are then analyzed and question is answered based on the results of the analysis. Because the data used to answer the questions are sample data, there is always chance that answer will be wrong. If the sample is not truly representative of the population from which it was taken, the type I and type II errors can occur. Thus, when a test of hypothesis is performed, it is essential that the confidence level – the probability that the statement is correct – be stated.

Methodology
1. Stating the Hypothesis
When tests of hypothesis are to be used to answer questions, the first step is to state what is to be proved.

The statement that is to be proved is known as the null hypothesis or H0

A second hypothesis inconsistent with the null hypothesis is called alternative hypothesis or H1
Statement is what the data analysis will attempt to prove or disprove. If analysis shows that the statement is true, fine. But if the analysis indicates that the statement is not true, a fallback position is needed.

It is strongly recommended that the null hypothesis always be stated as an equality. Although this isn’t necessary for statistical purposes, it does make later analysis much easier. The alternative hypothesis is then expressed either as the direction (less than or greater than) inequality or as a non directional inequality. The wording of the initial question determines the nature of the inequality used in the statement of the alternative hypothesis. A question involving “better than”, “faster than”, ”stronger than”, or similar terminology would require a directional inequality. The phrase “same as” or “not any different than” would imply a non directional inequality. The statement of the alternative hypothesis must be consistent with the observed sample data.

When the alternative hypothesis is stated as a directional inequality the procedure is called a one tailed test of hypothesis.

A non directional inequality in the alternative hypothesis signifies a two tailed test of hypothesis.

2. Specifying the Confidence Level
After both the null and the alternative hypotheses have been stated, the second step is to specify the confidence level. Usually the selection is arbitrary. However, there may be organizational guidelines that specify the confidence level. Common confidence level are 90 percent, 95 percent, and 99 percent. A brief statement or an equation defining the confidence level in terms of alpha is usually sufficient; for example, the notation alpha = 0.05 might appear after the hypothesis. This would designate 95 percent confidence.

3. Collecting Sample Data
The third step in testing hypothesis is the collection of sample data. After the null hypothesis has been identified – the equality of means, proportions, standard deviations, or whatever - the nature of the required data can be specified. The data must then be collected, and the appropriate sample descriptive statistics must be calculated.

4. Calculating Test Statistics
After sample test statistics have been calculated, the appropriate test statistics must be calculated. There are many test statistics that may be calculated. The specific test statistic used will depend on the nature of the null and alternative hypotheses.

5. Identifying Table Statistics or Using P-value
After the test statistics is calculated, the table statistic is determined. The nature of the alternative hypothesis, the sample size, and the specific statistic being tested will determine which of the standard distribution tables, such as the normal curve, student-t, or chi-square, should be used.

6. Decision Making
The following rule will govern all of the decision, provided common sense is applied.
-. If the absolute value of the test statistic is less than or equal to the table statistics or if p-value greater than alpha, then there is not sufficient evidence to reject the null hypothesis or – the null hypothesis is accepted as being true.
-. If the absolute value of the test statistic is greater than the table statistics or if p-value less than alpha, then there is sufficient evidence to reject the null hypothesis as being true – this would imply that the alternative hypothesis must be true.


Source:
-. Mathematical Statistics with Application, William Mendenhall, Richard L. Sceaffer, Dennis D. Wackerly,
-. Fundamentals of Industrial Quality Control, 3rd edition, Lawrence S. AFT, St. Lucie Press, London, 1998

Confidence - Confidence Interval

Inferential statistical analysis is the process of sampling characteristic from large populations, summarizing those characteristics or attributes, and drawing conclusions or making predictions from the summary or descriptive information. When inferences are made based on sample data, there is always a chance that a mistake will be made. The probability that the inference will be correct is referred to as the degree of confidence with which the inference can be stated. There are two types mistakes that can occur: type I error and type II error.

Type I error:
-. It made if H0 is rejected when H0 is true. The probability of a type I error is denoted by alpha.
-. Probability of incorrectly rejecting the null hypothesis – in most cases, it means saying a difference or correlation exists when it actually does not. Also termed alpha. Typically levels are 5 or 1 percent, termed the .05 or .01 level
-. Stating that the result of sampling are unacceptable when in reality the population from which the sample was taken meets the stated requirements

The probability of a type I error – rejecting what should be accepted – is known as the alpha risk, or level of significance. A level of significance of 5 percent corresponds to a 95 percent chance of accepting what should be accepted. In such an instance, the analyst would have 95 percent confidence in the conclusions drawn or the inferences made. Another interpretation would be that there is a 95 percent chance that the statements made are correct.

Type II error:
-. It made if H1 accepted when H0 is true. The probability of a type II error is denoted by beta.
-. Probability of incorrectly failing to reject the null hypothesis – in simple terms, the chance of not finding a correlation or mean difference when it does exist. Also termed beta, it is inversely related to type I error. The value of 1 minus the type II error (1 – beta) is defined as power.
-. Stating the results of sampling are acceptable when in reality the population from which the sample was taken does not meet the stated requirements

The probability of a type II error – accepting what should be rejected – is known as the beta risk. It is important when acceptance sampling plans are developed and used.

It is possible to estimate population parameters, such as the mean or the standard deviation, based on sample values. How good the predictions are depends on how accurately the sample values reflects the values for the entire population. If a high level of confidence in the inferences is desired, a large proportion of the populations should be observed. In fact, in order to achieve 100 percent confidence, one must sample the entire population. Because of the economic considerations typically involved in inspection, the selection of an acceptable confidence interval is usually seen as a trade-off between cost and confidence. Typically, 90, 95, 99 percent confidence levels are used, with the 99.73 percent leve used in certain quality control applications.

A confidence interval is a range of values that has a specified likelihood of including the true value of a population parameter. It is calculated from sample calculations of the parameters.

There are many types of population parameters for which confidence interval can be established. Those important in applications include means, proportions (percentages) and standard deviations.

Confidence intervals are generally presented in the following format:
Point estimate of populations parameter +/- (Confidence factor)x(Measure variability)x(Adjusting factor)


Source:
-. Fundamentals of Industrial Quality Control, 3rd edition, Lawrence S. AFT, St. Lucie Press, London, 1998
-. Mathematical Statistics with Application, William Mendenhall, Richard L. Sceaffer, Dennis D. Wackerly,
-. Multivariate Data Analysis, Joseph F. Hair, Jr.; William C. Black; Barry J. Babin; Rolph E. Anderson; Ronald L. Tatham, Person Education International, 2006, Singapore

Obtrusive and Unobtrusive Measurement

Definitions unobtrusive: adjective:
1. Not obtrusive; Not blatant, aggressive or arresting
2. Inconspicuous

Why use unobtrusive method ?
-. Access
-. Unique opportunity

In observation techniques, methods for gathering data by watching test subjects without interacting with them
Direct Observation: Researchers watch a behavior as it occurs and report what they see.
Indirect Observation: Researchers observe the results of a behavior.
Unobtrusive or Disguised Observation: Subject does not know he/she is being observed.
Obtrusive or Undisguised Observation: Subject knows he/she is being observed.

Six Different Ways Of Classifying Observation Methods:
1. Participant vs. Non participant observation.
2. Obtrusive vs. Unobtrusive (including physical trace observation).
3. Observation in natural vs. contrived settings.
4. Disguised vs. non-disguised observation.
5. Structured vs. unstructured observation, and
6. Direct vs. indirect observation

Social scientists distinguish between obtrusive and unobtrusive measurement. In obtrusive or reactive measurement, research subjects are aware that they are being studied. In unobtrusive measurement, the subjects are not aware.

Behavioral observation can be either obtrusive or unobtrusive measurement. This distinction refers to the extent to which the respondent or subject is aware that he or she is being evaluated. This awareness can affect both the internal validity and external validity of a study. Awareness can produce sensitization to the experimental manipulation, enhanced memory effects, reactivity to the research setting, and a host of other artificial effects which will obscure true relationships.
It is almost always the goal of a communication researcher to make observation as unobtrusive as possible. This can be done with careful design of the research setting or by choosing a measurement method that is inherently unobtrusive.

Reducing Obtrusiveness. Research settings can often be constructed so that the observer is inconspicuous or completely hidden.

For example, the children were observed through a one-way mirror which prevents the observed person from seeing the observer. The children may not have been aware of the purpose of a one-way mirror, but for older research participants the presence of a one-way mirror will be a dead give-away that they are being observed. This realization may affect behavior in unknown ways. But even if they realize that they are being observed from behind a mirror, there is a tendency to ignore the observer after a time, because there are no movements or noises from the observer to remind the subject that she is being observed.

If the subject suspects that he is being surreptitiously observed, he may actually react more strongly than if he is told that someone is behind the one-way mirror. The presence of a passive mirror or a small video camera in a discreet box are easily ignored after the novelty wears off, so it is often better to inform a subject that they are being observed that it is to allow them to have unconfirmed suspicions. Even if it is impossible to completely hide the observer, the obtrusive effect can be reduced by placing the observer in an out-of-the-way corner of the room and instructing him to remain as motionless and quiet as possible, to avoid rustling the coding sheets, etc.
There is a privacy issue involved with unobtrusive measurement.

Naturally Unobtrusive Measurement. Some types of observational measurement are inherently unobtrusive. This data is collected with little or no awareness by the sources of the data that communication research is being conducted.

For example, the mean income or number of telephones in urban census tracts could be useful variables for a telecommunications researcher. The U.S. Commerce department also collects detailed data about business organizations that can be used for similar aggregate analysis purposes. Governmental data is available at many public libraries and at most university libraries.

For the mass communication researcher, these archives are particularly useful when their information is combined with data from media archives which collect and preserve newspaper and magazine stories, television newscasts, television and radio commercials, and other media messages. Most large libraries carry the New York Times Index which can be used to summarize the frequency that newspaper stories about selected issues or topics appear. The Vanderbilt Television Archives publish an index of network television story coverage and can provide videotapes of stories about selected topics. The researcher can use a media archive to provide the material for a content analysis (described in more detail later in this chapter). The data from the content analysis, combined with data from a public opinion archive, can be used to track the relationship between media messages and aggregate audience response.
Archives of original documents like letters and manuscripts can also be a source of unobtrusive data to the researcher interested in analyzing messages.
For example, the organizational researcher might gain access to electronic mail messages in a sample of corporations, and use this information to study communication patterns within different types of organizations. She might also collect all the interoffice mail envelopes and code the sender’s and recipient’s departments to unobtrusively measure interdepartmental communication. This kind of measurement produces no demand characteristics and no sensitization of research subjects.

Reusing the data collected by other researchers (secondary analysis) is often a very efficient way to collect information. This measurement may or may not be considered obtrusive. For example, an interpersonal communication researcher might be able to gain access to interviews and transcripts at a research center for family communication and therapy. Since the research subjects were probably aware that their responses were being recorded, the data will be subject to some sensitization and social demand contamination. But if the subject of the interviews was, for example, establishment of rules for adolescents, and the communication researcher is interested in the dominance of the conversation by the mother or father, he can consider his dominance measurement as being unobtrusive.

There are many, many other sources of data for secondary analysis. Commercial research organizations often maintain databases that can be made available to academic researchers after their business value has disappeared.


Source:
-. http://www.slideshare.net/sladner/week08-unobtrusive-presentation
-. Webb, E.J, Campbell, D.T., Schwartz, R.D., & Sechrest, L. (1972). Unobtrusive measures: Nonreactive research in the social sciences. Chicago: Rand McNally.