Monday, October 19, 2009

Statistical Significance versus Statistical Power

This title based on sub chapter in Multivariate book (Multivariate Data Analysis Joseph F. Hair, Jr.; William C. Black; Barry J. Babin; Rolph E. Anderson; Ronald L. Tatham, Person Education International, 2006, Singapore)

A census of the entire population makes statistical inference unnecessary, because any difference or relationship, however small, is “true” and does exist. Rarely, if ever, is a census conducted, however. Therefore, the researcher id forced to draw inferences from a sample.

Types of Statistical Error and Statistical Power

Power: probability of correctly rejecting the null hypothesis when it false, that is, correctly finding a hypothesized relationship when it exist.

Determined as a function of:
1. The statistical significance level set by the researcher for type I error (alpha)
2. The sample sized used in the analysis
3. The effect size being examined

Interpreting statistical inferences requires that the researcher specify the acceptable levels of statistical error due to using sample (known as sampling error). The most common approach is to specify the level of type I error, also known as alpha. The type I error is the probability of rejecting the null hypothesis when actually true, or in simple terms, the chance of the test showing statistical significance when it actually is not present – the case of a “false positive”. By specifying an alpha level, the researcher sets the allowable limits for error and indicates the probability of concluding that significance exists when it really does not.

When specifying the level of type I error, the researcher also determines an associated error, termed the type II error or beta. The type II error is the probability of failing to reject the null hypothesis when actually false. An even more interesting probability is 1 – beta, termed the power of the statistical inference test. Power is probability of correctly rejecting the null hypothesis when it should be rejected. Thus power is the probability that statistical significance will be indicated if it is present. The relationship of the different error probabilities in the hypothetical setting of testing for the difference in two means is shown here:




Although specifying alpha establishes the level of acceptable statistical significance, it is the level of power that dictates the probability of success in finding the differences if they actually exist. Then why not set both alpha and beta at acceptable levels ? Because the type I and type II error are inversely related, and as the type I error becomes more restrictive (moves closer to zero), the probability of a type II error increases. Reducing the type I errors therefore reduces the power of the statistical test. Thus, the researcher must strike a balance between the level of alpha and resulting power.

Impact on Statistical Power
But why can’t high levels of power always be achieved ? Power is not solely a function of alpha. It is actually determined by three factors:

1.
Effect size: The probability of achieving statistical significance is based not only on statistical considerations but also on the actual magnitude of the effect of interest (e.g., a difference of means between two groups or the correlation between variables) in the population, termed the effect size. As one would expect, a larger effect is more likely to be found than a smaller effect, and thus more likely to impact the power of the statistical test. To assess the power of any statistical test, the researcher must first understand the effect being examined. Effect size are defined in standardized terms for ease of comparison. Mean differences are stated in terms of standard deviations, so that an effect size of .5 indicates that the mean difference is one-half of a standard deviation. For correlations, the effect size is based on the actual correlation between the variables

2. Alpha: As note earlier, as alpha becomes more restrictive, power decreases. Therefore, as the researcher reduces the chance of incorrectly saying an effect is significant when it is not, the probability of correctly finding an effect also decreases. Conventional guidelines suggest alpha levels of .05 or .01. The researcher must consider the impact of this decision on the power before selecting the alpha, however.

3. Sample size: At any given alpha level, increased sample sizes always produce greater power of statistical test. A potential problem then becomes too much power. By “too much” mean that by increasing sample size, smaller and smaller effects will be found to be statistically significant, until at very large sample sizes almost any effect is significant. The researcher must always be aware that sample size can affect the statistical test by either making it insensitive (at small sample sizes) or overly sensitive (at very large sample sizes).

The relationship among alpha, sample size, effect size, and power are quite complicated, and a number of sources of guidance are available. To achieve such power levels, all three factors – alpha, sample size and effect size – must be considered simultaneously.

Hypothesis Testing

The objective of statistics is to make inferences about unknown population parameters based on information contained in sample data. These inferences are phrased in two ways, as estimates of the respective parameters or as test of hypotheses about their values.

In many ways the formal procedure for hypothesis testing is similar to the scientific method. The scientist observes nature, formulate a theory, and then tests this theory against observations. The scientist poses a theory concerning one or more population parameters – that they equal specified values, then samples the population and compares observation with theory. If the observations disagree with the theory, the scientist rejects the hypothesis. If not, the scientist concludes either that the theory is true or that sample did not detect the difference between the real and hypothesized values of the population parameters.

Hypothesis tests are conducted in all fields in which theory can be tested against observation. Hypotheses can be subjected to statistical verification by comparing the hypotheses with observed sample data.

The objective of a statistical test is to test a hypothesis concerning the values of one or more population, called research hypothesis. For example, suppose that a political candidate, Jones, claims that he will gain more than 50% of votes in a city election and thereby emerge as the winner. If we don’t believe Jones’s claim, we might seek to support the research hypothesis that Jones is not favored by more than 50% of the electorate. For this research hypothesis, also called the alternative hypothesis, is obtaining by showing (using sample data as evidence) that the converse of the alternative hypothesis, the null hypothesis, is false. Thus support for one theory is obtained by showing lack of support for its converse, in a sense a proof by contradiction. Since we seek support for the alternative hypothesis that Jones’s claim is false, our alternative hypothesis is that p, the probability of selecting a voter favoring Jones, is less than .5. If we can show that the data support the rejection of the null hypothesis, p-.5 (the minimum value needed for a plurality), in favor of the alternative hypothesis, p<.5, we have achieved our research objective. Although it is common to speak of testing a null hypothesis, keep in mind that the research objective is usually to show support for the alternative hypothesis.

The element of a statistical test:
1. Null hypothesis, Ho
2. Alternative hypothesis, H1
3. Test statistics
4. Rejection region

The functioning parts of statistical test are the test statistic and associated rejected region. The test statistic is a function of sample measurement upon which statistical decision will be based. The rejection region specifies the value of the test statistic for which the null hypothesis is rejected. If for a particular sample the computed value of the test statistic fall in the rejection region, we reject the null hypothesis H0 and accept the alternative hypothesis H1. If the value of the test statistic does not fall into the rejection region, we accept H0.

Decision must often be made based on sample data. The statistical procedures that guide the decision making process are known as test of hypotheses. Sample observations of characteristic under consideration are made and descriptive statistics are calculated. These sample statistics are then analyzed and question is answered based on the results of the analysis. Because the data used to answer the questions are sample data, there is always chance that answer will be wrong. If the sample is not truly representative of the population from which it was taken, the type I and type II errors can occur. Thus, when a test of hypothesis is performed, it is essential that the confidence level – the probability that the statement is correct – be stated.

Methodology
1. Stating the Hypothesis
When tests of hypothesis are to be used to answer questions, the first step is to state what is to be proved.

The statement that is to be proved is known as the null hypothesis or H0

A second hypothesis inconsistent with the null hypothesis is called alternative hypothesis or H1
Statement is what the data analysis will attempt to prove or disprove. If analysis shows that the statement is true, fine. But if the analysis indicates that the statement is not true, a fallback position is needed.

It is strongly recommended that the null hypothesis always be stated as an equality. Although this isn’t necessary for statistical purposes, it does make later analysis much easier. The alternative hypothesis is then expressed either as the direction (less than or greater than) inequality or as a non directional inequality. The wording of the initial question determines the nature of the inequality used in the statement of the alternative hypothesis. A question involving “better than”, “faster than”, ”stronger than”, or similar terminology would require a directional inequality. The phrase “same as” or “not any different than” would imply a non directional inequality. The statement of the alternative hypothesis must be consistent with the observed sample data.

When the alternative hypothesis is stated as a directional inequality the procedure is called a one tailed test of hypothesis.

A non directional inequality in the alternative hypothesis signifies a two tailed test of hypothesis.

2. Specifying the Confidence Level
After both the null and the alternative hypotheses have been stated, the second step is to specify the confidence level. Usually the selection is arbitrary. However, there may be organizational guidelines that specify the confidence level. Common confidence level are 90 percent, 95 percent, and 99 percent. A brief statement or an equation defining the confidence level in terms of alpha is usually sufficient; for example, the notation alpha = 0.05 might appear after the hypothesis. This would designate 95 percent confidence.

3. Collecting Sample Data
The third step in testing hypothesis is the collection of sample data. After the null hypothesis has been identified – the equality of means, proportions, standard deviations, or whatever - the nature of the required data can be specified. The data must then be collected, and the appropriate sample descriptive statistics must be calculated.

4. Calculating Test Statistics
After sample test statistics have been calculated, the appropriate test statistics must be calculated. There are many test statistics that may be calculated. The specific test statistic used will depend on the nature of the null and alternative hypotheses.

5. Identifying Table Statistics or Using P-value
After the test statistics is calculated, the table statistic is determined. The nature of the alternative hypothesis, the sample size, and the specific statistic being tested will determine which of the standard distribution tables, such as the normal curve, student-t, or chi-square, should be used.

6. Decision Making
The following rule will govern all of the decision, provided common sense is applied.
-. If the absolute value of the test statistic is less than or equal to the table statistics or if p-value greater than alpha, then there is not sufficient evidence to reject the null hypothesis or – the null hypothesis is accepted as being true.
-. If the absolute value of the test statistic is greater than the table statistics or if p-value less than alpha, then there is sufficient evidence to reject the null hypothesis as being true – this would imply that the alternative hypothesis must be true.


Source:
-. Mathematical Statistics with Application, William Mendenhall, Richard L. Sceaffer, Dennis D. Wackerly,
-. Fundamentals of Industrial Quality Control, 3rd edition, Lawrence S. AFT, St. Lucie Press, London, 1998

Confidence - Confidence Interval

Inferential statistical analysis is the process of sampling characteristic from large populations, summarizing those characteristics or attributes, and drawing conclusions or making predictions from the summary or descriptive information. When inferences are made based on sample data, there is always a chance that a mistake will be made. The probability that the inference will be correct is referred to as the degree of confidence with which the inference can be stated. There are two types mistakes that can occur: type I error and type II error.

Type I error:
-. It made if H0 is rejected when H0 is true. The probability of a type I error is denoted by alpha.
-. Probability of incorrectly rejecting the null hypothesis – in most cases, it means saying a difference or correlation exists when it actually does not. Also termed alpha. Typically levels are 5 or 1 percent, termed the .05 or .01 level
-. Stating that the result of sampling are unacceptable when in reality the population from which the sample was taken meets the stated requirements

The probability of a type I error – rejecting what should be accepted – is known as the alpha risk, or level of significance. A level of significance of 5 percent corresponds to a 95 percent chance of accepting what should be accepted. In such an instance, the analyst would have 95 percent confidence in the conclusions drawn or the inferences made. Another interpretation would be that there is a 95 percent chance that the statements made are correct.

Type II error:
-. It made if H1 accepted when H0 is true. The probability of a type II error is denoted by beta.
-. Probability of incorrectly failing to reject the null hypothesis – in simple terms, the chance of not finding a correlation or mean difference when it does exist. Also termed beta, it is inversely related to type I error. The value of 1 minus the type II error (1 – beta) is defined as power.
-. Stating the results of sampling are acceptable when in reality the population from which the sample was taken does not meet the stated requirements

The probability of a type II error – accepting what should be rejected – is known as the beta risk. It is important when acceptance sampling plans are developed and used.

It is possible to estimate population parameters, such as the mean or the standard deviation, based on sample values. How good the predictions are depends on how accurately the sample values reflects the values for the entire population. If a high level of confidence in the inferences is desired, a large proportion of the populations should be observed. In fact, in order to achieve 100 percent confidence, one must sample the entire population. Because of the economic considerations typically involved in inspection, the selection of an acceptable confidence interval is usually seen as a trade-off between cost and confidence. Typically, 90, 95, 99 percent confidence levels are used, with the 99.73 percent leve used in certain quality control applications.

A confidence interval is a range of values that has a specified likelihood of including the true value of a population parameter. It is calculated from sample calculations of the parameters.

There are many types of population parameters for which confidence interval can be established. Those important in applications include means, proportions (percentages) and standard deviations.

Confidence intervals are generally presented in the following format:
Point estimate of populations parameter +/- (Confidence factor)x(Measure variability)x(Adjusting factor)


Source:
-. Fundamentals of Industrial Quality Control, 3rd edition, Lawrence S. AFT, St. Lucie Press, London, 1998
-. Mathematical Statistics with Application, William Mendenhall, Richard L. Sceaffer, Dennis D. Wackerly,
-. Multivariate Data Analysis, Joseph F. Hair, Jr.; William C. Black; Barry J. Babin; Rolph E. Anderson; Ronald L. Tatham, Person Education International, 2006, Singapore

Obtrusive and Unobtrusive Measurement

Definitions unobtrusive: adjective:
1. Not obtrusive; Not blatant, aggressive or arresting
2. Inconspicuous

Why use unobtrusive method ?
-. Access
-. Unique opportunity

In observation techniques, methods for gathering data by watching test subjects without interacting with them
Direct Observation: Researchers watch a behavior as it occurs and report what they see.
Indirect Observation: Researchers observe the results of a behavior.
Unobtrusive or Disguised Observation: Subject does not know he/she is being observed.
Obtrusive or Undisguised Observation: Subject knows he/she is being observed.

Six Different Ways Of Classifying Observation Methods:
1. Participant vs. Non participant observation.
2. Obtrusive vs. Unobtrusive (including physical trace observation).
3. Observation in natural vs. contrived settings.
4. Disguised vs. non-disguised observation.
5. Structured vs. unstructured observation, and
6. Direct vs. indirect observation

Social scientists distinguish between obtrusive and unobtrusive measurement. In obtrusive or reactive measurement, research subjects are aware that they are being studied. In unobtrusive measurement, the subjects are not aware.

Behavioral observation can be either obtrusive or unobtrusive measurement. This distinction refers to the extent to which the respondent or subject is aware that he or she is being evaluated. This awareness can affect both the internal validity and external validity of a study. Awareness can produce sensitization to the experimental manipulation, enhanced memory effects, reactivity to the research setting, and a host of other artificial effects which will obscure true relationships.
It is almost always the goal of a communication researcher to make observation as unobtrusive as possible. This can be done with careful design of the research setting or by choosing a measurement method that is inherently unobtrusive.

Reducing Obtrusiveness. Research settings can often be constructed so that the observer is inconspicuous or completely hidden.

For example, the children were observed through a one-way mirror which prevents the observed person from seeing the observer. The children may not have been aware of the purpose of a one-way mirror, but for older research participants the presence of a one-way mirror will be a dead give-away that they are being observed. This realization may affect behavior in unknown ways. But even if they realize that they are being observed from behind a mirror, there is a tendency to ignore the observer after a time, because there are no movements or noises from the observer to remind the subject that she is being observed.

If the subject suspects that he is being surreptitiously observed, he may actually react more strongly than if he is told that someone is behind the one-way mirror. The presence of a passive mirror or a small video camera in a discreet box are easily ignored after the novelty wears off, so it is often better to inform a subject that they are being observed that it is to allow them to have unconfirmed suspicions. Even if it is impossible to completely hide the observer, the obtrusive effect can be reduced by placing the observer in an out-of-the-way corner of the room and instructing him to remain as motionless and quiet as possible, to avoid rustling the coding sheets, etc.
There is a privacy issue involved with unobtrusive measurement.

Naturally Unobtrusive Measurement. Some types of observational measurement are inherently unobtrusive. This data is collected with little or no awareness by the sources of the data that communication research is being conducted.

For example, the mean income or number of telephones in urban census tracts could be useful variables for a telecommunications researcher. The U.S. Commerce department also collects detailed data about business organizations that can be used for similar aggregate analysis purposes. Governmental data is available at many public libraries and at most university libraries.

For the mass communication researcher, these archives are particularly useful when their information is combined with data from media archives which collect and preserve newspaper and magazine stories, television newscasts, television and radio commercials, and other media messages. Most large libraries carry the New York Times Index which can be used to summarize the frequency that newspaper stories about selected issues or topics appear. The Vanderbilt Television Archives publish an index of network television story coverage and can provide videotapes of stories about selected topics. The researcher can use a media archive to provide the material for a content analysis (described in more detail later in this chapter). The data from the content analysis, combined with data from a public opinion archive, can be used to track the relationship between media messages and aggregate audience response.
Archives of original documents like letters and manuscripts can also be a source of unobtrusive data to the researcher interested in analyzing messages.
For example, the organizational researcher might gain access to electronic mail messages in a sample of corporations, and use this information to study communication patterns within different types of organizations. She might also collect all the interoffice mail envelopes and code the sender’s and recipient’s departments to unobtrusively measure interdepartmental communication. This kind of measurement produces no demand characteristics and no sensitization of research subjects.

Reusing the data collected by other researchers (secondary analysis) is often a very efficient way to collect information. This measurement may or may not be considered obtrusive. For example, an interpersonal communication researcher might be able to gain access to interviews and transcripts at a research center for family communication and therapy. Since the research subjects were probably aware that their responses were being recorded, the data will be subject to some sensitization and social demand contamination. But if the subject of the interviews was, for example, establishment of rules for adolescents, and the communication researcher is interested in the dominance of the conversation by the mother or father, he can consider his dominance measurement as being unobtrusive.

There are many, many other sources of data for secondary analysis. Commercial research organizations often maintain databases that can be made available to academic researchers after their business value has disappeared.


Source:
-. http://www.slideshare.net/sladner/week08-unobtrusive-presentation
-. Webb, E.J, Campbell, D.T., Schwartz, R.D., & Sechrest, L. (1972). Unobtrusive measures: Nonreactive research in the social sciences. Chicago: Rand McNally.

Tuesday, September 1, 2009

Hypothesis - 1

The general objective in exploratory research is to gain insights and ideas. The exploratory study is particularly helpful in breaking broad, vague problem statements into smaller, more precise sub problems statements, hopefully in the form of specific hypothesis.

Hypothesis is:
1. A statement that specifies how two or more measurable variables are related. A good hypothesis carries clear implications for testing stated relationship.

2. A hypothesis is a proposed explanation for an observable phenomenon.

3. An assumption or concession made for the sake of argument b : an interpretation of a practical situation or condition taken as the ground for action

4. A tentative assumption made in order to draw out and test its logical or empirical consequences

5. The antecedent clause of a conditional statement

6. A preliminary or tentative explanation or postulate by the researcher of what the researcher considers the outcome of an investigation will be.

7. Statement postulating a possible relationship between two or more phenomena or variables. (Mouton's (1990: Chapter 6))

8. A statement describing a phenomenon or which specifies a relationship between two or more phenomena. (Guy's (1987: 116))

9. A tentative statement that proposes a possible explanation to some phenomenon or event. A useful hypothesis is a testable statement which may include a prediction. A hypotheses should not be confused with a theory. Theories are general explanations based on a large amount of data. For example, the theory of evolution applies to all living things and is based on wide range of observations. However, there are many things about evolution that are not fully understood such as gaps in the fossil record. Many hypotheses have been proposed and tested.

10. Proposition formulated to be tested empirically, tentative or estimation.

11. Theoretical answer to formula of research problem, but it doesn’t empiric answer. Hypothesis used for quantitative research, while in research qualitative, doesn’t formulate hypothesis, but exactly find hypothesis.


In early stages of research, we usually lack sufficient understanding of the problem to formulate a specific hypothesis.

In statistics, hypothesis use sample as data for the intake of conclusion, for that in this research use international confidence, significant level, confidence level, margin error and others. This matter because decision for the population of obtained from sample data, representing estimation of population and not a result of depicting real condition from population.

Hypothesis function:
1. Guidance to instruct research
2. Giving definition to what will check and what will not check
3. Instructing most appropriate research design form
4. Giving framework to compile conclusion to be yielded
5. Besides functioning to test the truth of a theory, hypothesis also can be used to give new idea in developing a theory and extend knowledge of researcher regarding a symptom which is studying.


Source:
-. Marketing Research, Methodological Foundations, 5th edition, The Dryden Press International Edition, author Gilbert A. Churchill, Jr.
-. http://en.wikipedia.org/wiki/Hypothesis
-. http://www.merriam-webster.com/dictionary/hypothesis
-. http://malayresearchfoundation.blogspot.com/2008/08/hypothesis.html
-. http://www.accessexcellence.org/LC/TL/filson/writhypo.php

Wednesday, August 26, 2009

Degree of freedom

Meaning of degree of freedom (df):

1. In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.[1]

2.
Mathematically, degrees of freedom is the dimension of the domain of a random vector, or essentially the number of 'free' components: how many components need to be known before the vector is fully determined.

3.
The number of degrees of freedom in a problem, distribution, etc., is the number of parameters which may be independently varied.

4.
The concept of degrees of freedom is central to the principle of estimating statistics of populations from samples of them. "Degrees of freedom" is commonly abbreviated to df. In short, think of df as a mathematical restriction that we need to put in place when we calculate an estimate one statistic from an estimate of another.

5.
In statistics, the number of degrees of freedom (d.o.f.) is the number of independent pieces of data being used to make a calculation. It is usually denoted with the greek letter nu, ν. The number of degrees of freedom is a measure of how certain we are that our sample population is representative of the entire population - the more degrees of freedom, usually the more certain we can be that we have accurately sampled the entire population. For statistics in analytical chemistry, this is usually the number of observations or measurements N made in a certain experiment.

6.
For a set of data points in a given situation (e.g. with mean or other parameter specified, or not), degrees of freedom is the minimal number of values which should be specified to determine all the data points.

7.
In statistics, the term degrees of freedom (df) is a measure of the number of independent pieces of information on which the precision of a parameter estimate is based.

Estimates of statistical parameters can be based upon different amounts of information or data. The number of independent pieces of information that go into the estimate of a parameter is called the degrees of freedom (df). In general, the degrees of freedom of an estimate is equal to the number of independent scores that go into the estimate minus the number of parameters estimated as intermediate steps in the estimation of the parameter itself.

The df can be viewed as the number of independent parameters available to fit a model to data. Generally, the more parameters you have, the more accurate your fit will be. However, for each estimate made in a calculation, you remove one degree of freedom. This is because each assumption or approximation you make puts one more restriction on how many parameters are used to generate the model. Put another way, for each estimate you make, your model becomes less accurate.

Another way of thinking about the restriction principle behind degrees of freedom is to imagine contingencies. For example, imagine you have four numbers (a, b, c and d) that must add up to a total of m; you are free to choose the first three numbers at random, but the fourth must be chosen so that it makes the total equal to m - thus your degree of freedom is three. Essentially, degrees of freedom are a count of the number of pieces of independent information contained within a particular analysis.

The maximum numbers of quantities or directions, whose values are free to vary before the remainders of the quantities are determined, or an estimate of the number of independent categories in a particular statistical test or experiment. Degrees of freedom (df) for a sample is defined as: df = n - 1 Where n is the number of scores in the sample.

The degrees of freedom for an estimate equals the number of observations (values) minus the number of additional parameters estimated for that calculation. As we have to estimate more parameters, the degrees of freedom available decreases. It can also be thought of as the number of observations (values) which are freely available to vary given the additional parameters estimated. It can be thought of two ways: in terms of sample size and in terms of dimensions and parameters.

Degrees of freedom are often used to characterize various distributions. See, for example, chi-square distribution, t-distribution, F distribution.

In case, the df was n-1, because an estimate was made that the sample mean is a good estimate of the population mean, so we have one less df than the number of independent observations.

In many statistical calculations you will do, such as linear regression, outliers, and t-tests, you will need to know or calculate the number of degrees of freedom. Degrees of freedom for each test will be explained in the section for which it is required.


Source:
http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)
http://mathworld.wolfram.com/DegreeofFreedom.html
http://www.statsdirect.com/help/basics/degrees_of_freedom.htm
http://www.chem.utoronto.ca/coursenotes/analsci/StatsTutorial/DegFree.html
http://www.statistics.com/resources/glossary/d/degsfree.php
http://www.statemaster.com/encyclopedia/Degrees-of-freedom-(statistics)

Thursday, August 20, 2009

Difference Quantitative and Qualitative Method - 3


3. Research characteristics



Main Points

a. Qualitative research involves analysis of data such as words (e.g., from interviews), pictures (e.g., video), or objects (e.g., an artifact).
b. Quantitative research involves analysis of numerical data.
c. The strengths and weaknesses of qualitative and quantitative research are a perennial, hot debate, especially in the social sciences. The issues invoke classic 'paradigm war'.
d. The personality / thinking style of the researcher and/or the culture of the organization is under-recognized as a key factor in preferred choice of methods.
e. Overly focusing on the debate of "qualitative versus quantitative" frames the methods in opposition. It is important to focus also on how the techniques can be integrated, such as in mixed methods research. More good can come of social science researchers developing skills in both realms than debating which method is superior.

Difference Quantitative and Qualitative Method - 2


2. Research process

a. Quantitative method

Process in quantitative method have linear character, it have clear step: formula of problem, theoretical, hypothesis, collecting data, data analysis, and make summary. Some method in quantitative method: survey method, ex post fact, experiment, evaluation, action research, etc.

Flow in quantitative method:
Source of problem (empiric-theoretic) -> formula of problem -> relevant theory concept or relevant invention -> test of hypothesis -> prejudice to relation between variable -> compiling research instrument - method / strategy approach of research -> collecting and analyzing data -> summary

b. Qualitative method
Divide in some steps:
1. Description step
In this step, researcher describes what he look, hear, and feel. Researcher know less to obtained information from data. Data quite a lot, varying and not yet lapped over clearly.
2. Reduction step
In this step, researcher reduces all information in previous step. Researcher only focus in certain problem. Researcher choose problem from data which interesting data, important, useful and new information. Only chosen data will be used.
3. Selection step
In this step, researches analyses data deeper than previous step to obtain new information. So that researcher can be find themes by construct data become hypothesis.

End result in qualitative method not only get information which can not find in quantitative method but also get new information and meaningful information, and also create new hypothesis to solve problem.

Different research process in quantitative method and qualitative method
-. Quantitative method have deductive, start from theoretical frame work, focus on formal theory, middle range theory, substantive theory, and then formulate it in hypothesis, hypothesis test, and get empirical social reality.
-. Qualitative method have inductive, start from observe in empirical social reality, then develop it in substantive theory, middle range theory, formal theory, and then result became theoretical frame work.

*. Formal theory is developed for board conceptual area in general theory
*. Substantive theory is developed for specific area of social concern
*. Middle range theory can be formal or substantive, middle range theory are slightly more abstract the empirical generalization or specific hypotheses.

Tuesday, August 18, 2009

Difference Quantitative and Qualitative Method - 1

Difference their methods covering in three things: difference in axiom, research process, and research characteristics

1. Difference in axiom
Axiom [is] elementary view. axiom in quantitative and qualitative cover axiom about reality, relation between researcher and object research, variable, possibility of generalizing, and value role.


Source:
http://uk.geocities.com/balihar_sanghera/ipsrmehrigiulqualitativequantitativeresearch.html
http://wilderdom.com/research/QualitativeVersusQuantitativeResearch.html

Sunday, August 9, 2009

For Today’s Graduate, Just One Word: Statistics

from: http://www.nytimes.com/2009/08/06/technology/06stats.html?_r=2&scp=1&sq=today%27s%20graduate&st=cse


by STEVE LOHR
Published: August 5, 2009

MOUNTAIN VIEW, Calif. — At Harvard, Carrie Grimes majored in anthropology and archaeology and ventured to places like Honduras, where she studied Mayan settlement patterns by mapping where artifacts were found. But she was drawn to what she calls “all the computer and math stuff” that was part of the job.

“People think of field archaeology as Indiana Jones, but much of what you really do is data analysis,” she said.

Now Ms. Grimes does a different kind of digging. She works at Google, where she uses statistical analysis of mounds of data to come up with ways to improve its search engine.

Ms. Grimes is an Internet-age statistician, one of many who are changing the image of the profession as a place for dronish number nerds. They are finding themselves increasingly in demand — and even cool.

“I keep saying that the sexy job in the next 10 years will be statisticians,” said Hal Varian, chief economist at Google. “And I’m not kidding.”

The rising stature of statisticians, who can earn $125,000 at top companies in their first year after getting a doctorate, is a byproduct of the recent explosion of digital data. In field after field, computing and the Web are creating new realms of data to explore — sensor signals, surveillance tapes, social network chatter, public records and more. And the digital data surge only promises to accelerate, rising fivefold by 2012, according to a projection by IDC, a research firm.

Yet data is merely the raw material of knowledge. “We’re rapidly entering a world where everything can be monitored and measured,” said Erik Brynjolfsson, an economist and director of the Massachusetts Institute of Technology’s Center for Digital Business. “But the big problem is going to be the ability of humans to use, analyze and make sense of the data.”

The new breed of statisticians tackle that problem. They use powerful computers and sophisticated mathematical models to hunt for meaningful patterns and insights in vast troves of data. The applications are as diverse as improving Internet search and online advertising, culling gene sequencing information for cancer research and analyzing sensor and location data to optimize the handling of food shipments.

Even the recently ended Netflix contest, which offered $1 million to anyone who could significantly improve the company’s movie recommendation system, was a battle waged with the weapons of modern statistics.

Though at the fore, statisticians are only a small part of an army of experts using modern statistical techniques for data analysis. Computing and numerical skills, experts say, matter far more than degrees. So the new data sleuths come from backgrounds like economics, computer science and mathematics.

They are certainly welcomed in the White House these days. “Robust, unbiased data are the first step toward addressing our long-term economic needs and key policy priorities,” Peter R. Orszag, director of the Office of Management and Budget, declared in a speech in May. Later that day, Mr. Orszag confessed in a blog entry that his talk on the importance of statistics was a subject “near to my (admittedly wonkish) heart.”

I.B.M., seeing an opportunity in data-hunting services, created a Business Analytics and Optimization Services group in April. The unit will tap the expertise of the more than 200 mathematicians, statisticians and other data analysts in its research labs — but that number is not enough. I.B.M. plans to retrain or hire 4,000 more analysts across the company.

In another sign of the growing interest in the field, an estimated 6,400 people are attending the statistics profession’s annual conference in Washington this week, up from around 5,400 in recent years, according to the American Statistical Association. The attendees, men and women, young and graying, looked much like any other crowd of tourists in the nation’s capital. But their rapt exchanges were filled with talk of randomization, parameters, regressions and data clusters. The data surge is elevating a profession that traditionally tackled less visible and less lucrative work, like figuring out life expectancy rates for insurance companies.

Ms. Grimes, 32, got her doctorate in statistics from Stanford in 2003 and joined Google later that year. She is now one of many statisticians in a group of 250 data analysts. She uses statistical modeling to help improve the company’s search technology.

For example, Ms. Grimes worked on an algorithm to fine-tune Google’s crawler software, which roams the Web to constantly update its search index. The model increased the chances that the crawler would scan frequently updated Web pages and make fewer trips to more static ones.

The goal, Ms. Grimes explained, is to make tiny gains in the efficiency of computer and network use. “Even an improvement of a percent or two can be huge, when you do things over the millions and billions of times we do things at Google,” she said.

It is the size of the data sets on the Web that opens new worlds of discovery. Traditionally, social sciences tracked people’s behavior by interviewing or surveying them. “But the Web provides this amazing resource for observing how millions of people interact,” said Jon Kleinberg, a computer scientist and social networking researcher at Cornell.

For example, in research just published, Mr. Kleinberg and two colleagues followed the flow of ideas across cyberspace. They tracked 1.6 million news sites and blogs during the 2008 presidential campaign, using algorithms that scanned for phrases associated with news topics like “lipstick on a pig.”

The Cornell researchers found that, generally, the traditional media leads and the blogs follow, typically by 2.5 hours. But a handful of blogs were quickest to quotes that later gained wide attention.

The rich lode of Web data, experts warn, has its perils. Its sheer volume can easily overwhelm statistical models. Statisticians also caution that strong correlations of data do not necessarily prove a cause-and-effect link.

For example, in the late 1940s, before there was a polio vaccine, public health experts in America noted that polio cases increased in step with the consumption of ice cream and soft drinks, according to David Alan Grier, a historian and statistician at George Washington University. Eliminating such treats was even recommended as part of an anti-polio diet. It turned out that polio outbreaks were most common in the hot months of summer, when people naturally ate more ice cream, showing only an association, Mr. Grier said.

If the data explosion magnifies longstanding issues in statistics, it also opens up new frontiers.

“The key is to let computers do what they are good at, which is trawling these massive data sets for something that is mathematically odd,” said Daniel Gruhl, an I.B.M. researcher whose recent work includes mining medical data to improve treatment. “And that makes it easier for humans to do what they are good at — explain those anomalies.”

Andrea Fuller contributed reporting.

Quantitative and Qualitative Method

Many labels have been used to distinguish between traditional research methods and these new methods: positivistic versus post positivistic research; scientific versus artistic research; confirmatory versus discovery – oriented research; quantitative versus interpretive research; quantitative versus qualitative research. The quantitative-qualitative distinction seem most widely used. Both quantitative researchers and qualitative researchers go about inquiry in different ways (Borg and Gall, 1989).

The others name of quantitative methods are traditional, positivistic, scientific and discovery methods, and qualitative methods are new methods, post positivistic, artistic and interpretive research.

Quantitative research is the systematic scientific investigation of quantitative properties and phenomena and their relationships. The objective of quantitative research is to develop and employ mathematical models, theories and/or hypotheses pertaining to natural phenomena. The process of measurement is central to quantitative research because it provides the fundamental connection between empirical observation and mathematical expression of quantitative relationships.

Quantitative research is widely used in both the natural sciences and social sciences, from physics and biology to sociology and journalism. It is also used as a way to research different aspects of education. The term quantitative research is most often used in the social sciences in contrast to qualitative research.

A quantitative attribute is one that exists in a range of magnitudes, and can therefore be measured. Measurements of any particular quantitative property are expressed as a specific quantity, referred to as a unit, multiplied by a number. Examples of physical quantities are distance, mass, and time. Many attributes in the social sciences, including abilities and personality traits, are also studied as quantitative properties and principles.

Quantitative research is research involving the use of structured questions where the response options have been predetermined and a large number of respondents is involved.

By definition, measurement must be objective, quantitative and statistically valid. Simply put, it’s about numbers, objective hard data.

The sample size for a survey is calculated by statisticians using formulas to determine how large a sample size will be needed from a given population in order to achieve findings with an acceptable degree of accuracy. Generally, researchers seek sample sizes which yield findings with at least 95% confidence interval (which means that if you repeat the survey 100 times, 95 times out of a hundred, you would get the same response) and plus/minus 5 percentage points margin error. Many surveys are designed to produce smaller margin of error.

Quantitative methods based on positivism philosophy, used to research in population and sample, collecting data by random sampling, and data analyze for test certain hypothesis.

Positivism philosophy looking into symptom/reality/phenomenon can be classified into: constant relative, perceived, measured and relation with character of causality.

Research – quantitative method, in general, conducted at certain sample or population which is representative. Quantitative method process have the character of deductively, where to answer formula of theory or concept till can be formulated by hypothesis. Hypothesis is tested through field data collecting. Then, it has quantitative analyzed by descriptive or inference statistics, so that can be conclusion for the hypothesis formulated, proven or not. Quantitative method have done in random sample, so that this result can be generalized in population where sample taken.

Method qualitative also known as naturalistic research method because its research conducted at natural condition, also as ethnography method. Collected data and its analysis have the character of qualitative.

Qualitative research is a field of inquiry applicable to many disciplines and subject matters. Qualitative researchers aim to gather an in-depth understanding of human behavior and the reasons that govern such behavior. The qualitative method investigates the why and how of decision making, not just what, where, when. Hence, smaller but focused samples are more often needed, rather than large random samples

Qualitative Research is collecting, analyzing, and interpreting data by observing what people do and say. Whereas, quantitative research refers to counts and measures of things, qualitative research refers to the meanings, concepts, definitions, characteristics, metaphors, symbols, and descriptions of things.

Qualitative research is much more subjective than quantitative research and uses very different methods of collecting information, mainly individual, in-depth interviews and focus groups. The nature of this type of research is exploratory and open-ended. Small numbers of people are interviewed in-depth and/or a relatively small number of focus groups are conducted.

Participants are asked to respond to general questions and the interviewer or group moderator probes and explores their responses to identify and define people’s perceptions, opinions and feelings about the topic or idea being discussed and to determine the degree of agreement that exists in the group. The quality of the finding from qualitative research is directly dependent upon the skills, experience and sensitive of the interviewer or group moderator.

This type of research is often less costly than surveys and is extremely effective in acquiring information about people’s communications needs and their responses to and views about specific communications.

Qualitative research method based on post positivism philosophy, used to research into naturalistic condition (opponent with experiment)। In naturalistic condition researcher as key instrument, sample taken with purposive and snowball technics, technics of gathering with triangulation ( alliance) methods, data analysis with inductive or qualitative method, result of qualitative research more emphasize meaning than generalizing.

Post positivism philosophy also conceived of interpretive paradigm and constructive, which look into social reality as intact something that, complex, dynamic, and having the reciprocal character. Research done at natural object, that is object can not influence by researcher and attendance of researcher does not so influence dynamics at the object.

In qualitative method, its instrument is researcher itself। For that researcher have to have circumstantial and wide of knowledge to problem of accurate. Technics of its data collecting is technics of triangulation, that is joining various is technics of data collecting by simultan. Data analysis have the character of inductive pursuant to found fact then construction become theory or hypothesis. Qualitative method used to get circumstantial data, that is meaning of data. In qualitative method does not emphasize at generalizing, but at meaning.

Source:
-. http://en.wikipedia.org/
-. http://uk.geocities.com/balihar_sanghera/ipsrmehrigiulqualitativequantitativeresearch.html

Sunday, July 26, 2009

Reliability & Validity

We often think of reliability and validity as separate ideas but, in fact, they're related to each other. Here, I want to show you two ways you can think about their relationship.

One of my favorite metaphors for the relationship between reliability is that of the target. Think of the center of the target as the concept that you are trying to measure. Imagine that for each person you are measuring, you are taking a shot at the target. If you measure the concept perfectly for a person, you are hitting the center of the target. If you don't, you are missing the center. The more you are off for that person, the further you are from the center.

Therefore, a scale can be reliable without being valid (i.e., it measures something accurately an consistently, but not what it was intended to measure), but it cannot be valid without reliable (i.e., the measures themselves are so inconsistent that they could not be accurate).










The figure above shows four possible situations. In the first one, you are hitting the target consistently, but you are missing the center of the target. That is, you are consistently and systematically measuring the wrong value for all respondents. This measure is reliable, but no valid (that is, it's consistent but wrong). The second, shows hits that are randomly spread across the target. You seldom hit the center of the target but, on average, you are getting the right answer for the group (but not very well for individuals). In this case, you get a valid group estimate, but you are inconsistent. Here, you can clearly see that reliability is directly related to the variability of your measure. The third scenario shows a case where your hits are spread across the target and you are consistently missing the center. Your measure in this case is neither reliable nor valid. Finally, we see the "Robin Hood" scenario -- you consistently hit the center of the target. Your measure is both reliable and valid (I bet you never thought of Robin Hood in those terms before).

Another way we can think about the relationship between reliability and validity is shown in the figure below. Here, we set up a 2x2 table. The columns of the table indicate whether you are trying to measure the same or different concepts. The rows show whether you are using the same or different methods of measurement. Imagine that we have two concepts we would like to measure, student verbal and math ability. Furthermore, imagine that we can measure each of these in two ways. First, we can use a written, paper-and-pencil exam (very much like the SAT or GRE exams). Second, we can ask the student's classroom teacher to give us a rating of the student's ability based on their own classroom observation.

















The first cell on the upper left shows the comparison of the verbal written test score with the verbal written test score. But how can we compare the same measure with itself? We could do this by estimating the reliability of the written test through a test-retest correlation, parallel forms, or an internal consistency measure. What we are estimating in this cell is the reliability of the measure.

The cell on the lower left shows a comparison of the verbal written measure with the verbal teacher observation rating. Because we are trying to measure the same concept, we are looking at convergent validity.

The cell on the upper right shows the comparison of the verbal written exam with the math written exam. Here, we are comparing two different concepts (verbal versus math) and so we would expect the relationship to be lower than a comparison of the same concept with itself (e.g., verbal versus verbal or math versus math). Thus, we are trying to discriminate between two concepts and we would consider this discriminant validity.

Finally, we have the cell on the lower right. Here, we are comparing the verbal written exam with the math teacher observation rating. Like the cell on the upper right, we are also trying to compare two different concepts (verbal versus math) and so this is a discriminant validity estimate. But here, we are also trying to compare two different methods of measurement (written exam versus teacher observation rating). So, we'll call this very discriminant to indicate that we would expect the relationship in this cell to be even lower than in the one above it.

The four cells incorporate the different values that we examine in the multitrait-multimethod approach to estimating construct validity.

When we look at reliability and validity in this way, we see that, rather than being distinct, they actually form a continuum. On one end is the situation where the concepts and methods of measurement are the same (reliability) and on the other is the situation where concepts and methods of measurement are different (very discriminant validity).


-. http://www.socialresearchmethods.net
-. Managerial Application of Multivariate – Analysis in Marketing, James H.. Myers and Gary M. Mullet, 2003, American Marketing Association, Chicago

Saturday, July 4, 2009

Comparison of Reliability Estimators

Each of the reliability estimators has certain advantages and disadvantages. Inter-rater reliability is one of the best ways to estimate reliability when your measure is an observation. However, it requires multiple raters or observers. As an alternative, you could look at the correlation of ratings of the same single observer repeated on two different occasions. For example, let's say you collected videotapes of child-mother interactions and had a rater code the videos for how often the mother smiled at the child. To establish inter-rater reliability you could take a sample of videos and have two raters code them independently. To estimate test-retest reliability you could have a single rater code the same videos on two different occasions. You might use the inter-rater approach especially if you were interested in using a team of raters and you wanted to establish that they yielded consistent results. If you get a suitably high inter-rater reliability you could then justify allowing them to work independently on coding different videos. You might use the test-retest approach when you only have a single rater and don't want to train any others. On the other hand, in some studies it is reasonable to do both to help establish the reliability of the raters or observers.

The parallel forms estimator is typically only used in situations where you intend to use the two forms as alternate measures of the same thing. Both the parallel forms and all of the internal consistency estimators have one major constraint -- you have to have multiple items designed to measure the same construct. This is relatively easy to achieve in certain contexts like achievement testing (it's easy, for instance, to construct lots of similar addition problems for a math test), but for more complex or subjective constructs this can be a real challenge. If you do have lots of items, Cronbach's Alpha tends to be the most frequently used estimate of internal consistency.

The test-retest estimator is especially feasible in most experimental and quasi-experimental designs that use a no-treatment control group. In these designs you always have a control group that is measured on two occasions (pretest and posttest). the main problem with this approach is that you don't have any information about reliability until you collect the posttest and, if the reliability estimate is low, you're pretty much sunk.

Each of the reliability estimators will give a different value for reliability. In general, the test-retest and inter-rater reliability estimates will be lower in value than the parallel forms and internal consistency ones because they involve measuring at different times or with different raters. Since reliability estimates are often used in statistical analyses of quasi-experimental designs (e.g., the analysis of the nonequivalent group design), the fact that different estimates can differ considerably makes the analysis even more complex.


-. www.socialresearchmethods.net

Monday, June 22, 2009

Internal Consistency Reliability

What is Internal Consistency Reliability ?

1. A procedure for studying reliability when the focus of the investigation is on the consistency of scores on the same occasion and on similar content, but when conducting repeated testing or alternate forms testing is not possible. The procedure uses information about how consistent the examinees' scores are from one item (or one part of the test) to the next to estimate the consistency of examinees' scores on the entire test.

2. The internal consistency reliability of survey instruments (e.g. psychological tests), is a measure of reliability of different survey items intended to measure the same characteristic.

3. Internal consistency reliability evaluates individual questions in comparison with one another for their ability to give consistently appropriate results.

4. In internal consistency reliability estimation we use our single measurement instrument administered to a group of people on one occasion to estimate reliability. In effect we judge the reliability of the instrument by estimating how well the items that reflect the same construct yield similar results. We are looking at how consistent the results are for different items for the same construct within the measure.

Example: there are 5 different questions (items) related to anxiety level. Each question implies a response with 5 possible values on a Likert scale , e.g. scores -2,-1,0,1,2. Responses from a group of respondents have been obtained. In reality, answers to different questions vary for each particular respondent, although the items are intended to measure the same aspect or quantity. The smaller this variability (or stronger the correlation), the greater the internal consistency reliability of this survey instrument.

There are a wide variety of internal consistency measures that can be used.

1. Average Inter-item Correlation
Average inter-item correlation compares correlations between all pairs of questions that test the same construct by calculating the mean of all paired correlations. The average inter-item correlation uses all of the items on our instrument that are designed to measure the same construct. We first compute the correlation between each pair of items, as illustrated in the figure. For example, if we have six items we will have 15 different item pairings (i.e., 15 correlations). The average interitem correlation is simply the average or mean of all these correlations. In the example, we find an average inter-item correlation of .90 with the individual correlations ranging from .84 to .95.















2. Average Itemtotal Correlation
Average item total correlation takes the average inter-item correlations and calculates a total score for each item, then averages these. This approach also uses the inter-item correlations. In addition, we compute a total score for the six items and use that as a seventh variable in the analysis. The figure shows the six item-to-total correlations at the bottom of the correlation matrix. They range from .82 to .88 in this sample analysis, with the average of these at .85.














3. Split-half correlation.
Split-half correlation divides items that measure the same construct into two tests, which are applied to the same group of people, then calculates the correlation between the two total scores. In split-half reliability we randomly divide all items that purport to measure the same construct into two sets. We administer the entire instrument to a sample of people and calculate the total score for each randomly divided half. the split-half reliability estimate, as shown in the figure, is simply the correlation between these two total scores. In the example it is .87.

It is often not feasible to obtain to obtain two or more measures of the same item by the same person at different points in time। This involves dividing a single survey measuring instrument into two parts and then correlating responses (scores) from one half with responses from other half. If all items are supposed to measure the same basic idea, the resulting correlation should be high.















4. Cronbach's alpha
Cronbach's alpha calculates an equivalent to the average of all possible split-half correlations. Imagine that we compute one split-half reliability and then randomly divide the items into another set of split halves and recompute, and keep doing this until we have computed all possible split half estimates of reliability. Cronbach's Alpha is mathematically equivalent to the average of all possible split-half estimates, although that's not how we compute it. Notice that when I say we compute all possible split-half estimates, I don't mean that each time we go an measure a new sample! That would take forever. Instead, we calculate all split-half estimates from the same sample. Because we measured all of our sample on each of the six items, all we have to do is have the computer analysis do the random subsets of items and compute the resulting correlations. The figure shows several of the split-half estimates for our six item example and lists them as SH with a subscript. Just keep in mind that although Cronbach's Alpha is equivalent to the average of all possible split half correlations we would never actually calculate it that way. Some clever mathematician (Cronbach, I presume!) figured out a way to get the mathematical equivalent a lot more quickly.

Coefficient alpha provides a summary measure of the inter-correlations among a set of items in any scale used in marketing research (Churchill 1995; Nunnaly 1978). Churchill (1995, p. 498; emphasis in original) observes that “Coeficient alpha routinely should be calculated to assess the quality of measure. Coefficient alpha is generally considered the best estimate of the true reliability of any multiple-item scale that is intended to measure some basic idea or construct useful to market researches or planners.

















Source:
-. Managerial Application of Multivariate – Analysis in Marketing, James H.. Myers and Gary M. Mullet, 2003, American Marketing Association, Chicago
-. www.changingminds.org
-. www.statitics.com
-. www.socialresearchmethods.net

Friday, June 19, 2009

Parallel-Forms Reliability


One problem with questions or assessments is knowing what questions are the best ones to ask. A way of discovering this is do two tests in parallel, using different questions. Parallel-forms reliability evaluates different questions and question sets that seek to assess the same construct. Parallel-Forms evaluation may be done in combination with other methods, such as Split-half, which divides items that measure the same construct into two tests and applies them to the same group of people.

Parellel-forms reliability is gauged by comparing to different tests that were created using the same content. This is accomplished by creating a large pool of test items that measure the same quality and then randomly dividing the items into two separate tests. The two tests should then be administered to the same subjects at the same time.

In parallel forms reliability you first have to create two parallel forms. One way to accomplish this is to create a large set of questions that address the same construct and then randomly divide the questions into two sets. You administer both instruments to the same sample of people. The correlation between the two parallel forms is the estimate of reliability.

For instance, we might be concerned about a testing threat to internal validity. If we use Form A for the pretest and Form B for the posttest, we minimize that problem. it would even be better if we randomly assign individuals to receive Form A or B on the pretest and then switch them on the posttest. With split-half reliability we have an instrument that we wish to use as a single measurement instrument and only develop randomly split halves for purposes of estimating reliability.


Source:
-. www.about.com
-. www.changingminds.org
-. www.socialresearchmethods.net

Monday, June 15, 2009

Test-Retest Reliability

What is Test-Retest Reliability ?

1. Test-retest is a statistical method used to examine how reliable a test is: A test is performed twice,
e.g., the same test is given to a group of subjects at two different times. Each subject should score different than the other subjects, but if the test is reliable then each subject should score the same in both test. (Valentin Rousson, Theo Gasser, and Burkhardt Seifert, (2002) "Assessing intrarater, interrater and test–retest reliability of continuous measurements," Statistics in Medicine 21:3431-3446).

2. A measure of the ability of a psychologic testing instrument to yield the same result for a single point at 2 different test periods, which are closely spaced so that any variation detected reflects reliability of the instrument rather than changes in status.

3. The test-retest reliability of a survey instrument, like a psychological test, is estimated by performing the same survey with the same respondents at different moments of time. The closer the results, the greater the test-retest reliability of the survey instrument. The correlation coefficient between such two sets of responses is often used as a quantitative measure of the test-retest reliability. (www.statistics.com)

4. Because a scale is considered reliable if it consistently produces the same measurement for a given amount or type of a response, one obvious way to assess reliability is to take two or more measures at different points in time using the same respondents. This is known as test-retest reliability. These measures must be taken using exactly the same measuring instrument an under conditions that are as similar as possible. Reliability is usually measured in terms of correlation coefficient between the first and second measures or among all measures if more than two are taken. The higher the correlation, the more similar the measurements are and therefore the greater is the test-retest reliability.



Example:

1. A group of respondents is tested for IQ scores: each respondent is tested twice - the two tests are, say, a month apart. Then, the correlation coefficient between two sets of IQ-scores is a reasonable measure of the test-retest reliability of this test. In the ideal case, both scores coincide for each respondent and, hence, the correlation coefficient is 1.0. In reality, correlation coefficient is 1.0 is almost never the case - the scores produced by a respondent would vary if the test were carried out several times. Normally, values of the correlation 0.7...0.8 are considered as satisfactory or good.

2. Various questions for a personality test are tried out with a class of students over several years. This helps the researcher determine those questions and combinations that have better reliability.

3. In the development of national school tests, a class of children are given several tests that are intended to assess the same abilities. A week and a month later, they are given the same tests. With allowances for learning, the variation in the test and retest results are used to assess which tests have better test-retest reliability.


The test-retest reliability is the most popular indicator of survey reliability. A shortcoming of the test-retest reliability is that the "practice effect" - respondents "learn" to answer the same questions in the first test and this affects their responses in the next test. For example, the IQ-scores may tend to be higher in the next test.

Reliability can vary with the many factors that affect how a person responds to the test, including their mood, interruptions, time of day, etc. A good test will largely cope with such factors and give relatively little variation. An unreliable test is highly sensitive to such factors and will give widely varying results, even if the person re-takes the same test half an hour later.

This method is particularly used in experiments that use a no-treatment control group that is measure pre-test and post-test.

We estimate test-retest reliability when we administer the same test to the same sample on two different occasions. This approach assumes that there is no substantial change in the construct being measured between the two occasions. The amount of time allowed between measures is critical. We know that if we measure the same thing twice that the correlation between the two observations will depend in part by how much time elapses between the two measurement occasions. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time -- the closer in time we get the more similar the factors that contribute to error. Since this correlation is the test-retest estimate of reliability, you can obtain considerably different estimates depending on the interval.


Source:
-. Managerial Application of Multivariate – Analysis in Marketing, James H.. Myers and Gary M. Mullet, 2003, American Marketing Association, Chicago
-. http://dx.doi.org/10.1002/sim.1253
-. www.statistics.com
-. www.socialresearchmethods.net
-. www.changingminds.org

Wednesday, June 10, 2009

Inter-Rater or Inter-Observer Reliability

Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.

So how do we determine whether two observers are being consistent in their observations? You probably should establish inter-rater reliability outside of the context of the measurement in your study. After all, if you use data from your study to establish reliability, and you find that reliability is low, you're kind of stuck. Probably it's best to do this as a side study or pilot study. And, if your study goes on for a long time, you may want to reestablish inter-rater reliability from time to time to assure that your raters aren't changing.

This type of reliability is assessed by having two or more independent judges score the test. The scores are then compared to determine the consistency of the raters estimates. One way to test inter-rater reliability is to have each rater assign each test item a score.

Inter-rater or inter-observer reliability is an estimation method that is used when your measurement procedure is applied by people. We are all subject to distractions, tiredness and a whole host of other effects upon our consistency, and if you are using people as, say, observers, you will want to have some estimation of the reliability and consistency of the people doing the observing.

Two major ways in which inter-rater reliability is used are
(a) testing how similarly people categorize items, and
(b) how similarly people score items.

This is the best way of assessing reliability when you are using observation, as observer bias very easily creeps in. It does, however, assume you have multiple observers, which is not always the case.

Inter-rater reliability is also known as inter-observer reliability or inter-coder reliability.

There are two major ways to actually estimate inter-rater reliability.

• those where observers are checking off which category an observation falls into (categorization). If your measurement consists of categories -- the raters are checking off which category each observation falls in -- you can calculate the percent of agreement between the raters. For instance, let's say you had 100 observations that were being rated by two raters. For each observation, the rater could check one of three categories. Imagine that on 86 of the 100 observations the raters checked the same category. In this case, the percent of agreement would be 86%. OK, it's a crude measure, but it does give an idea of how much agreement exists, and it works no matter how many categories are used for each observation.

• those where observers are ranking their observations against a continuous scale, such as a Likert scale. The other major way to estimate inter-rater reliability is appropriate when the measure is a continuous one. There, all you need to do is calculate the correlation between the ratings of the two observers. For instance, they might be rating the overall level of activity in a classroom on a 1-to-7 scale. You could have them give their rating at regular time intervals (e.g., every 30 seconds). The correlation between these ratings would give you an estimate of the reliability or consistency between the raters.

You might think of this type of reliability as "calibrating" the observers. There are other things you could do to encourage reliability between observers, even if you don't estimate it. For instance, I used to work in a psychiatric unit where every morning a nurse had to do a ten-item rating of each patient on the unit. Of course, we couldn't count on the same nurse being present every day, so we had to find a way to assure that any of the nurses would give comparable ratings. The way we did it was to hold weekly "calibration" meetings where we would have all of the nurses ratings for several patients and discuss why they chose the specific values they did. If there were disagreements, the nurses would discuss them and attempt to come up with rules for deciding when they would give a "3" or a "4" for a rating on a specific item. Although this was not an estimate of reliability, it probably went a long way toward improving the reliability between raters.

Examples
Two people may be asked to categorize pictures of animals as being dogs or cats. A perfectly reliable result would be that they both classify the same pictures in the same way.

Observers being used in assessing prisoner stress are asked to assess several 'dummy' people who are briefed to respond in a programmed and consistent way. The variation in results from a standard gives a measure of their reliability.

In a test scenario, an IQ test applied to several people with a true score of 120 should result in a score of 120 for everyone. In practice, there will be usually be some variation between people.


Source:
-. www.about.com
-. www.socialresearchmethods.net
-. www.changingminds.org

Monday, June 8, 2009

Reliability Measurement

Topic about reliability have been published in this blog in July 2008, and this topic with same theme will complete previous topic.

The similarity of results provided by independent but comparable measures of the same object, trait, or construct is called reliability. Data said to be reliable if it consistently procedures the same measurement time after time for a given amount or type of a response, regardless of who or when does measurement. Relation between reliability and validity is data can be reliable without valid, but it cannot be valid without being reliable.

Some definition from reliability:
1. In general, reliability (systemic def.) is the ability of a person or system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances.

2. In statistics, reliability is the consistency of a set of measurements or measuring instrument, often used to describe a test. This can either be whether the measurements of the same instrument give or are likely to give the same measurement (test-retest), or in the case of more subjective instruments, such as personality or trait inventories, whether two independent assessors give similar scores (inter-rater reliability). Reliability is inversely related to random error.

3. In experimental sciences, reliability is the extent to which the measurements of a test remain consistent over repeated tests of the same subject under identical conditions. An experiment is reliable if it yields consistent results of the same measure. It is unreliable if repeated measurements give different results. It can also be interpreted as the lack of random error in measurement.

4. Reliability has to do with the quality of measurement. Reliability is the "consistency" or "repeatability" of your measures.

5. In research, the term reliability means "repeatability" or "consistency". A measure is considered reliable if it would give us the same result over and over again.

6. A scale is said to be reliable if it consistently produces the same measurement or category time after time for a given amount or type of a response, regardless of who does the measurement or when.

7. Reliability refers to the consistency of a measure. A test is considered reliable if we get the same result repeatedly.

8. 'Reliability' of any research is the degree to which it gives an accurate score across a range of measurement. It can thus be viewed as being 'repeatability' or 'consistency'.

9. Reliability means "repeatability" or "consistency". A measure is considered reliable if it would give us the same result over and over again.


Reliability does not imply validity. That is, a reliable measure is measuring something consistently, but not necessarily what it is supposed to be measuring. It is distinguished from validity in that validity is represented in agreement between two attempts to measure the same trait through maximally different methods, whereas reliability is the agreement between two efforts to measure the same trait through maximally similar methods.

If a measure were valid, there would be little need to worry about its reliability. If a measure is valid, it reflects the characteristic that it is supposed to measure and id not distorted by other factors, either systematic or transitory. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance. In terms of accuracy and precision, reliability is precision, while validity is accuracy.

There are four general classes of reliability estimates, each of which estimates reliability in a different way. They are:
1. Inter-Rater or Inter-Observer Reliability
Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon.
Inter-rater: Different people, same test.

2. Test-Retest Reliability
Used to assess the consistency of a measure from one time to another.
Test-retest: Same people, different times.

3. Parallel-Forms Reliability
Used to assess the consistency of the results of two tests constructed in the same way from the same content domain.
Parallel-forms: Different people, same time, different test.

4. Internal Consistency Reliability
Used to assess the consistency of results across items within a test.
Internal consistency: Different questions, same construct.

Although lack of reliability provides negative evidence of the validity of a measure, the mere presence of reliability does not mean that the measure is valid. Reliability is a necessary, but not a sufficient, condition for validity. Reliability is more easily measured than validity.


Source:
-. Marketing Research, Methodological Foundations, 5th edition, The Dryden Press International Edition, author Gilbert A. Churchill, Jr.
-. Managerial Application of Multivariate – Analysis in Marketing, James H.. Myers and Gary M. Mullet, 2003, American Marketing Association, Chicago
-. www.wikipedia.com
-. www.socialresearchmethods.net
-. www.changingminds.org