Monday, June 22, 2009

Internal Consistency Reliability

What is Internal Consistency Reliability ?

1. A procedure for studying reliability when the focus of the investigation is on the consistency of scores on the same occasion and on similar content, but when conducting repeated testing or alternate forms testing is not possible. The procedure uses information about how consistent the examinees' scores are from one item (or one part of the test) to the next to estimate the consistency of examinees' scores on the entire test.

2. The internal consistency reliability of survey instruments (e.g. psychological tests), is a measure of reliability of different survey items intended to measure the same characteristic.

3. Internal consistency reliability evaluates individual questions in comparison with one another for their ability to give consistently appropriate results.

4. In internal consistency reliability estimation we use our single measurement instrument administered to a group of people on one occasion to estimate reliability. In effect we judge the reliability of the instrument by estimating how well the items that reflect the same construct yield similar results. We are looking at how consistent the results are for different items for the same construct within the measure.

Example: there are 5 different questions (items) related to anxiety level. Each question implies a response with 5 possible values on a Likert scale , e.g. scores -2,-1,0,1,2. Responses from a group of respondents have been obtained. In reality, answers to different questions vary for each particular respondent, although the items are intended to measure the same aspect or quantity. The smaller this variability (or stronger the correlation), the greater the internal consistency reliability of this survey instrument.

There are a wide variety of internal consistency measures that can be used.

1. Average Inter-item Correlation
Average inter-item correlation compares correlations between all pairs of questions that test the same construct by calculating the mean of all paired correlations. The average inter-item correlation uses all of the items on our instrument that are designed to measure the same construct. We first compute the correlation between each pair of items, as illustrated in the figure. For example, if we have six items we will have 15 different item pairings (i.e., 15 correlations). The average interitem correlation is simply the average or mean of all these correlations. In the example, we find an average inter-item correlation of .90 with the individual correlations ranging from .84 to .95.















2. Average Itemtotal Correlation
Average item total correlation takes the average inter-item correlations and calculates a total score for each item, then averages these. This approach also uses the inter-item correlations. In addition, we compute a total score for the six items and use that as a seventh variable in the analysis. The figure shows the six item-to-total correlations at the bottom of the correlation matrix. They range from .82 to .88 in this sample analysis, with the average of these at .85.














3. Split-half correlation.
Split-half correlation divides items that measure the same construct into two tests, which are applied to the same group of people, then calculates the correlation between the two total scores. In split-half reliability we randomly divide all items that purport to measure the same construct into two sets. We administer the entire instrument to a sample of people and calculate the total score for each randomly divided half. the split-half reliability estimate, as shown in the figure, is simply the correlation between these two total scores. In the example it is .87.

It is often not feasible to obtain to obtain two or more measures of the same item by the same person at different points in time। This involves dividing a single survey measuring instrument into two parts and then correlating responses (scores) from one half with responses from other half. If all items are supposed to measure the same basic idea, the resulting correlation should be high.















4. Cronbach's alpha
Cronbach's alpha calculates an equivalent to the average of all possible split-half correlations. Imagine that we compute one split-half reliability and then randomly divide the items into another set of split halves and recompute, and keep doing this until we have computed all possible split half estimates of reliability. Cronbach's Alpha is mathematically equivalent to the average of all possible split-half estimates, although that's not how we compute it. Notice that when I say we compute all possible split-half estimates, I don't mean that each time we go an measure a new sample! That would take forever. Instead, we calculate all split-half estimates from the same sample. Because we measured all of our sample on each of the six items, all we have to do is have the computer analysis do the random subsets of items and compute the resulting correlations. The figure shows several of the split-half estimates for our six item example and lists them as SH with a subscript. Just keep in mind that although Cronbach's Alpha is equivalent to the average of all possible split half correlations we would never actually calculate it that way. Some clever mathematician (Cronbach, I presume!) figured out a way to get the mathematical equivalent a lot more quickly.

Coefficient alpha provides a summary measure of the inter-correlations among a set of items in any scale used in marketing research (Churchill 1995; Nunnaly 1978). Churchill (1995, p. 498; emphasis in original) observes that “Coeficient alpha routinely should be calculated to assess the quality of measure. Coefficient alpha is generally considered the best estimate of the true reliability of any multiple-item scale that is intended to measure some basic idea or construct useful to market researches or planners.

















Source:
-. Managerial Application of Multivariate – Analysis in Marketing, James H.. Myers and Gary M. Mullet, 2003, American Marketing Association, Chicago
-. www.changingminds.org
-. www.statitics.com
-. www.socialresearchmethods.net

Friday, June 19, 2009

Parallel-Forms Reliability


One problem with questions or assessments is knowing what questions are the best ones to ask. A way of discovering this is do two tests in parallel, using different questions. Parallel-forms reliability evaluates different questions and question sets that seek to assess the same construct. Parallel-Forms evaluation may be done in combination with other methods, such as Split-half, which divides items that measure the same construct into two tests and applies them to the same group of people.

Parellel-forms reliability is gauged by comparing to different tests that were created using the same content. This is accomplished by creating a large pool of test items that measure the same quality and then randomly dividing the items into two separate tests. The two tests should then be administered to the same subjects at the same time.

In parallel forms reliability you first have to create two parallel forms. One way to accomplish this is to create a large set of questions that address the same construct and then randomly divide the questions into two sets. You administer both instruments to the same sample of people. The correlation between the two parallel forms is the estimate of reliability.

For instance, we might be concerned about a testing threat to internal validity. If we use Form A for the pretest and Form B for the posttest, we minimize that problem. it would even be better if we randomly assign individuals to receive Form A or B on the pretest and then switch them on the posttest. With split-half reliability we have an instrument that we wish to use as a single measurement instrument and only develop randomly split halves for purposes of estimating reliability.


Source:
-. www.about.com
-. www.changingminds.org
-. www.socialresearchmethods.net

Monday, June 15, 2009

Test-Retest Reliability

What is Test-Retest Reliability ?

1. Test-retest is a statistical method used to examine how reliable a test is: A test is performed twice,
e.g., the same test is given to a group of subjects at two different times. Each subject should score different than the other subjects, but if the test is reliable then each subject should score the same in both test. (Valentin Rousson, Theo Gasser, and Burkhardt Seifert, (2002) "Assessing intrarater, interrater and test–retest reliability of continuous measurements," Statistics in Medicine 21:3431-3446).

2. A measure of the ability of a psychologic testing instrument to yield the same result for a single point at 2 different test periods, which are closely spaced so that any variation detected reflects reliability of the instrument rather than changes in status.

3. The test-retest reliability of a survey instrument, like a psychological test, is estimated by performing the same survey with the same respondents at different moments of time. The closer the results, the greater the test-retest reliability of the survey instrument. The correlation coefficient between such two sets of responses is often used as a quantitative measure of the test-retest reliability. (www.statistics.com)

4. Because a scale is considered reliable if it consistently produces the same measurement for a given amount or type of a response, one obvious way to assess reliability is to take two or more measures at different points in time using the same respondents. This is known as test-retest reliability. These measures must be taken using exactly the same measuring instrument an under conditions that are as similar as possible. Reliability is usually measured in terms of correlation coefficient between the first and second measures or among all measures if more than two are taken. The higher the correlation, the more similar the measurements are and therefore the greater is the test-retest reliability.



Example:

1. A group of respondents is tested for IQ scores: each respondent is tested twice - the two tests are, say, a month apart. Then, the correlation coefficient between two sets of IQ-scores is a reasonable measure of the test-retest reliability of this test. In the ideal case, both scores coincide for each respondent and, hence, the correlation coefficient is 1.0. In reality, correlation coefficient is 1.0 is almost never the case - the scores produced by a respondent would vary if the test were carried out several times. Normally, values of the correlation 0.7...0.8 are considered as satisfactory or good.

2. Various questions for a personality test are tried out with a class of students over several years. This helps the researcher determine those questions and combinations that have better reliability.

3. In the development of national school tests, a class of children are given several tests that are intended to assess the same abilities. A week and a month later, they are given the same tests. With allowances for learning, the variation in the test and retest results are used to assess which tests have better test-retest reliability.


The test-retest reliability is the most popular indicator of survey reliability. A shortcoming of the test-retest reliability is that the "practice effect" - respondents "learn" to answer the same questions in the first test and this affects their responses in the next test. For example, the IQ-scores may tend to be higher in the next test.

Reliability can vary with the many factors that affect how a person responds to the test, including their mood, interruptions, time of day, etc. A good test will largely cope with such factors and give relatively little variation. An unreliable test is highly sensitive to such factors and will give widely varying results, even if the person re-takes the same test half an hour later.

This method is particularly used in experiments that use a no-treatment control group that is measure pre-test and post-test.

We estimate test-retest reliability when we administer the same test to the same sample on two different occasions. This approach assumes that there is no substantial change in the construct being measured between the two occasions. The amount of time allowed between measures is critical. We know that if we measure the same thing twice that the correlation between the two observations will depend in part by how much time elapses between the two measurement occasions. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time -- the closer in time we get the more similar the factors that contribute to error. Since this correlation is the test-retest estimate of reliability, you can obtain considerably different estimates depending on the interval.


Source:
-. Managerial Application of Multivariate – Analysis in Marketing, James H.. Myers and Gary M. Mullet, 2003, American Marketing Association, Chicago
-. http://dx.doi.org/10.1002/sim.1253
-. www.statistics.com
-. www.socialresearchmethods.net
-. www.changingminds.org

Wednesday, June 10, 2009

Inter-Rater or Inter-Observer Reliability

Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.

So how do we determine whether two observers are being consistent in their observations? You probably should establish inter-rater reliability outside of the context of the measurement in your study. After all, if you use data from your study to establish reliability, and you find that reliability is low, you're kind of stuck. Probably it's best to do this as a side study or pilot study. And, if your study goes on for a long time, you may want to reestablish inter-rater reliability from time to time to assure that your raters aren't changing.

This type of reliability is assessed by having two or more independent judges score the test. The scores are then compared to determine the consistency of the raters estimates. One way to test inter-rater reliability is to have each rater assign each test item a score.

Inter-rater or inter-observer reliability is an estimation method that is used when your measurement procedure is applied by people. We are all subject to distractions, tiredness and a whole host of other effects upon our consistency, and if you are using people as, say, observers, you will want to have some estimation of the reliability and consistency of the people doing the observing.

Two major ways in which inter-rater reliability is used are
(a) testing how similarly people categorize items, and
(b) how similarly people score items.

This is the best way of assessing reliability when you are using observation, as observer bias very easily creeps in. It does, however, assume you have multiple observers, which is not always the case.

Inter-rater reliability is also known as inter-observer reliability or inter-coder reliability.

There are two major ways to actually estimate inter-rater reliability.

• those where observers are checking off which category an observation falls into (categorization). If your measurement consists of categories -- the raters are checking off which category each observation falls in -- you can calculate the percent of agreement between the raters. For instance, let's say you had 100 observations that were being rated by two raters. For each observation, the rater could check one of three categories. Imagine that on 86 of the 100 observations the raters checked the same category. In this case, the percent of agreement would be 86%. OK, it's a crude measure, but it does give an idea of how much agreement exists, and it works no matter how many categories are used for each observation.

• those where observers are ranking their observations against a continuous scale, such as a Likert scale. The other major way to estimate inter-rater reliability is appropriate when the measure is a continuous one. There, all you need to do is calculate the correlation between the ratings of the two observers. For instance, they might be rating the overall level of activity in a classroom on a 1-to-7 scale. You could have them give their rating at regular time intervals (e.g., every 30 seconds). The correlation between these ratings would give you an estimate of the reliability or consistency between the raters.

You might think of this type of reliability as "calibrating" the observers. There are other things you could do to encourage reliability between observers, even if you don't estimate it. For instance, I used to work in a psychiatric unit where every morning a nurse had to do a ten-item rating of each patient on the unit. Of course, we couldn't count on the same nurse being present every day, so we had to find a way to assure that any of the nurses would give comparable ratings. The way we did it was to hold weekly "calibration" meetings where we would have all of the nurses ratings for several patients and discuss why they chose the specific values they did. If there were disagreements, the nurses would discuss them and attempt to come up with rules for deciding when they would give a "3" or a "4" for a rating on a specific item. Although this was not an estimate of reliability, it probably went a long way toward improving the reliability between raters.

Examples
Two people may be asked to categorize pictures of animals as being dogs or cats. A perfectly reliable result would be that they both classify the same pictures in the same way.

Observers being used in assessing prisoner stress are asked to assess several 'dummy' people who are briefed to respond in a programmed and consistent way. The variation in results from a standard gives a measure of their reliability.

In a test scenario, an IQ test applied to several people with a true score of 120 should result in a score of 120 for everyone. In practice, there will be usually be some variation between people.


Source:
-. www.about.com
-. www.socialresearchmethods.net
-. www.changingminds.org

Monday, June 8, 2009

Reliability Measurement

Topic about reliability have been published in this blog in July 2008, and this topic with same theme will complete previous topic.

The similarity of results provided by independent but comparable measures of the same object, trait, or construct is called reliability. Data said to be reliable if it consistently procedures the same measurement time after time for a given amount or type of a response, regardless of who or when does measurement. Relation between reliability and validity is data can be reliable without valid, but it cannot be valid without being reliable.

Some definition from reliability:
1. In general, reliability (systemic def.) is the ability of a person or system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances.

2. In statistics, reliability is the consistency of a set of measurements or measuring instrument, often used to describe a test. This can either be whether the measurements of the same instrument give or are likely to give the same measurement (test-retest), or in the case of more subjective instruments, such as personality or trait inventories, whether two independent assessors give similar scores (inter-rater reliability). Reliability is inversely related to random error.

3. In experimental sciences, reliability is the extent to which the measurements of a test remain consistent over repeated tests of the same subject under identical conditions. An experiment is reliable if it yields consistent results of the same measure. It is unreliable if repeated measurements give different results. It can also be interpreted as the lack of random error in measurement.

4. Reliability has to do with the quality of measurement. Reliability is the "consistency" or "repeatability" of your measures.

5. In research, the term reliability means "repeatability" or "consistency". A measure is considered reliable if it would give us the same result over and over again.

6. A scale is said to be reliable if it consistently produces the same measurement or category time after time for a given amount or type of a response, regardless of who does the measurement or when.

7. Reliability refers to the consistency of a measure. A test is considered reliable if we get the same result repeatedly.

8. 'Reliability' of any research is the degree to which it gives an accurate score across a range of measurement. It can thus be viewed as being 'repeatability' or 'consistency'.

9. Reliability means "repeatability" or "consistency". A measure is considered reliable if it would give us the same result over and over again.


Reliability does not imply validity. That is, a reliable measure is measuring something consistently, but not necessarily what it is supposed to be measuring. It is distinguished from validity in that validity is represented in agreement between two attempts to measure the same trait through maximally different methods, whereas reliability is the agreement between two efforts to measure the same trait through maximally similar methods.

If a measure were valid, there would be little need to worry about its reliability. If a measure is valid, it reflects the characteristic that it is supposed to measure and id not distorted by other factors, either systematic or transitory. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance. In terms of accuracy and precision, reliability is precision, while validity is accuracy.

There are four general classes of reliability estimates, each of which estimates reliability in a different way. They are:
1. Inter-Rater or Inter-Observer Reliability
Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon.
Inter-rater: Different people, same test.

2. Test-Retest Reliability
Used to assess the consistency of a measure from one time to another.
Test-retest: Same people, different times.

3. Parallel-Forms Reliability
Used to assess the consistency of the results of two tests constructed in the same way from the same content domain.
Parallel-forms: Different people, same time, different test.

4. Internal Consistency Reliability
Used to assess the consistency of results across items within a test.
Internal consistency: Different questions, same construct.

Although lack of reliability provides negative evidence of the validity of a measure, the mere presence of reliability does not mean that the measure is valid. Reliability is a necessary, but not a sufficient, condition for validity. Reliability is more easily measured than validity.


Source:
-. Marketing Research, Methodological Foundations, 5th edition, The Dryden Press International Edition, author Gilbert A. Churchill, Jr.
-. Managerial Application of Multivariate – Analysis in Marketing, James H.. Myers and Gary M. Mullet, 2003, American Marketing Association, Chicago
-. www.wikipedia.com
-. www.socialresearchmethods.net
-. www.changingminds.org

Thursday, June 4, 2009

Validity Types (2)

Validity types that are typically mentioned in texts and research papers when talking about the quality of measurement. Validity Types is the translation from concept to operationalization accurately representing the underlying concept. Does it measure what you think it measures.

1. Translation validity
a. Face validity
b. Content validity

2. Criterion-related validity
a. Predictive validity
b. Concurrent validity
c. Convergent validity
d. Discriminant validity

In essence, both of those validity types are attempting to assess the degree to which you accurately translated your construct into the operationalization, and hence the choice of name.



2. Criterion-Related Validity
In criteria-related validity, you check the performance of your operationalization against some criterion. How is this different from content validity? In content validity, the criteria are the construct definition itself -- it is a direct comparison. In criterion-related validity, we usually make a prediction about how the operationalization will perform based on our theory of the construct. The differences among the different criterion-related validity types is in the criteria they use as the standard for judgment.
Check the performance of operationalization against some criterion. Content validity differs in that the criteria are the construct definition itself -- it is a direct comparison. In criterion-related validity, a prediction is made about how the operationalization will perform based on our theory of the construct


a. Predictive Validity
In predictive validity, we assess the operationalization's ability to predict something it should theoretically be able to predict. For instance, we might theorize that a measure of math ability should be able to predict how well a person will do in an engineering-based profession. We could give our measure to experienced engineers and see if there is a high correlation between scores on the measure and their salaries as engineers. A high correlation would provide evidence for predictive validity -- it would show that our measure can correctly predict something that we theoretically think it should be able to predict.
Assess the operationalization's ability to predict something it should theoretically be able to predict.
A high correlation would provide evidence for predictive validity -- it would show that our measure can correctly predict something that we theoretically thing it should be able to predict.


b. Concurrent Validity
In concurrent validity, we assess the operationalization's ability to distinguish between groups that it should theoretically be able to distinguish between. For example, if we come up with a way of assessing manic-depression, our measure should be able to distinguish between people who are diagnosed manic-depression and those diagnosed paranoid schizophrenic. If we want to assess the concurrent validity of a new measure of empowerment, we might give the measure to both migrant farm workers and to the farm owners, theorizing that our measure should show that the farm owners are higher in empowerment. As in any discriminating test, the results are more powerful if you are able to show that you can discriminate between two groups that are very similar.
Assess the operationalization's ability to distinguish between groups that it should theoretically be able to distinguish between.
As in any discriminating test, the results are more powerful if you are able to show that you can discriminate between two groups that are very similar.


c. Convergent Validity
In convergent validity, we examine the degree to which the operationalization is similar to (converges on) other operationalizations that it theoretically should be similar to. For instance, to show the convergent validity of a Head Start program, we might gather evidence that shows that the program is similar to other Head Start programs. Or, to show the convergent validity of a test of arithmetic skills, we might correlate the scores on our test with scores on other tests that purport to measure basic math ability, where high correlations would be evidence of convergent validity.

Convergent validity is the degree to which an operation is similar to (converges on) other operations that it theoretically should also be similar to. For instance, to show the convergent validity of a test of mathematics skills, the scores on the test can be correlated with scores on other tests that are also designed to measure basic mathematics ability. High correlations between the test scores would be evidence of a convergent validity.
Convergent validity shows that the assessment is related to what it should theoretically be related to.

It is ideal that scales rate high in discriminant validity as well, which unlike convergent validity is designed to measure the extent to which a given scale differs from other scales designed to measure a different conceptual variable. Discriminant validity and convergent validity are the two good ways to measure construct validity.

d. Discriminant Validity
In discriminant validity, we examine the degree to which the operationalization is not similar to (diverges from) other operationalizations that it theoretically should be not be similar to. For instance, to show the discriminant validity of a Head Start program, we might gather evidence that shows that the program is not similar to other early childhood programs that don't label themselves as Head Start programs. Or, to show the discriminant validity of a test of arithmetic skills, we might correlate the scores on our test with scores on tests that of verbal ability, where low correlations would be evidence of discriminant validity.

Discriminant validity examine the degree to which the operationalization is not similar to (diverges from) other operationalizations that it theoretically should be not be similar to.
To show the discriminant validity of a test of arithmetic skills, we might correlate the scores on a test with scores on tests that of verbal ability, where low correlations would be evidence of discriminant validity.

Discriminant validity describes the degree to which the operationalization is not similar to (diverges from) other operationalizations that it theoretically should not be similar to.

Campbell and Fiske (1959) introduced the concept of discriminant validity within their discussion on evaluating test validity. They stressed the importance of using both discriminant and convergent validation techniques when assessing new tests. A successful evaluation of discriminant validity shows that a test of a concept is not highly correlated with other tests designed to measure theoretically different concepts.

In showing that two scales do not correlate, it is necessary to correct for attenuation in the correlation due to measurement error. It is possible to calculate the extent to which the two scales overlap by using the following formula where rxy is correlation between x and y, rxx is the reliability of x, and ryy is the reliability of y:

Although there is no standard value for discriminant validity, a result less than .85 tells us that discriminant validity likely exists between the two scales. A result greater than .85, however, tells us that the two constructs overlap greatly and they are likely measuring the same thing. Therefore, we cannot claim discriminant validity between them.


Source:
-. http://www.socialresearchmethods.net
-. http://www.wikipedia.com

Monday, June 1, 2009

Validity Types (1)

Validity types that are typically mentioned in texts and research papers when talking about the quality of measurement. Validity Types is the translation from concept to operationalization accurately representing the underlying concept. Does it measure what you think it measures. A scale is said to be valid if it measures what it is intended to measure. Physical measurements such as height and weight can be measured reliably (and they are also valid measures of how tall or heavy someone is), but they may not relate in any meaningful way to mental abilities, and etc.

1. Translation validity
a. Face validity
b. Content validity

2. Criterion-related validity
a. Predictive validity
b. Concurrent validity
c. Convergent validity
d. Discriminant validity

In essence, both of those validity types are attempting to assess the degree to which you accurately translated your construct into the operationalization, and hence the choice of name.


1. Translation Validity
Is the operationalization a good reflection of the construct?
This approach is definitional in nature assumes you have a good detailed definition of the construct and you can check the operationalization against it.

a. Face Validity
In face validity, you look at the operationalization and see whether "on its face" it seems like a good translation of the construct. This is probably the weakest way to try to demonstrate construct validity. For instance, you might look at a measure of math ability, read through the questions, and decide that yep, it seems like this is a good measure of math ability (i.e., the label "math ability" seems appropriate for this measure). Or, you might observe a teenage pregnancy prevention program and conclude that, "Yep, this is indeed a teenage pregnancy prevention program." Of course, if this is all you do to assess face validity, it would clearly be weak evidence because it is essentially a subjective judgment call. (Note that just because it is weak evidence doesn't mean that it is wrong. We need to rely on our subjective judgment throughout the research process. It's just that this form of judgment won't be very convincing to others.) We can improve the quality of face validity assessment considerably by making it more systematic. For instance, if you are trying to assess the face validity of a math ability measure, it would be more convincing if you sent the test to a carefully selected sample of experts on math ability testing and they all reported back with the judgment that your measure appears to be a good measure of math ability.

b. Content Validity
In content validity, you essentially check the operationalization against the relevant content domain for the construct. This approach assumes that you have a good detailed description of the content domain, something that's not always true. For instance, we might lay out all of the criteria that should be met in a program that claims to be a "teenage pregnancy prevention program." We would probably include in this domain specification the definition of the target group, criteria for deciding whether the program is preventive in nature (as opposed to treatment-oriented), and lots of criteria that spell out the content that should be included like basic information on pregnancy, the use of abstinence, birth control methods, and so on. Then, armed with these criteria, we could use them as a type of checklist when examining our program. Only programs that meet the criteria can legitimately be defined as "teenage pregnancy prevention programs." This all sounds fairly straightforward, and for many operationalizations it will be. But for other constructs (e.g., self-esteem, intelligence), it will not be easy to decide on the criteria that constitute the content domain.
Check the operationalization against the relevant content domain for the construct. Assumes that a well defined concept is being operationalized which may not be true. For example, a depression measure should cover the checklist of depression symptoms



Source:
-. Managerial Application of Multivariate – Analysis in Marketing, James H.. Myers and Gary M. Mullet, 2003, American Marketing Association, Chicago
-. http://www.socialresearchmethods.net
-. http://www.wikipedia.com