Monday, May 25, 2009

Construct Validity

Construct validity is most directly concerned with the question of what the instrument is, in fact, measuring. What construct, concept, or trait underlies the performance or score achieved on the test ? Does the measure of attitude measure attitude or some other underlying characteristic of the individual that affects his on her score ? Construct validity lies at the very heart of scientific progress. Scientific need constructs with which to communicate. Thus, in marketing we speak of people’s sosio economic class, their personality, their attitudes, and so on. These are all constructs that we use as we try to explain marketing behavior. And although vital, They are also unobservable. We can observe behavior related to these constructs, but we cannot observe the construct themselves. Rather, we operationally define the constructs in terms of a set of observables when we agree on the operational definitions, precision in communication is advanced. Instead of saying that what is measured by these it items is the person’s brand loyalty, we can speak of the notion of brand loyalty.

Construct validity seeks agreement between a theoretical concept and a specific measuring device or procedure. For example, a researcher inventing a new IQ test might spend a great deal of time attempting to "define" intelligence in order to reach an acceptable level of construct validity.

In social science and psychometrics, construct validity refers to whether a scale measures or correlates with a theorized psychological construct (such as "fluid intelligence"). It is related to the theoretical ideas behind the personality trait under consideration; a non-existent concept in the physical sense may be suggested as a method of organising how personality can be viewed. The unobservable idea of a unidimensional easier-to-harder dimension must be "constructed" in the words of human language and graphics.

A construct is not restricted to one set of observable indicators or attributes. It is common to a number of sets of indicators. Thus, "construct validity" can be evaluated by statistical methods that show whether or not a common factor can be shown to exist underlying several measurements using different observable indicators. This view of a construct rejects the operationist past that a construct is neither more nor less than the operations used to measure it.

Construct validity is the approximate truth of the conclusion that your operationalization accurately reflects its construct. All of the other terms address this general issue in different ways. A distinction between two broad types: translation validity and criterion-related validity.

In translation validity, focus on whether the operationalization is a good reflection of the construct. This approach is definitional in nature -- it assumes you have a good detailed definition of the construct and that you can check the operationalization against it. In criterion-related validity, examine whether the operationalization behaves the way it should given your theory of the construct. This is a more relational approach to construct validity. it assumes that your operationalization should function in predictable ways in relation to other operationalizations based upon your theory of the construct.

We need to ensure, through the plans and procedures used in constructing the instrument, that we have adequately sampled the domain of the construct and that, there is internal consistency among the items of the domain. The assumption about the internal consistency of a set of items is that “if a set of items is really measuring some underlying trait or attitude, then the underlying trait cause the covariation among the items. The higher correlation, the better the items are measuring the same underlying construct”. We saw that internal consistency was also at issue in determining content validity, and as a matter of fact, negative evidence of content validity of a measure also provides negative evidence about its construct validity. A measure possessing construct validity must be internally consistent insofar as the construct is internally consistent. On the other hand, it is not true that a consistent measure is a construct validity measure. In other words, consistency is a necessary but not sufficient condition for construct validity.

Evaluation of construct validity requires examining the correlation of the measure being evaluated with variables that are known to be related to the construct purportedly measured by the instrument being evaluated or for which there are theoretical grounds for expecting it to be related. Such is consistent with the multitrait-multimethod matrix of examining construct validity described in Campbell and Fiske's landmark paper (1959). Correlations that fit the expected pattern contribute evidence of construct validity. Construct validity is a judgment based on the accumulation of correlations from numerous studies using the instrument being evaluated.



Source:
-. Marketing Research, Methodological Foundations, 5th edition, The Dryden Press International Edition, author Gilbert A. Churchill, Jr.
-. http://www.colostate.edu/
-. www.wikipedia.com
-. http://www.socialresearchmethods.net

Building a Career in Statistics

by David L. Banks, Department of Statistical Science, Duke University
publishing.yudu.com/Library/Auxjn/AmstatNews/resources/3.htm


This article is based on a talk I gave at the JSM 2007 meeting for the ASA Committee and Career Development. But, I should confess at the outset that I have no particular qualifications or any expertise on this topic, aside from having a lot of jobs (which ought to raise questions about my suitability on first place).

Years ago, I was involved in drafting the New Researchers Survival Guide (available at www.imstat.org/publications). Reading over it again, we were awfully earnest and a bit naïve, but I think it has a lot of value for people who are beginning in academic career. So, I refer new faculty members so that, and, in this article, shall focus on topics that apply to everyone, not just recent PhDs and not just academics.

Although statisticians are relatively homogenous in our training, we have the usual range of talents, personalities, and utility functions. This creates many career paths and many ways to be successful. It also means you can be miserable if you get caught on a path that doesn’t fit your personal strengths and values.

All careers have a stochastic component, so we should look to dynamic programming as a model for continual reappraisal of our situations and ways that may better them. This implies a portfolio analysis perspective: We each have different mix of strengths and weakness, and we should try to adaptively invest our energy in combinations that seem most likely to pay off. Some skills that apply to employment in all sectors are the following

1. Technical Strength.
This is the foundation when you are starting out, but it often becomes less important as you advance. Especially in business and government, one needs breadth more than depth at higher levels.

2. Computational Ability.
Anyone who can do solid statistical programming will never miss a meal. It is a blue chip skill and a way of thinking that has a unique value. But, it is hard to become rich or famous on this alone.

3. Public Speaking.
Every member of the ASA has survived at least 1,5 decades of dull lecturer in school, which is why it amazes me that so many of us have not learned enough from that experience to avoid giving bad talks. Good presentation are key of component of almost any success story, and whatever you can do to build strength in this area will repay of your efforts.

4. Writing.
It is crucial to be able to write clearly, correctly, and briefly. This is a lifelong learning process – anyone who writes well is constantly studying how to write and attending to their process.

5. Social Networking.
This is crucially important, and it sometimes statisticians study it more, while learning less, than those in other field. You need diverse networking; having a lot of friends who work on local asymptotic minimaxity is not as helpful as having friends with complementary strengths.

6. Organization.
This sounds mundane, but it is very hard for a manager to promote you if you are sloppy or slow about paperwork. And the discipline of quick turnaround on such items (phone cells, email, appointments, referee report) helps in other aspects one’s career.

7. Time Management.
Don’t waste time feeling guilty about wasting time, just be efficient when you actually get down to work.

Someone else would probably generate a slightly different list, but these are all key areas to cultivate.

For those who need to stick in their job, there are still ways to advance. Personality counts for a lot. Try to pretend to be happy and productive. Read the newspaper so you have a wealth of conversation topics and aren’t stereotypically dull or narrow. You should avoid doomed projects, those that do not build new professional assets and those for which you are not central. I’d recommend looking for projects that cross division boundaries – it helps to have a broad base of good opinion, and you can build unique collaborations the organization needs. Try to differentiate yourself. Think of at least one of idea a week, but be properly skeptical of its value.

Wednesday, May 20, 2009

Content Validity

Content Validity is based on the extent to which a measurement reflects the specific intended domain of content (Carmines & Zeller, 1991, p.20). In psychometrics, content validity (also known as logical validity) refers to the extent to which a measure represents all facets of a given social construct.

Content validity is illustrated using the following examples: Researchers aim to study mathematical learning and create a survey to test for mathematical skill. If these researchers only tested for multiplication and then drew conclusions from that survey, their study would not show content validity because it excludes other mathematical functions. Although the establishment of content validity for placement-type exams seems relatively straight-forward, the process becomes more complex as it moves into the more abstract domain of socio-cultural studies. For example, a researcher needing to measure an attitude like self-esteem must decide what constitutes a relevant domain of content for that attitude. For socio-cultural studies, content validity forces the researchers to define the very domains they are attempting to study.

Content validity focuses on the adequacy with which the domain of the characteristic is captured by the measure. Content validity is sometimes known as “face validity”assessed by examining the measure with an eye toward ascertaining the domain being sampled. If the included domain is decidedly different from the domain of the variable as conceive, the measure is said to lack content validity.

How can we ensure that our measure will process content validity ?
We can never guarantee it because it is partly a matter of judgment. We may feel quite comfortable with the items included in a measure, for example, while a critic may argue that we have failed to sample from some relevant domain of the characteristic. Although we can never guarantee the content validity of a measure, we can severely diminish the objections of critics. The key to content validity lies in the procedures that are used to developed the instrument.

One widely used method of measuring content validity was developed by C. H. Lawshe. It is essentially a method for gauging agreement among raters or judges regarding how essential a particular item is. Lawshe (1975) proposed that each of the subject matter expert raters (SMEs) on the judging panel respond to the following question for each item: "Is the skill or knowledge measured by this item 'essential,' 'useful, but not essential,' or 'not necessary' to the performance of the construct?" According to Lawshe, if more than half the panelists indicate that an item is essential, that item has at least some content validity. Greater levels of content validity exist as larger numbers of panelists agree that a particular item is essential. Using these assumptions, Lawshe developed a formula termed the content validity ratio:

CVR = (ne - N/2)/(N/2)

CVR= content validity ratio,
ne = number of SME panelists indicating "essential",
N = total number of SME panelists.

This formula yields values which range from +1 to -1; positive values indicate that at least half the SMEs rated the item as essential. The mean CVR across items may be used as an indicator of overall test content validity.

One of the most critical elements in generating a content valid instrument is conceptually defining the domain of the characteristic. The researcher has to specify what the variable is and what it is not. The task of definition is expedited by examining the literature to determine now the variable has been defined and used. Because it is unlikely that all the definitions will agree, the researcher must specify which elements in the definitions underlie his or her use of the term. The researcher needs to be quite careful to include items from all the relevant dimensions of the variable. Again, a literature search may be productive in indicating the various dimensions or strata of a variable. At this stage, the researcher may wish to include items with slightly different shades of meaning, since the original list of items will be refined to produce the final measure.

The collection of items must be large so that after refinement the measure still contains enough items to adequately sample each of the variable’s domain. In the example cited previously, a measure of a sales representative’s job satisfaction would need so include items about each of the components of the job if it is to be content valid. The process of refinement, the essence of which is the internal consistency exhibited by the items within the test, is statistical in nature.


Source:
-. Marketing Research, Methodological Foundations, 5th edition, The Dryden Press International Edition, author Gilbert A. Churchill, Jr.
-. http://www.colostate.edu/
-. Wikipedia.com

Monday, May 18, 2009

Pragmatic Validity

The pragmatic approach to validation focuses on the usefulness of the measuring instrument as a predictor of some other characteristic or behavior of the individual; it is thus sometimes called predictive validity or criterion-related validity. Pragmatic validity is ascertained by how well the measure predicts the criterion, be it another characteristic or a specific behavior. An example would be the Graduate Management Admissions Test. The fact that this test is required by most of major schools of business attests to it pragmatic validity; it has proved to be useful in predicting how well a student with a particular score on the exam will do in a accredited MBA program. The test score is used to predict the criterion of performance. An example of an attitude scale might be using scores that sales representatives achieved or an instrument designed to assess their job satisfaction to predict who might quit. The attitude score would again be used to predict a behavior - the likelihood of quality. Both of these example illustrate predictive validity in the true sense of word – that is, use of the score to predict some future occurrence.

Another type of pragmatic validity is concurrent validity. Concurrent validity is concerned with the relationship between the predictor variable and criterion variable when both are assessed at the science point in time, for example, a pregnancy test administered to women to ascertain whether they are pregnant provides an example of concurrent validity. The interest here is not in forecasting whether the women will become pregnant in the future but in determining if she is pregnant now.

Concurrent validity is accurate measurement of the current condition or state. Most physical measuring instruments have excellent concurrent validity (e: thermometers, weighting scales, oil dipsticks). Most tests of mental ability also have this type of validity, but to a lesser extent.

Empirical evidence used to test validity -> Compare measure to other indicators, pragmatic (criterion) validity
1. Concurrent validity
Does a measure predict simultaneous criterion?
Validating new measure by comparing to existing measure
E.g., Does new intelligence test correlate with established test

2. Predictive validity
Does a measure predict future criterion?
E.g., SAT scores: Do they predict college GPA?

Pragmatic validity in research looks to a different paradigms than more traditional, positivistic research approaches. It tries to ameliorate problems associated with the rigour-relevance debate, and is applicable in all kinds of research streams. Simply put, pragmatic validity looks at research from a prescriptive-driven perspective.

Validity in prescription-driven research is approached in different ways than descriptive research. The first difference deals with what some researchers call ‘messy situations’ (Brown 1992; Collins, Joseph, and Bielaczuc 2004). A messy situation is a real-life, a highly multivariate one is where independent variables cannot be minimized nor completely accounted for.

The use of the phrase of Pragmatic Validity was first discussed in Worren, Moore & Elliott (2002), who contrasted it with Scientific Validity. This ideas has been taken up in the management literature to a considerable degree.

Cook (1983) actually questions the validity of causal explanations generated in a context-free setting (the goal of positivistic, explanatory research). Causal relationships in pragmatic research are looked at somewhat differently, which is apparent in the wording alone.

In pragmatic science, the goal is to develop knowledge that can be used to improve a situation. This we can call prescriptive knowledge. Prescriptive knowledge, according to van Aken (2004, 2004b, 2005) can take the form of a technological rule. A technological rule is “...a chunk of general knowledge linking an intervention or artifact with an expected outcome or performance in a certain field of application” (van Aken, 2005: p23). This rule can be formulated much the same way as my earlier example of a causal statement; ‘if you perform action X to subject Y, then Z happens’ (Note the cause and effect formulation).

Pragmatic validity is determined strictly by the correlation between the two measures; if the correlation is high, the measure is said to have pragmatic validity. Pragmatic validity is relatively easy to assess. It requires, to be sure, a reasonably valid measure of the criterion with which the scores on the measuring instrument are to be compared. All that the researcher needs to do is to establish the degree of relationship, usually in the form of some kind of correlation coefficient, between the scores on the measuring instrument and the criterion variable. Although easy to assess, pragmatic validity is rarely the most important kind of validity. We are often concerned with “what the measure in fact measures” rather than simply whether it predicts accurately or not.


Source:
-. Managerial Application of Multivariate – Analysis in Marketing, James H.. Myers and Gary M. Mullet, 2003, American Marketing Association, Chicago
-. Marketing Research, Methodological Foundations, 5th edition, The Dryden Press International Edition, author Gilbert A. Churchill, Jr.
-. Wikipedia.com

Tuesday, May 12, 2009

Validity of Measurement

The term “valid” means that a question or a scale measures what it is intended to measure. Validity is confidence in measures and design. Physical measurements such as height and weight can be measured reliably (and they are also valid measures of how tall or heavy someone is), but they may not relate in any various artistic or athletic achievements. Therefore, they are not valid measurements for these purpose or objectives. We need to ensure that respondent ratings of various kinds as well as reliable. We normally do not know the true score of object with respect to a given characteristic.

In psychological and educational testing, “Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests”. Although classical models divided the concept into various "validities," such as content validity, criterion validity, and construct validity, the modern view is that validity is a single unitary construct.

Validity also extends to:
Internal Validity: Precision in the design of the study – ability to isolate causal agents while controlling other factors
External Validity: Ability to generalized from the unique and idiosyncratic settings, procedures and participants to other populations and conditions

Type of Validity
Unfortunately, the concept of validity is not a simple one, because there are several possible meanings of this term. The most common ones include the following:

1. Concurrent validity:
accurate measurement of the current condition or state. Most physical measuring instruments have excellent concurrent validity (eg: thermometers, weighting scales, oil dipsticks). Most tests of mental ability also have this type of validity, but to a lesser extent.
In concurrent validity, we assess the operationalization's ability to distinguish between groups that it should theoretically be able to distinguish between. For example, if we come up with a way of assessing manic-depression, our measure should be able to distinguish between people who are diagnosed manic-depression and those diagnosed paranoid schizophrenic. If we want to assess the concurrent validity of a new measure of empowerment, we might give the measure to both migrant farm workers and to the farm owners, theorizing that our measure should show that the farm owners are higher in empowerment. As in any discriminating test, the results are more powerful if you are able to show that you can discriminate between two groups that are very similar.

2. Predictive validity: accurate prediction of some future events, such as success in academic or athletic achievement or future customer purchase or loyalty.
In predictive validity, we assess the operationalization's ability to predict something it should theoretically be able to predict. For instance, we might theorize that a measure of math ability should be able to predict how well a person will do in an engineering-based profession. We could give our measure to experienced engineers and see if there is a high correlation between scores on the measure and their salaries as engineers. A high correlation would provide evidence for predictive validity -- it would show that our measure can correctly predict something that we theoretically think it should be able to predict.

3. Face validity: the measuring scales appear to measure what they are intended to measure.
In face validity, you look at the operationalization and see whether "on its face" it seems like a good translation of the construct. This is probably the weakest way to try to demonstrate construct validity. For instance, you might look at a measure of math ability, read through the questions, and decide that yep, it seems like this is a good measure of math ability (i.e., the label "math ability" seems appropriate for this measure). Or, you might observe a teenage pregnancy prevention program and conclude that, "Yep, this is indeed a teenage pregnancy prevention program." Of course, if this is all you do to assess face validity, it would clearly be weak evidence because it is essentially a subjective judgment call. (Note that just because it is weak evidence doesn't mean that it is wrong. We need to rely on our subjective judgment throughout the research process. It's just that this form of judgment won't be very convincing to others.) We can improve the quality of face validity assessment considerably by making it more systematic. For instance, if you are trying to assess the face validity of a math ability measure, it would be more convincing if you sent the test to a carefully selected sample of experts on math ability testing and they all reported back with the judgment that your measure appears to be a good measure of math ability.

4. Construct validity: accurate measurement of some basic underlying idea or concept.
For example, customer satisfaction measurement involves all four types of validity. Ratings must accurately reflect how customers now feel about various aspects of the company (concurrent validity). They should also enable us to predict future loyalty and customer retention (predictive validity). They must appear to measure factors or aspects that are meaningful and important to company management and employees (face validity). And they should accurately reflect the basic idea of satisfaction (construct validity). This is a tall older, and it means that careful analysis should be done in the early stages of constructing an instrument for measuring costumer satisfaction on an ongoing basis.

Different from above type, in Marketing Research, Methodological Foundations, 5th edition, The Dryden Press International Edition, author Gilbert A. Churchill, Jr, validity measurement divide in three types:
1. Pragmatic Validity
2. Content Validity
3. Construct Validity

It is worth repeating the earlier caveat that a measuring instrument cannot be valid unless it is reliable. More specifically, reliability puts an upper limit on validity. Loosely speaking, if a measuring scale is only about 50% reliable, it can be only about 50% valid. This is why it is so important for companies to test their own questionnaires periodically and to make every effort to improve both reliability and validity of ratings gathered for any particular purpose.

In spite of importance of both reliability and validity, it is likely that even today most market research studies proceed without giving a thought to either of these topics. This may due to ignorance, time pressures, budgetary constraints, or other factors. Yet management usually just assumes that the data presented are both accurate and relevant in terms of the project objective. It is the responsibility of research managers and outside suppliers to ensure that, whenever possible, evidence is gathered to support the integrity of research findings.

Example of validity
Many recreational activities of high school students involve driving cars. A researcher, wanting to measure whether recreational activities have a negative effect on grade point average in high school students, might conduct a survey asking how many students drive to school and then attempt to find a correlation between these two factors. Because many students might use their cars for purposes other than or in addition to recreation (e.g., driving to work after school, driving to school rather than walking or taking a bus), this research study might prove invalid. Even if a strong correlation was found between driving and grade point average, driving to school in and of itself would seem to be an invalid measure of recreational activity.


Source:
-. Managerial Application of Multivariate – Analysis in Marketing, James H.. Myers and Gary M. Mullet, 2003, American Marketing Association, Chicago
-. Marketing Research, Methodological Foundations, 5th edition, The Dryden Press International Edition, author Gilbert A. Churchill, Jr.
-. http://www.colostate.edu/
-. http://www.wikipedia.edu/
-. http://www.socialresearchmethods.net

Wednesday, May 6, 2009

Classification and Assessment of Error

I put this topic from Marketing Research, Methodological Foundations, 5th edition, The Dryden Press International Edition, author Gilbert A. Churchill, Jr. The idea in measurement is to generate a score that reflects true differences in the characteristic one is attempting to measure and nothing else. What we in fact obtain, is something else. A measurement, call it X0, for what is observed can be written as a function of several components:

X0 = Xt + Xs + Xr;

Where:
Xt = the true score of the characteristic being measured
Xs = systematic error
Xr = random error
The total error of measurement is given, by the sum of Xs and Xr.

Xs is systematic error, systematic error is also known as constant error, because it affects the measurement in a constant way. An example would be the measurement of a man’s height with a poorly calibrated wooden yardstick. Differences in other stable characteristic of the individual, which affect the person score, are a source of systematic error.

Xr is random error, random error is not constant error but, rather, is use to transient aspects of the person or measurement situation. A random error manifest itself in the lack of consistency of repeated or equivalent measurements when the measurement are made on the same object or person. An example would be the use of an elastic ruler to measure a man’s height. It is unlikely that on two successive measurements of observer would stretch the elastic ruler to the same degree of tautness and therefore, the two measures would not agree although the man’s height had not changed. Differences resulting from transient personal factors are an example of this type of error in psychological measurement.

The distinction between systematic error and random error is critical because of the way the validity of measure is assessed. Validity is synonymous with accuracy or correctness. The validity of measurement is defined as “the extent to which differences in scores on it reflects true differences among individuals on the characteristic we seek to measure, rather than constant or random errors. When a measurement is valid, X0 = Xt, since there is no error.

The problem is to develop measures in which the score we observe and record actually represents the true score of the object on the characteristic we are attempting to measure. This is much harder to do than to say. It is not accomplished by simply making up a set of question or statements to measure. This relationship between measured score and true score is never established unequivocally but is always inferred. The bases for inferences are two:
1. Direct assessment employing validity
2. Indirect assessment via reliability


Monday, May 4, 2009

Comparison Among the Designs and Methods

Simple random sampling is the basic building block and point of reference for all other designs discussed in this text. However, few large scale surveys use only simple random sampling, because other designs often provide greater accuracy or efficiency or both. The sample is chosen from the entire population, using a random number generator. Each member of the population has an equal chance of being selected. The selection of any particular individual does not affect the chances of any other individual being chosen. Every number of the population has an equal chance of being selected and the selection of any particular individual does not affect the chances of any other individual being chosen. Choosing the sample randomly reduces that selected members will not representative the whole population. You could select the sample by drawing names randomly or by assigning each member of population a unique number and then using a random number generator to determine which members to include

Stratified random sampling produces estimators with smaller variance than those from simple random sampling, for the same sample size, when the measurements under study are homogenous within strata but stratum means vary among themselves. The ideal situation for stratified random sampling is to have all measurements within any one stratum equal but have differences occurring as we move from stratum to stratum. Sometimes a population includes groups of members who share common characteristics, such as gender, age, or educational level. Such groups are called strata. A stratified sample has the same proportion of members from each stratum as the population does

A stratified sample is obtained by taking samples from each stratum or sub-group of a population. When we sample a population with several strata, we generally require that the proportion of each stratum in the sample should be the same as in the population.

Stratified sampling techniques are generally used when the population is heterogeneous, or dissimilar, where certain homogeneous, or similar, sub-populations can be isolated (strata). Simple random sampling is most appropriate when the entire population from which the sample is taken is homogeneous. Some reasons for using stratified sampling over simple random sampling are:

a) the cost per observation in the survey may be reduced;
b) estimates of the population parameters may be wanted for each sub-population;
c) increased accuracy at given cost.

Example
Suppose a farmer wishes to work out the average milk yield of each cow type in his herd which consists of Ayrshire, Friesian, Galloway and Jersey cows. He could divide up his herd into the four sub-groups and take samples from these.
(Definition and example taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

Systematic sampling is used most often simply as a convenience. It is relatively easy to carry out. But this form of sampling may actually be better than simple random sampling, in terms of bounds on the error of estimation, if the correlation between pairs of elements within the same systematic sample is negative. The stratified and the systematic sample booth force the sampling to be carried out along the whole set of data, but stratified design offers more random selection and often produces a smaller bound on the error of estimation. A random starting point is chosen, using a random number generator. The sample is chosen by going through the population sequentially; the members of the sample are selected at regular intervals, e.g., every fifth person is selected. You go through the population sequentially and select members at regular intervals. The sample size and the population determine the sampling interval:
Interval = population size / sample size,
for example, if you wanted the sample to be a tenth of population, you would select every tenth member of the population, starting with one chosen randomly from among the first ten sequence

Stratified Cluster Sampling
• Combines elements of stratification and clustering
• First you define the clusters
• Then you group the clusters into strata of clusters, putting similar clusters together in a stratum
• Then you randomly pick one (or more) cluster from each of the strata of clusters
• Then you sample the subjects within the sampled clusters (either all the subjects, or a simple random sample of them)

Cluster sampling is generally employed because of cost effectiveness or because no adequate frame for elements is available. However, cluster sampling may be better than either simple or stratified random sampling if the measurements within clusters are heterogeneous and the cluster means are nearly equal. This condition is in contrast to that for stratified random sampling in which strata are to be homogeneous but stratum means are to differ.


Stratification vs. Clustering

Stratification
• Divide population into groups different from each other: sexes, races, ages
• Sample randomly from each group
• Less error compared to simple random
• More expensive to obtain stratification information before sampling
The population is divided into groups that share a common characteristic. From each group a simple random sample of the members is taken. The size of each sample from each group is proportional to the size of each group. There may often be factors which divide up the population into sub-populations (groups / strata) and we may expect the measurement of interest to vary among the different sub-populations. This has to be accounted for when we select a sample from the population in order that we obtain a sample that is representative of the population. This is achieved by stratified sampling.

Clustering
• Divide population into comparable groups: schools, cities
• Randomly sample some of the groups
• More error compared to simple random
• Reduces costs to sample only some areas or organizations
The population is divided into groups. A random sample of groups is chosen. All members from the
chosen group are surveyed.

Multi-stage Random Sampling
The population is organized into groups. A random sample of groups is chosen. From each group a random sample is chosen. This method uses several levels of random sampling. Multi-stage sampling is like cluster sampling, but involves selecting a sample within each chosen cluster, rather than including all units in the cluster. Thus, multi-stage sampling involves selecting a sample in at least two stages.


Source:
-. Richard L. Scheaffer, William Mendenhall, Lyman Ott; Elementary Survey Sampling, 4-th, PWS-Kent Publishing Company, 1990, Boston
-. Mugo Fridah W, Sampling in Research
-. SamplingBigSlides.pdf
-. St. Paul Mathematics Department