**Education Development Center, Inc.
Center for Children and Technology
**

*Prepared by:*

Chip Bruce

Scientists once sought a deterministic understanding of phenomena,
one which had no place for variability and uncertainty. Today,
across fields as diverse as quantum mechanics, genetics, epidemiology,
cognitive psychology, education, economics, and astrophysics,
scientists not only expect stochastic processes, but incorporate
probabilistic and statistical concepts into their theories. This
change in science has been called the "probabilistic revolution"
(Gigerenzer & Murray, 1987). As a result of the probabilistic
revolution, statistical reasoning has become indispensable for
interpreting scientific statements, making inferences, and engaging
in scientific inquiry. Similarly, the everyday world, as represented
in the daily newspaper, is one which demands statistical literacy.
In order to understand environmental hazards, economic conditions,
tests of new drugs, or political surveys, the reader must be able
to assess quantitative data in terms of variability, sample size,
bias, measures of central tendency, and other statistical concepts.

Recognizing the growing importance of statistics, educators added
data analysis, probability, and statistics to the mathematics
curriculum. The National Council of Teachers of Mathematics curriculum
and evaluation standards (NCTM, 1987) call for statistics and
probability in all grades, K- 12, with particular emphasis on
data exploration, analysis, and interpretation. Supporting this
call, a joint committee of the American Statistical Association
and NCTM has developed *Quantitative Literacy (QL) *(Landwehr
& Watkins, 1987; Landwehr, Swift & Watkins, 1987; Newman,
Obremski & Scheaffer, 1987; Gnanadesikan, Scheaffer &
Swift, 1987), a set of materials on statistics and probability
for middle school students.

At the core of statistical reasoning lies an understanding of
sampling processes. In order to make inferences about a population,
students must understand what information a sample contains and
what it can or cannot reveal about a population. But sampling
can be complex and difficult to understand. For many students,
it may be the first time they are asked to think of the world
in terms of estimates and probabilities rather than in terms of
knowable, quantifiable facts. The ability to conceptualize a problem
as a question of confidence in a method rather than as a question
of identifying the appropriate formula for calculation requires
students to revise their mental models of mathematics in basic
ways. This revision is one reason why statistics is difficult
to learn. While many students may be able to state useful definitions
for "sample" and "population" or manipulate
the formula for a confidence interval, they exhibit confusion
about the conceptual bases of statistical inference, even after
completion of a course on statistical reasoning. We chose sampling
as one focus within the *Reasoning Under Uncertainty (RUU) *project
(Rubin, Bruce, Conant, DuMouchel, Goodman, Horwitz, Lee, Mesard,
Pringle, Rosebery, Snyder, Tenney, & Warren, 1990; Rubin,
Rosebery, & Bruce, 1988) because of its importance within
statistical reasoning and because of the difficulty many students
have in mastering basic sampling concepts.

This report documents a program *(Sampling Laboratory) *with
which students can explore the processes of sampling and making
inferences from samples. It also describes a curriculum built
around the *Sampling Laboratory, *a field test of its use
in high school classrooms, and studies of the learning of statistical
reasoning related to sampling. It is intended to be a tool for
those interested in issues related to the teaching and learning
of reasoning from samples, and in particular, of the *Sampling
Laboratory.
*

Section 1 presents background on the *Sampling Laboratory, *including
previous research on the learning of statistical reasoning and
earlier curricula, such as *QL *and *RUU*. The *Sampling
Laboratory *software and information on how to use it are described
in Section 2. A module for teaching about sampling as realized
in high school classrooms is presented in Section 3. Results of
the implementation of these modules, including a study of students'
learning of statistical concepts, are given in Section 4. Future
directions are discussed in Section 5.

**1 Background
**

Most of the research on statistical reasoning has compared student
models for statistical reasoning with what we call the "standard
model" of statistical reasoning. Although there is an active
debate among statisticians (see for example the historical analysis
in Gigerenzer & Murray, 1989) on underlying inference models,
there is general agreement about the central aspects of statistically-based
reasoning. The "standard model" runs as follows: A set
of data can be represented pictorially in a number of ways. In
particular, a sample can be represented as a histogram. The relationship
between a sample and its pictorial representation is notational,
or definitional. Thus, one can talk about (a) correctness - does
the picture accurately represent the data according to the definition
of the graph type? (b) usefulness - within the allowable parameters
(e.g., bin size), does the picture show the data in a clear and
productive way for some purpose?

The relationship between a sample and a population is one of contingent
similarity; that is, if the sample is unbiased, it will tend to
have similar shape, spread, central tendency, etc. to the population,
and this tendency will be greater for larger samples. Thus, the
appropriate value terms are (a) randomness - is the sample in
fact unbiased with respect to the population of interest? and
(b) goodness - is the sample large enough to merit the appropriate
level of confidence in any conclusions drawn from it?

Some other aspects of the standard model are the following:

(a) There is a clear separation between the real world, conceptualized through the sample and the population, and abstractions, such as graphical representations, measures of central tendency, confidence judgments, etc.

(b) Samples are not more or less "right." It is not "wrong" to have a sample that looks very unlike the population. Following good statistical practice in drawing a sample does not ensure that the sample will look like the population; it merely allows one to specify precisely the likelihood of similarity.

(c) One expects samples to vary, not because of bad design or bias (although they can have a large effect), but because of randomness inherent in the sampling process.

(d) A histogram is supposed to *represent *a sample accurately;
a sample is supposed to *represent *a population. These representations
have radically different epistemological status. One is definitional;
the second is probabilistic.

(e) Reliability of estimation by sampling is dependent on sample size, but relatively independent of population size.

(f) The size of a *confidence interval *for a given sample
and a given confidence level is directly proportional to the sample
spread and the confidence level, and inversely proportional to
the square root of the *sample size.*

(g) The process of inferring population parameters from sample
statistics is critically dependent upon the assumption that the
sample is *unbiased.
*

**1.1 Research on Statistical Reasoning
**

In this section we simply want to mention some of the background
research for our work on the *Sampling Laboratory, *not,
by any measure, to give a complete review of research on statistical
reasoning. Many researchers have studied people's statistical
heuristics and judgments of subjective probability. One heuristic
proposed is the use of *representativeness *(Kahneman &
Tversky, 1974) as a measure of the likelihood of a sample being
drawn from a population. Kahneman and Tversky define representativeness
for a sample as "the degree to which it is (i) similar in
essential properties to its parent population and (ii) reflects
the salient features of the process by which it is generated."
(p. 431). If the sample is unordered, this definition reduces
to saying that the closer the sample statistic is to the population
parameter, the more representative the sample. Bar-Hillel (1982)
added the notion of "accuracy" to that of representativeness.
In her experiments, subjects described as "accurate"
those samples whose sample statistic exactly matched the population
parameter.

Kahneman and Tversky (1982) also found that people use an *availability
*heuristic to judge the frequency of a sample "by assessing
the ease with which the relevant mental operation of retrieval,
construction, or association can be carried out." (p. 164).
In examples where availability guided people's thinking, they
found, the problem was often stated so that the mechanism of constructing
the sample was emphasized, rather than its final composition.
Thus, there is some evidence from their research that the way
a problem is stated influences the representation subjects use
to explore it. Finally, research has shown that the sequence of
questions can influence final estimates because initial estimates
tend to have an effect on the entire sequence of answers subjects
offer. Slovic and Lichtenstein (1971) report that subjects often
construct a final estimate by small adjustments from their initial
estimate; the implication is that different orders of questions
can influence subjects to follow different reasoning routes and
arrive at different answers .

Rubin, Bruce, and Tenney ( 1990) report a study of students reasoning
from samples, which shows students struggling with dual aspects
of the central idea of statistical inference: that a sample gives
us *some* information about a population - not nothing, not
everything, but something. In practice, this allows us to put
bounds on the value of a characteristic of the population - usually
either a proportion or a measure of center (mean or median), but
not to know precisely what that characteristic is.

Under this view, *sample representativeness *is* *the
idea that a sample taken from a population will have characteristics
similar to those of its parent population. Thus, the proportion
of girls in a classroom is likely to be close to the proportion
of girls in the entire school. *Sample variability *is*
*the contrasting idea that samples from a single population
are not all the same and thus do not all match the population.
Thus, some classrooms in a school are likely to have many more
girls than boys, even if the school population is evenly divided
.

One of the keys to mastering statistical inference is balancing
these two ideas, interpreting more precisely the meaning of "likely"
in each. Because they are contradictory when seen in a deterministic
framework, students may over-respond to one or the other depending
on the context. Over-reliance on sample representativeness is
likely to lead to the notion that a sample tells us *everything
*about a population; over-reliance on sample variability implies
that a sample tells us *nothing. *Finding the appropriate
point on the continuum between the two extremes is complex and
needs to take into account confidence level, population variance
and sample size. For a given confidence level and population variance,
the effect of sample size relates closely to the representativeness/variability
continuum: the larger the sample, the more likely it is to be
representative of the population. Smaller samples are more likely
to vary .

The analysis of student responses indicated that most students
have inconsistent models of the relationship between samples and
populations, even for problems in which the underlying mathematical
models are isomorphic. In some situations, the notions of sample
representativeness hold sway, in others, those of sample variability
do. Sample size does not seem to operate appropriately to separate
the two; in fact, of the three problems analyzed, sample representativeness
appears to be a stronger guiding factor in the problem with the
smallest sample size.

In related work, Snyder (1989) conducted an ethnography of a high
school classroom, focusing on student learning in a statistics
course *(Reasoning Under Uncertainty, *see next section).
He found that students did not pick up the connection between
science and statistical reasoning, which is prominent in experts'
discussion of statistics. In interviews, students were unanimous
in noting scant use of statistical concepts in their science courses.
Some students came away with the idea that the world is full of
fuzzy variables, and that statistical techniques are often helpless
in the face of this chaos.

This course required students to go beyond manipulating equations
to make connections between mathematics and the real world, and
to grasp underlying concepts such as the distinction between populations
and samples. Generally speaking, the students had little trouble
mastering the few equations in the course, but had difficulty
with conceptual distinctions. They were particularly confused
about when to apply the formula for proportions versus the formula
for means. Students were hampered in their conceptual grasp of
sampling by not being clear about the status of unknown population
parameters and about how randomness produces variability, even
in unbiased samples.

In a study reported in Rubin, et al. (1990), Bruce and Snyder
interviewed all the students in a high school statistics class.
They also interviewed two teachers, a statistician, a physicist,
a demographer, two experimental psychologists, and a computer
scientist. They organized the interview around a problem that
appears in Moore (1985) in various problems and examples. It was
selected because it was complex enough to elicit a variety of
responses and because it called for interpretation and policy
judgments that went beyond simple calculations (see Appendix E).

Analyses of the interviews revealed that students had strengths
in several areas, especially in the area of descriptive statistics.
They also had difficulties in several areas related to more inferential
thinking, a few of which we mention below:

*Sample = population. *One problem reflects what appears
to be a conflation of the standard model for inference so that
the relation between sample and population is almost an equality
relation. Thus, several students said that a sample is supposed
to *represent *the population; further questioning revealed
that they meant that it was supposed to look like the population
in terms of location, shape and spread. To the extent that it
did not, students thought the person doing the sampling had made
an error. They did not distinguish sample-to-population *representation
*from that in the statement: The histogram is supposed to *represent
*the sample.

*Sampling variability. *Related to the first idea is students'
notion that samples should not vary. If the work is done correctly,
they think, there should be no sampling error. Here we may be
seeing confusion in part traceable to the unfortunate choice of
"error" as the term within standard statistics for the
effect of random sampling variation. Other problematic terms are
"normal," "bias," "random," "standard,"
"population," "individual," and "confidence."

*Data and process. *A* *striking difference between
most of the students and some of the adult experts was that the
students rarely asked questions about the processes that generated
the data set. This may simply reflect the social setting of the
interview and the outside-adult/student relationship. But we suspect
that students considered a problem in statistics to be complete
as stated; there was no need to ask further questions, or to know
the underlying process. It is noteworthy that even an excellent
text, such as Moore (1985), tends to present many short problems,
so that no problem is presented with much detail on the domain
of study (in these interviews, milk production). Other texts (such
as Tanur, Mosteller, Kruskal, Link, Pieters, & Rising, 1989),
contain longer case studies, but tend to be used as supplementary
materials. Although students often carry out surveys of their
own in class, they do not generalize their insights about the
importance of the details of data collection to problems from
a textbook.

*Normality and niceness. *Many students seemed to equate
normality, as in normal distribution, with perfection or niceness.
Their goal was to have a nice-looking picture, but often the picture
wasn't nice because it was difficult to do work perfectly: "it's
hard to get everything just right." A related idea was that
statistical work (the survey, the calculations) should be done
correctly "to show you've learned it." This desire conflicted
with the indeterminacy inherent in sampling.

*Randomness. *Students had difficulty, as one might expect,
with the difficult concept of randomness. In some cases it seemed
to be equated to fairness.

*Explanation and persuasion. *Perhaps the most disturbing
outcome of the interviews was that a number of students seemed
to interpret the question about explaining complex statistical
ideas to the public as "how could you distort the statistical
analysis to mislead the public?" Thus they saw the purpose
of explanation to be persuasion. In this case, they saw the job
of the health official to be reassuring the public, no matter
what the data showed.

*Statistics and the real world. *Statistics is about using
mathematical concepts in relation to real world data and important
questions. But the interviews showed that the surrounding school
context did not support this message. Despite the assertion that
statistics was important in science, social studies, and humanities
areas, students saw only trivial instances of statistical reasoning
in their other courses, e.g., a mention of "means" in
a science class. One student said she signed up for the statistics
course because it was "something different to take...it's
not like it comes up in everyday life...not in my math courses
or anything."

**1.2 Curricula
**

The *Sampling Laboratory *builds upon previous curricular
work on sampling, in particular, *Quantitative Literacy *and
*Reasoning Under Uncertainty. QL *"is* *an introduction
to statistics. In addition to learning the most up-to-date statistical
techniques,...students...get practice in division, percents, ratios,
ordering numbers, and many other topics in arithmetic.

Familiar statistical concepts such as reading tables, the mean
(average), and scatter plots are included as well as less familiar
ones such as the median, stem-and-leaf plots, box plots, and smoothing.
All of these techniques are part of a new emphasis in statistics
referred to as data analysis (or sometimes as exploratory data
analysis - EDA). The techniques of data analysis are easy to do
and are often graphical. They can reveal interesting patterns
and features in the data.

The techniques in *QL* encourage students to ask questions
and generate hypotheses about the data. This is an important part
of data analysis. By using these methods students will be able
to interpret data that are interesting and important to them."
(Landwehr, Swift, & Watkins, 1984, introduction page).

The objective of the *Reasoning Under Uncertainty *project
has been to develop and test a computer-supported environment
in which high school students learn how to think in probabilistic
and statistical terms. The central ideas are to use the computer
as a tool for data gathering, manipulation, and display, and to
have students investigate questions that are meaningful to them.
In contrast to the usual emphasis in statistics courses on formulas
and computational procedures, *RUU *emphasizes reasoning
about statistical problems. The students should be able to engage
in statistical reasoning about uncertainties that either they
or society face. Such a course conforms well to the National Science
Board's suggestion that "elementary statistics and probability
should now be considered fundamental for all high school students."

To facilitate involving the students in statistical and probabilistic
thinking, the computer-supported environment provides a series
of data sets that the students explore for meaning in terms of
statistical principles. Data sets that interest them- for example,
on sports, health issues, and social trends- promote functional
learning via activities that can serve the students' own goals.
In this setting, students can discover and construct their knowledge
via participatory, experimental learning.

*RUU is *a semester-long course with four modules, each with
several units:

1 Describing Groups

1.1 What do statistical questions and answers look like?

1.2 Measures of central tendency

1.3 Understanding variability

2 Answering Questions--Sampling from Groups

2.1 Why sample?

2 .2 Confidence

3 Making Comparisons

3.1 Asking statistical questions: Collecting data through surveys and experiments

3.2 Answering statistical questions: Visualizing and analyzing data

4 Understanding Relationships

4.1 Answering question about multivariate data

4.2 Making predictions

4.3 Association versus causation

4.4 Newspaper stories (optional unit)

The primary software used in the curriculum is *ELASTIC, *a
statistical spreadsheet that handles both categorical and numerical
variables. Users can easily display summary statistics, build
new variables from those that were already defined, and create
histograms, bar graphs, scatter plots and box plots. It also allows
the student to look at subsets of data - for example, all females
earning over $30,000, or all boys under five feet tall, or all
maze times under two minutes. It then allows the student to create
graphs - histograms, box plots, scatter plots, and bar charts
- for any of the variables, or for selected subsets. *ELASTIC
*also includes two exploratory environments:

*Stretchy Histograms *allows students to create and manipulate
distributions interactively. Measures of location and variability--mean,
median, and quartiles - change dynamically as the distribution
is modified.

*Shifty Lines *provides an environment for experimenting
with lines on scatter plots. A potential best-fit line can be
moved around the screen while a scale records how it differs from
the best possible fit. The software also allows students to identify
particular points on their scatter plot and to investigate how
a regression line would change if points were deleted from the
data set.

Module 2 makes use of a precursor of the *Sampling Laboratory,
*called *Sampler, *a program in which students can explore
the behavior of multiple samples drawn from a single population.
Experiments using *Sampler *can illuminate relations among
sample size, number of samples, and confidence limits of inferences
about the underlying population.

Ethnographies of *Reasoning Under Uncertainty (RUU) *classrooms,
which involved systematic observation of many class sessions and
focal interviews with teachers, students and administrators, are
described in Page (1989), Rubin, et al. (1990), and Snyder (1989).
One of these (Page, 1989) focused on implementation questions,
examining supports within the school for implementing *RUU*.
Its key conclusions are as follows:

The use of the computer played an important role in fostering
student-centered learning. Working in pairs was effective and
seemed to lead to greater understanding.

There was an apparent absence of competition among students in
the classroom, and an unexpected presence of competition among
the faculty, in the area of recruiting students for elective courses.
*RUU *was significant in this regard because the use of computer
and the data collection activities made the course attractive
to students.

The introduction of the innovation *(RUU) *was "relatively
easy...[It] was right out of a textbook: Proper training, proper
planning, an established course, a teacher with the right abilities
and attitude. The results are excellent, from all indications."

**2 Sampling Laboratory
**

Based on the research sketched above, we identified a set of goals
for teaching about sampling processes. A major decision, based
on the Snyder and Bruce research reported in Rubin, et al. (1990)
as well as the *Quantitative Literacy *curriculum, was to
focus on estimates of population proportions. Within that area
we have identified a set of basic concepts related to sampling
that we would like our activities to support. Although they are
stated here in terms of estimating population proportions, they
extend easily to other population parameters such as median or
mean:

(a) In general, you cannot calculate a population proportion directly:
the population is too large; it costs too much to take measurements;
or you don't have access to all the individuals in the population.

(b) A sample is not the same as a population, but it can give
you some information about a population.

(c) Randomly-chosen samples vary considerably, especially with
a small sample size. Thus the sample proportion you get is likely
to be different from the population proportion .

(d) This variation occurs even if you are careful to avoid bias
in choosing a sample. It is a consequence of randomness in the
sampling process, not of human error.

(e) Although the sample proportion may vary from the population
proportion, it varies in a predictable way. Samples are more likely
to be similar to the population than very different from it. If
you could look at many samples from the same population, you would
see that the greatest number of samples would have the same proportion
as the population, and the further from the population proportion
you looked the fewer samples you'd see. Thus, despite sampling
variability, a random sample can be used to make a reasonable
estimate of a population proportion.

(f) The goodness of the estimate is directly dependent upon the
size of the sample. As the sample size increases, it is less and
less likely that the sample proportion will differ greatly from
the population proportion.

(g) This sample size effect is non-linear. (The size of the confidence
interval is inversely related to the square root of the sample
size.)

The *Sampling Laboratory, *which runs on a Macintosh Plus,
addresses most of these goals. It has most of the functionality
of our original *Sampler, *although it is restricted to proportions.

**2.1 Special Features of the Sampling Laboratory
**

The *Sampling Laboratory *supports the following:

*Concrete representation of samples. *The *Sampling Laboratory
*uses icons to represent each individual in a small sample
in order to emphasize the difference between samples and populations,
which are represented as histograms. We are working on other concrete
representation ideas for populations, samples and the sampling
process, including one based on sampling by specifying a region
in a space of colored pixels and one based on sampling from a
pipeline in which individuals are produced temporally (see Figure
6).

*Relationship among populations, samples, and sampling distribution.
*In an earlier program (*Sampler*),* *students had
trouble understanding how the histogram of the sampling distribution
was derived from the individual samples. The *Sampling Laboratory
*indicates the relationship for each sample by flashing lines
on both the sample and sampling distribution graphs, pointing
out the correspondence (see Figure 6).

*Comparison of distributions of sample proportions. *The
*Sampling Laboratory *allows students to compare and contrast
sets of samples from different populations or sets of different
size samples from the same population (see Figure 7, below). Students
can re-examine a sampling distribution produced earlier or compare
it to a later distribution.

*Separation between setup and run. *Students or teachers
can set up a sampling process in terms of population, sample size,
and number of samples, but postpone the actual production of the
sampling distribution. This allows activities such as drawing
10 samples of sizes 10, 20, 40, and 80 from a population as a
single operation (see Figure 8).

*Box plot summaries of sampling distributions. *The *Sampling
Laboratory *can display a box plot summary for a sampling distribution
and allows students to set the percentage of samples contained
in the box (see Figure 9).

*Confidence intervals and summary window*. A summary window
displays a set of box plots representing many sampling distributions.
From a chart of box plots representing a set of sampling distributions,
the software can then display a confidence interval for any sample
proportion (see Figure 10). This approach follows that taken in
Landwehr, Swift, & Watkins (1987).

**2.2 Using the Program
**

**2.2.1 Data Sets
**

Figure 1 shows the opening screen for the Sampling *Laboratory.
*The user can open an existing data set or create a new one

**FIGURE 1
**

**2.2.2 Objects
**

The *Sampling Laboratory* allows a student to create any
number of *objects*, which can be used to construct *populations*.
Each object type has one or more possible *categories*. The
screen for creation of an object type is shown in Figure 2. In
the example, objects of type *M&M's* are defined as having
the categories, *red, brown, yellow, green, tan *and *orange*.
Objects of type *voters* might have the categories, *Bush,
Dukakis*, and *undecided*.

**Figure 2
**

Figure 3 shows two object types and the associated categories.
The user is focusing on the Voters object type. The categories
for each object type are shown on the right in a scrollable list.

**Figure 3
**

**2.2.3 Populations
**

After selecting a particular object, students can define different
*populations* by assigning a set of weights to the categories,
and optionally assigning a name to the population. This is done
by typing in weights or percentages, or by selecting *uniform*.
In the example below (Figure 4), the student has selected the
object type, M&M's, has labeled the population being created,
"30% red," and is setting weights for each category.

**Figure 4
**

**2.2.4** **Experiments
**

For each population so defined, the student can run *experiments*,
which are not full experiments in the sense of experimental design,
but the production of sampling distributions from a specified
population. To set up an experiment, the student sets a sample
size and a number of samples to be drawn. A comment of any length
can be added to describe the experiment further. The experiment
can be run immediately or at any later time. After setting up
and running several experiments, the student has a computer record
of these experiments. In the example below (Figure 5), the student
has run and commented on two experiments, one for a population
with 20% red M&M's and one for a population with 30%.

**Figure 5**. Experiments on two populations of M&M's.

When the student decides to run an experiment, three windows are
shown. One shows the population, the second shows the sample,
and the third shows the sampling distribution from all the samples
taken in that experiment. This third window thus approximates
the set of likely samples from the given population. The samples
can be drawn, as in the original *Sampler*, in either a step-by-step
mode or a continuous run mode, which the student can PAUSE at
any time.

In Figure 6, the process has been interrupted after 19 of 210
samples have been taken. The 19th sample is shown in the upper
right. It has 10% red M&M's, even though the population percentage
is 20%. The sampling distribution in the lower left shows a large
spread, which one expects from the small sample size (10). It
does seem to be centered near the population percentage. In addition
to continuing the simulation at this point, the student could
also investigate the pattern of proportions in the other categories
(green, orange etc.) by changing the focus category.

**Figure 6. Viewing the sampling process
**

Sampling distributions can then be compared. In Figure 7, the
student is comparing two distributions of 20 samples each, one
drawn from a population with 20% red M&M's and one drawn from
a population with 30% red M&M's. In the first case, the modal
column for the sampling distribution is at 30%, even though the
population percentage is 20%. In the second case, the 20% and
30% columns have the same height, and the population percentage
is 30%. Thus, the spreads are not clearly distinguishable. This
is not too surprising. With a small sample size (10) and population
proportions that are close together (20% and 30%), one would need
to see a large number of samples to have a sharp distinction between
the two sampling distributions.

**Figure 7 Comparing two sampling distributions
**

**2.2.6 Comparing Sampling Distribution Box Plots
**

Another way to compare sampling distributions is to compare box
plot summaries for the distributions. Each box plot is a summary
representation for the set of samples produced in a *Sampling
Laboratory* experiment for a given population proportion. This
distribution approximates the theoretical "likely sample
set."

The *Sampling Laboratory* allows the student to set a percentage
of sample proportions to be included within the box, the remaining
sample proportions to be represented by the whiskers. A large
number of experiments can then be compared easily. In Figure 8,
a student has set up five experiments, to look at samples from
populations with proportions ranging from 10% to 50%. In this
example, each of the experiments has already been run with 20
samples of size 10 each. The comment column shows that the actual
sample proportions are close to but not identical with the population
proportions.

**Figure 8. Five experiments on populations of M&M's.
**

Choosing the "Show Box Plots" button produces a single
display of the box plots for the five populations (Figure 9).
In this case, the student has set the box plot percentage to be
90%, meaning that the box must include at least 90% of the sample
proportions. There are toggle controls for indicating the actual
percentage of samples included within each box and each whisker,
or for showing the whiskers at all.

This sort of display is the type used in Landwehr, Swift, &
Watkins (1987). For each population proportion (reading along
the y-axis), it shows the set of samples produced in the experiment.
This approximates the set of theoretically likely samples from
the population. Reading up from a point on the x-axis, one can
see the populations whose likely sample sets include a given sample
proportion. For example, an actual sample proportion of 15% which
might be produced by collecting real data falls within the box
plots of 10%, 20%, and 30% populations, in this example. Thus,
an approximation to the theoretical 90% confidence interval is
the interval [.1, .3].

**Figure 9. Box plot comparison of five sampling distributions.
**

**2.2.7 Confidence Intervals
**

As the sample size and the number of samples increase, and as
we examine finer gradations of population proportions, we can
come arbitrarily close to the theoretical confidence interval.
The *Sampling Laboratory* also supports construction of confidence
intervals. The user simply clicks on the x-axis at the point corresponding
to an actual sample. The program performs a linear extrapolation
to connect the box plots in an envelope. It then highlights the
region of population proportions that corresponds to the confidence
interval about that sample (figure 10), in this case, [.066, .338].

**Figure 10. A constructed confidence interval for an actual
sample.
**

**3 Sampling Laboratory Curriculum
**

The *Sampling Laboratory* curriculum has the following characteristics:

(a) A connection to sampling issues in the real world of high
school students (see section 3.1 below).

(b) Awareness of the misleading nature of many words in the standard
statistics vocabulary, e.g., "normal," "error,"
"confidence" (see section 3.2 below).

(c) Activities using concrete materials (e.g., bottle caps, see
Appendix A) and real-worlds data (e.g., gender distribution in
families).

(d) Inquiry-oriented activities, in which students explore statistical
questions such as "do a coin and a tack have the same chance
of landing UP?," defining their own experimental method and
decision criteria.

(e) Significant use of the *Sampling Laboratory, *especially
in conjunction with inquiry-oriented activities.

The curriculum was realized in two classrooms, which provided
a wide range of student abilities and challenges for incorporating
sampling lessons into different contexts. In each class we conducted
before and after interviews with students to assess what they
were learning. The classes were the following:

(a) A statistics class at Belmont High School (BHS) which has
been taught using the *Reasoning Under Uncertainty *curriculum
and the *ELASTIC *software for the past three years. Here
the *Sampling Laboratory *activities were used as a four-week
module in a semester-long course on statistics.

(b) A general math course at Cambridge Rindge and Latin High School
(CRLS). This is a course for students who have not been successful
in standard mathematics courses. The focus of the four-week *Sampling
Laboratory *module was on relating concepts of statistical
reasoning to students' everyday concerns. Statistics was not a
part of the rest of the course.

In addition, students in another class at CRLS also used the *Sampling
Laboratory:
*

(c ) An advanced placement math course at CRLS. Sampling was introduced
near the end of the semester after students had taken the advanced
placement test.

**3.1 Examples of Sampling
**

Everyday experience is one place to begin discussion of sampling,
showing where it enters into the fabric of a typical day. For
example, Peter Mili, a teacher at Cambridge Rindge and Latin High
School (CRLS) suggested three questions as tapping the topics
students discuss frequently between classes and outside of school:
How many students want condoms to be available in school? How
many students support a Coke boycott? Are there different probabilities
of violence in different parts of the city? Other topics we used
or envision using are the following:

*Breakfast. *One may sample oatmeal, randomizing it by stirring.
Breakfast food advertises various proportions of nutrients on
the package. These samples have to be destroyed (another example
is sampling flashbulbs). Their sugar content is the subject of
an RUU activity sheet (1-10). Marketing as well as quality control
involves accurate sampling. The debate about the healthful effects
of oat bran illustrates the reasoning based on samples. So far
as food generally, the story of the removal and the return of
red M&M's *(RUU, p. *2-12) gives a good example of marketing
research. A question might be: If you sampled your friends, would
that give you a good idea of what Americans (or other populations)
have for breakfast?

*Clothes. *Here again is the issue of quality control and
inspection: Are all instances of a product the same? Here, too,
are marketing surveys of what items (e.g., shoes or some appealing
example) are popular with whom, what features will make a new
item saleable, how much people are willing to pay. Pump-up basketball
shoes are a spectacular example. What proportion of students at
the high school wear certain items? Would this be regional or
national?

*Media. *Radio stations do surveys in order to aim music
at specific groups *(RUU, p. *2-12). The Nielsen ratings
for TV are another example of surveys *(RUU *activity sheet
1-18). How are royalty payments for recordings played on the air
figured? This is a good stumper explained in Moore (p. 5). An
interesting case is trying out a new movie or piece of music,
based on marketing knowledge, but where the expectation can go
wrong. We are sampling variables that change through time. Do
we really decide what to like or is it decided for us? How do
we decide what movies to go to? Sampling determines much of the
entertainment that is offered to us.

*Risk. *CRLS students showed some interest in this topic.
Are teens unusually at risk for accidents, violence, early death?
Highway safety is one aspect *(RUU, *pp. 4-30 ff). Deaths
in Boston is another aspect - there's a discussable map in RUU
(p. 4-39). Medical tests sample both population trends and one's
own individual health, and samples may vary in both cases. Drug
testing is a hotly debated example. Has teen drug use declined,
as the government claims? Our very health system rests on testing
and sampling. One extended case of how that works is a description
of the introduction of the Salk vaccine at the end of Module 3
in *RUU.
*

*Politics and Government. *The heavy use of polling in elections
is a prominent example, and there's now an obligatory mention
of "margin of error." Politicians are guided in their
strategies by opinion polls. How would one predict the outcome
of a school election? The census provides crucial data and determines
how government money is spent. There are interesting questions
here of undercounts of ghettos and of the homeless, and whether
the final figures should be adjusted on the basis of estimates
drawn from samples. The government provides non-census data too
- what is the reason for this and is it reliable? Moore gives
a good discussion of the Bureau of Labor Statistics (p. 111 ff.)
and unemployment which might interest kids. Should one be skeptical
of this data? On what basis?

*Schools. *SATs, IQs, and other such national tests are built
into our educational system. Are they fair? Classroom testing
is an interesting if subtle example too, given individual variation
and differences of context. For example, is it fair to test students
after a long vacation? The issue of standardization across samples
is a fascinating one: At Belmont High School (BHS) a good discussion
was initiated by asking kids their lowest and highest grades for
the term, which prompted some accurate estimates of which teachers
the grades came from. Can we compare grades or tests from different
schools or classrooms?

**3.2 Key Terms
**

One of the problems identified in research on statistical reasoning
is that many terms are confusable with ordinary language terms.
The module addressed several of these:

*Normal. *Common sense contrasts this term with abnormal,
which leads some students to expect to escape abnormality by obtaining
normal curves, to want the sample to be a normal distribution.
Perhaps the stress should be on the verb "norming,"
the curve as a convenient norm. One can build out from histograms
made from equally distributed variables such as heads and tails,
or Moore's dark and white beads (p. 15). Then a normal curve can
be shown in the case of tabulating many samples so far as the
proportion of heads or dark beads. The peak is at the "break-even
point," necessarily tailing off in both directions.

*Error. *Moore's employment of the statistical sense of this
term runs in one case against the ordinary sense. He distinguishes
between *sampling errors, *which cause results to be different
from the results of a census, and *non-sampling errors, *which
might be present in a census (p. 22). Moore's text commits us
to the term, but perhaps the first and confusing sense, having
to do with a properly planned act of sampling, might be put in
scare quotes - "error" - and shown to be different from
error in the sense of doing something wrong. Certainly the data
from our interviews suggests it should be flagged in some way,
since some students weren't ready to encounter sampling variability.

Moore goes on to distinguish *random *sampling error, which
is just this ordinary variation of samples, and *nonrandom *sampling
error, such as through convenience sampling or an inappropriate
sampling frame. Randomness can be used to explain this, but there
is a further pitfall. This second category of sampling error,
while intuitively O.K., is likely to be confused with non-sampling
error (missing data, response error, processing error, etc.)*
*because both involve doing something wrong. In this context
Moore's distinction is not that useful, and perhaps could be played
down. The key distinction is between sampling variability and
bungled procedure.

*Random. *This is a subtle idea - trying to give a rule for
the ruleless, to delineate the incarnately slippery. Some Belmont
students got the idea of simple random sampling so firmly in mind
that they thought doing things right brings one to the zero case
of randomness, ruling out variability from the population. Moore
usually contrasts randomness with long term coherence. The idea
of randomness might stick better if it were presented not as a
correlative or contrastive idea but as the name of an autonomous
process that is simply *there. *Dice throwing is a nice example
which does not get conflated with later normal curves of a set
of samples of some proportion. It might help to dwell on random
sampling variability and such ideas as independence and small
causes acting independently, such as talking about the physical
basis for the results of dice-throwing and cheating with loaded
dice. Perhaps also it would help if students were asked to struggle
with defining randomness themselves.

*Confidence. *In our interviews we asked about the continuity
of ordinary and statistical senses of this term, and our experts
and students had it both ways. But there is an intuitive basis
here to build on. The BHS teacher Alice Mandel gave a good practical
example in her interview of taking on everyday issues (elections,
etc.) with scaling and weighted numbers as a way of focussing
our subjective expectations so they're not "a nebulous cloud
that you're trying to pack down" (p. 7). Such ideas as betting
odds were employed by some of the Belmont students to explicate
what confidence level means.

Moore explains clearly what confidence level does *not *mean
(pp. 302-303), but he lays down a fine-spun taboo about phrasing
it for the student who wants to say the true p falls within the
confidence interval with a certain probability. The conceptual
difference here is between following a method that achieves a
rational result in general and the strong tendency to want to
make an assertion about the unique case at hand. The point about
method seems likely to require extensive explanation if it is
raised (neither Alice nor her students raised it), but it might
be interesting to introduce it late in a course to see how students
respond.

*Bias. *Here the intuition is so robust that the problem
is to make a transition from the human tendency to objective method.
The student must grasp that a person being biased is only one
matter that can affect a sample being biased, and that the latter
and more objective case of bias may or may not be brought about
by bias in the person sense. What has to be added is an appreciation
of what bias does to a distribution - how it makes it diverge
in a certain direction from the features of the true population.
The student must *see *bias as an effect in the distribution.
Telephone surveying (Moore, p. 22) and voluntary response to questionnaires
(Moore, p. 7) are good examples for discussion of bias: you can
estimate the direction of bias.

**3.3 The Sampling Module at BHS
**

Below is an account by Alice Mandel of the sampling module developed
for use at BHS. Appendix A is a module handout and Appendices
B and C are quizzes for the module. Section 4.1 gives more details
about the actual classroom experiences.

Day 1: Population size experiment using a deck of cards with the
diamonds missing and then 3 decks of cards with the face cards
missing. This activity demonstrates the independence of population
size on inferences from samples, and by implication, the power
of sampling as a procedure for estimating population parameters.

Day 2: Bottle cap experiment. (See Appendix A.)

Day 3: Finish bottle cap experiment. Worksheet on Random Digits
from *QL.
*

Day 4: Use *Sampling Laboratory *to simulate bottle cap experiment.

Day 5: Constructing 90% box plots on Macs using *Sampling Laboratory.
*Design survey sampling procedure.

Day 6: Continue interpretation of charts of 90% box plots on Macs
and on paper.

Day 7: Continue 90% box plots on Mac. Discuss application 8 from
*QL.
*

Day 8: Begin discussion of 90% confidence intervals (without formulas).

Day 9: Reading 90% box plots and change to 90% confidence intervals.

Day 10: Confidence intervals in the news; margin of error.

Day 11: Quiz on 90% box plots (see Appendix B). Quiz included
question that required use of the *Sampling Laboratory.
*

Day 12: Confidence Interval formula for proportions.

Day 13: Using confidence interval to find desired sample size.
Collect survey project data.

Day 14: M&M Day. Took repeated samples of M&M's and calculated
percent brown and margin of error with graph of l/n. Also made
a histogram of sample proportions which looked normal almost immediately.
HW: Work in survey project.

Day 15: Work on JC Penney/Phone Book/Dictionary activities.

Day 16: Continuation

Day 17: Return quizzes. Review.

Day 18: Module test. (See Appendix C.)

A major part of the course was to do a survey project. Each student
formulated a hypothesis, defined independent and dependent variables,
devised a sampling procedure, designed and conducted a survey,
gathered data, organized and analyzed the data using *ELASTIC
*as well as hand-generated graphs, analyzed results, formulated
conclusions, and did a critique of their own study. Excerpts from
these projects are included in Appendix D.

**3.4 The Sampling Modules at CRLS
**

Below is an account by Peter Mili of the sampling module developed
for use at CRLS. Section 4.1 describes the classroom implementation
in more detail.

"We introduced the students to sampling using some newspaper
and magazine reports, along with much discussion. Included was
an activity where we tried to estimate how long it would take
to ask a large number of people a simple question (minutes, hours,
days,...).

"We had the students each create a question that they were
interested in knowing about, and then had them ask 20 students
at CRLS. Our final activity of the unit was to interpret these
results and report them with a margin of error.

"We did the coin flipping in class and worked on collecting
and organizing the information in tables and graphs. Then we went
to the software where we had the students run experiments and
try to understand all the representations (windows). This involved
getting printouts and working with the students individually.
We also had to "take a step back" at this point and
have the students do some paper and pencil constructions of histograms
and box plots. We felt that they needed this for better understanding.

"We found it beneficial to add written interpretation to
the printouts. Specifically, we wrote the actual number of samples
above each bar to correspond with the indicated percentages from
the horizontal axis. With the printout of the box plot, we listed
the actual numbers that corresponded to the range of the box and
whiskers just beneath the graph. We then reinforced and summarized
the data by writing a series of sentences which interpreted the
information provided by the graphs.

"At this point we went back to the software so that the students
could run experiments with different population proportions (10,
20...) in order to create the box plots so that we could talk
about confidence intervals. We took printouts back to class and
discussed how we use them to get a confidence interval for the
sample proportion. We did not use this vocabulary, instead we
used "margin of error" and "between" and "likely/unlikely"
to describe the confidence. We then had the students write a sentence
to report the results of their surveys."

Below are examples of survey questions devised by students at
CRLS. Each student interviewed 20 people for the project.

(a) If you had to choose between "rap" music or "rock"
music, which would you prefer?

Pick one. "Rock" / "Rap"

(b) Do you agree with the graduation requirement of 16 credits
(4 years) in physical education?

Pick one. Agree / Disagree

(c) Should the United States government legalize marijuana?

Pick one. Yes / No

(d) Should the United States government legalize cocaine?

Pick one. Yes / No

**4 Learning About Reasoning From Samples
**

As a complement to our work on developing new software and hardware
configurations, we have been conducting studies of student learning.
These studies have included ethnographic studies of classrooms
and open-ended interviews. There are several goals of this work:

(a) To identify and characterize areas of difficulty for students
in learning statistical reasoning, especially concepts related
to confidence judgements on estimations of population parameters
from sample statistics.

(b) To identify and characterize the impact of current teaching
practices (classroom activities, text materials, software) on
these areas.

(c) To develop guidelines for better teaching of statistical reasoning.

(d) To develop guidelines for the design of software and new classroom
activities.

**4.1 Ethnography
**

Preliminary results of an ethnographic study of *Sampling Laboratory
*classrooms are based on observations of 12 sessions of the
BHS classroom and 4 of the CRLS. They show both the workings of
the *Sampling Laboratory, *and the new sampling activities,
which in the current year focussed on proportional sampling and
multiplicity of samples and populations, and persisting challenges
and difficulties the students experience, into which the program
and its history offers at least experience and insight. The 1990
focus of the sampling module was a creative challenge for the
teachers in the two experimental sites and their adaptation of
it of their particular contexts was imaginative.

In a pivotal session of Alice Mandel's course, students were asked
what was wrong with a shuffled deck as the cards were turned up
one at a time. A wrong guess meant being out of the game, the
prize of which was a candy or cake which the winner could eat
in front of the class. Two kids guessed too soon, but Chris detected
the absence of diamonds on the seventh card. Then Alice tried
the same experiment with four decks. Ryan was out at two cards,
though he confessed that his premature guess was "so dumb."
Elise guessed at seven that there were no face cards. After a
follow-up discussion with Alice, the students went to the Macs.
Here they referred to their previous physical experience with
tossing bottle caps and tested the effect of different sample
sizes on the width of 90% box plots (April 30th).

In the class at CRLS, where the students were at a lower level
of math skills, Peter Mili started the unit by talking about the
current U.S. census and reading some clippings of recent surveys.
He zeroed in on a 1986 study of cocaine use which reported that
5.8 million Americans used the substance in the month studied.
Peter then asked the kids to imagine the logistics of surveying
this many subjects by phone, to think through the arithmetic of
the time required for phone calls and the money required for surveyors.
The students collaborated with Peter in answering these questions.
Wayne got confused between thousands and millions, but finally
the group decided it would take 100 people 580 ten-hour workdays
each to complete the task. At this point Cassandra said: "So
how did they do that? So they take a small amount and..."
This session was followed by a coin-tossing class, then a class
with sampling on the Macs (April lOth-12th). As in Alice's class,
there was a balance between students' intuitive ideas and math
potentials, and an underscoring of the multiplicity of samples
and populations, as well as task of making a reasonable estimate
of the population proportion from the sample.

In the BHS course, the chart of samples and populations (Figure
1O) - represented in software, printouts, paper handouts, overhead
projector, and blackboard sketches - became the working schema
to which students and Alice referred. This was so true that on
May 16th David accused Jen of getting a confidence interval not
by actual inspection of the box plots but by just guessing - i.e.,
by using the general shape and units of the chart. The working
lingo of Alice's class shifted from last year's means and distributions
to box plots and proportions. Students seemed to connect margin
of error and proportion more easily than standard deviation and
mean in 1989, possibly because the order of percentages is simpler
and more familiar.

Alice built up slowly to the idea of statistical confidence, using
it at first in the everyday sense as she talked about the graphic
apparatus. Confidence in the technical sense was introduced on
May 1st with the proportion formula. Alice tied this explanation
of confidence to the box of the box plot as described in a QL
handout (Section III, Application 6), where the proportions inside
the box are said to be "likely sample proportions."
She emphasized the multiplicity theme the same day by having Will
chart the calculations of width of confidence interval beside
sample size and Jen graph the same findings. Questioning the students
about the graph, she got them to talk about the bearing of its
non-linearity on deciding how much of a sample to pay for. Alice
addressed the issue of sample variability and randomness by injecting
into the lore of her class the expression "bleep happens,"
whose memorability was heightened by adolescent humor. Like the
graphic scheme, this notion seemed to sink in, and students enjoyed
embroidering on it. David expressed the increasing accuracy of
larger samples as "the bleep gets bleeped out" (May
7th).

Some of the 1989 BHS difficulties (see section 1.1) surfaced.
Early on, in the discussion of varying widths of confidence intervals
resulting from change in sample size, Kim asked "What is
it?," reflecting a desire for one number and displaying puzzlement
over the suspension of this central tendency over a plurality
of samples (May 1st). Another previous problem was "getting
it right." This year gave more examples: Jen said, "So
if you have a perfect sample, you get 40%?" (May 2nd).

A further challenge came in understanding the idea of independent
and dependent variables, which the students had to employ in their
final projects. Chris explained his application of the concept
with good understanding in his project paper, but it was harder
for others. On May 21st, David, who was running late in his project
work, had a running dialogue with Alice on this point while she
was meeting the demands of the kids at the Macs. His main confusion
seemed to be the idea that something could alternatively be one
or the other, depending on the question. After this, he talked
with Bart and Ryan. Ryan asked Bart, "What's your hypothesis
[in your project]?" Bart: "I didn't have any."
In his project paper, Ryan said there was no dependent variable
in his study of whether school interferes with student meals.
As in the 1989 BHS class, the relation between mathematical variables,
which can be connected constructively, and the causal or functional
variables of science, about which people often have knowledge
or strong intuitions, was not easy for the students to sort out.
On their behalf it should be said that they tackled slippery social
and psychological items in their project and struggled with issues
that trouble professional researchers.

Alice kept going a running distinction between simulations and
real-world samples (as did Peter), an issue closely linked to
the foregoing. That the students were alerted to this is shown
in Jen's query, two weeks after the card demonstration: "Did
you shuffle the cards? Did you have it all planned out?"
(May 16th).

The four CRLS students were a very diverse group, and their response
to the curriculum was highly individual. Wayne and, to a lesser
degree, Darla, had difficulty with the sheer size of some of the
numbers involved and with keeping straight the difference between
actual counts and percentages. Wayne believed he could influence
the toss of a coin but came around by watching the others to the
opinion that "it depends on how you flip it...you have to
flip it the same every time" (April 11th). Both were excited
by their real world survey question and by working on the Macs,
and both seemed to gain understanding in their handling of percentage
and their survey question. George had an intuitive sense of some
probability issues, such as polls and gambling, but missed the
middle part of the unit. Cassandra showed the clearest benefits.
She started with skimpy math skills but showed in class discussion
a good progress in grasping the point of surveys and the variability
of samples. Cassandra was able to give Wayne help in calculating
his confidence interval in the wind-up session (May 4th).

Peter Mili exhibited wonderful skill and ingenuity in bringing
the ideas of sampling to the group. In the opinion of the ethnographer,
his approach merits a repeat trial with a class more similar in
skills. It should also be noted that Peter was assisted during
the unit by the students' regular teacher, Julie Hochstadt, and
by a graduate student aide in education.

**4.2 Interviews
**

Using pre- and post-course interviews based on two similar problems
(see Appendix F) we interviewed four students each from the CRLS
and BHS classes. There were thus eight pre- and eight post-interviews,
each organized around two problems.

In the CRLS group, Wayne had trouble keeping numbers in mind to
answer the questions and perhaps also had difficulty with standard
English, being from Jamaica. He said the unit helped him with
percentages. He tended to translate statistical questions - such
as B4, B5 - into causal questions where he had some opinion. George
had a sense of betting odds and referred many of the questions
to his own scheme of smart/middle/dumb or high/middle/low in both
pre- and post-interviews. He did not know what a confidence level
was and had a vague sense of margin of error, interpreting A.1.3
in follow-up questions as meaning "It's wrong...it's wrong
by 9 points...the other guy's ahead...not necessarily."

Darla was also more comfortable with causal language, though she
took this reasoning farther than Wayne or George. In her pre-interview,
she came up with her own solution to the educational problems
of American students saying that the problem was hanging out and
not studying, and recommended work with parents and scholarships
for students. In the post-,she spoke of the ethnic and racial
terrain of elections. Darla seemed to lack confidence in her ability
to venture into numerical reasoning. At the end of the pre-interview,
she reproached the interviewer, saying "These questions are
complicated," and gave the example of an arithmetic problem
which she *could *solve. She found it hard to distinguish
sample and population in her post-interview.

The unit changed Cassandra's knowledge of margin of error from
admission of ignorance in the pre- to a clear verbal explanation
in post-. To B7, she said she would bet based on the margin of
error. Cassandra was casual in her use of numbers but otherwise
adept at getting at the ideas, though she did not grasp confidence
level. A sample of her thinking is shown in the following stretch
of interview in respect to question A.2.6 in her pre-interview:

Interviewer: "How did you figure that out?"

Cassandra: "I just thought...Cause I figured if like, there's
so many green and so little red, that, even though you mix them
up, it's like the greens overpowering, dominate...you know."

In the exit interview she said she liked working on the computer:
"It had all of charts right there." Box plot was the
main thing she learned. Of political surveys, Cassandra said:
"I won't look at them the way I used to. There's a margin
of error."

Elise was Cassandra's counterpart in the BHS interviewees. She
began by wondering whether "margin of error" means "outlier,"
then answered candidly she didn't know. In the post-, she clearly
identified margin of error and how it worked. Elise said "confidence"
meant "they're 90% sure that it is 53%...the other 10% they're
not sure about." Unlike Cassandra, Elise had solid math skills
and explained the shrinking of confidence interval as sample size
increases in terms of square root in the formula. When asked about
distributions in the second problem, Elise groped at first but
then tied the discussion to box plots. The contrast in Elise's
answers is seen in one of her pre- remarks about B.1.7.:

Interviewer: "Can you ever make a bet on the basis of a poll?"

Elise: "If you're stupid, yeah!"

Cindy had some knowledge - perhaps sheerly semantic - of margin
of error in her first interview: "That the person winning
might win more, or might not win." In her post-session, she
gestured outwards to mean "that there's an error." She
added, "It's easier for me to look at something and do it
out." Confidence Cindy explained as being 90% sure that the
data is accurate. She said that 100% confidence wouldn't really
be a sample, "so 90% is good because you can really get an
excellent idea of it." Like all the other BHS kids except
Elise, Cindy said she simply used the formula.

Will had an initial sense of margin of error and answered B.1.2
as meaning that 86 Japanese students passed "more or less
seven." Like George, Will seemed to have some familiarity
with betting, spoke of "spread" (B.1.7) and explained
margin of error in the second interview by saying to A.1.2. that
"it could mean something or could just be the luck of the
draw." Will had difficulty explaining confidence level: "Ms.
Mandel would kill me...uh...90% accurate." When asked about
distribution, Will said: "You gotta distribute a certain
amount into different...if you were doing this on a computer,
there'd be different columns, and you'd have to distribute it
in each column." After answering the questions in the second
interview, Will said that he was not good with graphs, that other
kids in the class had more computer courses, and that "I'm
going into business and I don't like computers that much, but
I know that they're there and I'm gonna have to use them."

Kim moved from characterizing margin of error as "it could
be wrong" to a correct numerical use. In fact, she overused
it in post-, ascribing overlapping margins of error to the American
population and the Japanese sample in problem B.l. Kim said that
confidence level meant that they're "not sure...They're 90%
sure it's valid." When pressed on the ambiguous "it,"
Kim identified it with "the percentage of students in the
population." She referred questions about the formula (POST,
B.l.ll) to the graphic scheme: "If you use 2 for z that's
for a 90% box plot."

As might have been expected, the interviews show the BHS kids
having a better grasp of the concepts, since they started with
better math skills and took the sampling unit as part of a larger
statistics course. On the other hand, in such matters as causal
reasoning, intuitions, and misconceptions (for example, the idea
that a larger sample will include more differences and consequently
make for a wider confidence interval), the groups were similar.
As perhaps can be seen from some of the student remarks, the interview
itself and the kind of understanding it explores was secondary
from the student point of view, for which the matters of importance
were passing tests, working on the computer, and doing the individual
projects. In this sense the mastery of statistics in use is probably
better than a strict interpretation of the interviews indicates,
because what the students aimed at was - in Cindy's words - "to
look at something and do it out."

**5 Future Directions
**

Further work on the *Sampling Laboratory *is* *needed.
Some of the ideas that have emerged from our field testing are
these:

*Visual representation of the sampling process. *The current
*Sampling Laboratory *displays a population as a bar chart.
A sample from that population magically appears as another standard
bar chart or, under user control, as a bar made up of small triangles,
each representing a unit in the sample. A sampling distribution
is another bar chart showing where each sample proportion falls
within the set of all samples generated in an "experiment."

This method of representing the sampling process is too abstract
for some students. We have considered a number of ways to make
the process more concrete. One idea is to build on Judah Schwartz's
*Ample Sample *program. In this approach, the population
is an array of icons with variation in color or shape. A sample
is a selected region of the population. Both the population and
the sample proportion are indicated directly by the density of
the focus category's icon. The sampling distribution could be
constructed by stacking up samples of similar densities.

An alternative is to use the time dimension to show the population.
Elements of the population could spew out of a pipe. If one did
not know the generating function for the population, this representation
would make the need for sampling more apparent. Samples in this
representation would be portions of the population stream .

*Context-dependent help and explanation features. *These
would be connected to each window and data object. Because the
*Sampling Laboratory *maintains extensive information about
each sampling distribution, it can serve as a resource for the
student who wants to explore questions such as "Which samples
contributed to this box plot having the shape it has?".

*Bias . *Bias in sampling can invalidate any information
about a population inferred from a sample. It would be valuable
to have a way to introduce bias in the sampling process in order
to observe its effects.

*Stratified sampling. *Stratified sampling can increase the
accuracy of information about populations that have multiple subpopulations
of interest. It would be useful to have methods for defining subpopulations
and for sampling from these subpopulations disproportionately.

*Links to sampling theory. *The current *Sampling Laboratory
*provides the basis for an empirical approach to understanding
concepts ordinarily encountered in abstract forms. Its power as
a tool would be enhanced if its operation could be linked to theoretical
constructs. For example, we could show the theoretical binomial
curve on the sampling distribution window. Then, the student could
select a region and see the probability that a sample would fall
in that region and compare that probability to the actual sampling
distribution. Other features could help students see the relation
between sample size and confidence interval or between confidence
level and confidence interval. We could also allow specification
of population sizes to show that it does not affect reliability
of estimates of the population proportion.

*Decision theory. *Questions about sample size, confidence
levels, and confidence intervals ultimately are meaningful only
if there are costs and benefits associated with choices one makes,
e.g., that each sample imposes a cost, or that there is some value
in being correct about which population a given sample came from.
This suggests a successor program in which students must make
choices about sampling, using limited resources in order to achieve
some goal, such as finding a "small" region that "almost
certainly" includes the population proportion. Definitions
of parameters like region size (width of confidence interval)
would arise from some real problem context.

*Extension to other measures. Sampling Laboratory *was restricted
to estimations of population proportions for several reasons.
But many students could go beyond this to explore other measures
of central tendency such as mean or median.

Comments on the CCT Web site: Webspinner.