Education Development Center, Inc.
Center for Children and Technology
A Systems Approach to Educational Testing
CTE Technical Report Issue No. 2
John R. Fredriksen & Allan Collins
Bolt Beranek and Newman
Our concern in this paper is with the validity of educational
tests when they are employed as critical measures of educational outcomes
within a dynamic system. The problem of validity arises if an educational
system adapts itself to the characteristics of the outcome measures. We
introduce the concept of systemically valid tests as ones that induce curricular
and instructional changes in education systems (and learning strategy changes
in students) that foster the development of the cognitive traits that the
tests are designed to measure. We analyze some general characteristics that
contribute to or detract from a testing system's systemic validity, such
as the use of direct rather than indirect assessment. We then apply these
characteristics in developing a set of design principles for creating testing
systems that are systemically valid. Finally, we provide an illustration
of the proposed principles by applying them to the design of a student assessment
system. This design example addresses not only specifications for the tests,
but also the means of teaching the process of assessment to users of the
There are enormous stakes placed on students' performance on educational
tests. And there are, consequently, enormous pressures on school districts,
school administrators, teachers, and students to improve scores on tests.
These pressures drive the educational system to modify its behavior in ways
that will increase test scores (Darling-Hammond & Wise, 1985; Madaus,
1988). The test scores, rather than playing the role of passive indicator
variables for the state of the system, become the currency of feedback within
an adapting educational system. The system adjusts its curricular and instructional
practices, and students adjust their learning strategies and goals, to maximize
the scores on the tests used to evaluate educational outcomes, and this
is particularly true when the stakes are high (Corbett & Wilson, 1988).
Thus, for example, if a reading test emphasizes certain skills, such as
knowledge of phonics, then these become the skills that will receive emphasis
in the reading curriculum.
Our concern in this paper is with the validity of educational tests within
such a dynamic system. To introduce tests into a system that adapts itself
to the characteristics of tests poses a particular challenge to their validity
and calls into question many of the current practices in educational testing.
That challenge to validity has to do with the effects of the instructional
changes engendered by the use of the test and whether or not they contribute
to the development of the knowledge and/or skills that the test purportedly
measures. This extension of the notion of construct validity of a test to
take into account the effects of instructional changes brought about by
the introduction of the test into an educational system we shall refer to
as the systemic validity of a test.
A systemically valid test
is one that induces in the education system curricular and instructional
changes that foster the development of the cognitive skills that the test
is designed to measure. Evidence for systemic validity would be an improvement
in those skills after the test has been in place within the educational
system for a period of time.
Given this challenge to test validity due to systemic effects, the question
we must take up has to do with whether there are any general characteristics
of a system of testing that can be identified as either contributing to
or detracting from a test's systemic validity. In our analysis, we shall
identify a number of characteristics that contribute to systemic validity.
We shall then apply these principles in developing a set of design principles
for an alternative form of testing system that is systemically validone
that we believe will drive the educational system toward practices that
will lead to improvements in the underlying knowledge and skills that tests
are seeking to measure. Finally, we shall provide an illustration of the
proposed principles, in the context of a student assessment system. (Elsewhere,
we have applied the design principles to teacher assessment; Collins &
J. R. Frederiksen, 1989).
Educational Systems as Dynamic Systems
The measures that educators choose to use in assessing outcomes provide
one important form of feedback that determines how the system will modify
its future operation. Schoenfeld's (in press) observations of the teaching
of one of the most successful math teachers in New York State precisely
illustrates our point. Students of geometry in the state of New York must
all pass a statewide Regents' Exam that has become, in no uncertain terms,
the goal of instruction: Scores on the test are used to judge students,
teachers, and school districts. In geometry, the exam includes as a major
component a required proof (chosen from a list of a dozen theorems) and
also a construction problem (in which tools such as a straightedge and a
compass are used to "construct" a figure with specified properties).
In the scoring of the proofs, students are expected to reproduce all the
steps of the proof in a two-column form, listing each proof step and a justification
for that step. In the construction problem, they are not required to give
justifications for the steps of the construction, but are graded on whether
the construction has all of the required arcs and lines and how accu
rately they are drawn. Schoenfeld found that these characteristics of the
Regents' Exam have completely subverted the way the teacher taught geometry.
Instead of teaching students how to generate proofs, the teacher had students
memorize the steps for each of the 12 proofs that might be on the exam.
In their constructions, the students were taught how to carry them out neatly.
The students were thus able to pass the geometry part of the Regents' Exam
with flying colors, but they did not learn how to reason mathematically.
This example illustrates how the systemic validity of a test is dependent
on the specification of the construct the test is taken to measure, which
is in turn related to the goals of teaching and learning. If the goal of
teaching geometry is to be able to reproduce formal proofs and to develop
flawless constructions, then the Regents' geometry test can be said to be
systemically valid. However, if the goal is to assess how students can develop
proofs and use constructions as tools for mathematical exploration, then
the test cannot be said to be systemically valid, because its use has engendered
instructional adaptations that do not contribute to the development of these
cognitive skills. A test's validity cannot be evaluated apart from the intended
use of the test (Messick, 1988).
In the absence of feedback and adaptation to the test, the Regents' test
and tests like it may provide an adequate indication of students' knowledge,
because most representative geometry items will correlate highly with one
another and the use of one or another particular set of test items will
not result, therefore, in any gross misclassification of test takers. However,
the requirement of systemic validity creates a much more stringent standard
for the construction of tests, for it requires us to consider evolutions
in the form and content of instruction and students' learning engendered
by use of the test. That is, will instruction that focuses on the skills
and problem formats represented in tests promote the ability of students
to engage, in the present case, in authentic mathematical investigations
and problem solving? There are several reasons why we believe that it will
1. If a test emphasizes isolated skill components and items of knowledge,
instruction that seeks to increase test scores is likely to emphasize those
skill components rather than higher level processes (N. Frederiksen, 1984;
Resnick &r Resnick, in press).
2. Instruction that seeks to develop specialized test-taking strategies
(e.g., in taking a multiple-choice
test, trying to eliminate one or more of the response alternatives and then
guessing) will not improve domain knowledge and skills.
3. Time and effort spent in directly improving test scores in these ways
will displace other learning activities that could more directly address
the skills and learning goals the test was supposed to be measuring in the
4. Students will direct their study strategies toward those skills (such
as memorization) that are represented on the tests and that appear to be
valued by educational institutions rather than toward the use of cognitive
skills and knowledge in solving extended problems.
One solution to the problem of low systemic validity would be, of course,
to disallow the development of any instruction aimed explicitly at improving
scores on the test. Such an approach, however, would deny to the educational
system the ability to capitalize on one of its greatest strengths: to invent,
modify, assimilate, and in other ways improve instruction as a result of
experience. No school should be enjoined from modifying its practices in
response to their perceived success or failure. Nor should students be prevented
from optimizing their study so as to carry out the kinds of problem solving
valued within their course of study. Yet if these strategic modifications
in teaching and learning are to be based on test scores, then their efficacy
will depend crucially on the systemic validity of the tests that are used.
We are left, therefore, with the alternative solution to the problem: to
encourage the inventiveness and adaptability of educational systems by developing
tests that directly reflect and support the development of the aptitudes
and traits they are supposed to measure.
Characteristics of Systemically Valid Tests
There are two dimensions or characteristics of tests that have a bearing
on their usefulness as facilitators of educational improvement. These are
(a) the directness of cognitive assessment, and (b) the degree of subjectivity
or judgment required in assigning a score to represent the cognitive skill.
In indirect tests, an abstract cognitive skill is measured by evaluating
less abstract, more directly observable features of performance that are
known (or theoretically expected) to be highly correlated with the abstract
skill. For example, verbal aptitude, a construct that might be defined as
"the ability to formulate and express arguments in verbal form,"
is measured using tests of vocabulary knowledge or verbal analogies. In
direct tests, the cognitive skill that is of interest is directly
evaluated as it is expressed in the performance of some extended task. An
example would be to rate the coherence of an argument in a legal brief.
The degree of subjectivity of a test refers to the degree to which
judgment is used in assigning a score to a student's test performance. Objective
tests use simple, algorithmic scoring methods such as counting the number
of items correct. Subjective tests, on the other hand, require judgment,
analysis, and reflection on the part of the scorer in the assignment of
a score. Because the scoring algorithms of objective tests are simple, the
item formats of such tests are usually constructed to invoke unitary responses,
such as selecting one from a set of multiple-choice response alternatives
or writing a single word, phrase, or number. Subjective tests do not necessitate
this restriction on the form of response and typically allow more extended
responses to a test item, such as the writing of an essay. Drew Gitomer
(personal communication, May 8,1989) has pointed out that in objective tests,
there is a low degree of inference required at the item-scoring level, but
a much higher degree of inference required when items are aggregated using
a psychometric model (e.g., item response theory, factor analysis) to produce
a scale representing a particular construct. Subjective tests require, in
contrast, more judgment and expertise in scoring at the item level, but
very little inference at the level of summarizing item level scores. In
educational testing, objective tests are generally preferred because they
reduce the scoring task to a simple, objective scoring algorithm such as
a tallying of correct answers. Benefits of such objective tests are the
reliability of scoring, the lack of potential biases that might affect score
assignments, and the ease and economy of algorithmic scoring.
Problems with using objective tests. We believe that one pays a very
high price in reduced systemic validity for using objective tests. This
is due to the fact that the desire for objective tests leads to tests that
are indirect, and indirect tests often have problems of systemic validity.
For example, in teacher assessment, competency can be assessed using tests
of teachers' knowledge (domain knowledge and pedagogical knowledge) and
basic skills (e.g., reading and mathematics). However, while such knowledge
may be associated with or even necessary for effective practice as a teacher,
it does not provide direct evidence of such
practice, nor will developing such knowledge ensure more effective teaching.
Similar remarks can be made about tests of factual knowledge as a measure
of accomplishment at the end of a course in history or tests of vocabulary
knowledge as a measure of the capacity to do college work. In general, objective
tests emphasize low-level skills, factual knowledge, memorization of procedures,
and isolated skills, and these are aspects of performance that correlate
with but do not constitute the flexible, high-level skills needed for generating
arguments and constructing solutions to problems (N. Frederiksen, 1989;
Resnick & Resnick, in press). Use of objective tests thus leads to teaching
strategies that emphasize the conveying of information and to student learning
strategies that emphasize memorization of facts and procedures, rather than
learning to generate solutions to problemsincluding novel problems that
occur in "real life" contexts. N. Frederiksen (1984) has termed
this effect of tests on the content of instruction "the real test bias."
In some cases, it may be possible to construct objective tests that are
direct measures of important cognitive constructs, such as identifying mental
models in physics (Clement, 1982; McCloskey, Caramazza, & Green, 1980;
McDermott, 1984; White, 1983) or assessing creativity in scientific problem
solving (N. Frederiksen, 1978). It may also be possible to use techniques
of artificial intelligence to build relatively detailed models of students'
knowledge on the basis of extended examples of their problem solving (Anderson,
Boyle, & Reiser, 1985; Clancey, 1983; J. R. Frederiksen & White,
1989; Johnson & Soloway, 1985; Sleeman & Brown, 1982). Although
it is worthwhile to continue efforts to develop objective tests of important
cognitive outcomes of learning, in general the state of the art does not
permit objective tests for directly measuring higher order thinking skills,
problem-solving strategies, and metacognitive abilities involved in tasks
such as teaching, writing, constructing a historical argument, and "doing"
mathematics. Thus we believe that it is important to consider some of the
advantages of subjective, direct assessment of such high-order cognitive
Advantages of direct tests. Direct tests attempt to evaluate
a cognitive skill as it is expressed in the performance of extended tasks.
Such measures are systemically valid, because instruction that improves
the test score will also have improved performance on the extended task
and the expression of the cognitive skill within the task context. In figure
gymnastics, for example, measures of traits such as technical merit and
artistic impression are assigned by judges based on an extended program
that is developed and performed by the athlete.
In educational testing, a particularly good example of this approach (and
one that has been seminal in influencing our thinking) is the primary trait
system for scoring writing tasks that was developed by the National Assessment
of Educational Progress (NAEP) (Mullis, 1980). The purpose of the NAEP assessment
was to measure whether a piece of writing is successful or unsuccessful
in achieving a particular purpose. The student is given a writing assignment
with a particular goal, such as writing a letter to the chairman of the
school board on the advisability of instituting a 12-month school year.
To evaluate such writing, a set of primary traits was developed that are
important for successfully achieving the goal of the writing assignment.
For example, one primary trait, persuasiveness, involves the presentation
of a set of logical and compelling arguments. The completed writing exercise
is rated on a set of such primary traits, using a simple 4-point scale for
each. For example, persuasiveness is rated as follows: "l" for
a paper containing no reasonable argument, "2'' for a paper having
one or two poorly thought out arguments, "3" for a paper containing
several logically thought out reasons, and "4" for a paper containing
in addition a number of compelling details (Mullis).
Basing educational assessment on such subjective scoring requires that scorers
understand the scoring categories and be taught how to use them reliably.
This in turn necessitates building a library of exemplars of student work
representing different levels of the desired primary traits. This library
is then used to train scorers to assess the traits. In the case of the NAEP
writing assessment, for each writing exercise, exemplars of texts scored
in each category are provided. In addition, a detailed rationale is included
for each exemplar explaining why the particular score has been assigned.
Assessors study these exemplars and practice scoring until they have internalized
the criteria and can rate primary trait performance reliably in a variety
of task contexts. In the NAEP primary trait assessment of writing, a typical
interscorer agreement of 91%-95% was achieved. Moreover, studies have shown
that individual, remote scorers, following calibration (Braun, 1986), can
provide scores that approach quite closely the values derived using standardized
scoring methods (Breland & Jones, 1988).
It would be difficult to justify the cost of developing
these training materials if they were to be used only to train professional
assessors. However, there is another use to which they can be put: The
training materials can become the medium for communicating to teachers and
students the critical traits to look for in good writing, good historical
analysis, and good problem solving. The library of exemplars can be
viewed as a set of "case studies" that can be used by teachers
to make their students aware of the nature of expert performance, or as
Wolf puts it, to help them "develop a keen sense of standards and critical
judgment" (1987, p. 26). Using them, students can learn to assess their
own work in the same way that their teachers will judge it. They can, for
example, learn to recognize critical traits in their writing and to carry
this awareness along with them as they carry out their assignments. The
assessment system provides a basis for developing a metacognitive awareness
of what are important characteristics of good problem solving, good writing,
good experimentation, good historical analysis, and so on. Moreover, such
an assessment can address not only the product one is trying to achieve,
but also the process of achieving it, that is, the habits of mind that contribute
to successful writing, painting, and problem solving (Wiggins, 1989). We
believe that building such awareness will lead to genuine improvements in
the cognitive traits on which the assessment system is based.1 We argue,
therefore, that adopting subjective, direct assessment is a good way to
increase the systemic validity of a testing system.
Principles for the Design of
Systemically Valid Testing
Our plan for the design of a systemically valid testing system has three
major aspects: (a) the components of the testing system; (b) the standards
to be sought in the design of the system; and (c) the methods by which the
system encourages learning. A general outline of the design specification
will be presented in this section. In the subsequent section, we will illustrate
the applications of this design for a student assessment system.
Components of the Testing System
The testing system we envision has four major components: a set of tasks,
a specification of primary traits to be assessed, a library of exemplars
of performances on each task, and a training system for teaching how to
score the primary traits.
Set of tasks. The tests should consist of a representative
set of tasks that cover the spectrum of knowledge, skills, and strategies
needed for the activity or domain being tested. For example, in student
assessment, if there is a set of basic problem-solving skills we think students
should acquire, these skills must be called for in the tasks given. The
tasks might be constructed as in the assessment of figure skating: a set
of compulsory tasks plus a set of elective tasks, so that testees can demonstrate
both their basic abilities in compulsory tasks and their planning and creativity
in elective tasks. The tasks should be authentic, ecologically valid tasks
in that they are representative of the ways in which knowledge and skills
are used in "real world" contexts (Brown, Collins, & Duguid,
1989; Wiggins, 1989).
Primary traits for each task and subprocess. The knowledge
and skills used in performing any task may consist of distinct subprocesses.
For example, teaching might be broken down into planning, classroom practice,
and evaluating students' work, each of which requires somewhat different
talents. These subprocesses need to be assessed independently so that test
takers will direct their efforts to doing well in all phases of the task
domain being tested. Each subprocess must be characterized by a small number
of primary traits or characteristics that cover the knowledge and
skills necessary to do well in that aspect of the activity. The traits should
cover both process and products and should include planning and reflection.
For example, in writing, processes might include note taking, outlining,
drafting, and revising. The primary traits for expository writing might
be clarity, persuasiveness, memorability, and enticingness (Collins &
Gentner, 1980). (The specific traits may differ for different processes
and products.) The primary traits chosen should be ones that the test takers
should strive to achieve, and thus should be traits that are learnable.
The small number is necessary to focus the test taker's learning. The particular
traits chosen for any task domain are not too critical, as long as they
cover the skills that are judged to be important and they are learnable.
In other words, we believe that the testing approach is robust over different
sets of primary traits.
A library of exemplars. In order to ensure reliability of
scoring and learnability, it is important that for each task there be a
library of exemplars of all levels of performance for each primary trait
assessed in the test. The library should include exemplars representing
the different ways to do well (or poorly) with respect to
each trait. It should also include critiques of each sample performance,
so that it is clear how the performance was judged. The library should be
accessible to all, and particularly to the testees, so that they can learn
to assess their own performance reliably and thus develop clear goals to
strive for in their learning.
A training system for scoring tests. There are three groups
that must learn to score test performance reliably: (a) the administrators
of the testing system, who develop and maintain the assessment standards
(i.e., master assessors); (b) the coaches in the testing system whose role
is to help test takers to perform better; and (c) the test takers themselves,
who must internalize the criteria by which their work is being judged. The
master assessors are charged with defining the criteria, ensuring that test
performance can be scored reliably, and training coaches to score performances.
The coaches work with the test takers to teach them self-assessment.
Standards must be developed for the testing system that include the following:
Directness. From a systems point of view, we have seen that
it is essential that whatever knowledge and skills we want test takers to
develop be measured directly. Sometimes this may require measuring a process,
sometimes a product, and sometimes both. In either case, any indirectness
in the measure will lead to a misdirection of learning effort by test takers
to the degree that it matters to them to do well on the test.
Scope. The test should cover, as far as possible, all the
knowledge, skills, and strategies required to do well in the activity. To
the degree that any knowledge or skills are left out, test takers will direct
their learning efforts to only part of what is required of them.
Reliability. We think that the most effective way to obtain
reliable scoring that fosters learning is to use primary trait scoring borrowed
from the evaluation of writing. Developing a primary trait system for any
test involves the same steps that were used by NAEP in applying it to writing.
Transparency. The terms in which the test takers are judged
must be clear to them if a test is to be successful in motivating and directing
learning (Wiggins, 1989). In fact, we argue that the test must be transparent
enough so that they can assess themselves and others with almost the same
reliability as the actual test evaluators achieve.
Methods for Fostering
Improvement on the Test
The testing system should not only employ forms of assessment that enhance
learning, but it should also include specific methods designed to foster
such learning. These include the following:
Practice in self-assessment. The test takers should have ample
opportunity to practice taking the test and should have coaching to help
them assess how well they have done and why. This kind of reflection on
performance (Collins & Brown, 1988) is made possible by recording technologies
such as videotape and computers. The assistance of a coach, who has internalized
the testing standards, is critical to helping the test takers see their
performance through others' eyes.
Repeated testing. Although it may be necessary to have the
test administered at only a few times during a year, it is still important
to encourage students to take the test multiple times to encourage striving
for improvement. If what is measured by the test is important to learn,
then the test should not be taken once and forgotten. It should serve as
a beacon to guide future learning.
Feedback on test performance. Whenever a person takes the
test, there should be a "rehash" with a master assessor or teacher.
This rehash should emphasize what the testee did well and poorly on, and
how performance might be improved. It should preferably involve a master
assessor so that the institutionalized standards will be clear to the test
Multiple levels of success. There should be various landmarks
of success in performance on the test, so that students can strive for higher
levels of performance in repeated testing. The landmarks or levels might
include such labels as "beginner," "intermediate," and
"expert" to motivate attempts to do better.
The system we envision involves developing a number of extended tasks or
projects that students would carry out to demonstrate their mastery of courses
they are taking, such as history or physics. We can illustrate the approach
with two structured tasks that might be given to students in American history
and physics. For history, a task might be as follows: "At the beginning
of World War II, the United States was divided as to whether to enter the
war or to stay neutral. Pick three presidents in history, other than Franklin
Roose-velt, who you think would have taken different positions
on the issue, and write a 2-minute speech of each to the American public
on what should be done in that situation." These speeches might then
be delivered and recorded on videotape, with questions following from other
students as in a press conference. For physics, the task might be to design
a set of activities using a Dynaturtle (diSessa, 1982; White, 1984) that
would help younger students learn to understand Newton's Laws of Motion.
(A Dynaturtle is an object in a computer simulation that operates in a frictionless,
gravity-free environment, and is controlled like a spaceship.) These are
examples of the kind of extended tasks that students could be given to demonstrate
their understanding of history or science. A variety of such tasks could
be provided to teachers for use in assessment, or teachers could construct
their own tasks following a set of task specifications that are provided
to them. In general, the tasks to be included within an assessment system
would vary from structured tasks that measure students' understanding of
critical concepts or skills to open-ended tasks that allow students to demonstrate
special knowledge and creativity. Ideally, these tasks would be fully integrated
within a course, rather than serving as accessories to the course.
Scoring Student Performance
Students would be evaluated on the tasks in terms of a set of primary traits.
Examples of primary traits that could be used are (a) clarity of expression,
(b) creativity, (c) depth of understanding or thoroughness, (d) consideration
of multiple perspectives, and (e) focus or coherence. The particular traits
chosen are, again, not critical so long as they cover the desired qualities
and direct students' efforts appropriately. The primary traits would cover
both process and products, and also might be applied to different phases
of an assessment task, such as planning, presentation, and revision.
To implement the assessment system, it is important to build a library of
exemplars of students working on a variety of tasks, covering all the major
subject areas. This library would be embodied in paper, videotapes, and
computer traces. For example, paper records might include notes, outlines,
and multiple drafts of articles written. Videotapes might record students
discussing their initial plans, making presentations, answering questions,
or performing dramatic scenes. Computers might record document preparation
and revision or students' solutions to problems such as the
physics activity described above. Each of these exemplars should also contain
a critique of the performance by master assessors in terms of the set of
primary traits chosen for evaluating students.
The administration for such a system could be centered at the school, district,
state, or even national level. There would have to be a group of master
assessors who are responsible for developing the set of traits, the criteria
for scoring, and the library of exemplars. They would also be responsible
for showing teachers how to evaluate student performance, and in fact testing
teachers to make sure that they have internalized the evaluation criteria.
Teachers would function as coaches to the students as they practiced different
tasks, to help them internalize the criteria by which they are judged. Ideally,
students would learn how to critique their own and each other's performances
in terms of the primary traits adopted.
Addressing Different Audiences
A major problem in student assessment is that the test scores generated
have to address the needs and desires of many different audiences. Colleges
need to know whether the student meets their admission standards. Teachers
want to know what students have learned and failed to learn. Parents and
students want to know how the student is doing relative to some standard.
Administrators want to know how well different teachers and schools are
succeeding. All of these different needs have to be balanced in setting
up an assessment system.
Because colleges are a major constituency for student assessment, the criteria
for evaluating students in each subject should be developed in conjunction
with college admissions officers, who have ideas about what are essential
knowledge and skills for admission. (For students in vocational courses,
criteria should be developed in consultation with businesses and other potential
employers and with licensing boards.) These same criteria should suffice
for parents, students, and teachers, since they are the outcome measures
that are valued by colleges or future employers, and are therefore ecologically
valid measures of performance that are judged to be important in "real
A Changing Role for Testing Organizations
Lest the proposal for a systemically valid testing system we have made seem
overly visionary, we shall examine
briefly the practical side of implementing such a system. We believe that
the efficiency in current testing practices is greatly outweighed by the
cost of using a system that has low systemic validityone that has a negative
impact on learning and teaching. The goal of assessment has to be, above
all, to support the improvement of learning and teaching. To accomplish
this, major changes must occur in the role and function of testing organizations.
In the future, they will retain their important role as developers of assessment
tools, and they will, as now, be responsible for setting scoring standards
and practices. However, they will have to assume some new responsibilities:
(a) they must develop materials for use in teaching the assessment techniques,
not only to master assessors within schools and school districts, but also
to teachers and students; and (b) they must take responsibility for ensuring
that the assessment standards are assimilated and maintained by these new
groups of assessors. The big difference is that the practice of assessment
will no longer be confined to the testing organizations; it will become
more decentralized, as teachers and students are taught to internalize the
standards of performance for which they are to strive.
We end with some caveats. Clearly, much research needs to be done to test
the assumptions on which our proposal is based: Can primary traits be assessed
reliably on a common scale when the particular tasks that test takers carry
out may vary? Does an awareness of primary traits help students to improve
performance on projects and teachers to become more effective in the classroom?
Can a consensus be reached on what are appropriate primary traits for different
domains and activities? Can scoring standards be met when assessment is
decentralized? These and other questions should become the basis of a concerted
research effort in support of a new, systemically valid system of educational
This work was supported by the Center for Technology in Education under
Grant No. 1-135562167-Al from the Office of Educational Research and Improvement,
U.S. Department of Education, to Bank Street College of Education. We would
like to thank Norman Frederiksen, Drew Gitomer, Robert Glaser, and Ray Nickerson
for their thoughtful comments on an earlier draft of the paper.
1. A critical assumption is that scorers can learn to recognize and reliably
assess primary traits, not only in the particular tasks used in the library
of exemplars, but in other tasks for which the trait is relevant. Although
there is evidence bearing on these assumptions in the assessment of writing
(Breland & Jones, 1988), further work will be required to check its
validity for the specific primary traits that are to be the goal of assessment.
Anderson, J. A., Boyle, C. F., & Reiser, B. J. (1985). Intelligent tutoring
systems. Science, 228, 456-68.
Braun, H. (1986). Calibration of essay readers (Report No. RR-86-9).
Princeton, NJ: Educational Testing Service.
Breland, H. M., & Jones, R. J. (1988). Remote scoring of essays
(Report No. 88-4). Princeton, NJ: Educational Testing Service.
Brown, J. S., Collins, A., & Duguid, P. (1989). Situated cognition and
the culture of learning. Educational Researcher, 18(1), 32-42.
Clancey, W. (1983). Guidon. Journal of Computer-Based Instruction,
10(1 & 2), 8-15.
Clement, J. (1982). Students' preconceptions in elementary mechanics. American
Journal of Physics, 50, 66-71.
Collins, A., & Brown, J. S. (1988). The computer as a tool for learning
through reflection. In H. Mandl & A. Lesgold (Eds.), Learning issues
for intelligent tutoring systems (pp. 1-18). New York: Springer.
Collins, A., & Gentner, D. G. (1980). A framework for a cognitive theory
of writing. In L. W. Gregg & E. R. Steinberg (Eds.), Cognitive processes
in writing (pp. 51-72). Hillsdale, NJ: Erlbaum.
Collins, A., & Frederiksen, J. R., (1989). Five traits of good teaching:
Learning, thinking, listening, involving, helping. Unpublished report,
BBN Laboratories, Cambridge, MA.
Corbett, H. D., & Wilson, B. (1988). Raising the stakes in statewide
mandatory minimum competency testing. Politics of Education Association
L., & Wise, A. (1985). Beyond standardization: State standards and school
improvement. Elementary School Journal, 85, 315-336.
diSessa, A. (1982). Unlearning Aristotelian physics: A study of knowledge-based
learning. Cognitive Science, 6, 37-76.
Frederiksen, J. R., & White, B. Y. (1989). Intelligent tutors as intelligent
testers. In N. Frederiksen, R. Glaser, A. Lesgold, & M. Shafto (Eds.),
Diagnostic monitoring of skill and knowledge acquisition (pp. 1-25).
Hillsdale, NJ: Erlbaum.
Frederiksen, N. (1978). Assessment of creativity in scientific problem
solving (Research Memorandum RM-78-9). Princeton, NJ: Educational Testing
Frederiksen, N. (1984). The real test bias. American Psychologist,
Frederiksen, N. (1989). Introduction. In N. Frederiksen, R. Glaser, A. Lesgold,
& M. Shafto (Eds.), Diagnostic monitoring of skill and knowledge
acquisition (pp. viixv). Hillsdale, NJ: Erlbaum.
Johnson, W. L., & Soloway, E. (1985). PROUST: An automatic debugger
for Pascal programs. Byte , 10(4), 179-190.
Madaus, G. (1988). The influence of testing on the curriculum. In L. Tanner
(Ed.), Critical issues in curriculum: 87th Yearbook of the NSSE, Part
1. Chicago: University of Chicago Press.
McCloskey, M., Caramazza, A., & Green, B. (1980). Curvilinear motion
in the absence of external forces: Naive beliefs about the motion of objects.
Science, 210, 1139-1141.
McDermott, L. C. (1984). Research on conceptual understanding in mechanics.
Physics Today, 37,24-32.
Messick, S. (1988). Validity. In R. L. Linn (Ed., Educational measurement
(3rd ed., pp. 13-103). New York: Macmillan.
Mullis, I. V. S. (1980). Using the primary trait system for evaluating
writing. National Assessment of Educational Progress Report. Denver,
CO: Education Commission of the States.
Resnick, L. B., & Resnick, D. P. (in press). Assessing the thinking
curriculum: New tools for educational reform. In B. R. Gifford & M.
C. O'Connor (Eds.), Future assessments: Changing views of aptitude, achievement,
and instruction. Boston: Kluwer.
Schoenfeld, A. H. (in press). On mathematics as sense-making: An informal
attack on the unfortunate divorce of formal and informal mathematics. In
D. N. Perkins, J. Segal, & J. Voss (Eds.), Informal reasoning and
education. Hillsdale, NJ: Erlbaum.
Sleeman, D., & Brown, J. S. (Eds.). (1982). Intelligent tutoring
systems. New York: Academic Press.
White, B. Y. (1983). Sources of difficulty in understanding Newtonian dynamics.
Cognitive Science, 7(1), 41-65.
White, B. Y. (1984). Designing computer activities to help physics students
understand Newton's laws of motion. Cognition and Instruction, 1,
Wiggins, G. (1989, May). A true test: Toward more authentic and equitable
assessment. Phi Delta Kappan, 703-713.
Wolfe, D. P. (1987, December). Opening up assessment. Educational Leadership,
Published in Educational Researcher, Vol. 18, No. 9, pp. 27-32 (1989,
[ Home | About CCT | Projects | Newsletters | Reports | Staff | Links | EDC Home ]
Last Update: 11/18/96
Comments on the CCT Web site: Webspinner.
©1996 Education Development Center, Inc. All Rights Reserved.