Standardized tests’ Holy Grail


The much-maligned multiple-choice test, the crux of California’s and other states’ accountability exams, will be replaced partly, if not entirely, by more complex, lengthier and probably more costly, state tests. As part of its Race to the Top program, Education Secretary Arne Duncan has set aside $350 million to pay for the development of new standardized tests, plus high schools measures of career and college readiness, over the next four to five years.

Duncan and President Obama, who has derided “fill-in-a-bubble” standardized tests, are expecting that the new “performance assessments,” along with the common-core standards to which they’ll be aligned, will guide teachers’ instruction and improve student results. Skeptics – primarily defenders of current versions such as California’s STAR tests – are doubting whether the next generation tests can deliver what Duncan is demanding.

Stanford Education Professor Linda Darling-Hammond, author of the new book “The Flat World and Education,” has no doubt that they can. She is leading a consortium of groups, backed by a majority of states, that is in the running for one of two $160 million grants that Duncan will award this year. At a public briefing in Washington on Tuesday, Darling-Hammond and others affiliated with SCOPE – Stanford Center for Opportunity Policy in Education – will present seven research papers that will support their effort.

In an overview paper she co-authored,  “Beyond Basic Skills : The Role of Performance Assessment in Achieving 21st Century Standards of Learning,” Darling-Hammond wrote that standardized tests developed to comply with No Child Left Behind measure “mostly lower-level skills, such as the recall or recognition of information.” Particularly in lower-achieving schools, these tests have led to a narrowing of the curriculum, while, at higher-achieving schools, have “placed a glass ceiling over more advanced students, who are unable to demonstrate the depth and breadth of their abilities on such exams.”  And, because tests tend to influence what teachers teach,  multiple-choice tests have discouraged teachers “from having students conduct experiments, make oral presentations, write extensively, and do other sorts of intellectually challenging activities.”

That’s quite an indictment. Defenders of California’s STAR program and other states’ tests argue that multiple-choice questions have gotten a bad rap. If written well, they can measure a deep understanding of a subject. But there’s also no denying that this time of year, in the weeks preceding STAR tests, many schools are slaves to weeks of rote test preparation. Particularly for low-achieving students, April is the cruelest month.

The alternative is performance assessments, which require students to construct their own responses to questions. These can take the form of supplying short phrases or sentences to questions, writing essays or conducting complex and time-consuming activities, such as a lab experiment. “By tapping into students’ advanced thinking skills and abilities to explain their thinking, performance assessments yield a more complete picture of students’ strengths and weaknesses,” Darling-Hammond wrote.

Duncan must agree. The regulations for competing for the assessment grants lean toward performance measures, both for formative assessments that will measure students’ progress during the year, and end-of-year statewide accountability tests.

But performance assessments face big challenges if they’re going to be used for high-stakes tests whose results will determine which schools are failing  and, in many states. how teachers will be evaluated and paid.  Not only are Obama and Duncan not backing away from accountability under No Child Left Behind, but they want a longtitudinal growth model that can measure individual  students’ improvement over the course of a year and from year to year. They also want tests that offer valid state comparisons.

But performance assessments face obstacles of cost, reliability and testing time. Short constructed-response questions, requiring students to fill in phrase or, with a math problem, to show their work, can probably be done without too much extra time or money through the use of computer assisted technology. But more in-depth items, from essays to experiments, will take a lot longer to take and score, and will present daunting challenges to score uniformly across districts, not to mention states.

Darling-Hammond’s paper gives numerous examples of performance-type questions  and assessments used in states and abroad. But even in her prototype high-quality assessment of the future, multiple-choice comprises half of the questions in math and English-language arts ­–a nod to the time and cost challenges. Multiple choice is not likely going away entirely, just sharing the stage.

One skeptic of  performance assessments for high-stakes accountability purposes is Doug McRae, a retired publisher for the testing division of McGraw-Hill.  (The Educated Guess readers may recall that McRae first drew attention to methodology problems with the state’s selection of persistently low-performing schools. You can read his  three-page critique of Darling-Hammond’s overview paper here. McRae concludes that it’s an open question whether performance assessment methodology realistically can be used in high-stakes testing. Relying on it may be a “Trojan horse” that undermines the Obama’s accountability goals.

Whichever consortium wins the grant will write the assessments to common-core standards. If California or any state rejects the adoption of common core, then it would be also in effect rejecting a switch to new federal assessments.


  1. “If California or any state rejects the adoption of common core, then it would be also in effect rejecting a switch to new federal assessments.”

    This doesn’t seem to be a necessary conclusion. If California standards are broader/deeper than common core, then they should be good enough preparation for the common core tests. And if the common core tests are really what CA wants and they are cheap enough to implement, what’s wrong with doing so. Heck, not linking the curriculum and the testing so tightly might actually reduce the “teach to the test” effects.

    Report this comment for abusive language, hate speech and profanity

  2. Ms. Darling-Hammond greatly overstates the superiority of so-called “authentic assessment” (a term apparently intended to imply that standardized tests are not authentic). Standardized tests are superior to “authentic” tests in all of the following ways: (1) They are valid [they measure what they are intended to measure] (2) they are consistent [if administered many times, they give the same result], (3) they are reliable [the results don't vary based on who or what is doing the scoring], (4) they cover a broad spectrum of knowledge within a given amount of time, (5) they are very inexpensive to score. Finally, if the test is properly designed, it should simply be a random sample of students’ knowledge of information that should have been taught anyway (i.e., they are aligned with state curriculum frameworks, just like our instructional materials), and (6) the law specifically prohibits “teaching to the test” Education Code Sec. 60611 (a). Finally, there is absolutely no prohibition on teachers using experiments, portfolios, compostions, etc. as part of the curriculum.

    Report this comment for abusive language, hate speech and profanity

  3. Correction: that should have read “(6) if the test is properly designed…” and “(7) the law specifically prohibits…” Mea culpa

    Report this comment for abusive language, hate speech and profanity

  4. Paul: The “idea of of not linking curriculum and testing so tightly” is not as appealing as you may think at a first blush. You must always describe what students are going to be tested on, otherwise teachers will correctly complain that (a) they don’t know what to teach; and (b)how can you hold them accountable for student results. And the courts are bound to intervene in such situations too. The idea behind alignment between standards and assessment was that teachers *should* teach to the standards, and hence also to the test, as they are essentially one and the same. The law says teachers should not do test-prep (as edpolicywonk writes below) in the sense of wasting time trying to teach students test-taking tricks or focus only on old sample items instead of teaching the standards. Do some teachers subvert this? It does seem so. There is also a law against stealing. Still, some people steal. So?

    Report this comment for abusive language, hate speech and profanity

  5. Ze’ev: Agreed. That’s why I prefaced that statement with the assumption that California standards would be adequate preparation. Not sure anyone has had that time to verify that yet.

    Report this comment for abusive language, hate speech and profanity

  6. Hmmmm…teachers complaining that they don’t know what to teach. I don’t recall that as being a significant (or identifiable) problem. And with 35 years in the classroom I guess I would have heard if teachers were plagued with that issue. Ever notice that as the demand for “highly qualified” teachers gets ratcheted up there is an equal but opposite demand they comply with ever more standardized and rigid curriculums? Teachers know very well what to teach. Just get the testing/standards jihad out of their way and let them do it. The elephant in the room on testing is that it is run by large corporations who could’nt give a hoot about kid’s, teaching, schools, and learning. They care about dollars and the bottom line. Bubble in tests demean learning and “authentic” (essay et al) tests will be run through testing boiler rooms with anyone willing to work at minimum wages cranking out dozens of test scores per hour. The other alternative is to hire teachers at per diem to do the scoring and that cuts into the bottom line. It is ironic that Diane Ravitch may be the first to begin driving a stake into the heart of this educationally destructive “accountability” beast, but so be it.

    Report this comment for abusive language, hate speech and profanity

  7. Gary: I can’t speak about you, but I do recall teacher complaining that they can’t be held accountable for teaching the standards until they get aligned textbooks and until they get staff development on those textbooks and on the standards. Vehemently and repeatedly complaining. As recent as 1998 through 2003-4, until full alignment and PD for the standards was completed. In my book that is equivalent to “they don’t know what to teach.” Are you telling me now that those teachers had been lying to everyone for all those years, and what they really wanted is simply to avoid being held responsible for teaching the standards? I can’t believe it! Shame on you for implying that!

    Report this comment for abusive language, hate speech and profanity

  8. Gary: On a separate note, I agree with you that ‘”authentic” (essay et al) tests will be run through testing boiler rooms with anyone willing to work at minimum wages cranking out dozens of test scores per hour.’ However, I disagree with you when you say that “[b]ubble in tests demean learning.” Do you feel demeaned when a doctor takes your pulse, temperature, and blood pressure in his office rather than spending an hour watching how you “authentically” exercise in your neighborhood gym?

    Report this comment for abusive language, hate speech and profanity

  9. In John Fensterwald’s article Standardized tests’ Holy Grail Doug McRae’s comments are used as a counter to a arguments made in a paper by Linda Darling Hammond and Frank Adamson. Although counter points are welcome to help shape public policy, Doug McRae makes several questionable statements.

    First, to set the record straight, Lee Cronbach did not confirm that individual CLAS scores lacked reliability. Reliability is an indicator of the consistency or accuracy of tests (or other measures). Cronbach concluded for individual reading, writing and mathematics scores that there was a satisfactory level of accuracy for reporting student scores (Sampling and statistical procedures used in the California Learning Assessment System: Report of the select committee. July 25, 1994, page 49). The Cronbach report was mostly critical of how logistics and operations (e.g. sampling papers to score) undermined the accuracy of test scores, not the content (or item type) of the test itself.

    Second, McRae’s statement that good reliability requires 60-75 data points (i.e., multiple choice items) for accountability tests is simply not true. It is true, generally speaking, that as the number of test items increase, test score reliability also increases. However, there is a point of diminishing returns. If for the moment we confine ourselves to standardized large scale multiple choice tests, increasing a test from two items to ten items increases the reliability coefficient from about .30 to .65. The reliability coefficient is one measure of test score reliability and ranges from 0.0 (i.e., no reliability) to 1.0 (perfect reliability). So, an increase of eight items, increases the reliability coefficient about .35. Increasing a test from 50 to 75 items increases the reliability from about .90 to .93. In this case increasing the test by fifteen items increases the reliability coefficient only about .03. If the test was increased to 100 items the reliability coefficient would be about .96. So, doubling a test from 50 to 100 items only increases the reliability coefficient about .06. There is no accepted consensus as to what is meant by good reliability. However, one rule of thumb for making inferences about individual test scores, is to aim for a reliability coefficient of about .90 (which is accomplished by using about 50 multiple choice items). So, Doug McRae’s statement that 60-75 data points is a requirement for accountability tests is not true. It’s just his opinion.

    Third, the statement that the reliability for 30-35 data points (i.e., multiple choice items) would be too low for an accountability test is not true in that it ignores the fact the test would also include two or more constructed response items. Each constructed response item is equivalent to several multiple choice items. A standardized large scale test with 30 multiple choice items and three constructed response item (with a scoring rubrics of 0-5) would have a reliability coefficient comparable to a 60 item multiple choice test.

    Report this comment for abusive language, hate speech and profanity

  10. If Linda Darling Hammond is behind this effort, than I feel confident that it has merit and will succeed

    Report this comment for abusive language, hate speech and profanity

  11. It is not helpful to categorize tests as either “authentic” or “standardized.” These terms are not clear and are not mutually exclusive (i.e., an authentic test (whatever that means) could be standardized). It is somewhat more helpful to classify tests as either “multiple choice” or “performance assessments.” Then you still have to define/describe what is meant by a “performance assessment.” Typically for large scale testing programs performance assessments are made up of a combination of constructed response (e.g., writing an essay or conducting and experiment) and multiple choice items. And, it must be emphasized that multiple choice tests are not more valid or reliable than performance assessments. The only advantage that multiple choice tests have over performance assessments is that they are cheaper to develop, administer, and score. However, if there is no need to generate individual scores (i.e., there would be school, district, county, and state scores), costs for performance assessments can be made comparable to multiple choice tests.

    Report this comment for abusive language, hate speech and profanity

  12. Well, OK hsingi, you have me digging in my dusty file closet for the 1994 Cronbach Report to respond to your challenge of three “questionable statements” in my critique of the Linda Darling Hammond paper that John linked in his post. First, re Cronbach not confirming that individual CLAS scores lacked reliability, your summary of the Cronbach findings on page 49 of his report isn’t upheld by the report language itself; that section of the report is based on speculation on the mix of types of items in CLAS and assumptions for a favorable mix; the quote on the top of page 49 refers only to a pilot study for writing and is not generalizable to all content areas; finally, the legislative language authorizing CLAS called for individual student reliable scores, and with the less than stellar report on this topic from the Cronbach Report, that became the rationale for Gov Wilson’s vetoing the CLAS re-authorization later in 1994. Second, on my statement that good reliability for high stakes accountability tests requires 60-75 data points, that indeed is an opinion. However, in my forty years in the business a generally accepted concensus has been that relibilities of .90 or better [on the scale of .00 to 1.00 that you mention] are good targets for a system of K-12 tests, and that given variations from content area to content area and grade level to grade level having 60-75 data points will get one to those targets. Of course, there will be exceptions, and I’ll acknowledge it is possible to get acceptable reliability with fewer data points. Third, my reference to 30-35 data points for the “high quality” design described in the Hammond paper (p 39) includes the constructed response portion of the design. I would challenge your statement that a test with 30 MC items and 3 CR items (with a 6-point scoring rubric) would have a reliability coefficient equal to a 60 item MC test; the CR items would contribute to somewhat higher reliabilities than the 30 MC test by itself, but by no where near as much as you claim. The bottom line is that heavily performance assessment methodology will increase testing time, increase testing cost, and be very challenged to attain the reliability needed for high stakes accountability usage, despite some acknowledged progress in the field over the past 20 years. Doug McRae, Retired Test Publisher, Monterey, CA

    Report this comment for abusive language, hate speech and profanity

  13. hsingi: You write that “multiple choice tests are not more valid or reliable than performance assessments.” This is actually incorrect. Multiple choice items in general are more valid in assessing underlying constructs, if for no other reason than they are less prone to confounding due to heavy linguistic load for math or science items. There are more, but that’s the most obvious. Darling-Hammond argument that this can possibly be eliminated is no more than a marketing pitch. Further, you are correct in your argument that reliable scores for PA can be inexpensively given at school level and above. However, the current MC-based tests are already reliable at school level and above. The push of Bush and Obama administrations (and of parents) is to assure individual accountability for every child. Aggregate scores will let schools yet again play games with hiking school averages while leaving a large chunk of disadvantaged students behind.

    Report this comment for abusive language, hate speech and profanity

  14. Ze’ev:

    While I think it’s great that lay-people are involved in education there are times, and this is one, when they (and education/kids/schools/etc.) need to tend to their own knitting, silicon-chips…whatever. This will be for the good of everyone. You mistake teachers’ complaints. Once the phony and demeaning tests were in place teachers (with no choice or voice in the matter) had a legitimate complaint that they needed texts and professional development. All were aware though that it was an artificial and ideological construct. I don’t know any experienced teachers who haven’t thought the standards, particulalry math, stink to high heaven. As to the bubble tests narrowing the curriculum and, as a consequence, demeaning education I don’t think I can say it any better than Ravitch (though I have been saying it much longer). The whole standards, testing, bar-raising, top-down, phony accountability, etc., etc., debacle was never founded on anything but the fantasies of neo-cons (and some neo-liberals) refusal to accept the reality that the US’s brutal economic system resulted in wealth gaps and social capital gaps that led to achievement gaps. CA has embarked on a quest for the Grail for almost 20 years now and it has been like a quest for the gold at the end of the rainbow. All make-believe and a cruel joke on the teachers and students of the state. There was no reason to believe this was all going to result in closing gaps or increased achivement and it hasn’t. Kids in relatively wealthy communities are still doing well and kids in not so wealthy communities are still doing not so well.

    Report this comment for abusive language, hate speech and profanity

  15. To McRae: First: The section of the Cronbach report to which I am referring is the section on the reliability (i.e., accuracy) of individual student scores. The report concludes that accuracy estimates for writing, reading, and mathematics would all be acceptable. However, estimating the reliability of individual student scores was complicated by a couple (if not more) issues. First, CLAS was basically a matrix design intended to produce group level scores so, there were few items per student. The initial administrations weren’t geared for producing individual scores. Second, a mapping process was used to combine the information from the multiple choice and constructed response items to generate individual performance levels (i.e., scores). (Maybe mapping wasn’t the best strategy for combining information across items types.) David Wiley had to be creative in estimating the accuracy of these scores. But there was too little money and too little time to make the changes needed to generate the individual scores that Governor Wilson wanted. So he axed the program. The point I’m trying to make is that Cronbach’s criticisms about reliability had to do with logistics and operations (used to produce the group level scores) not with the items used to generate scores. Yet the press and others continue to refer to CLAS as unreliable (as if there was something inherently wrong with the test). The Cronbach report does not support that popular contention. Second: Why push for a 75 item test when acceptable reliability can be had with a 50 item test? Isn’t there some desire for reduced testing time? Third: I need to revise my original statement. I overestimated the reliability coefficient of a 30 multiple choice and 3 constructed response item test. A colleague and I made some rough estimates and conclude that the reliability coefficient would be about .90 (or about the same as a 50 item test). As further explanation, a 30 item multiple choice test (functioning the way it should) has a reliability coefficient of about .85. Three constructed response items (on a 5 or 6 point scale functioning as intended) have a reliability coefficient of about .70 (or even .80). (As a point of comparison, 3 multiple choice items have a reliability coefficient of about .35.) So, if you are starting with a reliability coefficient of .85 and adding information it has to go up from there. The additional information from the constructed response items raises the coefficient to about .90.

    Report this comment for abusive language, hate speech and profanity

  16. To Wurman: Let’s confine the discussion to validity. Validity (in testing) is the accuracy of inferences made from test scores. Validity exists along a continuum (i.e., test scores are more or less valid (for particular purposes)). So, you infer a student knows algebra 1 because of a high score on an algebra 1 test. What evidence supports your inference? Let’s compare a one item algebra 1 multiple choice test on to a one item constructed response test. From the one item answer you need to infer how well the test taker knows (or has learned) algebra 1. A single multiple choice item will provide very little information to make your inference. A student could have the correct answer on this one item but know little about other aspects of algebra 1. The item could be correct because the student guessed. The item could be incorrect because the student mistakenly marked the wrong answer. And, so on … Now if the same prompt is used (so the language load is exactly the same) and the student is asked the solve a problem and show his/her work there is much more information available to make the inference about algebra 1 knowledge. Does the one item multiple choice test seem more or less valid than the one item constructed response test? Multiple choice items gain validity by asking lots of items in a reasonable length of time. At this point I should mention that I am not advocating performance assessments for large scale testing programs. As already stated, performance items are expensive to develop, administer (e.g., in terms of time), and score. There are also more ways for problems to occur in administration and scoring.

    Report this comment for abusive language, hate speech and profanity

  17. To hsingi: Methinks you read far too much into my reference of the Cronbach report in my critique. The big picture is that the statutory language for the CLAS program called for individual student reliable scores; the Cronbach report documented that CLAS did not produce such scores; so, CLAS crashed and burned at considerable expense to CA taxpayers. For the CLAS experiment, as well as Vermont’s and Kentucky’s experiments with performance assessment methodology, and for the lessons they provide for RTTT Assessment, it is fair to say “Been There. Done That. It Failed.” That’s a good big picture basis for being skeptical about promoting performance assessment methods for the “next generation” accountability tests for the very high stakes uses envisioned by Obama/Duncan initiatives. Methinks this view is appropriate regardless whether one can split hairs whether the Cronbach report fingered the method itself or the implementation of the method, or maybe someplace in between.
    Second: Re my statement about 60-75 items, I would agree that if a fewer number of items can generate good reliabilities with reduced testing time, that argues for fewer items on the tests, tho with a large testing system featuring many grades and content areas one doesn’t want to allow a minimally acceptable reliability target dictate test length by itself. And I absolutely agree there is great pressure from the trenches for reduced testing time — minimizing testing time consistent with good validity, reliability, fairness, and comparability is the name of the test design game. In that context, reliance on performance assessment items (especially extended CR etc) greatly expands testing time and that fact argues against performance assessment methodology. Third: Re your discussion about how much CR items can increase reliabilities, methinks we are back in the weeds again. The LDH report suggests a “high quality” design might involve 25 MC items each for Math and Reading, and 10 MC items for Writing. Seems to me increasing testing time by 3-fold [that was my napkin estimate for the testing time difference between LDH’s page 39 “current typical” vs “high quality” designs), and increasing costs also, via use of performance assessment methods sufficent to generate the kind of stable consistent reliable accountability data over time to support the kinds of high stakes usage envisioned by the Obama/Duncan initiatives is a very questionable proposition. I’m skeptical, and that is what my critique says. I’m not against performance assessment methods per se, and in fact have argued a number of times we should expand (in particular) use of objectively scored CR items, but wholesale reliance of performance assessment methods for summative accountability tests is and should be a questionable proposition. Doug McRae, Retired Test Publisher, Monterey, CA

    Report this comment for abusive language, hate speech and profanity

  18. “If Linda Darling Hammond is behind this effort, than I feel confident that it has merit and will succeed”

    That’s funny. The minute I read that LDH was pushing testing, my thought was “She must think that this testing method will hide the achievement gap”. I heard LDH go on at length about “authentic” performances at a talk she gave eighteen months ago. It was, to put it mildly, unconvincing.

    I thought Doug McRae’s analysis was excellent, for what it’s worth.

    Report this comment for abusive language, hate speech and profanity

"Darn, I wish I had read that over again before I hit send.” Don’t let this be your lament. To promote a civil dialogue, please be considerate, respectful and mindful of your tone. We encourage you to use your real name, but if you must use a nom de plume, stick with it. Anonymous postings will be removed.

2010 elections(16)
2012 election(16)
A to G Curriculum(27)
Achievement Gap(38)
Adequacy suit(19)
Adult education(1)
Advocacy organizations(20)
Blog info(5)
Career academies(20)
Character education(2)
Common Core standards(71)
Community Colleges(66)
Did You Know(16)
Disabilities education(3)
Dropout prevention(11)
© Thoughts on Public Education 2014 | Home | Terms of Use | Site Map | Contact Us