Value-added evaluations can be designed but can limitations be understood?
The notion of using student test scores for teacher evaluations has been a professional dilemma for me for many years.
On the one hand, using student gains to evaluate teacher effectiveness, as in various “value added” schemes, seems to be a no-brainer, logical thing to do. On the other hand, implementing such schemes seems to violate professional standards for valid, reliable, and fair use of student test scores.
My first job after finishing a Ph.D. in psychometrics was as a Director of Testing at a medium-sized school district in Michigan in the early 1970s. The first week on the job, I found myself at a coffee house on South Main Street being interviewed by a morning talk show on a local radio station. The first question: Dr. McRae, Will the testing program you design for our district be used to evaluate teachers? Gulp! Fresh-out-of-graduate school, welcome to the real world of K-12 educational testing! I stumbled through an answer something like student tests were not as precise a measurement tool as an outdoor temperature gauge.
Later that week, I met the leadership of the local chapter of the Michigan Education Association, the district’s teacher union, at a Sizzler on North Main. I talked to the two local officers decked out in their team jackets with UAW and Teamster patches among other union loyalties readily apparent. I first was informed that my central office administrator salary was indexed to the teacher salary scale … and then grilled on whether I thought student test scores should be used for teacher evaluation purposes. Double welcome to the world of K-12 testing!!
After more than 35 years, the professional dilemma on use of student test scores – particularly following individual student test scores over multiple years – for teacher evaluation still remains, although the parameters of the dilemma are better known and more complex. While not solving the dilemma entirely, I would offer three actions that California policymakers might take to improve the odds of implementing valid, reliable, and fair use of student test scores for teacher evaluation.
- Efficiently implement a robust student data tracking system a la CALPADs, and a similar robust teacher data tracking system a la CALTIDES. These systems are complex, but they are not groundbreaking brain surgery. Anything more than a year or so for design, and another couple years for 99-plus percent accurate implementation is suspect. Both student and teacher data systems are required before value-added systems can validly be implemented.
- Add vertical scaling to California’s STAR assessment system to justify comparison of test scores across grades – for example, to compare 5th grade test scores to 6th grade test scores following an individual student or a cohort of students. If the architecture of a K-12 assessment system may be described in terms of bricks and mortar, then grade-specific test questions and test forms are the bricks and things like vertical scaling are the mortar. STAR currently has no mortar that keeps orientation of the bricks solidly in place. At times it is claimed that value-add teacher evaluation systems can be done without elements such as vertical scaling. However, alignment is essential. To use another analogy, to predict the number of apples harvested next year, the number of apples harvested this year would be a better predictor than the number of oranges. No matter what value-added engineering is chosen, it will work far better with vertically scaled “mortar” to maintain stable relationships among scores from adjacent grades. Vertical scaling properties can be added to our current STAR system in 18-24 months at relatively minimal expense.
- Develop guidelines to address the “problem of attribution” issue. This may be the most difficult problem to resolve for value-added systems to be used for teacher evaluations. The “attribution” problem refers to the fact that just associating a student test score with a teacher’s name isn’t good enough for high-stakes teacher evaluation. Some sort of minimal instructional “connect time” between the student and the teacher also has to be established before it can fairly be claimed that student outcomes are related to teacher performance. For example, student-teacher “connect time” may be compromised due to student or teacher absenteeism, student mobility, changes in teacher assignments, teacher leaves-of-absence, and/or teacher team teaching arrangements.
While I’d like to think student test scores can be used for teacher evaluations in a valid, reliable, and fair manner, I admit I wonder whether such a goal will be accomplished and accepted by all within the next 35 years. Putting operational things like student and teacher data tracking systems in place, as well as mortar like vertically scaled test scores, may only take several years. But it is likely to take much longer to address issues like the “problem of attribution” as well as to change the collective mindset among students, teachers, administrators, policymakers and the public to accept both the potentials and the limitations inherent in using student test scores for high-stakes teacher evaluations.
Doug McRae is a retired educational measurement specialist living in Monterey. In his 40 years in the K-12 testing business, he has served as an educational testing company executive in charge of design and development of K-12 tests widely used across the US, as well as an adviser on the initial design and development of California’s STAR assessment system. He has a Ph.D. in Quantitative Psychology from the University of North Carolina, Chapel Hill.






As usual Doug gives a thoughtful analysis of some of the issues involved in using VAM for teacher evaluations. A problem arises, though, because in the current political climate thoughtful approaches get heaved out with the bathwater. There are those desparate to keep a focus on teachers and their “accountability” so that there is no focus on the social and political lack of accountability that accounts for the fact that kids show up at school with an “achievement gap” already present. The achievement gap grows during school breaks because of a lack of supports to children’s needs in homes and communities. (The homes and communites don’t have the capacity for support. This should not be construed as more finger-pointingat the powerless.) The real remedies to this situation are investments in human capital on the scale of the Marshall Plan (in current dollars) directed mostly to communities of color. We should all hold ourcollective breath until that happens.
It would appear, too, that Doug would still have to arm-wrestle with the objections to using test scores for high-stakes decesions raised by the American Psychological Association and the National Research Council. The NRC views VAM with outright alarm.
CA’s own test vendor, ETS, in two studies of its own test results (The Family: America’s Smallest Schools and Parsing the Achievement Gap) indicate that school related factors represent 1/3 of test score variability and teacher effects are only a part of that 1/3rd. Current “accountability” efforts propose that the 1/3 tail wag the 2/3rds dog. Not likely. That won’t stop the conservatives and neo-liberals from insisting that it happen though. Consideration of these issues does put the current debate on tax cuts for the wealthy into perspective though, and it’s avoiding that consideration and perspective that drives the frenzied scape-goating of teachers.
Some practical issues involve the collective bargaining aspects. Teachers working at the secondary level frequently have no state adopted tests to use. This presents an issue for bargaining in that you set up a two tiered evaluation system where one group in the bargaining unit has test score to be used for evaluation and one does not. At the eighth grade the social studies CST asks questions based on 6th-7th-and-8th grade curriculum. Whose evaluation should be based on that data? Then there’s the music, art, industral arts, drama, speech, and PE teachers to consider. At the elementary level test scores in math and reading are available. As Diane Ravitch pointed out, the more “accountability” is tied to those scores the more the curriculum is narrowed. Bad outcome.
Some will suggest we don’t need to consider collective bargaining issues: This is about the children! Recall that teacher working conditions are negotiated in collective bargaining and those “working conditions” are kid’s learning conditions. School is still important, even if it is not the dominant variable in test scores.
Report this comment for abusive language, hate speech and profanity
I also found this analysis to be careful and thoughtful, but typical of researchers. They always seem to think there’s a way to quantify and analyze everything. Doug admits we’re not ready yet but holds out hope that we might be someday. Gary’s citation provides a good counterpoint. I would add that in secondary schools attribution should be considered impossible. In my own blog posts, I have cited study after study after study to show research claiming various factors that impact test scores – school schedules, peers, administrators, support programs, etc. Reading is particularly hard to test, because students don’t need to read the selections to answer the questions in many cases, and in any case, students practice reading all day long in every class. The most effective schools are probably using reading strategies instruction in every class as they cover different content. How can anyone logically claim that if almost everything might have some impact on test scores, then we can hold certain teachers alone accountable for the results? The only way to do that would be to assume that all the unidentified and unquantified variables in a school can be held constant long enough to generate VAM results, and these factors will affect all teachers equally. Any honest participant in the debate must acknowledge this imperfection, and it is such a significant imperfection that it ought to be considered a fatal flaw in the use of VAM for teacher evaluation. ”Well, it would only be used for 30% of the evaluation” VAM supporters respond. Sorry – the arbitrary selection of a percentage for the evaluation does NOTHING to mitigate the flaw. Imagine a doctor relying on unreliable information for a diagnosis, but only 30% of the diagnosis. ”Today, the patient’s blood pressure is 120/80, (± 30), and yesterday it was 130/90 (± 30). Therefore, the patient’s blood pressure is probably going down, and the reason must be…” Please! And why 30%, instead of 34% or 22%? They just make up these numbers so that they sound palatable but there’s no meaning and no research behind the numbers. James Popham, one of the giants in the field of assessment, concluded that the most reasonable approach to teacher evaluation (given that no method would be foolproof or perfect) would be the professional judgment of fellow educators.
Report this comment for abusive language, hate speech and profanity