Report: No judging teachers on tests

Problems with value-added analysis
By John Fensterwald - Educated Guess

The same day that the Los Angeles Times went public with an online database rating teachers’ effectiveness based on test scores, a Washington-based nonpartisan think tank released a paper strongly cautioning against such a use.

“Recognizing the technical and practical limitations of what test scores can accurately reflect, we conclude that changes in test scores should be used only as a modest part of a broader set of evidence about teacher practice,” wrote a panel of 10 scholars in a brief published by the Economic Policy Institute..

On Sunday, the Times released the evaluations of 6,000 third, fourth and fifth grade teachers, rating them on a scale of least to most effective in teaching math and English. The Times used a value-added analysis, a method that calculates the difference between how individual students were expected to score on standardized tests, based on past years’ results, and how they actually did. The difference is attributed to the impact of the teacher.

The writers of Problems With the Use of Test Scores to Evaluate Teachers – four of them professors from California, including Stanford’s Linda Darling-Hammond – didn’t say that there’s no value in value-added. To the contrary, they acknowledge that it’s far preferable than quick and dirty, unadjusted comparisons of teachers’ scores from year to year or comparing one teachers’ averages to another’s.

But the problems with value-added are still significant, which is why, they said, it shouldn’t be used as the primary factor in determining tenure and pay.

The Times obviously doesn’t control Los Angeles Unified’s personnel decisions, but publishing value-added scores certainly constitutes high stakes – at least for teachers who will be judged by the public and parents for only one measure of what they do.

LAUSD Superintendent Ramon Cortines has called for using value-added test scores as 30 percent of a teacher’s evaluation. He and Deputy Superintendent John Deasy indicated they plan to release the value added scores confidentially to teachers and to publicly release value-added results for  schools. District administrators and United Teachers Los Angeles are to negotiate the elements of performance reviews. All acknowledge the current system, based on cursory observations, with rarely any negative reviews, is a sham.

The Economic Policy Institute concluded that value-added methods are unreliable, and that heavy reliance on results for individual teachers would not improve student achievement. To the contrary, it would distort educational priorities and motivate teachers to put test results ahead of all else.

Value-added analysis is intended to account for students’  demographic background, and other factors. But, the panel found, it cannot adjust for important variables, including:

  • Summer learning loss. Especially low-income children tend to regress in their learning, so this affects using test results from one spring to the next;
  • Impact of other teachers, specialists, tutoring and help outside of the classroom, which cannot be attributed to individual teachers;
  • Lack of randomness in assigning students. Some teachers may get a disproportionate share of disruptive students or English learners and other high-needs students;
  • Class-sizes;
  • Attitude toward the tests: Particularly in higher grades, students know the tests won’t affect them personally and don’t take them seriously – randomly answering multiple choice questions in some instances;
  • Rates of students’ learning, which are rarely uniform and will vary for children who may be very far behind or way ahead of their grade level.

“Despite the hopes of many, even the most highly developed value-added models fall short of their goal of adequately adjusting for the backgrounds of students and the context of teachers’ classrooms,” the paper said.

Yearly fluctuations

The Times sought to smooth out the effect of variables by using seven years of data, which excluded less experienced teachers from the database. That is a far better time frame than many districts are using and Legislatures are requiring. But studies have found that value-added results are unpredictable and unstable.  In one study of urban teachers cited by the report, less than one third of teachers found to be among the most effective 20 percent were classified that way the next year. One third fell to the bottom 40 percent. Such fluctuations will undermine the credibility of the system.

The Times database requires the user to supply the name of a teacher to get results, with one exception. The newspaper published the names, with links to their scores, of the top 100 LAUSD teachers: those whose students gained an average 17 percentage points in math and 12 points in English on the California Standards Test in a single year.

And that, notwithstanding the valid criticisms of the Times, is one of the benefits of what the newspaper did: drawing attention to teachers who succeed year in and year out, with little notice and no extra reward. In interviewing and observing some of these teachers, Times reporter Jason Felch found different personal styles and methods — but no drill and kill.

LAUSD also has had similar value-added results for teachers, but done nothing with them – either to recognize those who are excelling or to confront those who are failing their students. Value-added analysis offers that opportunity, as long as it’s put in perspective, mindful of its limitations as one factor in assessing performance.

6 Comments

  1. We all look for simple answers to complex questions, and the process of educating a child is very complex. Standardized tests are used to gauge where students are in relation to the others, and that’s fuzzy enough. To then extend those results to determine how well the teachers are doing adds another order of magnitude of fuzziness. Just as colleges consider SAT or ACT scores as one element of an applicant’s performance, anyone attempting to assess teachers should take care to look at the bigger picture.

    Report this comment for abusive language, hate speech and profanity

  2. Of course the fine print in the L.A. Times does note that test scores are only one way to judge teachers, but then the entire rest of the feature is packaged in such a way as to clearly convey to the reader that the teachers singled out are failures. As previously noted, it’s far beyond the scope of newspaper reporters to pronounce judgment on who is a good teacher and who is a failure; it’s arrogant, destructive and a total collapse of journalistic standards and ethics. Inasmuch as it implicitly promotes the “it’s a miracle!” simplistic education fads being pushed by the billionaires and other powerful forces, it’s comforting the already comfortable and afflicting the already afflicted — the opposite of what most of us went into journalism to do.

    Report this comment for abusive language, hate speech and profanity

  3. A major problem with using CST scores in California and one that is rarely addressed is the scaling of the raw scores. I have seen students answer the same number of questions right out of the same number possible two consecutive years and the scaled score drop from the first to second year as many as 60 points. Further the students proficiency level also decreases a level.

    I have also seen students answer fewer question right in their second year and the scaled score goes up and their proficiency level also increase. So a student answers fewer questions correctly and in this case the teacher would be seen as having a positive impact on student learning.

    This tends to have a greater impact on students at the higher end of the scale. One question right or wrong can count for 40 to 60 points. Where a student at the bottom end can answer 0 to 10 question right and receive the same scale score.

    The reporting and scaling of scores in California needs to change in order for this concept to work. In theory this is a good idea, but in California it simply can not work based on how scores are scaled and reported. Try explaining all of this to students and parents in a way that makes sense.

    Take a look at:
    http://www.cde.ca.gov/ta/tg/sr/documents/csttechrpt09.pdf

    You can see all of the scaling for each test administered in 2009.

    Report this comment for abusive language, hate speech and profanity

  4. Thanks very much for posting this, John. I appreciate your being “fair and balanced.” Being a teacher leader, I was extremely dismayed by the LA Times articles and heartened by the EPI report. Unfortunately, the EPI report doesn’t seem to have gotten nearly as much publicity as the Times pieces. Thanks again!

    Report this comment for abusive language, hate speech and profanity

  5. Well, OK, Amanda raises technical issues questioning the viability of value-added methods for teacher evaluations, particularly about the scaling of the STAR tests. Amanda is correct, lack of “vertical” scaling that would potentially allow folks to validly compare (say) 5th grade scores to 6th grade scores is problematic for any value-added system. [CA has lots of company in the "non-vertically-scaled statewide tests" category -- roughly half the state assessment systems across the country do not currently have vertically scaled tests.] But simply constructing vertical scales for our tests will not be sufficient to justify use for value-added methods. In the early 90’s, the father of value-added methods for education (Bill Sanders, a statistician from Tennessee at the time) studied the national tests in use across the US — all of which had strong vertical scaling properties — and concluded they were not strong enough to support his proposed value-added analyses for teacher evaluation purposes. He solved this problem by requiring 3-year rolling averages to stabilize longitudinal test data to create test data strong enough to support his value-added methods. I mention this as an indicator for the degree of work needed to validly use value-added methods. Also, I should mention that there are techniques other than vertical scaling that can support value-added analyses. In fact, the LATimes analyses use something called a “regression” approach rather than vertical scaling. Again in the early 90’s, Bob Linn [one of the authors of the Economic Policy Institute study that is the focus of John's blog post] wrote a widely cited article on methods for “linking” tests that included both vertical scaling methods and regression methods. Generally, Linn showed that vertical scaling would be a stronger method if it can be done, while regression methods have problems with repeatability over time [to use technical language, they don't cross-validate well]. The bottom line is that while value-added methods are interesting and certainly have potential, there are a host of technical issues that need to be solved for good solid value-added applications. It will take some time for these kinds of technical issues to have mature solutions acceptable to all before value-added methods can be widely used for high stakes applications such as teacher evaluations. At least, that’s my view on the value-added dust-up generated by the LATimes article. Doug McRae, Retired Test Publisher, Monterey, CA

    Report this comment for abusive language, hate speech and profanity

  6. I took a quick look at the report mentioned and it seems like much more of an advocacy piece than a work of disinterested scholarship. Sadly that is too often the case with “scholarly” work on educational policy issues. If one knows the sponsor and the authors, it is often easy to predict the results before reading the piece.

    Report this comment for abusive language, hate speech and profanity

"Darn, I wish I had read that over again before I hit send.” Don’t let this be your lament. To promote a civil dialogue, please be considerate, respectful and mindful of your tone. We encourage you to use your real name, but if you must use a nom de plume, stick with it. Anonymous postings will be removed.

2010 elections(16)
2012 election(3)
A to G Curriculum(23)
Achievement Gap(29)
Adequacy suit(19)
Advocacy organizations(20)
Assessments(30)
Blog info(4)
CALPADS(30)
Career academies(17)
Character education(2)
Charters(76)
Common Core standards(60)
Community Colleges(47)
Data(24)
Did You Know(16)
Disabilities education(3)
Dropout prevention(8)
Education Excellence Committee(13)
English learners(5)
© Thoughts on Public Education 2012 | Home | Terms of Use | Site Map | Contact Us