<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Standardized tests&#8217; Holy Grail</title>
	<atom:link href="http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/feed/" rel="self" type="application/rss+xml" />
	<link>http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/</link>
	<description>Analysis, opinion and ruminations on California education policy</description>
	<lastBuildDate>Tue, 07 Feb 2012 18:57:59 +0100</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Cal</title>
		<link>http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/comment-page-1/#comment-10106</link>
		<dc:creator>Cal</dc:creator>
		<pubDate>Sun, 18 Apr 2010 16:01:18 +0000</pubDate>
		<guid isPermaLink="false">http://educatedguess.org/blog/?p=1766#comment-10106</guid>
		<description>&quot;If Linda Darling Hammond is behind this effort, than I feel confident that it has merit and will succeed&quot;

That&#039;s funny. The minute I read that LDH was pushing testing, my thought was &quot;She must think that this testing method will hide the achievement gap&quot;. I heard LDH go on at length about &quot;authentic&quot; performances at a talk she gave eighteen months ago. It was, to put it mildly, unconvincing. 

I thought Doug McRae&#039;s analysis was excellent, for what it&#039;s worth.</description>
		<content:encoded><![CDATA[<p>&#8220;If Linda Darling Hammond is behind this effort, than I feel confident that it has merit and will succeed&#8221;</p>
<p>That&#8217;s funny. The minute I read that LDH was pushing testing, my thought was &#8220;She must think that this testing method will hide the achievement gap&#8221;. I heard LDH go on at length about &#8220;authentic&#8221; performances at a talk she gave eighteen months ago. It was, to put it mildly, unconvincing. </p>
<p>I thought Doug McRae&#8217;s analysis was excellent, for what it&#8217;s worth.
<p>
				<span id="reportcomment_results_div_10106"><a href="javascript:void(0);" onclick="reportComment_AddTextArea( 10106 );" title="Report this comment" rel="nofollow">Report this comment for abusive language, hate speech and profanity</a></span><br />
				<span id="reportcomment_comment_div_10106"></span>
			</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Doug McRae</title>
		<link>http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/comment-page-1/#comment-9718</link>
		<dc:creator>Doug McRae</dc:creator>
		<pubDate>Thu, 15 Apr 2010 03:52:18 +0000</pubDate>
		<guid isPermaLink="false">http://educatedguess.org/blog/?p=1766#comment-9718</guid>
		<description>To hsingi:  Methinks you read far too much into my reference of the Cronbach report in my critique.  The big picture is that the statutory language for the CLAS program called for individual student reliable scores; the Cronbach report documented that CLAS did not produce such scores; so, CLAS crashed and burned at considerable expense to CA taxpayers. For the CLAS experiment, as well as Vermont&#039;s and Kentucky&#039;s experiments with performance assessment methodology, and for the lessons they provide for RTTT Assessment, it is fair to say &quot;Been There. Done That. It Failed.&quot;  That&#039;s a good big picture basis for being skeptical about promoting performance assessment methods for the &quot;next generation&quot; accountability tests for the very high stakes uses envisioned by Obama/Duncan initiatives.  Methinks this view is appropriate regardless whether one can split hairs whether the Cronbach report fingered the method itself or the implementation of the method, or maybe someplace in between.
Second: Re my statement about 60-75 items, I would agree that if a fewer number of items can generate good reliabilities with reduced testing time, that argues for fewer items on the tests, tho with a large testing system featuring many grades and content areas one doesn&#039;t want to allow a minimally acceptable reliability target dictate test length by itself.  And I absolutely agree there is great pressure from the trenches for reduced testing time -- minimizing testing time consistent with good validity, reliability, fairness, and comparability is the name of the test design game.  In that context, reliance on performance assessment items (especially extended CR etc) greatly expands testing time and that fact argues against performance assessment methodology.  Third: Re your discussion about how much CR items can increase reliabilities, methinks we are back in the weeds again.  The LDH report suggests a &quot;high quality&quot; design might involve 25 MC items each for Math and Reading, and 10 MC items for Writing.  Seems to me increasing testing time by 3-fold [that was my napkin estimate for the testing time difference between LDH&#039;s page 39 &quot;current typical&quot; vs &quot;high quality&quot; designs), and increasing costs also, via use of performance assessment methods sufficent to generate the kind of stable consistent reliable accountability data over time to support the kinds of high stakes usage envisioned by the Obama/Duncan initiatives is a very questionable proposition.  I&#039;m skeptical, and that is what my critique says. I&#039;m not against performance assessment methods per se, and in fact have argued a number of times we should expand (in particular) use of objectively scored CR items, but wholesale reliance of performance assessment methods for summative accountability tests is and should be a questionable proposition.  Doug McRae, Retired Test Publisher, Monterey, CA</description>
		<content:encoded><![CDATA[<p>To hsingi:  Methinks you read far too much into my reference of the Cronbach report in my critique.  The big picture is that the statutory language for the CLAS program called for individual student reliable scores; the Cronbach report documented that CLAS did not produce such scores; so, CLAS crashed and burned at considerable expense to CA taxpayers. For the CLAS experiment, as well as Vermont&#8217;s and Kentucky&#8217;s experiments with performance assessment methodology, and for the lessons they provide for RTTT Assessment, it is fair to say &#8220;Been There. Done That. It Failed.&#8221;  That&#8217;s a good big picture basis for being skeptical about promoting performance assessment methods for the &#8220;next generation&#8221; accountability tests for the very high stakes uses envisioned by Obama/Duncan initiatives.  Methinks this view is appropriate regardless whether one can split hairs whether the Cronbach report fingered the method itself or the implementation of the method, or maybe someplace in between.<br />
Second: Re my statement about 60-75 items, I would agree that if a fewer number of items can generate good reliabilities with reduced testing time, that argues for fewer items on the tests, tho with a large testing system featuring many grades and content areas one doesn&#8217;t want to allow a minimally acceptable reliability target dictate test length by itself.  And I absolutely agree there is great pressure from the trenches for reduced testing time &#8212; minimizing testing time consistent with good validity, reliability, fairness, and comparability is the name of the test design game.  In that context, reliance on performance assessment items (especially extended CR etc) greatly expands testing time and that fact argues against performance assessment methodology.  Third: Re your discussion about how much CR items can increase reliabilities, methinks we are back in the weeds again.  The LDH report suggests a &#8220;high quality&#8221; design might involve 25 MC items each for Math and Reading, and 10 MC items for Writing.  Seems to me increasing testing time by 3-fold [that was my napkin estimate for the testing time difference between LDH&#8217;s page 39 &#8220;current typical&#8221; vs &#8220;high quality&#8221; designs), and increasing costs also, via use of performance assessment methods sufficent to generate the kind of stable consistent reliable accountability data over time to support the kinds of high stakes usage envisioned by the Obama/Duncan initiatives is a very questionable proposition.  I&#8217;m skeptical, and that is what my critique says. I&#8217;m not against performance assessment methods per se, and in fact have argued a number of times we should expand (in particular) use of objectively scored CR items, but wholesale reliance of performance assessment methods for summative accountability tests is and should be a questionable proposition.  Doug McRae, Retired Test Publisher, Monterey, CA
<p>
				<span id="reportcomment_results_div_9718"><a href="javascript:void(0);" onclick="reportComment_AddTextArea( 9718 );" title="Report this comment" rel="nofollow">Report this comment for abusive language, hate speech and profanity</a></span><br />
				<span id="reportcomment_comment_div_9718"></span>
			</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: hsingi</title>
		<link>http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/comment-page-1/#comment-9701</link>
		<dc:creator>hsingi</dc:creator>
		<pubDate>Wed, 14 Apr 2010 23:04:35 +0000</pubDate>
		<guid isPermaLink="false">http://educatedguess.org/blog/?p=1766#comment-9701</guid>
		<description>To Wurman: Let’s confine the discussion to validity. Validity (in testing) is the accuracy of inferences made from test scores. Validity exists along a continuum (i.e., test scores are more or less valid (for particular purposes)). So, you infer a student knows algebra 1 because of a high score on an algebra 1 test. What evidence supports your inference? Let’s compare a one item algebra 1 multiple choice test on to a one item constructed response test. From the one item answer you need to infer how well the test taker knows (or has learned) algebra 1. A single multiple choice item will provide very little information to make your inference. A student could have the correct answer on this one item but know little about other aspects of algebra 1. The item could be correct because the student guessed. The item could be incorrect because the student mistakenly marked the wrong answer. And, so on … Now if the same prompt is used (so the language load is exactly the same) and the student is asked the solve a problem and show his/her work there is much more information available to make the inference about algebra 1 knowledge. Does the one item multiple choice test seem more or less valid than the one item constructed response test? Multiple choice items gain validity by asking lots of items in a reasonable length of time. At this point I should mention that I am not advocating performance assessments for large scale testing programs. As already stated, performance items are expensive to develop, administer (e.g., in terms of time), and score. There are also more ways for problems to occur in administration and scoring.</description>
		<content:encoded><![CDATA[<p>To Wurman: Let’s confine the discussion to validity. Validity (in testing) is the accuracy of inferences made from test scores. Validity exists along a continuum (i.e., test scores are more or less valid (for particular purposes)). So, you infer a student knows algebra 1 because of a high score on an algebra 1 test. What evidence supports your inference? Let’s compare a one item algebra 1 multiple choice test on to a one item constructed response test. From the one item answer you need to infer how well the test taker knows (or has learned) algebra 1. A single multiple choice item will provide very little information to make your inference. A student could have the correct answer on this one item but know little about other aspects of algebra 1. The item could be correct because the student guessed. The item could be incorrect because the student mistakenly marked the wrong answer. And, so on … Now if the same prompt is used (so the language load is exactly the same) and the student is asked the solve a problem and show his/her work there is much more information available to make the inference about algebra 1 knowledge. Does the one item multiple choice test seem more or less valid than the one item constructed response test? Multiple choice items gain validity by asking lots of items in a reasonable length of time. At this point I should mention that I am not advocating performance assessments for large scale testing programs. As already stated, performance items are expensive to develop, administer (e.g., in terms of time), and score. There are also more ways for problems to occur in administration and scoring.
<p>
				<span id="reportcomment_results_div_9701"><a href="javascript:void(0);" onclick="reportComment_AddTextArea( 9701 );" title="Report this comment" rel="nofollow">Report this comment for abusive language, hate speech and profanity</a></span><br />
				<span id="reportcomment_comment_div_9701"></span>
			</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: hsingi</title>
		<link>http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/comment-page-1/#comment-9700</link>
		<dc:creator>hsingi</dc:creator>
		<pubDate>Wed, 14 Apr 2010 23:04:04 +0000</pubDate>
		<guid isPermaLink="false">http://educatedguess.org/blog/?p=1766#comment-9700</guid>
		<description>To McRae: First: The section of the Cronbach report to which I am referring is the section on the reliability (i.e., accuracy) of individual student scores. The report concludes that accuracy estimates for writing, reading, and mathematics would all be acceptable. However, estimating the reliability of individual student scores was complicated by a couple (if not more) issues. First, CLAS was basically a matrix design intended to produce group level scores so, there were few items per student. The initial administrations weren’t geared for producing individual scores. Second, a mapping process was used to combine the information from the multiple choice and constructed response items to generate individual performance levels (i.e., scores). (Maybe mapping wasn’t the best strategy for combining information across items types.) David Wiley had to be creative in estimating the accuracy of these scores. But there was too little money and too little time to make the changes needed to generate the individual scores that Governor Wilson wanted. So he axed the program. The point I’m trying to make is that Cronbach’s criticisms about reliability had to do with logistics and operations (used to produce the group level scores) not with the items used to generate scores. Yet the press and others continue to refer to CLAS as unreliable (as if there was something inherently wrong with the test). The Cronbach report does not support that popular contention. Second: Why push for a 75 item test when acceptable reliability can be had with a 50 item test? Isn’t there some desire for reduced testing time? Third: I need to revise my original statement. I overestimated the reliability coefficient of a 30 multiple choice and 3 constructed response item test. A colleague and I made some rough estimates and conclude that the reliability coefficient would be about .90 (or about the same as a 50 item test). As further explanation, a 30 item multiple choice test (functioning the way it should) has a reliability coefficient of about .85. Three constructed response items (on a 5 or 6 point scale functioning as intended) have a reliability coefficient of about .70 (or even .80). (As a point of comparison, 3 multiple choice items have a reliability coefficient of about .35.) So, if you are starting with a reliability coefficient of .85 and adding information it has to go up from there. The additional information from the constructed response items raises the coefficient to about .90.</description>
		<content:encoded><![CDATA[<p>To McRae: First: The section of the Cronbach report to which I am referring is the section on the reliability (i.e., accuracy) of individual student scores. The report concludes that accuracy estimates for writing, reading, and mathematics would all be acceptable. However, estimating the reliability of individual student scores was complicated by a couple (if not more) issues. First, CLAS was basically a matrix design intended to produce group level scores so, there were few items per student. The initial administrations weren’t geared for producing individual scores. Second, a mapping process was used to combine the information from the multiple choice and constructed response items to generate individual performance levels (i.e., scores). (Maybe mapping wasn’t the best strategy for combining information across items types.) David Wiley had to be creative in estimating the accuracy of these scores. But there was too little money and too little time to make the changes needed to generate the individual scores that Governor Wilson wanted. So he axed the program. The point I’m trying to make is that Cronbach’s criticisms about reliability had to do with logistics and operations (used to produce the group level scores) not with the items used to generate scores. Yet the press and others continue to refer to CLAS as unreliable (as if there was something inherently wrong with the test). The Cronbach report does not support that popular contention. Second: Why push for a 75 item test when acceptable reliability can be had with a 50 item test? Isn’t there some desire for reduced testing time? Third: I need to revise my original statement. I overestimated the reliability coefficient of a 30 multiple choice and 3 constructed response item test. A colleague and I made some rough estimates and conclude that the reliability coefficient would be about .90 (or about the same as a 50 item test). As further explanation, a 30 item multiple choice test (functioning the way it should) has a reliability coefficient of about .85. Three constructed response items (on a 5 or 6 point scale functioning as intended) have a reliability coefficient of about .70 (or even .80). (As a point of comparison, 3 multiple choice items have a reliability coefficient of about .35.) So, if you are starting with a reliability coefficient of .85 and adding information it has to go up from there. The additional information from the constructed response items raises the coefficient to about .90.
<p>
				<span id="reportcomment_results_div_9700"><a href="javascript:void(0);" onclick="reportComment_AddTextArea( 9700 );" title="Report this comment" rel="nofollow">Report this comment for abusive language, hate speech and profanity</a></span><br />
				<span id="reportcomment_comment_div_9700"></span>
			</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gary Ravani</title>
		<link>http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/comment-page-1/#comment-9697</link>
		<dc:creator>Gary Ravani</dc:creator>
		<pubDate>Wed, 14 Apr 2010 22:56:00 +0000</pubDate>
		<guid isPermaLink="false">http://educatedguess.org/blog/?p=1766#comment-9697</guid>
		<description>Ze&#039;ev:

While I think it&#039;s great that lay-people are involved in education there are times, and this is one, when they (and education/kids/schools/etc.) need to tend to their own knitting, silicon-chips...whatever. This will be for the good of everyone. You mistake teachers&#039; complaints. Once the phony and demeaning tests were in place teachers (with no choice or voice in the matter) had a legitimate complaint that they needed texts and professional development. All were aware though that it was an artificial and ideological construct. I don&#039;t know any experienced teachers who haven&#039;t thought the standards, particulalry math, stink to high heaven. As to the bubble tests narrowing the curriculum and, as a consequence, demeaning education I don&#039;t think I can say it any better than Ravitch (though I have been saying it much longer). The whole standards, testing, bar-raising, top-down, phony accountability, etc., etc., debacle was never founded on anything but the fantasies of neo-cons (and some neo-liberals) refusal to accept the reality that the US&#039;s brutal economic system resulted in wealth gaps and social capital gaps that led to achievement gaps. CA has embarked on a quest for the Grail for almost 20 years now and it has been like a quest for the gold at the end of the rainbow. All make-believe and a cruel joke on the teachers and students of the state. There was no reason to believe this was all going to result in closing gaps or increased achivement and it hasn&#039;t. Kids in relatively wealthy communities are still doing well and kids in not so wealthy communities are still doing not so well.</description>
		<content:encoded><![CDATA[<p>Ze&#8217;ev:</p>
<p>While I think it&#8217;s great that lay-people are involved in education there are times, and this is one, when they (and education/kids/schools/etc.) need to tend to their own knitting, silicon-chips&#8230;whatever. This will be for the good of everyone. You mistake teachers&#8217; complaints. Once the phony and demeaning tests were in place teachers (with no choice or voice in the matter) had a legitimate complaint that they needed texts and professional development. All were aware though that it was an artificial and ideological construct. I don&#8217;t know any experienced teachers who haven&#8217;t thought the standards, particulalry math, stink to high heaven. As to the bubble tests narrowing the curriculum and, as a consequence, demeaning education I don&#8217;t think I can say it any better than Ravitch (though I have been saying it much longer). The whole standards, testing, bar-raising, top-down, phony accountability, etc., etc., debacle was never founded on anything but the fantasies of neo-cons (and some neo-liberals) refusal to accept the reality that the US&#8217;s brutal economic system resulted in wealth gaps and social capital gaps that led to achievement gaps. CA has embarked on a quest for the Grail for almost 20 years now and it has been like a quest for the gold at the end of the rainbow. All make-believe and a cruel joke on the teachers and students of the state. There was no reason to believe this was all going to result in closing gaps or increased achivement and it hasn&#8217;t. Kids in relatively wealthy communities are still doing well and kids in not so wealthy communities are still doing not so well.
<p>
				<span id="reportcomment_results_div_9697"><a href="javascript:void(0);" onclick="reportComment_AddTextArea( 9697 );" title="Report this comment" rel="nofollow">Report this comment for abusive language, hate speech and profanity</a></span><br />
				<span id="reportcomment_comment_div_9697"></span>
			</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ze'ev Wurman</title>
		<link>http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/comment-page-1/#comment-9630</link>
		<dc:creator>Ze'ev Wurman</dc:creator>
		<pubDate>Wed, 14 Apr 2010 02:16:03 +0000</pubDate>
		<guid isPermaLink="false">http://educatedguess.org/blog/?p=1766#comment-9630</guid>
		<description>hsingi: You write that &quot;multiple choice tests are not more valid or reliable than performance assessments.&quot; This is actually incorrect. Multiple choice items in general are more valid in assessing underlying constructs, if for no other reason than they are less prone to confounding due to heavy linguistic load for math or science items. There are more, but that&#039;s the most obvious. Darling-Hammond argument that this can possibly be eliminated is no more than a marketing pitch.     Further, you are correct in your argument that reliable scores for PA can be inexpensively given at school level and above. However, the current MC-based tests are already reliable at school level and above. The push of Bush and Obama administrations (and of parents) is to assure individual accountability for every child. Aggregate scores will let schools yet again play games with hiking school averages while leaving a large chunk of disadvantaged students behind.</description>
		<content:encoded><![CDATA[<p>hsingi: You write that &#8220;multiple choice tests are not more valid or reliable than performance assessments.&#8221; This is actually incorrect. Multiple choice items in general are more valid in assessing underlying constructs, if for no other reason than they are less prone to confounding due to heavy linguistic load for math or science items. There are more, but that&#8217;s the most obvious. Darling-Hammond argument that this can possibly be eliminated is no more than a marketing pitch.     Further, you are correct in your argument that reliable scores for PA can be inexpensively given at school level and above. However, the current MC-based tests are already reliable at school level and above. The push of Bush and Obama administrations (and of parents) is to assure individual accountability for every child. Aggregate scores will let schools yet again play games with hiking school averages while leaving a large chunk of disadvantaged students behind.
<p>
				<span id="reportcomment_results_div_9630"><a href="javascript:void(0);" onclick="reportComment_AddTextArea( 9630 );" title="Report this comment" rel="nofollow">Report this comment for abusive language, hate speech and profanity</a></span><br />
				<span id="reportcomment_comment_div_9630"></span>
			</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Doug McRae</title>
		<link>http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/comment-page-1/#comment-9618</link>
		<dc:creator>Doug McRae</dc:creator>
		<pubDate>Tue, 13 Apr 2010 23:40:18 +0000</pubDate>
		<guid isPermaLink="false">http://educatedguess.org/blog/?p=1766#comment-9618</guid>
		<description>Well, OK hsingi, you have me digging in my dusty file closet for the 1994 Cronbach Report to respond to your challenge of three &quot;questionable statements&quot; in my critique of the Linda Darling Hammond paper that John linked in his post.  First, re Cronbach not confirming that individual CLAS scores lacked reliability, your summary of the Cronbach findings on page 49 of his report isn&#039;t upheld by the report language itself; that section of the report is based on speculation on the mix of types of items in CLAS and assumptions for a favorable mix; the quote on the top of page 49 refers only to a pilot study for writing and is not generalizable to all content areas; finally, the legislative language authorizing CLAS called for individual student reliable scores, and with the less than stellar report on this topic from the Cronbach Report, that became the rationale for Gov Wilson&#039;s vetoing the CLAS re-authorization later in 1994.  Second, on my statement that good reliability for high stakes accountability tests requires 60-75 data points, that indeed is an opinion.  However, in my forty years in the business a generally accepted concensus has been that relibilities of .90 or better [on the scale of .00 to 1.00 that you mention] are good targets for a system of K-12 tests, and that given variations from content area to content area and grade level to grade level having 60-75 data points will get one to those targets.  Of course, there will be exceptions, and I&#039;ll acknowledge it is possible to get acceptable reliability with fewer data points.  Third, my reference to 30-35 data points for the &quot;high quality&quot; design described in the Hammond paper (p 39) includes the constructed response portion of the design.  I would challenge your statement that a test with 30 MC items and 3 CR items (with a 6-point scoring rubric) would have a reliability coefficient equal to a 60 item MC test; the CR items would contribute to somewhat higher reliabilities than the 30 MC test by itself, but by no where near as much as you claim.  The bottom line is that heavily performance assessment methodology will increase testing time, increase testing cost, and be very challenged to attain the reliability needed for high stakes accountability usage, despite some acknowledged progress in the field over the past 20 years.  Doug McRae, Retired Test Publisher, Monterey, CA</description>
		<content:encoded><![CDATA[<p>Well, OK hsingi, you have me digging in my dusty file closet for the 1994 Cronbach Report to respond to your challenge of three &#8220;questionable statements&#8221; in my critique of the Linda Darling Hammond paper that John linked in his post.  First, re Cronbach not confirming that individual CLAS scores lacked reliability, your summary of the Cronbach findings on page 49 of his report isn&#8217;t upheld by the report language itself; that section of the report is based on speculation on the mix of types of items in CLAS and assumptions for a favorable mix; the quote on the top of page 49 refers only to a pilot study for writing and is not generalizable to all content areas; finally, the legislative language authorizing CLAS called for individual student reliable scores, and with the less than stellar report on this topic from the Cronbach Report, that became the rationale for Gov Wilson&#8217;s vetoing the CLAS re-authorization later in 1994.  Second, on my statement that good reliability for high stakes accountability tests requires 60-75 data points, that indeed is an opinion.  However, in my forty years in the business a generally accepted concensus has been that relibilities of .90 or better [on the scale of .00 to 1.00 that you mention] are good targets for a system of K-12 tests, and that given variations from content area to content area and grade level to grade level having 60-75 data points will get one to those targets.  Of course, there will be exceptions, and I&#8217;ll acknowledge it is possible to get acceptable reliability with fewer data points.  Third, my reference to 30-35 data points for the &#8220;high quality&#8221; design described in the Hammond paper (p 39) includes the constructed response portion of the design.  I would challenge your statement that a test with 30 MC items and 3 CR items (with a 6-point scoring rubric) would have a reliability coefficient equal to a 60 item MC test; the CR items would contribute to somewhat higher reliabilities than the 30 MC test by itself, but by no where near as much as you claim.  The bottom line is that heavily performance assessment methodology will increase testing time, increase testing cost, and be very challenged to attain the reliability needed for high stakes accountability usage, despite some acknowledged progress in the field over the past 20 years.  Doug McRae, Retired Test Publisher, Monterey, CA
<p>
				<span id="reportcomment_results_div_9618"><a href="javascript:void(0);" onclick="reportComment_AddTextArea( 9618 );" title="Report this comment" rel="nofollow">Report this comment for abusive language, hate speech and profanity</a></span><br />
				<span id="reportcomment_comment_div_9618"></span>
			</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: hsingi</title>
		<link>http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/comment-page-1/#comment-9617</link>
		<dc:creator>hsingi</dc:creator>
		<pubDate>Tue, 13 Apr 2010 23:15:52 +0000</pubDate>
		<guid isPermaLink="false">http://educatedguess.org/blog/?p=1766#comment-9617</guid>
		<description>It is not helpful to categorize tests as either “authentic” or “standardized.” These terms are not clear and are not mutually exclusive (i.e., an authentic test (whatever that means) could be standardized). It is somewhat more helpful to classify tests as either “multiple choice” or “performance assessments.”  Then you still have to define/describe what is meant by a “performance assessment.” Typically for large scale testing programs performance assessments are made up of a combination of constructed response (e.g., writing an essay or conducting and experiment) and multiple choice items. And, it must be emphasized that multiple choice tests are not more valid or reliable than performance assessments. The only advantage that multiple choice tests have over performance assessments is that they are cheaper to develop, administer, and score. However, if there is no need to generate individual scores (i.e., there would be school, district, county, and state scores), costs for performance assessments can be made comparable to multiple choice tests.</description>
		<content:encoded><![CDATA[<p>It is not helpful to categorize tests as either “authentic” or “standardized.” These terms are not clear and are not mutually exclusive (i.e., an authentic test (whatever that means) could be standardized). It is somewhat more helpful to classify tests as either “multiple choice” or “performance assessments.”  Then you still have to define/describe what is meant by a “performance assessment.” Typically for large scale testing programs performance assessments are made up of a combination of constructed response (e.g., writing an essay or conducting and experiment) and multiple choice items. And, it must be emphasized that multiple choice tests are not more valid or reliable than performance assessments. The only advantage that multiple choice tests have over performance assessments is that they are cheaper to develop, administer, and score. However, if there is no need to generate individual scores (i.e., there would be school, district, county, and state scores), costs for performance assessments can be made comparable to multiple choice tests.
<p>
				<span id="reportcomment_results_div_9617"><a href="javascript:void(0);" onclick="reportComment_AddTextArea( 9617 );" title="Report this comment" rel="nofollow">Report this comment for abusive language, hate speech and profanity</a></span><br />
				<span id="reportcomment_comment_div_9617"></span>
			</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brian Kaplan</title>
		<link>http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/comment-page-1/#comment-9614</link>
		<dc:creator>Brian Kaplan</dc:creator>
		<pubDate>Tue, 13 Apr 2010 22:55:17 +0000</pubDate>
		<guid isPermaLink="false">http://educatedguess.org/blog/?p=1766#comment-9614</guid>
		<description>If Linda Darling Hammond is behind this effort, than I feel confident that it has merit and will succeed</description>
		<content:encoded><![CDATA[<p>If Linda Darling Hammond is behind this effort, than I feel confident that it has merit and will succeed
<p>
				<span id="reportcomment_results_div_9614"><a href="javascript:void(0);" onclick="reportComment_AddTextArea( 9614 );" title="Report this comment" rel="nofollow">Report this comment for abusive language, hate speech and profanity</a></span><br />
				<span id="reportcomment_comment_div_9614"></span>
			</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: hsingi</title>
		<link>http://toped.svefoundation.org/2010/04/12/standardized-tests-holy-grail/comment-page-1/#comment-9608</link>
		<dc:creator>hsingi</dc:creator>
		<pubDate>Tue, 13 Apr 2010 21:46:30 +0000</pubDate>
		<guid isPermaLink="false">http://educatedguess.org/blog/?p=1766#comment-9608</guid>
		<description>In John Fensterwald’s article Standardized tests’ Holy Grail Doug McRae’s comments are used as a counter to a arguments made in a paper by Linda Darling Hammond and Frank Adamson. Although counter points are welcome to help shape public policy, Doug McRae makes several questionable statements.

First, to set the record straight, Lee Cronbach did not confirm that individual CLAS scores lacked reliability. Reliability is an indicator of the consistency or accuracy of tests (or other measures). Cronbach concluded for individual reading, writing and mathematics scores that there was a satisfactory level of accuracy for reporting student scores (Sampling and statistical procedures used in the California Learning Assessment System: Report of the select committee. July 25, 1994, page 49). The Cronbach report was mostly critical of how logistics and operations (e.g. sampling papers to score) undermined the accuracy of test scores, not the content (or item type) of the test itself. 

Second, McRae’s  statement that good reliability requires 60-75 data points (i.e., multiple choice items) for accountability tests is simply not true. It is true, generally speaking, that as the number of test items increase, test score reliability also increases. However, there is a point of diminishing returns. If for the moment we confine ourselves to standardized large scale multiple choice tests, increasing a test from two items to ten items increases the reliability coefficient from about .30 to .65. The reliability coefficient is one measure of test score reliability and ranges from 0.0 (i.e., no reliability) to 1.0 (perfect reliability). So, an increase of eight items, increases the reliability coefficient about .35. Increasing a test from 50 to 75 items increases the reliability from about .90 to .93. In this case increasing the test by fifteen items increases the reliability coefficient only about .03. If the test was increased to 100 items the reliability coefficient would be about .96. So, doubling a test from 50 to 100 items only increases the reliability coefficient about .06. There is no accepted consensus as to what is meant by good reliability. However, one rule of thumb for making inferences about individual test scores, is to aim for a reliability coefficient of about .90 (which is accomplished by using about 50 multiple choice items). So, Doug McRae’s statement that 60-75 data points is a requirement for accountability tests is not true. It’s just his opinion.

Third, the statement that the reliability for 30-35 data points (i.e., multiple choice items) would be too low for an accountability test is not true in that it ignores the fact the test would also include two or more constructed response items. Each constructed response item is equivalent to several multiple choice items. A standardized large scale test with 30 multiple choice items and three constructed response item (with a scoring rubrics of 0-5) would have a reliability coefficient comparable to a 60 item multiple choice test.</description>
		<content:encoded><![CDATA[<p>In John Fensterwald’s article Standardized tests’ Holy Grail Doug McRae’s comments are used as a counter to a arguments made in a paper by Linda Darling Hammond and Frank Adamson. Although counter points are welcome to help shape public policy, Doug McRae makes several questionable statements.</p>
<p>First, to set the record straight, Lee Cronbach did not confirm that individual CLAS scores lacked reliability. Reliability is an indicator of the consistency or accuracy of tests (or other measures). Cronbach concluded for individual reading, writing and mathematics scores that there was a satisfactory level of accuracy for reporting student scores (Sampling and statistical procedures used in the California Learning Assessment System: Report of the select committee. July 25, 1994, page 49). The Cronbach report was mostly critical of how logistics and operations (e.g. sampling papers to score) undermined the accuracy of test scores, not the content (or item type) of the test itself. </p>
<p>Second, McRae’s  statement that good reliability requires 60-75 data points (i.e., multiple choice items) for accountability tests is simply not true. It is true, generally speaking, that as the number of test items increase, test score reliability also increases. However, there is a point of diminishing returns. If for the moment we confine ourselves to standardized large scale multiple choice tests, increasing a test from two items to ten items increases the reliability coefficient from about .30 to .65. The reliability coefficient is one measure of test score reliability and ranges from 0.0 (i.e., no reliability) to 1.0 (perfect reliability). So, an increase of eight items, increases the reliability coefficient about .35. Increasing a test from 50 to 75 items increases the reliability from about .90 to .93. In this case increasing the test by fifteen items increases the reliability coefficient only about .03. If the test was increased to 100 items the reliability coefficient would be about .96. So, doubling a test from 50 to 100 items only increases the reliability coefficient about .06. There is no accepted consensus as to what is meant by good reliability. However, one rule of thumb for making inferences about individual test scores, is to aim for a reliability coefficient of about .90 (which is accomplished by using about 50 multiple choice items). So, Doug McRae’s statement that 60-75 data points is a requirement for accountability tests is not true. It’s just his opinion.</p>
<p>Third, the statement that the reliability for 30-35 data points (i.e., multiple choice items) would be too low for an accountability test is not true in that it ignores the fact the test would also include two or more constructed response items. Each constructed response item is equivalent to several multiple choice items. A standardized large scale test with 30 multiple choice items and three constructed response item (with a scoring rubrics of 0-5) would have a reliability coefficient comparable to a 60 item multiple choice test.
<p>
				<span id="reportcomment_results_div_9608"><a href="javascript:void(0);" onclick="reportComment_AddTextArea( 9608 );" title="Report this comment" rel="nofollow">Report this comment for abusive language, hate speech and profanity</a></span><br />
				<span id="reportcomment_comment_div_9608"></span>
			</p>
]]></content:encoded>
	</item>
</channel>
</rss>

