Measured Progress's Recent Logistics Problems
by Timothy Horrigan
Copyright © 2005-2006 Timothy Horrigan

 This article was originally part of my the "Notes on my Test Scoring Experience web page:

There are a number of links in this text to various newspaper articles. Those links are likely to go dead at some point in the future, if they have not done so already. It seems best to just leave the links in, as a guide to finding the material offline.


 RECENT LOGISTICS PROBLEMS 

Measured Progress's logistics group ran into a couple of glitches in May 2005.

First, on Friday May 13, local media in Nevada reported that a significant number of answer sheets went missing: somewhere between 100 and 400 papers from high schools in the Reno and Las Vegas areas. This problem would not have come to the media's attention were it not for the fact that these test papers all belonged to seniors who needed to pass the test before they could graduate.

The missing sheets represented 25% to 50% of the seniors at the affected schools. (At Las Vegas' Western High School, for example, 150 test papers were supposedly misplaced from a school with 328 seniors.) Measured Progress didn't notice the problem until the schools asked why the score reports hadn't come back yet. Happily, Measured Progress found the test sheets after a frantic weekend search through what Measured Progress president Stuart Kahl described as "tens of thousands of cartons." School administrators were blamed for putting the answer sheets in the same boxes with the test booklets rather than in a separate envelope. (Typically, all testing materials are shipped back to New Hampshire, but the sheets are supposed to be separated from the rest of the stuff before everything goes back.)

Even before the answer sheets got lost, there was a problem with the Nevada exams: a large number of high school math tests went out with messed-up formula sheets. The conversion formulas for several common units had blatant typos: "quart," "liter," and "kilogram" were all spelled wrong on the printed sheets. Oddly, when Measured Progress CEO Stuart Kahl checked the computer files ostensibly used to generate the sheets, those words were spelled correctly. (My uninformed guess is that there was a problem with the fonts: probably for some reason the printer and operating system chose not to use the fonts included with the original PDF file.) Luckily, the messed-up formulas weren't really necessary to answer any of the test questions anyway. (Unless the test question is explicitly testing this skill and no other skill, students are never asked to convert from one measuring system to the other. Questions are always written using all-metric or all-non-metric units.)

Ironically, one of the reasons Measured Progress won the Nevada contract over the previous contractor (Harcourt) was because they agreed to handle virtually all of the basic grunt work of sending out and retrieving test materials. Another reason was because Harcourt had made what the Las Vegas Sun's Emily Richmond described in a November 18, 2004 article as "a series of high-profile — and expensive — errors over the past two years." (Presumably, those two years were 2002 and 2003.)

On Thursday, May 19, it turned out that the incidents in Reno and Las Vegas were not entirely isolated: other test materials (from other grades and other schools) had been misplaced as well.

Also, on Thursday, May 19, the Boston Globe reported that Massachusetts' MCAS test (which is developed and tested by Measured Progress) had been disrupted by incorrect test booklets. Some Grade 10 booklets went out with an English/ Language Arts test item where the questions had nothing to do with the brief essay which they ostensibly referred to. Luckily, the test item was a "field test" question where the answers do not count towards the students' scores. (It is possible that Measured Progress's test developers originally thought that the questions in fact did have something to do with the test prompt, but when the item went out into the field the educators who administered the test failed to see how the questions were relevant. However, it is much more likely that the books were simply printed wrong.) Also, some Grade 4 booklets went out with a page missing from the English/ Language Arts section. A state Department of Education spokesperson admitted that no one from their department proofs the booklets after they come back from the printer, but in any case Measured Progress does share responsibility for proofing the test booklets both before and after they go to the printers.

A subsequent story on Friday, May 20 in a suburban Boston daily, the Waltham Daily News-Tribune shed a little more light on this incident: the 10th grade booklets in question contained an essay prompt followed by a set of seven questions consisting of two unnumbered (and hence unanswerable) questions and five numbered questions which had no obvious connection to the prompt. The 4th grade booklets were replaced with correctly printed booklets at the schools where the problem was noted. (Schools are given roughly 1.1 test booklets per student, to allow a margin of error when something like this goes wrong.)

The next week (just before Memorial Day) a weekly paper in Cambridge, the Cambridge Chronicle, published a scathing interview with the city school superintendent, Tom Fowler-Finn. 7 of 462 10th graders at Cambridge Rindge & Latin received the booklets with the unanswerable reading field-test item. Towler-Finn was quoted as saying: "If you're a student sitting in that exam thinking you have to pass it to get through high school, and then you read the passage and the questions following it don't match, as far as I'm concerned, that does a great deal of harm. It shakes [a student's] confidence."

The Boston Globe also reported on May 26 that 24 test booklets were misplaced in Randolph, MA (a Boston suburb.) They got left behind for a month in a special secure basement classroom used only for standardized testing. (Randolph is a wealthy town.) On April 6, a group of students took the MCAS, but the teachers forgot to take the completed test materials out before locking up the secure facility. The booklets were discovered a month later when the room was reopened for the SAT, but the students still had to take the MCAS a second time. This foul-up is the school's fault rather than Measured Progress's. Nevertheless, this incident does make one wonder what sort of systems are in place to track outgoing and incoming materials. This incident also shows how harsh the process was. Even though the lost papers had already been found, the kids were told on May 11 that they had to take the test the very next day. The rationale for this was the possibility that the booklets had been tampered with while they were missing. This is a valid concern, but this concern is inconsistent with the fact that Measured Progress's logistics center never noticed that those 24 test papers (out of a sophomore class with 280 students, according to the Public School Review website) were missing.

The Randolph incident involved a screw-up which was entirely beyond Measured Progress's control. There's not much you can do when your client locks completed test booklets in a little-used room and leaves them there for a month. (Although, as I said already, you could try to flag schools which return an abnormally low percentage of their test materials. In theory, all test materials go back to Measured Progress. In practice a few booklets might go missing— but 24 out of 280 is 8.5%!)

The Nevada incident was within Measured Progress's control. Answer sheets being shipped back in the wrong box is a foreseeable problem (and, reading between the lines of the local newspaper articles, it appears that this is a problem which has happened before.) Just to belabor a point I already made, they managed to lose half the papers from some schools. Yeah, 50% of the papers from a single school adds up to a minuscule fraction of Measured Progress's nationwide throughput, but it's a huge fraction of that school's population.

It seems to me that if you know (as you damn well ought to!) how many students are in a given class at a given school, how many students are scheduled to take the test, and how many sets of test materials the school ordered— then you should have a pretty good idea how many test booklets are going to come back (especially if you also compare this year's numbers at a given school to previous year's numbers and/or to those at other schools.) Between the completed and uncompleted booklets, you should have every booklet accounted for. Evidently accounting for the uncompleted booklets was not considered a high priority, even though you have a significant security problem if someone swipes test materials before they get sent back. You're almost never going to get enough completed booklets back to account for 100% of the enrolled students. (Some kids are always going to miss the test, for various legitimate or illegitimate reasons) However, the percentage of students who do indeed show up for the test should be reasonably predictable, and it should be a lot higher than the 50% to 75% seen at the effected Nevada schools.

It seems pretty evident that Measured Progress (and its competitors) are not keeping very good track of their incoming (or outgoing, for that matter) testing materials.

A year later, Stuart Kahl gave a presentation before the Aspen Institute which touched on some of these logistical issues.


In Massachusetts, there was another controversy in September 2005, when the MCAS scores from the previous spring were reported. Some schools reported that test booklets had been scanned incorrectly. Also statewide, scores went down significantly, and understandably many school administrators theorized that the change of contractors from Harcourt to Measured Progress may have been a factor. This controversy failed to reach the pages of Massachusetts' only statewide paper, Boston Globe. However, several smaller papers did run stories (which may or may not still be available by the time you read this article):

Then, on October 7, a suburban paper called the Holbrook Sun reported that 60 out of 700 MCAS test booklets given to Holbrook students had gone missing. "We've been on this since August," Superintendent Susan Martin said. "We've been trying to track this down and find a resolution to this issue." Holbrook is not far from Weymouth.

On October 13, the Springfield Republican reported a controversy over 54 missing scores from the West Springfield Middle School, which has roughly 1000 students. 51 of these 54 scores were mistakenly attributed to West Bridgewater (a community geographically on the other side of the state from West Springfield— but alphabetically adjacent.) The state says the reported scores are correct. The West Springfield school district feels otherwise. Angelo Rota, a West Springfield school administrator, commented "It's my understanding that [the tests] all go to a warehouse in Texas. It's like looking for a needle in a haystack." (Actually it is my understanding that the warehouse is in Dover, NH. The previous contractor's warehouse was in Texas.)


  Measured Progress is not the only test provider to run into problems shipping test forms back and forth across the country: Pearson experienced a rather embarrassing snafu with the SAT in the fall of 2005 and the winter of 2006. Test scoring companies in general need to consider to what extent they want to be in the business of shipping and handling test booklets. This doesn't necessarily mean they need to outsource all those functions. Measured Progress actually created some problems for itself by not outsourcing some of its printing.

2005-2006 SAT Scoring Errors

Aside from the open-response writing prompt, the rest of the new SAT works just like the old SAT. It consists of several dozen multiple-choice questions. In March 2006, there was a bit of a furore after the College Board admitted that a few of the scores from the fall 2005 tests might be just a wee bit off. Initially, these errors supposedly just affected a few dozen tests, and the scores would be off by no more than 100 points, and typically by just 10 to 40 points. Then, we were told that exactly 0.8% of the test papers (1 in 125) were effected. There have been reports of scores being as much as 200 points off or even more. The College Board is adjusting overly low scores upwards but the smaller number of overly high scores will be left the way they are.


I don't know anything more about this than what I have read in the papers. The various news reports agree that, sometime in the fall of 2005, somehow something mysteriously went wrong while the multiple-choice questions were being scored at a Pearson Education facility somewhere in Texas. Supposedly, it is of a highly technical nature.

The first wave of news stories went into no technical detail whatsoever about what happened or how. After a few days of controversy, the College Board offered an explanation. It was a rather lame explanation, using everyone's favorite excuse— unusual weather. The October 8 test session coincided with a week of record rainfalls in the northeast US, especially New Jersey. Papers from those areas absorbed abnormally large amounts of moisture, which caused the papers to be marked in an "unacceptable manner" and/or caused the marks to be lined up incorrectly.

It is worth mentioning that the papers were actually scanned in Austin, Texas, not in the Northeast, and they were scanned quite some time after the test was administered. This incident raises some disturbing questions about quality control in the testing industry. Aside from the fact that it took them four or five months to address the erroneous scores, it is shocking that Pearson and the College Board didn't design the scanning system so it wouldn't choke on damp test papers. Yes, this is a technical issue, but it is also a technical issue related to a technology (optical mark sensing) which has been around for decades.

My personal pet theory about this snafu is as follows: I think there was probably a problem with exactly one of the many answer keys corresponding to the many versions of the test. If there are five different versions of each of the three sections (Reading, Writing and Math) then you would get 5^3=125 different test forms (and 125 corresponding answer keys.) 1 wrong answer key out of 125 is exactly 0.8%! It makes sense to me (though perhaps not to anyone else.)

The news stories indicate that most of the mistakes led to lower scores— but not all. This is consistent with my answer-key theory. SAT questions are constructed with 5 choices: 1 correct answer, 1 "distractor" (which is plausible but wrong), and 3 blatantly wrong answers. The right answer is usually the most popular choice, and the right answer and the distractor are virtually always the two most popular answers. So if you apply a random answer key to a question, you are much more likely to score the real right answer as being "wrong" than to score a wrong one as "correct."

The official "humidity" explanation simply doesn't make much sense— not to me, at least (though what do I know?)







I spent a couple of months with the logistics group in the fall of 2002 myself. [This was when the logistics group was based in Newington, NH; since then they have moved twice, first in 2005 to a new corporate headquarters campus in Dover, NH and then in 2007 to an off-campus site in nearby Rochester.]

Their approach to the job was an odd mixture of diligence and haphazardness, aggravated by the fact that the managers (with only one or two exceptions) operated in panic mode 125% of the time. The boss operated in panic mode 150% of the time, even though this would be (unless she quit or got fired in the meantime) the same boss who failed to notice the large numbers of missing booklets from Nevada and Michigan. (I recall how she starred in an amusing staff meeting near the end of one particularly miserable week, which was like something out of a John Cleese video. The gist of our boss's message was that if we didn't stop lollygagging about and if we didn't shape up right now and realize the seriousness of her situation, the company would go out of business. In fact, if we made even one more mistake, if we let her down even one more time, the whole company would go out of business, and it would be all our fault and she would be even more pissed off at us than she was already. The motivational impact of her message was lessened by the fact that she sounded so insane that she made us want the company to go out of business, so we wouldn't have to be yelled at by a crazy lady anymore.)

I was assigned to a project where we were collating and sending out hundreds of thousands of pages of reports related to the Georgia testing program. (This was a very troubled program, for a variety of reasons: Measured Progress eventually lost the contract.)

At some point, someone at the Georgia state education department demanded that the reports be error-free. The boss decided that this meant that she needed to hire 100 temps, including myself, to come in for a month to count and inspect each and every page of the printed reports. (The only problem we ever found were the usual faded pages produced when the toner starts running low: we were told that slightly faded printing was acceptable, even when it was streaky.)

Most of us were test scorers. The procedures for printing up the reports were not exactly optimally efficient: we did a lot more collating than we should have been doing. (For various reasons, the order in which things were printed was not the order we wanted to send them out in.) While we collated, we counted each and every sheet by hand. Just to make the job more challenging, the temps were stuck for as many as 10 hours a day on an un-airconditioned loading dock in late summer with huge fans blowing the papers around and noisy machinery being operated nearby. Us test scorers, being intelligent people, wanted to know what the heck we were doing and why, and the logistics supervisors responded by belittling and harrassing us and by telling us that we were "too smart" and we should "stop thinking." Us test scorers, being proud people, didn't respond well to belittlement and harassment, so the supervisors stepped up their attacks on us. It was very unpleasant. (I was lucky: I soon got fired from counting tests and was reassigned to a slightly less boring job taping boxes together.)

Once we counted a set of reports, we just left them sitting there for several weeks (plenty of time for them to get misplaced.) When the reports were shipped out, neither the labels on the boxes nor the (generic) packing slips inside the boxes specified which school a given box pertained to.

When I was taping boxes together, I worked with a group of Indonesian (as well as Malaysian) refugees. (Southern New Hampshire has a large southeast Asian population.) I think I was supposed to feel disgraced that I was relegated to working with the Indonesians, but they were actually much more interesting than the people I was originally working with. The theory behind having Southeast Asian refugees working in the logistics center was that their limited language skills would render them incapable of reading and/or understanding the materials they were handling— and hence they would pose less of a security risk than us Americans. One of many problems with this theory were that their language skills were not limited at all. Most of them were college educated and spoke pretty good English, and they also spoke what sounded to me like outstanding Malayalam. If anything, they were a bigger security risk than American citizens would have been. (Speaking of security risks, the old "warehouse" was laughably insecure when I worked there, although the new facility is much better.)




Stuart Kahl's Aspen Institute Testimony:

"Assessment and Data Quality Issues"

West Hartford, CT: May 9, 2006


A year after the problems in Nevada, Stuart Kahl gave some interesting testimony to the Aspen Institute, at a May 9, 2006 conference in West Hartford, CT. I know I kinda made fun of him earlier, but his prepared remarks make some great points.

He touches in passing on his company's logistical problems. He says the the testing industry is not running out of capacity, and he somewhat amusingly stated that "At a recent (April 25, 2006) meeting between Secretary of Education Margaret Spellings and executives from many testing companies, it was suggested that the error rate in the testing industry may be considerably smaller than that in other industries." (He admits that this is still not good enough: since the current level of mistakes"cannot be accepted given the consequences of the errors for individual students and schools.")

He praised his company's "efficient, high-tech systems" for processing tests— including image scoring— which ostensibly improve "quality control with respect to the work of human readers." He also admitted that incoming materials are not always returned in the manner prescribed in the instructions (e.g., answer sheets left in test booklets.) Another big problem which he talked about was the difficulty of cleaning up data files, both before and after the students take the test.

I was especially interested to see that the traditional 0 through 4 data scale is not merely justified by the obvious fact that it is easier to pick a score point from just 5 choices than from, say, (if you graded on a 0 through 100 scale) 101 choices. A more technical justification is that data quality is higher (in the sense of being more replicable from test to test) when you have just a few choices. As Kahl puts it, "common statistical analysis packages generate the same reliability coefficients for tests whether they include dichotomously scored multiple-choice items or constructed-response items scored on a zero-to-four continuum."



One of the other experts on the panel, Joel Klein of the New York City Department of Education, took a slightly different view of how responses should be scored:

We measure all gains and all losses, even those that don't jump students from one proficiency level into another. For example, under the federal law, it does not count as progress if a student moves from a low-level two to a high-level 2. But a jump from a high level 2, what we call approaching standards, to a low level 3, what we call meeting standards, counts a lot.

Klein's approach poses a couple of problems. Firstly, it is much more expensive to distinguish low 2s from high 2s rather than just lumping all 2s together. Secondly, Klein's approach compromises the statistical reliability of the tests. Under the current system with only the five integer grade points between 0 and 4, if a scorer mistakes a high 2 for a low 2, the score is not effected, and the bell curve is not messed up. Moreover, when we make the scoring system finer, it becomes even harder for test developers to design prompts and rubrics which keep the mean scores within the optimal range between 2.3 and 2.7 which yield the most symmetrical bell curves.



More info:


 



The Forgotten Liars