Measured Progress Test Scoring





NOTES ABOUT MY TEST-SCORING EXPERIENCE
by Timothy Horrigan
Copyright © 2005-2011 Timothy Horrigan


Click here to learn how to become a test reader/scorer yourself!


Monster Horoscope 


 December 30, 2008

Note: This article is a few years out of date, even though Measured Progress's basic corporate culture is still the same, and even though test scorers are still regarded as the lowest caste in the company's social hierarchy. One change I should mention (which may not be new) is that training is no longer done in groups. The trainers never meet the trainees in person, even though the trainers are physically located in the United States (usually in Dover, NH) rather than in India: it is all done by video and teleconferencing.

I should also comment that the old scoring center where I worked had been around for a long time and hadn't had its infrastructure upgraded in many years.  The new facilities in Dover, NH and elsewhere are much more up-to-date and also much more secure.

 June 6, 2007

Note: I wrote the original version of this article in the fall of 2004, when Measured Progress's Dover, NH scoring center was still at its old location in an old mill building in downtown Dover, NH. Since then, the Dover center has moved to a new location on the outskirts of town in a suburban industrial park. (And the company has opened up a new center in the Denver area which is almost twice the size as the Dover center, as well as another large center in Louisville, Kentucky. In fact most of the company's test-scoring seats are now located outside New Hampshire. )

My second-hand information generally indicates that the new scoring centers are still operated in much the same manner as the old Dover center. However, I have heard that they are, at least, fixing one of my minor complaints: test responses are now scanned in as imagery with more than just the two colors of black and white. (Measured Progress had a funny rationale for switching to full-color Jpeg from 2-color TIFF. According to an April 2007 Kodak press release, the deciding factor behind the switch was supposedly that TIFF required "a licensed viewer to preview." The only thing is, you don't need a license to use TIFF files. The switch away from low-rez 2-color TIFF was long overdue— and switching to full-color high-rez TIFF would probably have been unwise because of TIFF's generally huge file sizes— but the licensing-problem rationale is laughable.)

Generally speaking, Measured Progress has a tendency to define jobs too narrowly and to model work too linearly. But there is an exception to this rule at the Dover scoring center: the lower-level supervisors are given too wide a range of responsibilities (especially considering that they are minimally equipped and not-too-extensively trained temps.) You can see the problem just from their title: they are titled "Quality Assurance Coordinators" even though they do very little actual QA. The company expects each "Quack" to serve not just as a QA person but also as a trainer, a technical writer, a benchmarker, a file clerk, a traffic cop, and as a table leader (amongst other things.) Even if the company was willing to actually give the QACs the tools they needed to simultaneously fulfill all those different roles, and even if the company was able to give the QACs something to aspire to beyond more mere Quackery, this would still be a frustrating job.


Intro


If you have kids in school, if you are a kid, or if you used to be a kid, or if you have friends and family members who fall in one of the three previous categories, you are probably concerned about educational assessment testing. The tests have a huge impact on how the American educational system is structured, and they also affect the funding available to your local school system. In some cases, they are even used to decide whether or not individual students get admitted to special programs such as gifted education, magnet schools, etc.

I used to be one of the people who helped score these tests. In 2002 and 2003, I worked for Measured Progress, Inc.'s scoring center in Dover, NH. Measured Progress is a relatively small firm, but is growing fast, and it is one of the top five companies in the business. Some of the states where Measured Progress is the primary educational assessment testing provider include Massachusetts, Louisiana, Maine, Nevada, and New Hampshire.

Just as grade school and high school curricula are arbitrarily but rigidly divided into a few subject areas (reading, writing, mathematics, social studies, etc.) so is the testing process divided into a few arbitrarily but rigidly defined functional areas. These are, basically, project management (dealing with state education officials), test development (writing the questions, and creating the testing instruments), logistics (sending out the test materials and processing them when they come back from the field), and test scoring.

Test scoring is arguably the most important part of the process, since a mediocre test scored accurately is more valuable than a good test scored badly. Even the best possible test would be worthless if it was scored badly enough. However, at Measured Progress (and apparently at its competitors, judging from what I have heard), the scoring gets less attention and fewer resources than the other steps.

(Note: A couple of embarrassing May 2005 news stories suggest that test scoring is not the only part of the testing process which is being done haphazardly. Click here to learn about the logistics department's difficulties.)

Political & Financial Issues

There are a variety of political and financial reasons for this.

The main political problem is that the test scorers, and all of the test scoring supervisors aside from a very limited number of year-rounders, are temps. Large numbers of test scorers (as many as 175 at a time) are brought in on a seasonal basis as needed, typically shortly before a deadline, to score a state's tests during a period of a few days or weeks. Most of the temps are not even Measured Progress employees, but are instead employed by temporary agencies. The rest of the year, the temps are not around, so no one gives much thought to how to maximize their productivity or job satisfaction. The test scorers never have any contact with their colleagues in other departments. In two years at One Washington Street (our scoring center in an old mill in downtown Dover, NH) I think I only saw a test developer once, even though the corporate headquarters was only a couple of miles away. And that test developer only came over because the Discovery Channel was filming a documentary about the No Child Left Behind act and she was being interviewed. (The company made us vacate our work area so she could be filmed in front of a background of a row of a dozen or so computers, with no test scorers cluttering up the shot.)

The main financial problem is that the test scoring department gets a fairly small share of the revenue from the test. Most state test scoring contracts are structured so that the state department of education pays for the test development up front and the local school districts pays for the processing and scoring when the test is actually administered. The school districts are charged a very low price for administering a regular test and a much higher price for a special ed test. The price for the regular test typically just barely covers the marginal cost of administering the test. This pricing structure is designed to encourage local schools to fulfill their obligation to administer the tests to all students in the classroom. (The states want to discourage school districts from gaming the system by diverting low-performing students to the "Sped" test or by having them simply not take the test at all.) This pricing structure also creates a situation where the test scoring department is just scraping by while the other departments are massively profitable and highly productive.

What's On the Test?

What you really want to know, of course, is what's on the test and how is it scored? The questions on the tests tend to be not too tough, but not too easy: most students can answer the questions more or less correctly, but only the few students at the top of the bell curve can "ace" them. The subject matter on the tests varies hardly at all from state to state, aside from trivial local color such as using the names of towns in a particular state on math questions. The questions are frequently recycled from one state to another. Some states have standards-based tests (i.e., students and their schools are tested against a set of expectations) and some have norm-referenced tests (i.e., students and schools are graded each year "on the curve"), and many states try to combine those two approaches. The testing instruments themselves tend to be almost the exactly same regardless of which approach is being taken. Measured Progress does market a turnkey standards-based testing product called Progress Towards Standards, but most states prefer to gather together task forces of educators, state department of education officials, outside consultants, etc. to produce custom tests geared to the unique needs of students in that particular state. These customized tests all end up being pretty similar to one another. In some cases, states will offer their own custom tests to some grades, and pre-packaged tests to other grades.

Some states have a mixture of easy and hard questions on the test. Others like to set up the test so every question is answered correctly 50% of the time, which makes the whole test pretty difficult for most students. (You get a nicer, more symmetrical bell curves that way.)

Most of the questions on these tests are the old-fashioned fill-in-the-bubble A-B-C-D E-None of the Above multiple choice tests. (Click here to see some sample choice questions, from New Hampshire's NHEIAP test.) Except in a few states like Maine which do computerized testing, kids still have to use Number 2 pencils to fill in the bubbles. However, a few questions are open-ended questions, where the student has to solve a math problem, write a brief essay, etc. These are scanned (into rather low-resolution 2-color TIFF files) at the same time the fill in the bubble questions are scored. The TIFF files are then loaded onto a database at the scoring center.

And Out Come the Freaks

Once the database is loaded on the scoring center servers, an army of test scorers comes in to score the questions. A test scorer can be just about anyone who has at least two years of college education and who is willing to take a dead-end temp job for ten dollars an hour and no benefits and not much scope for advancement. You don't really need to be an expert in the field you are scoring: in fact, it's easier to accept the set scoring standards in a given subject area if you don't have advanced training in that area. The scorers fall into three main groups: college students who are not in school this semester, adults who have other careers but are unemployed (or have flexible schedules), and retirees who still want to work part of the year but not year-round. I fall into the second group, who are gradually being pushed out of the scoring work force in favor of the first and third groups (who make fewer demands.)

The work at the downtown Dover, NH scoring center was somewhat boring (but not as boring as most temp jobs) and the working environment was generally pleasant enough. The chairs were comfortable, the noise level was usually moderate (except occasionally when construction went on overhead), and our building had a scenic location on a bend on a river, just where the tidewater and freshwater met. The management at the scoring center was ineffectual and neglectful but not too unpleasant on the rare occasions when they deigned to acknowledge our existence.

One very cool thing was that the management at the scoring center (in contrast to many places where I have worked) did actually understand that human beings have limited attention spans, need to get up and stretch every so often, need to eat and drink, need to use the restroom every so often, function best when the temperature is between 60 and 80 degrees Fahrenheit, etc. Not only did they understand these needs, they even took action on them occasionally: for example they would actually make us get up and stretch once a hour. (I appreciated the scoring center more after a couple of months at the "Logistics Center" aka "The Warehouse" where the management was utterly incapable of dealing with any of those human needs: in fact conditions were so bad there that they crossed the line between mere discomfort and outright physical and emotional abuse.)

One not so cool thing was that no one paid much attention to us: for example, during two years of hard work, I personally never once got even a perfunctory performance evaluation of any type. I never even got a simple individual "Thank You." Our immediate supervisors were temps, as were our immediate supervisors' supervisors, and the permanent managers three-plus layers above us were too busy with more important duties to bother themselves with mentoring a mob of temporary test scorers. Our immediate supervisors tended to get very frustrated. They were expected to manage without any means of rewarding good scorers, of retraining mediocre ones, or even of punishing bad ones. They were also expected to manage without telephones or email, or even their own private desks. Moreover, they had virtually no hard data about which scorers were in fact good, bad, or mediocre.

The scoring season begins promptly at 8am the day after Martin Luther King Day, with the handing-out of parking permits. This takes a long time to sort out when there are lots of new scorers. There is a large, apparently empty parking lot across the street from the scoring center with prominent Measured Progress signs all around the perimeter. This is a satellite lot for the corporate headquarters, which is on the other side of town (and whose parking lot is too small.) Supposedly, permanent headquarters employees can park their cars here and take a shuttle bus across town. In two years at the scoring center, I never saw the shuttle bus, or even any permanent employees waiting for it, though I did see a few cars parked in the lot.

The test scorers' spaces are hidden away on the far side of the Cocheco River: there are a few dozen discreetly-marked leased spots tucked away between the town skateboard park and the town transfer station (also known as the "town dump") and a couple hundred more spots in a free public lot at the transfer station. There is a very pretty covered pedestrian bridge connecting these parking spaces to the scoring center, but it is at least a half mile drive by car. An HR person stands at the entrance to the scoring center and tries to explain how to get from where our cars are parked to where they should be parked. This drive is pretty hard to explain, since it involves multiple detours through a maze of one-way streets. Luckily, there is a sketch map available at the front desk, but only one copy of it, and it has the company's old name on it ("Advanced Systems for Measurement and Evaluation," which was changed to "Measured Progress" in the mid-1990's.) By 9 am, most of us have found a parking spot and made our way back to work: we are now ready to begin an extensive two-hour training program. There is a large movie screen rolled up in the rafters of the main scoring room. At the beginning of the HR department's presentation, this screen is unfurled with great pomp and circumstance, and the assistant HR director rolls in a cart with a mid-1990s-vintage laptop attached to a video projector to show us his PowerPoint presentation. This is the only time the movie screen is ever used. Our line supervisors will make all their presentations during the remainder of the scoring project without the aid of projected PowerPoint slides. In fact, they won't even use old-fashioned transparencies (although an overhead projector is available and is in fact usually sitting out blocking traffic in the hallway where we hang our coats.)

You will get a scoring manual on your first day of work: it is about 60 pages long, photocopied on 8 ½ by 11 paper with a card stock cover. You will be expected to return it by the end of the week and/or at the end of the project. You are not allowed to write any notes in the booklet. Are these rules in place for security reasons? Nope. Those rules are in place because it's "too expensive" to print up a copy for every new scorer. It costs maybe half a buck apiece at most to print these booklets. Printing new copies for every new scorer would run as high as— yikes!— OVER 100 BUCKS every year!!! (Interestingly, even though the manual is confidential, they do let you take it home.)

By 11am we are doing actual scoring. After the first day, the routine was that we would start at 8:00 am, have a coffee break at 10:00, score till noon, followed by lunch from noon to 12:30 (except for the few permanent employees, who waited till 12:30), another coffee break at 2:15, with scoring finishing up at 4pm.

i-Score

The scoring is done using a propietary database program called "i-Score." This was developed inhouse, apparently by someone who didn't know the English language very well, because many of the prompts are unidiomatic. For example, when there weren't any more responses left for you to score, you are informed:

No more response.Please see QAC for reason.

It's always alarming getting an error message which doesn't make sense— not just because the message doesn't make sense but also because nonsensical error messages usually correspond to errors which rarely happen. (This message also had a minor punctuation problem: there was no space after the period in the middle of the message.) Usually, when something happens frequently, a programmer will go to the trouble of coming up with a message which actually says what the problem is. The first time I saw this prompt, I brought this to the attention of my supervisor, who told me, "Oh, that just means we ran out of papers for you to score." Running out of papers to score was an event which happened dozens of time a week in the normal course of events.

The physical security in the facility, in a old red-brick mill building on the banks of the Cocheco River in Dover, NH, was adequate, though not as secure as implied by the contract documents given to our clients, which spin out elaborate fantasies about "proximity badges" and other high-tech marvels. There was one door in the front with a receptionist, and an unguarded back door which led to an alley where most of the smokers spent their breaks.

There was a fairly strong firewall on the server, and all the activity on the network was competently and carefully monitored. However, the password system was quite weak, with short passwords which couldn't be changed and which duplicated an easily obtainable piece of data . Worse yet, in many cases, usernames and passwords were recycled. If a supervisor needed to do some live scoring, he or she would assume the identity and a password of a former scorer who was no longer with the company, rather than going to the trouble of creating a new user account. This is the wrong thing to do, on just about every level imaginable. Unless under fairly rare circumstances, there is no justification for letting one employee use another employee's net-identity. And even if you really really really do need to give a supervisor access to an ex-employee's account, you ought to have the system operator change the password first. (In fact, you normally should change the ex-employee's password and revoke all her permissions the second she leaves her job— even if your security is so incredibly strong that there is absolutely no way she could ever gain access to your computer assets.) The rationale for the lousy password system was that the sysop was quote-unquote "too busy" to be bothered to do his job competently.

It seemed like the company had great difficulty figuring out who was working on what project at any given time. When we arrived in the morning, a receptionist (a temp like us, but not a scorer herself) wrote down our names as we arrived. (If there was no receptionist, the assistant scoring manager did the honors, apparently because he had the least seniority and/or there was no one else available who was willing to actually interact with the scorers.) You would think that they could tell who was working on what and for how long just by looking at the server logs, but that was not how they did it. (I suspect one reason for this was the usual problem of the sysop being too busy. Also, if he had to look at the logs himself, I suppose he would on some level be interacting with the scorers, which would be an affront to his dignity.) They tracked our hours worked by printing up timesheets with everyone's name on it. We wrote in our total hours worked, to the tenth of an hour, in #2 pencil. (If we worked on multiple projects, we used multiple timesheets.) Printing up the timesheets was an arduous process (presumably because the sysop was "too busy"): the assistant scoring manager always looked frazzled when he brought the sheets in, and the sheets appeared at totally unpredictable times, randomly made available almost anytime between 9am and 4:30pm. Even though the roster of scorers rarely changed, occasionally someone came or (more often) went, and the comings and goings evidently always caused major delays.

There was no desktop-maintenance system in place: if the desktop had to be updated or restored, the system manager actually still had to go around to each computer with a floppy disk. Changing your desktop settings was taboo, by the way: it was a crisis if anyone so much as dared to change the color of the desktop from the default aqua color. (There was an amusing incident once where a scorer quit or got fired in the middle of the project. His or her parting shot was to create a desktop pattern where one out of every 64 pixels was a minimally different shade of aqua from the default value. It took our supervisor a day or two to notice, but when he finally did notice he read us the riot act, even though the instigator was long gone.)

The states' contract documents spun out fantasies about state-of-the-art workstations, but the i-Score client computers were in fact rather ordinary mid-1990s Compaqs with small screens and only 32MB of RAM. One of the few perks of the job were our official Measured Progress coffee mugs, which were issued so we would not destroy the keyboards by spilling beverages on them. Ironically, the replacement value of a keyboard was less than the cost of a mug. An even worse disaster than spilling your beverage on your keyboard would have been to spill your beverage on your CPU. The replacement cost of your CPUs was evidently a little higher than that of your mug or your keyboard: one weekend at a local computer flea-market, I saw some surplus Compaqs configured just like ours, only with CD-ROM drives (which ours lacked), being sold for $35 apiece (or best offer.)

The technical infrastructure, even though it was far from state of the art, was adequate to run the i-Score software (although it would have been nice to have more than the minimal 32 Megs of RAM.) The main purpose of i-Score is to randomly flash answers to test questions on the screen, so that the scorers could score them. (One subject area, Writing, does not use i-Score. The reason for this is that i-Score times-out after a few minutes, in case the scorer leaves a response on the screen without scoring it, and this time-out value is hard-wired into the software. The Writing responses take more than this allotted time to score. Instead of modifying a few lines of code to make the time-out value adjustable, the company chooses to physically truck thousands upon thousands of writing papers over from the Logistics facility, which is located in a town several miles away from Dover. The writing scorers score all their responses by hand.)

The Actual Scoring Process

The open-ended questions fall into four groups. "Common" questions are given to the whole population of students for a given grade in a given state. "Matrix" questions are given to a subset of students. (The "commons" are old matrix questions which are used one more time before being retired.) The last two types of questions don't count towards students' individual scores. "Equating" questions are relatively tricky questions used to make sure scorers are being trained the same way from one year to the next. "Field-Test" questions are just what their name implies: these are new questions which are being field tested by real students under live scoring conditions.

We score the questions according to a "rubric" which sets out the criteria for each score point and (if applicable) shows the correct answer for the question. The score points are represented by "exemplars."

The scoring packets are put together using a low-tech process with no desktop publishing software. The student responses which are used for the exemplars are printed up at the logistics center (from the scanned images rather than being photocopied from the original test booklets.) Photocopies are made of the printouts, and these photocopies are sent to the scoring center, where they are pasted up (literally pasted up) using glue and tape. A master photocopy is made if and when the scoring guide is completed, which is in turn used to make however many copies are needed for scorers.

The copies tend to be quite noisy (i.e., with extraneous specks on them) since no one routinely cleans the glass on the copier (which in any case collects a lot of specks because of the White-Out on the originals.) There is only a limited amount of glass cleaner on hand, which is kept under lock and key. The cleaner is locked up to prevent scoring personnel from wasting it cleaning the scoring center's 200 or so computer screens when they're not really all that dirty yet. The person who has the key to the supply closet is a permanent employee and is also extremely busy— much too busy to bother getting the glass cleaner every time a temp notices a few minor specks on the copier glass. The specks on the copies cause a lot of confusion with the math scoring packets, because sometimes scorers can't tell a decimal point apart from copier noise. The copies also tend to get blurry and slanted over time. Also, it is quite common for a quarter- or a half-inch or so on one side to get cut off. This wouldn't be a problem if it was the left side getting cut off, since there's usually a wide margin on that side to accommodate three-hole punches; however, it's always the right side which gets cut off.

The test developers do not directly develop the scoring guides. The only point of contact between the test developers and the test scorers are the "Chief Readers." The Dover scoring center had five of these, one each for the subject areas of Math, Reading, Writing, Science, and Social Studies. They are year round employees who get health insurance and have masters degrees and some teaching experience.

The Chief Readers in most cases do not even get written memos from the Test Developers; the usual procedure is that the Chief Reader meets in person or via conference call with the Test Developers, and he or she jots down notes in Number 2 pencil. The notes about the rubric are typed up in a standard format at the scoring center. The notes about the exemplars are never typed up anywhere but are simply transcribed (once again in Number 2 pencil) into the margins of a copy kept in a looseleaf binder in the Chief Reader's office. Occasionally, when it comes time to make up scoring packets for the test scorers, the wrong looseleaf binder gets grabbed and copied, and the scorers get packets with the exemplars notated. However, the standard procedure is that a lower-ranking supervisor called a QAC or "Quack" (for Quality Assurance Coordinator) hands out a packet with unnotated exemplars and then (at the proper point in the training process) he or she reads the notations out loud from his or her notated copies. (The procedure was slightly different in the area of Writing: for some unknown reason, the Chief Writing Reader chose to actually type up his notes from the meetings with the Test Developers, and he routinely shared the Developers' comments verbatim with the test scorers during the training process. He did all sorts of other bizarre things as well: he did his own training sessions rather than leaving this chore to the QACs; he had transparencies made up to illustrate his training sessions; and he was even known to sit down alongside his scorers and score papers himself. I have no idea how he got away with such weird behavior!)

The training process tends to be rather cursory. There is no professional training of any type for test scorers: we are only trained on specific questions, with no attempt to globally enhance our knowledge of our subject areas. The trainers are under pressure to spend as little time as possible preparing for the training, partially because of scheduling pressures and partially because the company's accounting system breaks out the time trainers spend training but not the time scorers spend being trained (even though between ten and one hundred scorers are being trained by a single trainer during a typical session.)

The training for each specific questions is reasonably effective, even though the trainers have to get by without any audiovisual aids (such as Powerpoint presentations, etc.) and without dedicated training rooms. Once a space is found to meet in, scorers fill up that space and the trainers hand out copies of a packet with the rubric and a large number of sample responses. There is no quality control for the packets (e.g., no one checks to see if all the pages are present and readable, etc.) but the examples are usually well-chosen. The scoring is done using a rather coarse scale from 0 to 4, so we don't have to make very fine distinctions. A 4 is a perfect answer. A 3 is a pretty good one with one or two errors. Most students get 2s, which is a mediocre answer with multiple errors. If there is anything even remotely relevant to the question, the kid gets a 1. A 0 is given when there are marks in the space which totally fail to answer the question, and a Blank is given when the space is blank. There is an entire training session devoted to distinguishing between Blanks and Zeroes, since sometimes it is hard to tell if a marking was deliberate or accidental. (Some "short-answer" questions use a 0 through 2 scale.)

Once we understand the scoring criteria, we scorers march back to our workstations to be tested. (Actually we don't have to march back. In fact, the time between the training and the scoring test is a good time to take a break. A QAC has to specify which scorers will be taking the test. The process of "getting [scorers] on" a question is a long and tedious one which frequently goes wrong. So there is always a delay before we can do the test.)

The first ten (or in some cases twenty) questions are "CRR's". The C stands for "Criterion", one R stands for "Reference" and I forget what the other R stands for. If you get enough questions right, you move on to the live questions. You know you're doing live questions when you start to see Blanks. You also can tell you're doing live questions when you see many 1s or 0s in a row. Typically the CRRs have 2 or 3 samples from each score point, with no Blanks and only 1 or 2 Zeroes. The 1s and 2s are usually statistically underrepresented in the sample used for the CRR test.

The responses are "double-blinded," i.e., some responses are scored more than once. If the two scores disagree by more than two score points (e.g., if one person gives a 3 and another gives a 1) the two scorers are theoretically supposed to come see the QAC and reach a consensus on the proper score point. Once one or both scorers agree on a new score, the QAC changes the appropriate scoring entries. This is considered a very bad thing, so scorers tend to score towards the middle (giving 2s instead of 1s, etc.) In most cases, no matter how small a sampling gets doubleblinded, the QACs don't have time to meet with all the pairs of scorers, so the QAC simply changes one or both scorers' scores.

Scoring the Scorers

Even though our company was in the educational-assessment testing business, scorers' performance was not assessed in much detail. The main vehicle for monitoring scorers' performance was a daily set of not particularly comprehensible printed reports which the QACs rarely got a chance to look at and which never got printed up more than once a day (and not at a predictable time of day.) Even when the QACs did look at these reports, they would be looking at data which was at least one day old— and often much older than one day.

The scorers' version of the iScore client had no self-monitoring features at all. It didn't tell you many test papers you had scored. It didn't tell you what scores your previous papers had been given. It didn't even give you a clock display to tell you how long you had been logged in. (The standard Windoze clock in the taskbar was still visible, but no attempt was made to synchronize the various machines' system clocks.)

After many months of scoring in the dark, I finally hit upon a crude means of monitoring my performance: I made six columns on a piece of scrap paper (0 through 4 plus "blank") and checked off each response in the appropriate column. My QAC didn't understand what purpose this could possibly serve, and I was discouraged from continuing this practice.

The QACs' version of the iScore client had a little more information about scorers' performance. However, the only thing they ever used was the "arbitration queue," which showed how many times your doubleblinded responses differed from your colleagues' responses. Scorers who turned up a lot in this queue were viewed as poor scorers, which penalized scorers who scored a lot of papers (unless they had incredibly good accuracy.) Scorers who rarely turned up in the queue were also viewed as good scorers, even if they were merely very, very slow. The number of arbitrations was not a very good metric to use for rating scorers: it induced the scorers to score just enough papers to meet the minimum speed standard.

To make things worse, scorers who worked slowly were praised for "taking the time to score the papers accurately." Aside from the fact that speed is more important than accuracy, accuracy is by no means inversely correlated with speed. A speedy scorer who gets to see a wide variety of test papers is not as bored as one who agonizes over just a few papers. Also, the more papers a scorer sees, the more sense he or she has of how each individual response compares to the whole universe of responses. Even if it were possible— which it isn't— to score each paper strictly by the rubric rather by comparing it to the universe of all the other papers, you still need to know all the different ways individual students' answers can fall short of and/or go beyond the curriculum standards embodied in the rubric.

And Finally...

Once all the questions are scored, the results go back to the Logistics group, where a Data Analyst translates the raw scores into reported scores which no one actually understands, not unlike those SAT scores you got your senior year in High School. In most cases, you're supposed to assume that these scores fall into a "Bell Curve."

The algorithms used to create these scores often tend to produce an inordinately large number of scores just barely below the passing level, which recently created political turmoil in at least two different states tested by Measured Progress (Georgia and Massachusetts) and one which runs its own tests (New York.)

So, what should you do if your kid has to take these tests? Ideally, you should home-school your kids. In most cases, this is impractical, however! Failing that, you should remind your children not to get uptight about these tests, especially since (with a few exceptions) the school is being tested rather than the individual students. These tests are just tests, which only measure how well you do on the test, which has nothing to do with real life. These tests don't tell you anything more than that. And of course, you should give your kids the chance to explore learning opportunities outside of school, at home in your community, where your kids and you can learn about the knowledge which isn't on the test. In most cases, the truly important knowledge is not gonna be on the test!

I've painted a not-too-positive picture of Measured Progress, which in some ways is unfair. They're not the greatest company in the world, but they're not the worst. They don't spend that all much money on their employees, but most companies spend even less. They do cut corners in some funny places. For example, the first thing I saw when I walked into the scoring center was a whiteboard with various approximations of the constant Pi on it ("π=3.14 or 3.14159 or 22/7.") This was necessary because the pocket calculators provided to the scorers lacked trig functions, even though virtually every Grade 8 or above test has trigonometry on it. The rationale for not buying scientific calculators was because they're more expensive than basic 4-function models, but the amount of money being saved was minuscule.

But the company was generous in other areas: for example, as mentioned above, the chairs actually are comfortable enough to sit on all day. And there is a cultural explanation which partially explains why they have been unable to come up with an effective means of scoring the scorers. Educational assessment tests are designed to produce One Big Number which uses a pseudoscientific scale not all visibly related to the thing being measured. (For example, the old version of the SAT gave each student a total score between 400 and 1600, instead of telling the student how many questions she answered correctly. Instead of reporting that she answered, say, 23 of questions correctly out of 48 on the Math section and 34 questions out of 36 on the English section, you would tell her that her total SAT score was 1400.) It would be difficult to boil the various elements of a test scorers' work down to One Big Number, and operationally such a number wouldn't be very valuable information anyway. But, the One Big Number ideal gets in the way of deciding which numbers to use to evaluate the scorers— and hence the decision-making process never culminates in a decision. And so, the company continues not scoring the scorers at all.


A Few Random Thoughts

This article is pretty long already, but I will add a few observations here.

The open-ended questions are generally designed to get a mean score somewhere around 2.5. Preferably, it should be a little less, since the 0 score point takes in some students who could answer the question, but chose not to (as well as the students who totally didn't know the material.) This practice gives the nicest bell curve, but it also means that the open-ended questions concentrate on topics which the average student will get a "2" score on. This is not necessarily the material which is the most essential for students to know: in fact, it's likely not to be. This means that, when teaching to the test, schools will ignore the advanced topics for the grade point, which are left off the test because the scores on questions about these topics would be too low to yield a nice bell curve. And, to a lesser extent, the very basic topics will be ignored, since they would tend to get scores which would be too high— unless of course the scores are artificially lowered by either adjusting the rubric and/or by designing tricky questions. (The big problem with tricky questions is, unfortunately, that you end up merely testing the students on their ability to see through the test developers' tricks. Slightly smaller problems are that it is hard to predict which tricks kids will or will not fall for, and that we end up making a lot of kids feel discouraged when confronted by test questions which they can almost but not quite answer correctly.)

As I said already, the open-ended questions are graded according to a rather coarse scale. This means you lose a lot of fine distinctions particularly amongst the 2s and 3s. There are a lot of ways to get a 2 or 3, and there is a particularly big difference between a "High 2" and a "Low 2." But no effort is made to keep track of the differences within the score points. The coarse grading scale has obvious operational advantages: the tests are being scored quickly by low-paid temps with (in most cases) no specialized training. It's a lot easier to train the temps to distinguish a 2 from a 1 or a 3 than to distinguish a 2.5 from a 2.9 or 2.1. But the validity of the scoring system tends to be overestimated: much emphasis is placed on the statistical consistency of the results: i.e., equivalent questions get scored the same way every year (as proven by the "equating" questions.) Unfortunately, the statistical measures don't tell us whether or not what's being measured has any larger validity. We take it on faith that the ability to answer test questions is correlated with academic ability in general, but this correlation is by no means perfect.

Using the rubrics on the tests to measure academic ability is a little like (to cite a fanciful hypothetical example) using a series of body-shape outlines to estimate weight (perhaps in the interests of monitoring childhood obesity.) Let's say that instead of putting kids on a scale to measure their weight, schools took pictures of them and then drew outlines around their bodies. At the same time, let's say that you developed 5 general weight classes, with 0 being less than 100 pounds, 1 being 100 to 124 pounds, 2 being 125 to 149, 3 being 150 to 174, and 4 being 175 and over. As an alternative to putting the kids on a scale, you could hire test scorers to come in and estimate the weight using the outlines. Even if you used a perfectly accurate scale, the results in pounds on the scale for any given kid would vary from day to day, since your weight actually does vary. Statistically, the results would be more replicable if you used the outlines and the weight classes— but they would be much less precise.

Or to use a slightly less fanciful example, suppose the NFL stopped scouting college football players. Suppose they relied strictly on the few standardized tools used at the scouting combines, which include a 40 yard dash, a standing high jump and a bench press (as well as paper and pencil IQ and personality tests.) The process would certainly be a lot less subjective if teams didn't bother to look at the intangibles. And then, let's suppose college football programs paid less emphasis on playing football and more on preparing players for the combines. The players would be somewhat better at running 40 yards in a row in a straight line, and their performance could be quantified with perfect objectivity. But they would be a lot worse at actually playing football. And, you would miss out on a lot of guys who tested poorly and who didn't have the ideal credentials, but who are great football players. For example, a totally objective testing system would have filtered out the best player in the game right now, Patriots quarterback Tom Brady, who was a sixth round draft choice.


 July 8, 2010

Actually, my hypothetical about the NFL may not be so hypothetical. The league does seem to place more and more emphasis on standardized testing. They even make players take an IQ-like test called the "Wonderlic test."

Something happened during the 2010 draft which makes me wonder if the league has gone test-crazy. 4 of the top 5 players in the 2010 draft (along with the #21 pick in the 1st round and three later-round choices) were members of the 2009 edition of the Oklahoma Sooners. Yes, OU has one of the best of the programs in the NCAA, and I am actually a big fan of theirs. But, the 2009 team had a crappy season. They went 7-5 in the regular season which is only 1 game above .500 (although they did beat Stanford in the Sun Bowl on New Year's Eve.) The Sooners were not good on the field, but they aced the standardized tests at the post-season combines. The #1 pick was quarterback Sam Bradford, who did win the Heisman Trophy in 2008, but who also only played a game and a half in 2009. He is rumored to have deliberately chosen to skip the last half of the season, after the second of his two shoulder injuries. Supposedly, he could have come back and played some more. But he opted to concentrate on his first priority, which was preparing for the pre-draft tests, Be that as it may, Bradford did great on those tests, except for those which involved actually throwing the football: he skipped those because his shoulder was still healing from surgery. (He got a 36 on the Wonderlic test, which is a very good score.)

But the Sooners' three other top-five picks were all relatively obscure linemen who hadn't done anything all that noteable in college. They were assessed by the scouts as having questionable work ethics and/or other glaring flaws. But they were big and strong and fast, and they did great on the standardized tests, so they got drafted ahead of guys who had been actual stars in college. (I should add that I do hope everything works out well for each one of the Sooners who got drafted.)


 2005 LOGISTICS PROBLEMS 

 The info which used to be here has been moved to its own page:

http://www.timothyhorrigan.com/documents/measuredprogress.logistics.2005.html




Monster Horoscope  



SOME TEST-SCORING RELATED LINKS


         




Monster Horoscope 

 



The Forgotten Liars