The key to Moneyball as practiced by the A's is that it valued complicated, sometimes obscure statistics, called sabermetrics, over the conventional wisdom of baseball veterans. Instead of looking at a young player's attributes and batting average, scouts would compile tables of data, looking at WHIP and DIPS for pitchers and VORP and WAR for hitters. The statistics within sabermetrics are complicated, and when it comes down to business in baseball front offices, some teams follow these stats religiously. It isn't outside the realm of possibility that some GM's make decisions based on numbers they don't totally understand. This is what's happening in education policy today.
The Los Angeles Times recently released a value-added analysis of all the 3-5 grade teachers in the LA Unified School District. Parents, policy-makers and everyone else can now go to the linked website and search for any of these teachers to see how well the LA Times has concluded they are doing. Value-added analyses make sense on a large scale, and the logical basis for these types of analyses is solid: judging teachers based purely on test scores isn't quite fair, nearly everyone has agreed. But we need some kind of data to evaluate teachers and determine effectiveness. So we take test scores from 2006, for example, and attach the scores to individual kids. If their scores go up after a year with Mr. Ehrenfeld, that improves my value-added score. Looking at each of the kids in my class, we can see how effective a teacher I am.
This analysis is useful for entire schools with lots of kids. For individual classes and teachers, it can be disastrous. Lots of factors affect student test performance; many of these are impossible for a teacher to affect, including family situation, student health, personal circumstances on the day of the test, there's a huge range. Also, classes are never assigned randomly, so there is selection bias. But don't simply trust me: there is significant debate, and many smart people agree that VAA analyses are not perfect and indeed are very flawed.
Professor Linda Darling-Hammond of Stanford advised President Obama on education issues during the campaign and his transition, only to be passed over for a position in the DoE for neophyte Arne Duncan. She wrote "Unfortunately...these measures are highly unstable for individual teachers." Her insistence on contextualizing test scores if they are to be used in evaluation is the most compelling argument on this issue; test scores alone, just like OPS alone, cannot be used to evaluate how well an individual does his or her job, be that a teacher or a third-baseman.
Diane Ravitch, a respected and controversial education historian, writes that VAA models are "problematic" and subject to extremely high measurement error, meaning that the results of the test could simply be wrong based on sample size and experimentation issues. With 3 years of student scores, Diane points out, the error rate is 25%, meaning many good teachers will be identified as "needing improvement" and some bad teachers will even be rated "highly effective."
Statistics can be accurate and correctly measured, but they can still be misleading. From year-to-year, baseball players, like teachers, have divergent statistics. In all likelihood, there are a few Alex Rodriguez teachers out there who are good consistently, but the more common teacher will be comparable to Eric Chavez, former A's third-baseman who Billy Beane frequently said was a better value and competitive with A-Rod. He was wrong, and so are the policy people who argue that VAA models can tell how effective a teacher is and will be.
The best, fairest teacher evaluations are comprehensive. Student test scores, even a value-added model, ought to play a small role in this process, but the biggest piece should be administrator and peer review. It's the only way to get an accurate picture of teacher effectiveness.
Moneyball and sabermetrics revolutionized the way Major League front offices evaluate players. But look at the case of Jeremy Bonderman, who was drafted by Beane's A's in the 2001 draft. Beane was so incensed that the player had been drafted that he reportedly threw a chair at the wall so hard that it exploded on impact. But looking at the career of the player, even using the advanced sabermetrics, Bonderman has been excellent, vindicating Beane's scouts who drafted Bonderman back in 2001.
But this is only baseball. Teachers and students and public education as a whole are too important to allow mistakes like this to happen. A great teacher who is fired for being ineffective loses his or her livelihood and deprives future students of all he or she has to offer. Denying students the opportunity to be taught by truly excellent teachers--those who really inspire greatness and help students develop--is a travesty beyond the mistake Beane made by trading Bonderman the year after he was drafted.
The obsession with statistics now commonplace in baseball and growing in import in education is extremely problematic. Just look at the recent scandal in New York: the bar for passing tests is extremely subjective, and it moves. Also look at David Leonhardt's review of teacher evaluation and test score obsession in last week's New York Times Magazine. Campbell's Law holds true in education as elsewhere: the more a quantitative indicator is used for social decision-making, the more apt it will be to distort and corrupt that which it was originally intended to measure and the more subject it is to corruption pressures.
Overvaluing test scores in teacher evaluation is wrong, and basing teacher evaluations wholly on VAA models is criminal. Thanks for nothing, Billy Beane.
No comments:
Post a Comment