Author Archive

The Data Adversary

November 27th, 2012

“Each film is only as good as its villain. Since the heroes and the gimmicks tend to repeat from film to film, only a great villain can transform a good try into a triumph.”

–  Roger Ebert

Data scientists are really on a roll. Their name has changed from “analyst” in the vernacular, and at the same time they have gone from sniffling egghead to white hatted hero played by Gary Cooper. Or maybe 007, extolled as the Sexiest of the Century. Or an even better archetype might be The Girl with the Dragon Tattoo, with a dark side and revenge on her mind.

The movement of course is truly significant at the grass roots level where thousands of data scientists work without fame or fortune. Every big trend has its famous faces, though, and a few data scientists and their adherents have emerged from obscurity (baseball’s Billy Beane  was one of the first to become a household name, especially when he went from book form to Brad Pitt incarnation in Moneyball). By far the hottest name in data lately—and maybe ever—has been Nate Silver, who took his predictive analytic talents from Beane’s world of baseball to politics, and started predicting election outcomes with exceptional success.

For all the recent success of the data scientist née analyst, the plot only gets interesting with the entrance of a capable foil. Enter the “data adversary.” The favorite quote of the data adversary is Mark Twain’s “There are three kinds of lies: lies, damned lies, and statistics.” He knows what he sees with his own two eyes; he knows what his years of experience tell him, and your numbers don’t sway him. The same trends that have brought fame to some of our protagonists have also drawn high profile data adversaries to the stage. Many baseball fans know of a recent drama that played out pitting the data nerds vs. the Luddite naysayers—the American League MVP vote.

For you growing legions of non-baseball fans, allow me to provide a synopsis. Baseball fans familiar with the story can skip down to the paragraph starting with “Outside of Angels and Tigers…”

Rookie phenom Mike Trout, centerfielder of the LA Angels, and veteran 3rd baseman Miguel Cabrera of the Detroit Tigers were clearly the best two candidates for the Most Valuable Player award this year. That these two stood out above the rest was beyond debate. Which of them was more deserving, on the other hand, was a matter that generated hot controversy.

Cabrera was the first player in over forty years to win the American League Triple Crown, meaning he led the league in the three traditional batting statistics of batting average, runs batted in (RBI’s), and home runs. Such a rare and high profile feat would be a virtual lock for MVP (Boston old-timers would bring up Ted Williams and Joe Dimaggio in 1947, but that’s another story entirely). Baseball’s dataphiles, however, had a different take.

The statistical nature of baseball has made it a magnet for data junkies for a long time. So much so that baseball “data science” has its own name, “sabermetrics,” and its practitioners are “sabermetricians.” They even have a founding father, the venerable Bill James, a man whom any data scientist—baseball fan or not—should get to know.

One of the many data driven insights brought to the fore by sabermetricians has been the fact that Triple Crown components, especially RBI’s and batting average, are overrated and a poor representation of a player’s value to the team. Rather, a slew of additional statistics have been identified that show much better correlation to a team’s wins and losses. That’s where Trout comes in. While he trailed Cabrera in the Triple Crown stats, Trout dominated Cabrera in most of the other statistics (the kind that instantly roll the data adversary’s eyes) that truly predict a player’s contributions to team wins and losses.

Outside of Angels and Tigers fans, the support for Cabrera vs. Trout broke down fairly cleanly between traditionalists (data adversaries) vs. “stats geeks” (data scientists), respectively. Cabrera won easily. A good representative of the data adversaries in this case came from the plume and inkwell of Mitch Albom of the Detroit Free Press, Miguel Cabrera’s award a win for fans, defeat for stats geeks. A sampling of his hands-over-ears-screaming:

Which, by the way, speaks to a larger issue about baseball. It is simply being saturated with situational statistics. What other sport keeps coming up with new categories to watch the same game? A box score now reads like an annual report. And this WAR statistic — which measures the number of wins a player gives his team versus a replacement player of minor league/bench talent (honestly, who comes up with this stuff?) — is another way of declaring, “Nerds win!”

We need to slow down the shoveling of raw data into the “what can we come up with next?” machine. It is actually creating a divide between those who like to watch the game of baseball and those who want to reduce it to binary code.

Apparently Mitch never bothered to ask any of these “nerds” if they like to watch the game of baseball in addition to paying attention to the statistics. If he had he would have heard them all say that they love both.

This little battle about awards to men playing a boys’ game is truly instructional to all of us analysts/data scientists. Data adversaries like Mitch Albom are everywhere, of course. Every organization has them, and they aren’t dumb. They can make good points, and they are in positions of power. They’re persuasive, and they know the power of a story. Being the world’s best number cruncher who can predict outcomes with the highest percentage accuracy does not alone make one an effective data scientist. Particularly in the real world where situations are organic and outcomes are gray.

The best data scientist remembers that predictive analysis is only as good as the decisions made and actions taken based on its findings. Decisions are made by humans. Humans—unlike numbers and algorithms—are political, irrational, and are persuaded by stories much more than by numbers. There is usually common ground between the data scientist and the data adversary, and it is in both of their interest—and the interest of the signer of both of their paychecks—to find it.

Incidentally, Miguel Cabrera was a gracious winner with much better perspective than many who voted for him, including Albom when he quoted Cabrera at the end of his article:

“I think they can use both,” Cabrera said when asked about computer stats versus old-time performance. “In the end, it’s gonna be the same. You gotta play baseball.”