In recent years, consumers have been spending more of their lives online and giving up troves of data to businesses and governments that are eager to use these data. But how can these organizations use these data to make good decisions? Enter data science – an emerging field that attempts to methodically interpret and analyze large amounts of data.
Pomona College Economics Professor Gary Smith, who has spent decades studying how people misuse data, argues in the new book “The 9 Pitfalls of Data Science” that businesses and governments are making many of the same mistakes in the era of big data that were made before the data deluge began.
“When people data mine or torture data, they often end up with preposterous conclusions,” says Smith. “Along comes big data and powerful computers analyzing the data so it’s a lot easier and faster to come up with preposterous conclusions.
“Some think that if we have more data then we’ll make better decisions, but it’s not that simple.”
Jay Cordes ’93, coauthor of “The 9 Pitfalls of Data Science,” adds that historical data may be helpful for finding new ideas that can be tested, but to really put the ‘science’ into data science, you need to know how to run valid experiments to reach reliable conclusions.
In the book, Smith and Cordes emphasize how scientific rigor and critical thinking skills are indispensable in this age of big data, as computer algorithms often find meaningless patterns that can lead to dangerous false conclusions. Instead of focusing only on success stories of companies and governments using data science well, the book is mostly about failure stories, or “pitfalls” to avoid when doing data science.
Even though the book is mostly geared towards budding data scientists or managers of data scientists, the authors hope it will be interesting to anyone who wants to develop an ability to evaluate evidence.
One of the pitfalls, “Being Surprised by Regression toward the Mean” explains the reality behind the Sports Illustrated Jinx. People started noticing that athletes who appear on the cover of Sports Illustrated don’t perform as well afterwards. The jinx appeared to be real. However, when a player or team does something exceptional enough to earn a place on the cover of Sports Illustrated, there is essentially nowhere to go but down. To the extent luck plays a role in athletic success, the players at the top almost certainly benefited from the good variety. There’s a Swedish proverb which states “luck doesn’t give, it only lends.”
Smith and Cordes were both at Pomona in the 1990s, but they never crossed paths. A math major at Pomona, Cordes pursued software development and data analysis after college. Even though the term hadn’t really caught on yet, he spent his free time doing things that would now be considered data science, like developing a profitable strategy for online poker or helping develop the first objective ranking system for mixed martial arts (MMA) fighters.
While working in Los Angeles, Cordes heard about Smith’s academic background through a professional contact. Cordes, who also lives in Claremont, then reached out to Smith and asked if he would come to L.A. to give a talk about statistics at the company where Cordes worked.
“When I saw Gary was from Pomona College, I realized it could be a great opportunity to pick his brain since we could take the train together from Claremont into downtown L.A.,” says Cordes. “Once he was trapped on the train with me, Gary was willing to let me pepper him with questions for the whole ride.”
They kept in touch and years later decided to combine their experiences and expertise to write “The 9 Pitfalls of Data Science.” Smith provides the theoretical foundation for the book, while Cordes has stories from the corporate world and Berkeley projects, where he received his master’s degree in data science to drive home the lessons. “Basically, I help make the case that the pitfalls are real, pervasive and that even experts fall into them,” says Cordes.