Mets Exec: More Data Doesn’t Mean Better Data
Thursday, 28 Mar 2013 | By: Paul DePodesta
We all want to know the future. Where will the market be next week? What will my house be worth next year? Who will win the game tonight? However, one of the fundamental and maddening aspects of life is uncertainty. The future is unknowable, and yet many executives get paid to predict the future—how will a particular item sell, what trends will be hot next spring, or how many home runs will a particular player hit?
In my line of work, we go to Venezuela to scout 15-year-old baseball players—literally 15 years old—and we decide whether to sign them and how much to pay them. When we engage in this activity, we are making what amounts to an explicit prediction of how that player is going to perform 10 or 12 years later, when the player will be a fundamentally different human being, both physically and emotionally, and not in Maracaibo, Venezuela, but in New York City, in front of 40,000 fans and against the greatest competition in the world.
It’s an inexact science.
Many years ago, we admitted to ourselves that we were not very good at this prediction game, or at the very least, we wanted to get better at it. So, we turned to data. If nothing else, at least data provided us with a framework with which to deal with this omnipresent uncertainty.
People have often told me that we were lucky, as baseball had a wealth of statistics that had been kept meticulously for more than 100 years. However, in reality this presented a significant challenge—how were we to wade through all of this data and establish what was truly important? That was 1999. In the intervening years, there has been an absolute explosion of available information, data so granular that we could scarcely have imagined our ability to obtain it. While the task of sifting through the data to find meaningful relationships was difficult in 1999, it is quickly becoming Herculean in the 21st century.
More data is not always better data. What we are seeking is relevant data, and as terabytes become petabytes or even exabytes, the easier it becomes to create relationships or conclusions out of thin air. Our own innate psychological biases can greatly impact how we view data or even what we seek in the data, so often we will see what we want to see. With this amount of data, virtually any point of view can find supporting evidence. This process has the chance to become increasingly hazardous.
Furthermore, and maybe most importantly, the data we gather is from the past, while the future will be a unique and complex set of circumstances. As has been said, “History doesn’t repeat itself, but it does rhyme.” That observation is instructive when using big data to predict an uncertain future, because it reminds us of data’s limitations. After all, the future, while similar in some ways to the past, will be full of novel events, and yet we are restrained by using past data as our blunt predictive tool. In these cases, data from the past will typically do a poor job of equipping decision-makers to prepare for the novelty of the future, and if these events happen to carry severe consequences…well, we have all lived through the fallout.
While big data carries with it unprecedented opportunity to understand the world around us, it also carries significant challenges that must be dealt with both seriously and diligently. Everyone who creates, administers or utilizes big data needs to understand where the models and algorithms break down and what the consequences can be when that happens. Being right about the future much more often than we ever were previously may also create a feeling of prescience, a feeling that is illusory, and consequently, dangerous. A healthy skepticism and appreciation for the power of the data will be necessary to use it effectively, as following the data of the past blindly into the future may prove perilous.
Big data is already heavily influencing our world from baseball to the board room, and that influence will continue to grow along with available storage. Let’s just keep Icarus in mind, as enhanced predictability should not be confused with enhanced stability.