Want Better Models? Streamline Your Data.

MIT Sloan Sports Analytics Conference, Boston

The MIT Sloan Sports Analytics Conference is relentlessly focused on increasing the quality and quantity of data to provide competitive advantage. I worry about the quantity side of this equation, however, since we’re already drowning in data, be it in sports, business, or higher education.

So too, there is at least rhetorical fascination here with the roles of AI and machine learning in data analytics, which will produce yet more data.

The challenge, of course, is that more data does not necessarily mean better predictive value. More data does not necessarily translate into competitive advantage. In fact, it can decrease advantage by blurring or eliminating the available patterns in the data.

It’s in this context that I’m delighted Walt suggested I read the chapter on “Overfitting” in Brian Christian and Tom Griffith’s 2017 book “Algorithms to Live By: The Computer Science of Human Decisions.” This notion of “overfitting” is one form of what the authors label “data idolatry.” They contend that introducing too many factors into a model - overfitting - can produce unwanted and even distorting complexity.

Christian and Griffith argue that it’s imperative to reduce needless complexity in predictive models. They show in one case the predictive weakness and instability of a nine-factor model in one study in comparison to a cleaner, less-populated two-factor model.

Sure, they understand that more factors in a model will generally make for a better fit with the data you already have. That said, a better fit for the available data does not necessarily make for better prediction, especially for time periods outside the range of the data.

What does this all mean? For one thing, it underscores the essential role wisdom must play in the collection and interpretation of data. You wouldn’t always know it here, but more data - especially more bad data - is not necessarily a good thing. 




Image courtesy of NG Data.