Source: Yuriy Guts selection from Shutterstock
Practitioners of AI, machine learning, predictive modeling, and data science have grown enormously over the last few years. What was once a niche field defined by its blend of knowledge is becoming a rapidly growing profession. As the excitement around AI continues to grow, the new wave of ML augmentation, automation, and GUI tools will lead to even more growth in the number of people trying to build predictive models.
But here’s the rub: While it becomes easier to use the tools of predictive modeling, predictive modeling knowledge is not yet a widespread commodity. Errors can be counterintuitive and subtle, and they can easily lead you to the wrong conclusions if you’re not careful.
I’m a data scientist who works with dozens of expert data science teams for a living. In my day job, I see these teams striving to build high-quality models. The best teams work together to review their models to detect problems. There are many hard-to-detect-ways that lead to problematic models (say, by allowing target leakage into their training data).
Identifying issues is not fun. This requires admitting that exciting results are “too good to be true” or that their methods were not the right approach. In other words, it’s less about the sexy data science hype that gets headlines and more about a rigorous scientific discipline.
Almost a year ago, I read an article in Nature that claimed unprecedented accuracy in predicting earthquake aftershocks by using deep learning. Reading the article, my internal radar became deeply suspicious of their results. Their methods simply didn’t carry many of the hallmarks of careful predicting modeling.
I started to dig deeper. In the meantime, this article blew up and became widely recognized! It was even included in the release notes for Tensorflow as an example of what deep learning could do. However, in my digging, I found major flaws in the paper. Namely, data leakage which leads to unrealistic accuracy scores and a lack of attention to model selection (you don’t build a 6 layer neural network when a simpler model provides the same level of accuracy).
The testing dataset had a much higher AUC than the training set . . . this is not normal
To my earlier point: these are subtle, but incredibly basic predictive modeling errors that can invalidate the entire results of an experiment. Data scientists are trained to recognize and avoid these issues in their work. I assumed that this was simply overlooked by the author, so I contacted her and let her know so that she could improve her analysis. Although we had previously communicated, she did not respond to my email over concerns with the paper.
So, what was I to do? My coworkers told me to just tweet it and let it go, but I wanted to stand up for good modeling practices. I thought reason and best practices would prevail, so I started a 6-month process of writing up my results and shared them with Nature.
Upon sharing my results, I received a note from Nature in January 2019 that despite serious concerns about data leakage and model selection that invalidate their experiment, they saw no need to correct the errors, because “Devries et al. are concerned primarily with using machine learning as [a] tool to extract insight into the natural world, and not with details of the algorithm design”. The authors provided a much harsher response.
You can read the entire exchange on my github.
It’s not enough to say that I was disappointed. This was a major paper (it’s Nature!) that bought into AI hype and published a paper despite it using flawed methods.
Then, just this week, I ran across articles by Arnaud Mignan and Marco Broccardo on shortcomings that they found in the aftershocks article. Here are two more data scientists with expertise in earthquake analysis who also noticed flaws in the paper. I also have placed my analysis and reproducible code on github.
Go run the analysis yourself and see the issue
I want to make it clear: my goal is not to villainize the authors of the aftershocks paper. I don’t believe that they were malicious, and I think that they would argue their goal was to just show how machine learning could be applied to aftershocks. Devries is an accomplished earthquake scientist who wanted to use the latest methods for her field of study and found exciting results from it.
But here’s the problem: their insights and results were based on fundamentally flawed methods. It’s not enough to say, “This isn’t a machine learning paper, it’s an earthquake paper.” If you use predictive modeling, then the quality of your results are determined by the quality of your modeling. Your work becomes data science work, and you are on the hook for your scientific rigor.
There is a huge appetite for papers that use the latest technologies and approaches. It becomes very difficult to push back on these papers.
But if we allow papers or projects with fundamental issues to advance, it hurts all of us. It undermines the field of predictive modeling.
Please push back on bad data science. Report bad findings to papers. And if they don’t take action, go to twitter, post about it, share your results and make noise. This type of collective action worked to raise awareness of p-values and combat the epidemic of p-hacking. We need good machine learning practices if we want our field to continue to grow and maintain credibility.
Acknowledgments: I want to thank all the great data scientists at DataRobot that collaborated and supported me this past year, a few of these include: Lukas Innig, Amanda Schierz, Jett Oristaglio, Thomas Stearns, and Taylor Larkin.
This article was orignally posted on Medium and featured on Reddit