We’re excited to congratulate FAIR’s Léon Bottou and Google AI’s Olivier Bousquet on receiving the NeurIPS 2018 Test of Time award for their paper “The Tradeoffs of Large Scale Learning,” which they presented at NIPS 2007 while Léon was a researcher at NEC Labs. To view the award presentation by Olivier Bousquet, visit the NeurIPS Facebook page for the livestreamed video (start time: 58:29).
Léon joined the Facebook AI Research team in 2015 and is best known for his work on “deep” neural networks in the 1990s, large scale learning in the 2000s, and possibly his more recent work on causal inference in learning systems. He is also known for the DjVu document compression technology.
We recently caught up with Léon to learn more about the research paper that won him and Olivier the NeurIPS 2018 Test of Time Award.
Q: What was the research about?
A: The paper explains why a deceptively simple optimization algorithm, Stochastic Gradient Descent (SGD), proposed by Robbins in 1951, gives superior performance for large-scale machine learning problems, decisively beating apparently more sophisticated optimization algorithms.
More can be read about the research itself in The NeurIPS 2018 Test of Time Award: The Trade-Offs of Large Scale Learning.
Q: What led you to do the initial research?
A: Although SGD had been routinely used to train neural networks in the 1990s, very few people in the 2000s knew that its learning performance could be excellent. Because of their mathematical clarity, convex kernel methods such as SVMs were favored by researchers. But the world was changing at the same time. The volume of available data was increasing much faster than the available computational power. Olivier and I teamed to theoretically establish why this new state of affair makes SGD a winner for all kinds of machine learning models, whether SVMs or neural networks.
Q: What happened when the paper was originally published, and how was it received in the community then?
A: By coincidence, I was also asked to give a NIPS tutorial in 2007 about Large Scale Learning. Since I was convinced that this paper was the proper way to approach the problem, a good part of my tutorial consisted in explaining this work and also in showing practical examples where a very simple SGD implementation decisively outperformed the sophisticated competing methods. Because I had made the source code available, some people attending the tutorial managed to download the code and replicate the results in real time. This fact made an impression…
Q: How has the work been built upon? Is there any impact of this work in products we see today?
A: This work contributed to the spectacular return of stochastic gradient optimization methods in machine learning. In the following ten years, SGD and its variants went from being discarded a priori to becoming the first methods people think about when facing a machine learning problem. Although this is not always the best thing to do, there are many cases for which this is the only credible method. For instance, all deep learning neural networks today are either trained with SGD or with one of its variants. This algorithm became the engine that powers AI today.
Q: Were there any surprises along the way?
A: This is not really a surprise, but SGD is working even better than predicted by our framework. We knew that there were other effects with a positive impact on the performance of SGD, but the effects we took into account were already sufficient to make our point in a clear and simple way. People later observed that these other effects make a real difference in practice. In general, the optimization of deep neural networks is still full of unsolved mysteries. Things could change dramatically in the future..
Q: What is your current focus?
A: My personal focus is about understanding the gap that separates machine learning from machine intelligence. This is a conceptual problem because we do not even know how to speak precisely about the phenomena that we observe when the data distributions change or when machine learning systems exploit superficial correlations that result from the vagaries of the data collection process. I believe that thinking of causation provides many ways to understand these phenomena better. My colleagues and I are organizing a NeurIPS workshop on Friday 12/7 about this topic.