Course Description
The past few years have seen a dramatic increase in the performance of recognition systems thanks to the introduction of deep networks for representation learning. However, the mathematical reasons for this success remain elusive. For example, a key issue is that the training problem is nonconvex, hence optimization algorithms are not guaranteed to return a global minima. Another key issue is that while the size of deep networks is very large relative to the number of training examples, deep networks appear to generalize very well to unseen examples and new tasks. This course will overview recent work on the theory of deep learning that aims to understand the interplay between architecture design, regularization, generalization, and optimality properties of deep networks.
Syllabus
- Introduction
- 10/26: Brief History of Neural Networks
- 10/26: Impact of Deep Learning in Computer Vision, Speech and Games
- 10/26: Key Theoretical Questions: Optimization, Approximation and Generalization
- 10/26: Overview of Recent Work in Optimization, Approximation and Generalization
- Reading
- Optimization Theory
- Analysis of the Geometry of the Error Surface
- 10/26: Positively Homogeneous Network Architectures and Regularizers
- 11/02: Global Optimality for Matrix Factorization
- 11/02: Global Optimality for Positively Homogeneous Networks
- Analysis of Optimization Algorithms
- 11/09: Analysis of Stochastic Gradient Descent (SGD)
- 11/09: Analysis of Entropy SGD
- 11/09: Analysis of Dropout
- Reading
- 10/26: P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks 2 (1), 53-58, 1989.
- 10/26: B. Haeffele and R. Vidal. Structured Low-Rank Matrix Factorization: Global Optimality, Algorithms, and Applications, arXiv 1708.07850, 2017
- 10/26: B. Haeffele and R. Vidal. Global Optimality in Neural Network Training, IEEE Conference on Computer Vision and Pattern Recognition, 2017
- 11/02: P. Chaudhari et al. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. arXiv 1611.01838, 2016.
- 11/02: N. Srivastava et al.
Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 2014.
- 11/02: J. Cavazza, B. D. Haeffle, C. Lane, P. Morerio, V. Murino, R. Vidal.
Dropout as a Low-Rank Regularizer for Matrix Factorization. International Conference on Artificial Intelligence and Statistics, 2018.
- 11/02: P. Mianjy, R. Arora and R. Vidal. On the Implicit Bias of Dropout. International Conference on Machine Learning, 2018.
- Approximation Theory
- 11/16: Neural Networks as Universal Approximators
- 11/16: Deep versus Shallow Networks
- 11/16: Analysis of Approximation Error Based on Wavelets and Sparsity
- Reading
- 11/09: G. Cybenko. Approximations by superpositions of sigmoidal functions, Mathematics of Control, Signals, and Systems, 2 (4), 303-314, 1989.
- 11/09: K. Hornik, M. Stinchcombe and H. White. Multilayer feedforward networks are universal approximators, Neural Networks, 2(3), 359-366, 1989.
- 11/09: K. Hornik. Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, 4(2), 251-257, 1991.
- 11/16: G. Montúfar, R. Pascanu, K. Cho, Y. Bengio. On the number of linear regions of deep neural networks. NIPS, 2014
- 11/16: H. Mhaskar and T. Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 2016.
- 11/16: M. Telgarsky. Benefits of depth in neural networks. COLT 2016.
- 11/16: H. Bölcskei, P. Grohs, G. Kutyniok, P. Petersen. Memory-optimal neural network approximation. Wavelets and Sparsity, 2017.
- Generalization Theory
- 11/30: VC Dimension of Neural Networks
- 11/30: Path SGD
- 12/08: Information Bottleneck
- 12/08: Information Dropout
- Reading
- 11/30: E. Sontag. VC Dimension of Neural Networks. Neural Networks and Machine Learning, 1998.
- 11/30: P. Bartlett, W. Maass. Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 2003.
- 11/30: B. Neyshabur, R. Salakhutdinov, N Srebro. Path-SGD: Path-Normalized Optimization in Deep Neural Networks. NIPS, 2015.
- 12/03: R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via information. arXiv:1703.00810, 2017.
- 12/03: A. Achille and S. Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
- 12/03: T. Liang, T. Poggio, A. Rakhlin, J. Stokes. Fisher-Rao Metric, Geometry and Complexity of Neural Networks. arXiv:1711.01530, 2017.
Slides
10/26/18:
Introduction + Optimization Theory
11/02/18: Optimization Theory: Geometry
11/09/18: Optimization Theory: Algorithms
11/16/18: Approximation Theory
11/30/18: Generalization Theory
12/07/18: Generalization Theory
Grading
- Reading (30%): Read the assigned papers as indicated above and submit a 1 page critique (strengths and weaknesses) one week after the date the reading is assigned.
- Project (70%):
There will be a final project to be done either individually or in teams of up to three students. Presentations will be on the scheduled exam day, Saturday December 15th, 10:00 AM - 1:00 PM.
Please submit answers to the questions described here. The project is to be done individually
Honor Policy
The strength of the university depends on academic and personal integrity. In this course, you must be honest and truthful. Ethical violations include cheating on exams, plagiarism, reuse of assignments, improper use of the Internet and electronic devices, unauthorized collaboration, alteration of graded assignments, forgery and falsification, lying, facilitating academic dishonesty, and unfair competition.