JHU Computer Vision Machine Learning

Mathematics of Deep Learning

Time: F 1:00-3:00 p.m.

Place: Gilman 50

TA: Dan Zhu (OH: F 4:30-5:30 p.m. in Clark 110)

Course Description

The past few years have seen a dramatic increase in the performance of recognition systems thanks to the introduction of deep networks for representation learning. However, the mathematical reasons for this success remain elusive. For example, a key issue is that the training problem is nonconvex, hence optimization algorithms are not guaranteed to return a global minima. Another key issue is that while the size of deep networks is very large relative to the number of training examples, deep networks appear to generalize very well to unseen examples and new tasks. This course will overview recent work on the theory of deep learning that aims to understand the interplay between architecture design, regularization, generalization, and optimality properties of deep networks.

Syllabus

Introduction

10/26: Brief History of Neural Networks
10/26: Impact of Deep Learning in Computer Vision, Speech and Games
10/26: Key Theoretical Questions: Optimization, Approximation and Generalization
10/26: Overview of Recent Work in Optimization, Approximation and Generalization
Reading

10/26: R. Vidal, J. Bruna, R. Giryes, S. Soatto. Mathematics of Deep Learning, arXiv 1712.04741, 2017

Optimization Theory

Analysis of the Geometry of the Error Surface

10/26: Positively Homogeneous Network Architectures and Regularizers
11/02: Global Optimality for Matrix Factorization
11/02: Global Optimality for Positively Homogeneous Networks

Analysis of Optimization Algorithms

11/09: Analysis of Stochastic Gradient Descent (SGD)
11/09: Analysis of Entropy SGD
11/09: Analysis of Dropout

Reading

10/26: P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks 2 (1), 53-58, 1989.
10/26: B. Haeffele and R. Vidal. Structured Low-Rank Matrix Factorization: Global Optimality, Algorithms, and Applications, arXiv 1708.07850, 2017
10/26: B. Haeffele and R. Vidal. Global Optimality in Neural Network Training, IEEE Conference on Computer Vision and Pattern Recognition, 2017
11/02: P. Chaudhari et al. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. arXiv 1611.01838, 2016.
11/02: N. Srivastava et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 2014.
11/02: J. Cavazza, B. D. Haeffle, C. Lane, P. Morerio, V. Murino, R. Vidal. Dropout as a Low-Rank Regularizer for Matrix Factorization. International Conference on Artificial Intelligence and Statistics, 2018.
11/02: P. Mianjy, R. Arora and R. Vidal. On the Implicit Bias of Dropout. International Conference on Machine Learning, 2018.

Approximation Theory

11/16: Neural Networks as Universal Approximators
11/16: Deep versus Shallow Networks
11/16: Analysis of Approximation Error Based on Wavelets and Sparsity
Reading

11/09: G. Cybenko. Approximations by superpositions of sigmoidal functions, Mathematics of Control, Signals, and Systems, 2 (4), 303-314, 1989.
11/09: K. Hornik, M. Stinchcombe and H. White. Multilayer feedforward networks are universal approximators, Neural Networks, 2(3), 359-366, 1989.
11/09: K. Hornik. Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, 4(2), 251-257, 1991.
11/16: G. Montúfar, R. Pascanu, K. Cho, Y. Bengio. On the number of linear regions of deep neural networks. NIPS, 2014
11/16: H. Mhaskar and T. Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 2016.
11/16: M. Telgarsky. Benefits of depth in neural networks. COLT 2016.
11/16: H. Bölcskei, P. Grohs, G. Kutyniok, P. Petersen. Memory-optimal neural network approximation. Wavelets and Sparsity, 2017.

Generalization Theory

11/30: VC Dimension of Neural Networks
11/30: Path SGD
12/08: Information Bottleneck
12/08: Information Dropout
Reading

11/30: E. Sontag. VC Dimension of Neural Networks. Neural Networks and Machine Learning, 1998.
11/30: P. Bartlett, W. Maass. Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 2003.
11/30: B. Neyshabur, R. Salakhutdinov, N Srebro. Path-SGD: Path-Normalized Optimization in Deep Neural Networks. NIPS, 2015.
12/03: R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via information. arXiv:1703.00810, 2017.
12/03: A. Achille and S. Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
12/03: T. Liang, T. Poggio, A. Rakhlin, J. Stokes. Fisher-Rao Metric, Geometry and Complexity of Neural Networks. arXiv:1711.01530, 2017.

Slides

10/26/18: Introduction + Optimization Theory

11/02/18: Optimization Theory: Geometry

11/09/18: Optimization Theory: Algorithms

11/16/18: Approximation Theory

11/30/18: Generalization Theory

12/07/18: Generalization Theory

Grading

Reading (30%): Read the assigned papers as indicated above and submit a 1 page critique (strengths and weaknesses) one week after the date the reading is assigned.
Project (70%): ~~There will be a final project to be done either individually or in teams of up to three students. Presentations will be on the scheduled exam day, Saturday December 15th, 10:00 AM - 1:00 PM.~~ Please submit answers to the questions described here. The project is to be done individually

Honor Policy

The strength of the university depends on academic and personal integrity. In this course, you must be honest and truthful. Ethical violations include cheating on exams, plagiarism, reuse of assignments, improper use of the Internet and electronic devices, unauthorized collaboration, alteration of graded assignments, forgery and falsification, lying, facilitating academic dishonesty, and unfair competition.