Déposez votre fichier ici pour le déplacer vers cet enregistrement.
y
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions.
[-]
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a ...
[+]
65K10 ; 65K05 ; 68W99 ; 68T99
Déposez votre fichier ici pour le déplacer vers cet enregistrement.
y
Machine learning pipelines often rely on optimization procedures to make discrete decisions (e.g. sorting, picking closest neighbors, finding shortest paths or optimal matchings). Although these discrete decisions are easily computed in a forward manner, they cannot be used to modify model parameters using first-order optimization techniques because they break the back-propagation of computational graphs. In order to expand the scope of learning problems that can be solved in an end-to-end fashion, we propose a systematic method to transform a block that outputs an optimal discrete decision into a differentiable operation. Our approach relies on stochastic perturbations of these parameters, and can be used readily within existing solvers without the need for ad hoc regularization or smoothing. These perturbed optimizers yield solutions that are differentiable and never locally constant. The amount of smoothness can be tuned via the chosen noise amplitude, whose impact we analyze. The derivatives of these perturbed solvers can be evaluated eciently. We also show how this framework can be connected to a family of losses developed in structured prediction, and describe how these can be used in unsupervised and supervised learning, with theoretical guarantees.
We demonstrate the performance of our approach on several machine learning tasks in experiments on synthetic and real data.
[-]
Machine learning pipelines often rely on optimization procedures to make discrete decisions (e.g. sorting, picking closest neighbors, finding shortest paths or optimal matchings). Although these discrete decisions are easily computed in a forward manner, they cannot be used to modify model parameters using first-order optimization techniques because they break the back-propagation of computational graphs. In order to expand the scope of learning ...
[+]
90C06 ; 68W20 ; 62F99
Déposez votre fichier ici pour le déplacer vers cet enregistrement.
Déposez votre fichier ici pour le déplacer vers cet enregistrement.
y
We are interested in nonsmooth analysis of backpropagation as implemented in modern machine learning librairies, such as Tensorflow or Pytorch. First I will illustrate how blind application of
differential calculus to nonsmooth objects can be problematic, requiring a proper mathematical model.
Then I will introduce a weak notion of generalized derivative, named conservativity, and illustrate how it complies with calculus and optimization for well structured objects. We provide stability results for empirical risk minimization similar as in the smooth setting for the combination of nonsmooth automatic differentiation, minibatch stochastic approximation and first order optimization. This is joint work with Jérôme Bolte.
[-]
We are interested in nonsmooth analysis of backpropagation as implemented in modern machine learning librairies, such as Tensorflow or Pytorch. First I will illustrate how blind application of
differential calculus to nonsmooth objects can be problematic, requiring a proper mathematical model.
Then I will introduce a weak notion of generalized derivative, named conservativity, and illustrate how it complies with calculus and optimization for ...
[+]
65K05 ; 65K10 ; 68T99
Déposez votre fichier ici pour le déplacer vers cet enregistrement.
y
The Burer-Monteiro factorization is a classical heuristic used to speed up the solving of large scale semidefinite programs when the solution is expected to be low rank: One writes the solution as the product of thinner matrices, and optimizes over the (low-dimensional) factors instead of over the full matrix. Even though the factorized problem is non-convex, one observes that standard first-order algorithms can often solve it to global optimality. This has been rigorously proved by Boumal, Voroninski and Bandeira, but only under the assumption that the factorization rank is large enough, larger than what numerical experiments suggest. We will describe this result, and investigate its optimality. More specifically, we will show that, up to a minor improvement, it is optimal: without additional hypotheses on the semidefinite problem at hand, first-order algorithms can fail if the factorization rank is smaller than predicted by current theory.
[-]
The Burer-Monteiro factorization is a classical heuristic used to speed up the solving of large scale semidefinite programs when the solution is expected to be low rank: One writes the solution as the product of thinner matrices, and optimizes over the (low-dimensional) factors instead of over the full matrix. Even though the factorized problem is non-convex, one observes that standard first-order algorithms can often solve it to global ...
[+]
90C26 ; 90C22 ; 42C40
Déposez votre fichier ici pour le déplacer vers cet enregistrement.
y
In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level. The canonical pivotal estimator is the square-root Lasso, formulated along with its derivatives as a "non-smooth + non-smooth'' optimization problem.
Modern techniques to solve these include smoothing the datafitting term, to benefit from fast efficient proximal algorithms.
In this work we focus on minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators. We also provide some guidelines on how to set the smoothing hyperparameter, and illustrate on synthetic data the interest of such guidelines.
This is joint work with Quentin Bertrand (INRIA), Mathurin Massias, Olivier Fercoq and Alexandre Gramfort.
[-]
In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level. The canonical pivotal estimator is the square-root Lasso, formulated along with its derivatives as a "non-smooth + non-smooth'' optimization problem.
Modern techniques to solve these include smoothing the datafitting term, to benefit from fast efficient proximal algorithms.
In this work we ...
[+]
62J05 ; 62J12 ; 62P10