Web page of Mikhail Belkin, Machine Learning and Geometry

Mikhail (Misha) Belkin

Professor
Halicioglu Data Science Institute
Computer Science and Engineering (affiliated)
University of California San Diego
email: mbelkin at ucsd.edu

Amazon Scholar

[Papers in Google Scholar] [Selected/recent papers] [Selected Talks] [Blog] [Bio] [CV] [Code]

Thoughts on some implications of deep learning and LLMs. We live in an interesting time.

The necessity of machine learning theory in mitigating AI risk: thoughts on necessity, urgency and possibility of deep learning theory.
Copernicus, Darwin and chatGPT: on the implications of the success of LLMs and their structure for statistics, science and the human condition.

Recent activities and collaborations.

From Jan 2024 I am the editor-in-chief for the SIAM Journal on Mathematics of Data Science (SIMODS). Please submit your best work!
I have been elected 2023 ACM Fellow.
Our paper on feature learning in neural networks has appeared in Science.
I am a PI in the Collaboration on the Theoretical Foundations of Deep Learning funded by the NSF and the Simons Foundation.
External collaborator with the new Eric and Wendy Schmidt Center at Broad Institute. Also a part of the NSF funded TILOS AI Institute.
An overview paper summarizing some of my thoughts on deep learning and modern ML (written for Acta Numerica).
A nice popular article in Quanta Magazine discussing kernel methods including some of my work.

Research interests and directions.

I am interested in questions concerning computation, statistics an optimization in Machine Learning, particularly for high dimensional data. Recently much of my research has been focused on the fundamental understanding of modern ML and deep learning, particularly on interpolation, overparemeterization and feature learning.

In the past I have worked on a range of topics including manifold and semi-supervised learning introducing Laplacian Eigenmaps, a method for dimensionality reduction, data representation and visualization based on the geometry of the heat equation, and Graph Regularization and Manifold Regularization for semi-supervised learning. Other work includes spectral clustering, learning Gaussian mixture models, Stochastic Block Models and generalizations of Independent Components Analysis, among others.

In recent years the practice of deep learning has presented a number of foundational challenges to statistical learning theory, necessitating rethinking some of the foundations of the subject. There are some key challenges of modern machine learning I am particularly interested in:

First is the question of generalization. Why do classifiers with zero training error (we call this interpolation) still perform well on the test set, even for very noisy data, in contradiction to the accepted statistical wisdom on overfitting?
The second challenge is understanding optimization. Why do methods such as Stochastic Gradient Descent (SGD) perform so well for modern non-convex deep architectures?
The third question is how deep learning systems learn features in data differently from classical methods, such as kernel machines. This question is closely related to generalization and the ability of modern Neural Networks to overcome the curse of dimensionality.
Finally, how does the combination of the first three result in human-level AI in modern LLMs?

We now have satisfactory answers to the first two questions and some partial answers to the third. The last one is still mysterious.

Generalization. While much theoretical and empirical work needs to be done, the outline of the new generalization theory is becoming clear. The classical analyses (and their resulting practical recommendations for model selection) rely on bounding the difference between the training loss and the test loss through the tools of the empirical process theory, such as VC or margin bounds. However, in the modern interpolating setting the empirical loss is identically zero and yields no information about the expected loss, making the classical bounds uninformative (except for the special case of zero test loss). Yet, as we show, using a different type of analysis, close to that used for classical nearest neighbor methods, interpolating methods can indeed be statistically optimal or near-optimal.

But where do the classical notions of complexity fit in this picture? Our recent work shows how interpolation and classical complexity control be reconciled. The classical U-shaped bias-variance trade-off curve and the interpolating regimes form two distinct regimes within a single double descent risk curve. The "classical" part of the curve is controlled by the usual complexity-based bounds. To understand the "modern" part we need to analyze algorithms that maximize or ensure functional smoothness subject to the interpolation constraints, a form of the Occam's razor. See our recent paper for a detailed mathematical analysis of double descent for two simple linear regression models.

Optimization. Once we accept the counter-intuitive premise that interpolation plays nice with generalization, the question of optimization is easier to address. Indeed, properties of interpolating systems seem radically different from those in "classical" regimes. Our recent paper develops a general theory for optimization of large systems (continuing a line of recent advances in the literature). In particular we connect existence of global minima for loss functions of large systems and exponential convergence of GD/SGD to an certain non-linear condition number of the corresponding tangent kernel function, which is tied to a version of the Polyak-Lojasiewicz condition, which we call PL*. In general, large systems, including various neural networks, are well-conditioned and, thus, satisfy the PL* condition. Hence existence of global minima and convergence of gradient descent can be established. Our analysis also provides a new perspective on the remarkable recent finding [Jacot, Gabriel, Hongler, 18] that neural tangent kernels (NTK) of certain large neural networks are approximately constant. We show that the underlying reason for constant NTK is a transition to linearity where certain large linear systems become approximately linear in a fixed size neighborhood of initialization. This happens due to certain structural properties of the Hessian. When these properties are not satisfied (e.g., for a network with a non-linear output layer), NTK is not constant, even for infinitely large systems, yet convergence to a global minimum can still be shown.

Feature learning. Our recent work shows that feature learning in Fully Connected Networks can be associated with the mathematical notion of Average Gradient Outer Product (AGOP), a mathematical object representing task dependent value of directions in the input space. We postulate Deep Neural Feature Ansatz that connects weight matrices of neural networks to AGOP. This idea can be directly incorporated into classical kernel machines resulting in a new algorithm which we call Recursive Feature Machine. Similar ideas applies to convolutional neural networks.

Selected and some recent papers. Complete publication list in Google Scholar.

Mechanism for feature learning in neural networks and backpropagation-free machine learning models [Science]
Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Mikhail Belkin, Science, 2024.

+ abstract

Understanding how neural networks learn features, or relevant patterns in data, for prediction is necessary for their reliable use in technological and scientific applications. In this work, we presented a unifying mathematical mechanism, known as Average Gradient Outer Product (AGOP), that characterized feature learning in neural networks. We provided empirical evidence that AGOP captured features learned by various neural network architectures, including transformer-based language models, convolutional networks, multi-layer perceptrons, and recurrent neural networks. Moreover, we demonstrated that AGOP, which is backpropagation-free, enabled feature learning in machine learning models, such as kernel machines, that apriori could not identify task-specific features. Overall, we established a fundamental mechanism that captured feature learning in neural networks and enabled feature learning in general machine learning models.
Mechanism of feature learning in convolutional neural networks [arxiv]
Daniel Beaglehole, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin

+ abstract

Understanding the mechanism of how convolutional neural networks learn features from image data is a fundamental problem in machine learning and computer vision. In this work, we identify such a mechanism. We posit the Convolutional Neural Feature Ansatz, which states that covariances of filters in any convolutional layer are proportional to the average gradient outer product (AGOP) taken with respect to patches of the input to that layer. We present extensive empirical evidence for our ansatz, including identifying high correlation between covariances of filters and patch-based AGOPs for convolutional layers in standard neural architectures, such as AlexNet, VGG, and ResNets pre-trained on ImageNet. We also provide supporting theoretical evidence. We then demonstrate the generality of our result by using the patch-based AGOP to enable deep feature learning in convolutional kernel machines. We refer to the resulting algorithm as (Deep) ConvRFM and show that our algorithm recovers similar features to deep convolutional networks including the notable emergence of edge detectors. Moreover, we find that Deep ConvRFM overcomes previously identified limitations of convolutional kernels, such as their inability to adapt to local signals in images and, as a result, leads to sizable performance improvement over fixed convolutional kernels.
Linear Recursive Feature Machines provably recover low-rank matrices [arxiv]
Adityanarayanan Radhakrishnan, Mikhail Belkin, Dmitriy Drusvyatskiy

+ abstract

A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a classical statistical estimator called the average gradient outer product (AGOP). The authors proposed Recursive Feature Machines (RFMs) as an algorithm that explicitly performs feature learning by alternating between (1) reweighting the feature vectors by the AGOP and (2) learning the prediction function in the transformed space. In this work, we develop the first theoretical guarantees for how RFM performs dimensionality reduction by focusing on the class of overparametrized problems arising in sparse linear regression and low-rank matrix recovery. Specifically, we show that RFM restricted to linear models (lin-RFM) generalizes the well-studied Iteratively Reweighted Least Squares (IRLS) algorithm. Our results shed light on the connection between feature learning in neural networks and classical sparse recovery algorithms. In addition, we provide an implementation of lin-RFM that scales to matrices with millions of missing entries. Our implementation is faster than the standard IRLS algorithm as it is SVD-free. It also outperforms deep linear networks for sparse linear regression and low-rank matrix completion.
Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation [Acta Numerica, arxiv]
Mikhail Belkin, Acta Numerica, Volume 30 , May 2021 , pp. 203 - 248.

+ abstract

In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges. However, the gap between theory and practice is gradually starting to close. In this paper I will attempt to assemble some pieces of the remarkable and still incomplete mathematical mosaic emerging from the efforts to understand the foundations of deep learning. The two key themes will be interpolation, and its sibling, over-parameterization. Interpolation corresponds to fitting data, even noisy data, exactly. Over-parameterization enables interpolation and provides flexibility to select a right interpolating model. As we will see, just as a physical prism separates colors mixed within a ray of light, the figurative prism of interpolation helps to disentangle generalization and optimization properties within the complex picture of modern Machine Learning. This article is written with belief and hope that clearer understanding of these issues brings us a step closer toward a general theory of deep learning and machine learning.
Reconciling modern machine learning practice and the bias-variance trade-off [PNAS, arxiv]
Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal, PNAS, 2019, 116 (32).

+ abstract

The question of generalization in machine learning---how algorithms are able to learn predictors from a training sample to make accurate predictions out-of-sample---is revisited in light of the recent breakthroughs in modern machine learning technology. The classical approach to understanding generalization is based on bias-variance trade-offs, where model complexity is carefully calibrated so that the fit on the training sample reflects performance out-of-sample. However, it is now common practice to fit highly complex models like deep neural networks to data with (nearly) zero training error, and yet these interpolating predictors are observed to have good out-of-sample accuracy even for noisy data. How can the classical understanding of generalization be reconciled with these observations from modern machine learning practice? In this paper, we bridge the two regimes by exhibiting a new "double descent" risk curve that extends the traditional U-shaped bias-variance curve beyond the point of interpolation. Specifically, the curve shows that as soon as the model complexity is high enough to achieve interpolation on the training sample---a point that we call the "interpolation threshold"---the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models. The double descent risk curve is demonstrated for a broad range of models, including neural networks and random forests, and a mechanism for producing this behavior is posited.
Loss landscapes and optimization in over-parameterized non-linear systems and neural networks [ACHA, arxiv]
Chaoyue Liu, Libin Zhu, Mikhail Belkin, Applied and Computational Harmonic Analysis (ACHA), 2022, 59.

+ abstract

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization problems corresponding to such systems are generally not convex, even locally. We argue that instead they satisfy PL*, a variant of the Polyak-Lojasiewicz condition on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL* condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL*-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL* condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL* condition applicable to "almost" over-parameterized systems.
Wide and deep neural networks achieve consistency for classification [PNAS, arxiv]
Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler, PNAS, 2023, Vol. 120, No. 14.

+ abstract

While neural networks are used for classification tasks across domains, a long-standing open problem in machine learning is determining whether neural networks trained using standard procedures are consistent for classification, i.e., whether such models minimize the probability of misclassification for arbitrary data distributions. In this work, we identify and construct an explicit set of neural network classifiers that are consistent. Since effective neural networks in practice are typically both wide and deep, we analyze infinitely wide networks that are also infinitely deep. In particular, using the recent connection between infinitely wide neural networks and neural tangent kernels, we provide explicit activation functions that can be used to construct networks that achieve consistency. Interestingly, these activation functions are simple and easy to implement, yet differ from commonly used activations such as ReLU or sigmoid. More generally, we create a taxonomy of infinitely wide and deep networks and show that these models implement one of three well-known classifiers depending on the activation function used: 1) 1-nearest neighbor (model predictions are given by the label of the nearest training example); 2) majority vote (model predictions are given by the label of the class with the greatest representation in the training set); or 3) singular kernel classifiers (a set of classifiers containing those that achieve consistency). Our results highlight the benefit of using deep networks for classification tasks, in contrast to regression tasks, where excessive depth is harmful.
Simple, fast, and flexible framework for matrix completion with infinite width neural networks [PNAS]
Adityanarayanan Radhakrishnan, George Stefanakis, Mikhail Belkin, Caroline Uhler, PNAS, 2022, 119 (16).

+ abstract

Matrix completion is a fundamental problem in machine learning that arises in various applications. We envision that our infinite width neural network framework for matrix completion will be easily deployable and produce strong baselines for a wide range of applications at limited computational costs. We demonstrate the flexibility of our framework through competitive results on virtual drug screening and image inpainting/reconstruction. Simplicity and speed are showcased by the fact that most results in this work require only a central processing unit and commodity hardware. Through its connection to semisupervised learning, our framework provides a principled approach for matrix completion that can be easily applied to problems well beyond those of image completion and virtual drug screening considered in this paper. Matrix completion problems arise in many applications including recommendation systems, computer vision, and genomics. Increasingly larger neural networks have been successful in many of these applications but at considerable computational costs. Remarkably, taking the width of a neural network to infinity allows for improved computational performance. In this work, we develop an infinite width neural network framework for matrix completion that is simple, fast, and flexible. Simplicity and speed come from the connection between the infinite width limit of neural networks and kernels known as neural tangent kernels (NTK). In particular, we derive the NTK for fully connected and convolutional neural networks for matrix completion. The flexibility stems from a feature prior, which allows encoding relationships between coordinates of the target matrix, akin to semisupervised learning. The effectiveness of our framework is demonstrated through competitive results for virtual drug screening and image inpainting/reconstruction. We also provide an implementation in Python to make our framework accessible on standard hardware to a broad audience.
On the linearity of large non-linear models: when and why the tangent kernel is constant [arxiv]
Chaoyue Liu, Libin Zhu, Mikhail Belkin, NeurIPS 2020.

+ abstract

The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian scaling applicable to the standard classes of neural networks. Our analysis provides a new perspective on the phenomenon of constant tangent kernel, which is different from the widely accepted "lazy training". Furthermore, we show that the transition to linearity is not a general property of wide neural networks and does not hold when the last layer of the network is non-linear. It is also not necessary for successful optimization by gradient descent.
A Universal Trade-off Between the Model Size, Test Loss, and Training Loss of Linear Predictors [arxiv]
Nikhil Ghosh, Mikhail Belkin, SIAM Journal on Mathematics of Data Science, 5(4), 2023.

+ abstract

In this work we establish an algorithm and distribution independent non-asymptotic trade-off between the model size, excess test loss, and training loss of linear predictors. Specifically, we show that models that perform well on the test data (have low excess loss) are either "classical" -- have training loss close to the noise level, or are "modern" -- have a much larger number of parameters compared to the minimum needed to fit the training data exactly. We also provide a more precise asymptotic analysis when the limiting spectral distribution of the whitened features is Marchenko-Pastur. Remarkably, while the Marchenko-Pastur analysis is far more precise near the interpolation peak, where the number of parameters is just enough to fit the training data, in settings of most practical interest it differs from the distribution independent bound by only a modest multiplicative constant.
A On the Inconsistency of Kernel Ridgeless Regression in Fixed Dimensions [arxiv]
Daniel Beaglehole, Mikhail Belkin, Parthe Pandit, SIAM Journal on Mathematics of Data Science, 5(4), 2023.

+ abstract

“Benign overfitting,” the ability of certain algorithms to interpolate noisy training data and yet perform well out-of-sample, has been a topic of considerable recent interest. We show, using a fixed design setup, that an important class of predictors, kernel machines with translation-invariant kernels, does not exhibit benign overfitting in fixed dimensions. In particular, the estimated predictor does not converge to the ground truth with increasing sample size, for any nonzero regression function and any (even adaptive) bandwidth selection. To prove these results, we give exact expressions for the generalization error and its decomposition in terms of an approximation error and an estimation error that elicits a trade-off based on the selection of the kernel bandwidth. Our results apply to commonly used translation-invariant kernels such as Gaussian, Laplace, and Cauchy.
Overparameterized Neural Networks Implement Associative Memory [PNAS]
Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler, PNAS, 2020, 117 (44).

+ abstract

Memorization of data in deep neural networks has become a subject of significant research interest. We prove that over-parameterized single layer fully connected autoencoders memorize training data: they produce outputs in (a non-linear version of) the span of the training examples. In contrast to fully connected autoencoders, we prove that depth is necessary for memorization in convolutional autoencoders. Moreover, we observe that adding nonlinearity to deep convolutional autoencoders results in a stronger form of memorization: instead of outputting points in the span of the training images, deep convolutional autoencoders tend to output individual training images. Since convolutional autoencoder components are building blocks of deep convolutional networks, we envision that our findings will shed light on the important phenomenon of memorization in over-parameterized deep networks.
Classification vs regression in overparameterized regimes: Does the loss function matter? [JMLR]
Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, Anant Sahai, Journal of Machine Learning Research (JMLR), 22(222):1-69, 2021.

+ abstract

We compare classification and regression tasks in the overparameterized linear model with Gaussian features. On the one hand, we show that with sufficient overparameterization all training points are support vectors: solutions obtained by least-squares minimum-norm interpolation, typically used for regression, are identical to those produced by the hard-margin support vector machine (SVM) that minimizes the hinge loss, typically used for training classifiers. On the other hand, we show that there exist regimes where these solutions are near-optimal when evaluated by the 0-1 test loss function, but do not generalize if evaluated by the square loss function, i.e. they achieve the null risk. Our results demonstrate the very different roles and properties of loss functions used at the training phase (optimization) and the testing phase (generalization).
Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks [arxiv]
Like Hui, Mikhail Belkin, ICLR 2021.

+ abstract

Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy.
Two models of double descent for weak features [arxiv]
Mikhail Belkin, Daniel Hsu, Ji Xu, SIAM Journal on Mathematics of Data Science, 2(4), 1167–1180.

+ abstract

The "double descent" risk curve was recently proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models. This article provides a precise mathematical analysis for the shape of this curve in two simple data models with the least squares/least norm predictor. Specifically, it is shown that the risk peaks when the number of features p is close to the sample size n, but also that the risk decreases towards its minimum as p increases beyond n. This behavior is contrasted with that of "prescient" models that select features in an a priori optimal order.
Risk bounds for over-parameterized maximum margin classification on sub-Gaussian mixtures [arxiv]
Y. Cao, Q. Gu, M. Belkin, Neural Inf. Proc. Systems (NeurIPS) 2021.

+ abstract

Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" (Bartlett et al. (2020)) phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.
Multiple Descent: Design Your Own Generalization Curve [arxiv]
Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi, Neural Inf. Proc. Systems (NeurIPS) 2021.

+ abstract

This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized. We show that the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled. Our results highlight the fact that both classical U-shaped generalization curve and the recently observed double descent curve are not intrinsic properties of the model family. Instead, their emergence is due to the interaction between the properties of the data and the inductive biases of learning algorithms.
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate [arxiv]
Mikhail Belkin, Daniel Hsu, Partha Mitra, Neural Inf. Proc. Systems (NeurIPS) 2018.

+ abstract

Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for "overfitted" / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and weighted k-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems. These schemes have an inductive bias that benefits from higher dimension, a kind of "blessing of dimensionality". Finally, connections to kernel machines, random forests, and adversarial examples in the interpolated regime are discussed.
To understand deep learning we need to understand kernel learning [arxiv]
Mikhail Belkin, Siyuan Ma, Soumik Mandal, ICML 2018.

+ abstract

Generalization performance of classifiers in deep learning has recently become a subject of intense study. Deep models, which are typically heavily over-parametrized, tend to fit the training data exactly. Despite this overfitting, they perform well on test data, a phenomenon not yet fully understood. The first point of our paper is that strong performance of overfitted classifiers is not a unique feature of deep learning. Using six real-world and two synthetic datasets, we establish experimentally that kernel classifiers trained to have zero classification error (overfitting) or zero regression error (interpolation) perform very well on test data.
We proceed to prove lower bounds on the norm of overfitted solutions for smooth kernels, showing that they increase nearly exponentially with data size. Since most generalization bounds depend polynomially on the norm of the solution, this result implies that they diverge as data increases. Furthermore, the existing bounds do not apply to interpolated classifiers.
We also show experimentally that (non-smooth) Laplacian kernels easily fit random labels using a version of SGD, a finding that parallels results recently reported for ReLU neural networks. In contrast, as expected from theory, fitting noisy data requires many more epochs for smooth Gaussian kernels. The observation that the ultimate performance of overfitted Laplacian and Gaussian classifiers on the test is quite similar, suggests that generalization is tied to the properties of the kernel function rather than the optimization process.
We see that some key phenomena of deep learning are manifested similarly in kernel methods in the ``modern'' overfitted regime. We argue that progress on understanding deep learning will be difficult, until more analytically tractable ``shallow'' kernel methods are better understood. The combination of the experimental and theoretical results presented in this paper indicates a need for new theoretical ideas for understanding properties of classical kernel methods.
The power of interpolation: understanding the effectiveness of SGD in modern over-parametrized learning [arxiv]
Siyuan Ma, Raef Bassily, Mikhail Belkin, ICML 2018.

+ abstract

Stochastic Gradient Descent (SGD) with small mini-batch is a key component in modern large-scale machine learning. However, its efficiency has not been easy to analyze as most theoretical results require adaptive rates and show convergence rates far slower than that for gradient descent, making computational comparisons difficult.
In this paper we aim to clarify the issue of fast SGD convergence. The key observation is that most modern architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on test data, these regimes allow for very fast convergence of SGD, comparable in the number of iterations to gradient descent.
Specifically, consider the setting with quadratic objective function, or near a minimum, where the quadratic term is dominant. We show that: (1) Mini-batch size 1with constant step size is optimal in terms of computations to achieve a given error. (2) There is a critical mini-batch size such that: (a:linear scaling) SGD iteration with mini-batch size m smaller than the critical size is nearly equivalent to m iterations of mini-batch size 1. (b:saturation) SGD iteration with mini-batch larger than the critical size is nearly equivalent to a gradient descent step.
The critical mini-batch size can be viewed as the limit for effective mini-batch parallelization. It is also nearly independent of the data size, implying O(n) acceleration over GD per unit of computation.
We give experimental evidence on real data, with the results closely following our theoretical analyses.
Finally, we show how the interpolation perspective and our results fit with recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.
Linear Convergence and Implicit Regularization of Generalized Mirror Descent with Time-Dependent Mirrors [arxiv]
Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler.

+ abstract

The following questions are fundamental to understanding the properties of over-parameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum? (2) What form of implicit regularization occurs through training? While significant progress has been made in answering both of these questions for gradient descent, they have yet to be answered more completely for general optimization methods. In this work, we establish sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), a generalization of mirror descent with a possibly time-dependent mirror. GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad. By using the Polyak-Lojasiewicz inequality, we first present a simple analysis under which non-stochastic GMD converges linearly to a global minimum. We then present a novel, Taylor-series based analysis to establish sufficient conditions for linear convergence of stochastic GMD. As a corollary, our result establishes sufficient conditions and provides learning rates for linear convergence of stochastic mirror descent and Adagrad. Lastly, we obtain approximate implicit regularization results for GMD by proving that GMD converges to an interpolating solution that is approximately the closest interpolating solution to the initialization in l2-norm in the dual space, thereby generalizing the result of Azizan, Lale, and Hassibi (2019) in the full batch setting.
Kernel machines that adapt to GPUs for effective large batch training [arxiv, EigenPro2.0 code]
Siyuan Ma, Mikhail Belkin, SysML 2019.

+ abstract

Modern machine learning models are typically trained using Stochastic Gradient Descent (SGD) on massively parallel computing resources such as GPUs. Increasing mini-batch size is a simple and direct way to utilize the parallel computing capacity. For small batch an increase in batch size results in the proportional reduction in the training time, a phenomenon known as linear scaling. However, increasing batch size beyond a certain value leads to no further improvement in training time. In this paper we develop the first analytical framework that extends linear scaling to match the parallel computing capacity of a resource. The framework is designed for a class of classical kernel machines. It automatically modifies a standard kernel machine to output a mathematically equivalent prediction function, yet allowing for extended linear scaling, i.e., higher effective parallelization and faster training time on given hardware. The resulting algorithms are accurate, principled and very fast. For example, using a single Titan Xp GPU, training on ImageNet with 1.3×106 data points and 1000 labels takes under an hour, while smaller datasets, such as MNIST, take seconds. As the parameters are chosen analytically, based on the theoretical bounds, little tuning beyond selecting the kernel and the kernel parameter is needed, further facilitating the practical use of these methods.
Accelerating Stochastic Training for Over-parametrized Learning [arxiv]
Chaoyue Liu, Mikhail Belkin, ICLR 2020.

+ abstract

In this paper we introduce MaSS (Momentum-added Stochastic Solver), an accelerated SGD method for optimizing over-parameterized networks. Our method is simple and efficient to implement and does not require changing parameters or computing full gradients in the course of optimization. We provide a detailed theoretical analysis for convergence and parameter selection including their dependence on the mini-batch size in the quadratic case. We also provide theoretical convergence results for a more general convex setting. We provide an experimental evaluation showing strong performance of our method in comparison to Adam and SGD for several standard architectures of deep networks including ResNet, convolutional and fully connected networks. We also show its performance for convex kernel machines.
On exponential convergence of SGD in non-convex over-parametrized learning [arxiv]
Raef Bassily, Mikhail Belkin, Siyuan Ma.

+ abstract

Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning. Although SGD methods are very effective in practice, most theoretical analyses of SGD suggest slower convergence than what is empirically observed. In our recent work [8] we analyzed how interpolation, common in modern over-parametrized learning, results in exponential convergence of SGD with constant step size for convex loss functions. In this note, we extend those results to a much broader non-convex function class satisfying the Polyak-Lojasiewicz (PL) condition. A number of important non-convex problems in machine learning, including some classes of neural networks, have been recently shown to satisfy the PL condition. We argue that the PL condition provides a relevant and attractive setting for many machine learning problems, particularly in the over-parametrized regime.
Kernel Machines Beat Deep Neural Networks on Mask-based Single-channel Speech Enhancement [arxiv]
Like Hui, Siyuan Ma, Mikhail Belkin, INTERSPEECH 2019.

+ abstract

We apply a fast kernel method for mask-based single-channel speech enhancement. Specifically, our method solves a kernel regression problem associated to a non-smooth kernel function (exponential power kernel) with a highly efficient iterative method (EigenPro). Due to the simplicity of this method, its hyper-parameters such as kernel bandwidth can be automatically and efficiently selected using line search with subsamples of training data. We observe an empirical correlation between the regression loss (mean square error) and regular metrics for speech enhancement. This observation justifies our training target and motivates us to achieve lower regression loss by training separate kernel model per frequency subband. We compare our method with the state-of-the-art deep neural networks on mask-based HINT and TIMIT. Experimental results show that our kernel method consistently outperforms deep neural networks while requiring less training time.
Does data interpolation contradict statistical optimality? [arxiv]
Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov, AI&Stats 2019.

+ abstract

We show that learning methods interpolating the training data can achieve optimal rates for the problems of nonparametric regression and prediction with square loss.
Approximation beats concentration? An approximation view on inference with smooth kernels [arxiv]
Mikhail Belkin, COLT 2018

+ abstract

Positive definite kernels and their associated Reproducing Kernel Hilbert Spaces provide a mathematically compelling and practically competitive framework for learning from data. In this paper we take the approximation theory point of view to explore various aspects of smooth kernels related to their inferential properties. We analyze eigenvalue decay of kernels operators and matrices, properties of eigenfunctions/eigenvectors and "Fourier" coefficients of functions in the kernel space restricted to a discrete set of data points. We also investigate the fitting capacity of kernels, giving explicit bounds on the fat shattering dimension of the balls in Reproducing Kernel Hilbert spaces. Interestingly, the same properties that make kernels very effective approximators for functions in their "native" kernel space, also limit their capacity to represent arbitrary functions. We discuss various implications, including those for gradient descent type methods.
It is important to note that most of our bounds are measure independent. Moreover, at least in moderate dimension, the bounds for eigenvalues are much tighter than the bounds which can be obtained from the usual matrix concentration results. For example, we see that the eigenvalues of kernel matrices show nearly exponential decay with constants depending only on the kernel and the domain. We call this "approximation beats concentration" phenomenon as even when the data are sampled from a probability distribution, some of their aspects are better understood in terms of approximation theory.
Unperturbed: spectral analysis beyond Davis-Kahan [arxiv]
Justin Eldridge, Mikhail Belkin, Yusu Wang, ALT 2018

+ abstract

Classical matrix perturbation results, such as Weyl's theorem for eigenvalues and the Davis-Kahan theorem for eigenvectors, are general purpose. These classical bounds are tight in the worst case, but in many settings sub-optimal in the typical case. In this paper, we present perturbation bounds which consider the nature of the perturbation and its interaction with the unperturbed structure in order to obtain significant improvements over the classical theory in many scenarios, such as when the perturbation is random. We demonstrate the utility of these new results by analyzing perturbations in the stochastic blockmodel where we derive much tighter bounds than provided by the classical theory. We use our new perturbation theory to show that a very simple and natural clustering algorithm -- whose analysis was difficult using the classical tools -- nevertheless recovers the communities of the blockmodel exactly even in very sparse graphs.
Eigenvectors of Orthogonally Decomposable Functions [arxiv]
Mikhail Belkin, Luis Rademacher, James Voss, Siam Journal on Computing (SICOMP), 2018,
short version COLT 2016 (Learning a Hidden Basis Through Imperfect Measurements: An Algorithmic Primitive)

+ abstract

n this paper, we generalize the eigendecomposition of quadratic forms (symmetric matrices) to a broad class of "orthogonally decomposable" functions. We focus on extending two characterizations of eigenvectors: First, that the eigenvectors of a quadratic form arise from the optima structure of the quadratic form on the sphere, and second that the eigenvectors are the fixed points of the matrix power iteration. We identify a key role of convexity in extending these characterizations to our setting. The generalized power iteration is a simple first order method which we call gradient iteration. Further, our framework captures as special cases recent methods for inferential problems in machine learning in areas including orthogonal tensor decompositions, Independent Component Analysis (ICA), topic modeling, spectral clustering, and Gaussian mixture learning. We provide a complete theoretical analysis of gradient iteration using the structure theory of discrete dynamical systems to show almost sure convergence and fast (super-linear) convergence rates. The analysis extends to the case when the observed function is only approximately orthogonally decomposable, with bounds that are polynomial in dimension and other relevant parameters, such as perturbation size. Our perturbation results can be considered as a non-linear version of the classical Davis-Kahan theorem for perturbations of eigenvectors of symmetric matrices.
Diving into the shallows: a computational perspective on large-scale shallow learning [arxiv, EigenPro code (Keras/Matlab)]
Siyuan Ma, Mikhail Belkin, NIPS 2017 (spotlight, 5% of submissions).

+ abstract

In this paper we first identify a basic limitation in gradient descent-based optimization methods when used in conjunctions with smooth kernels. An analysis based on the spectral properties of the kernel demonstrates that only a vanishingly small portion of the function space is reachable after a polynomial number of gradient descent iterations. This lack of approximating power drastically limits gradient descent for a fixed computational budget leading to serious over-regularization/underfitting. The issue is purely algorithmic, persisting even in the limit of infinite data. To address this shortcoming in practice, we introduce EigenPro iteration, based on a preconditioning scheme using a small number of approximately computed eigenvectors. It can also be viewed as learning a new kernel optimized for gradient descent. It turns out that injecting this small (computationally inexpensive and SGD-compatible) amount of approximate second-order information leads to major improvements in convergence. For large data, this translates into significant performance boost over the standard kernel methods. In particular, we are able to consistently match or improve the state-of-the-art results recently reported in the literature with a small fraction of their computational budget. Finally, we feel that these results show a need for a broader computational perspective on modern large-scale learning to complement more traditional statistical and convergence analyses. In particular, many phenomena of large-scale high-dimensional inference are best understood in terms of optimization on infinite dimensional Hilbert spaces, where standard algorithms can sometimes have properties at odds with finite-dimensional intuition. A systematic analysis concentrating on the approximation power of such algorithms within a budget of computation may lead to progress both in theory and practice.
Graphons, mergeons, and so on! [arxiv, 3-min pre-NIPS video, NIPS video]
Justin Eldridge, Mikhail Belkin, Yusu Wang, NIPS 2016 (oral presentation, 2% of submissions)

+ abstract

In this work we develop a theory of hierarchical clustering for graphs. Our modeling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We define what it means for an algorithm to produce the "correct" clustering, give sufficient conditions in which a method is statistically consistent, and provide an explicit algorithm satisfying these properties.
Clustering with Bregman Divergences: an Asymptotic Analysis [link]
Chaoyue Liu, Mikhail Belkin, NIPS 2016

+ abstract

Clustering, in particular k-means clustering, is a central topic in data analysis. Clustering with Bregman divergences is a recently proposed generalization of k-means clustering which has already been widely used in applications. In this paper we analyze theoretical properties of Bregman clustering when the number of the clusters k is large. We establish quantization rates and describe the limiting distribution of the centers as k tends to infinity, extending well-known results for k-means clustering.
Back to the future: Radial Basis Function networks revisited [link]
Qichao Que, Mikhail Belkin, AI & Statistics 2016.

+ abstract

Radial Basis Function (RBF) networks are a classical family of algorithms for supervised learning. The most popular approach for training RBF networks has relied on kernel methods using regularization based on a norm in a Reproducing Kernel Hilbert Space (RKHS), which is a principled and empirically successful framework. In this paper we aim to revisit some of the older approaches to training the RBF networks from a more modern perspective. Specifically, we analyze two common regularization procedures, one based on the square norm of the coefficients in the network and another on using centers obtained by k-means clustering. We show that both of these RBF methods can be recast as certain data-dependent kernels. We provide a theoretical analysis of these methods as well as a number of experimental results, pointing out very competitive experimental performance as well as certain advantages over the standard kernel methods in terms of both flexibility (incorporating of unlabeled data) and computational complexity. Finally, our results shed light on some impressive recent successes of using soft k-means features for image recognition and other tasks.
The Hidden Convexity of Spectral Clustering [arxiv]
James Voss, Mikhail Belkin, Luis Rademacher, AAAI 2016 (oral presentation)

+ abstract

In recent years, spectral clustering has become a standard method for data analysis used in a broad range of applications. In this paper we propose a new class of algorithms for multiway spectral clustering based on optimization of a certain "contrast function" over a sphere. These algorithms are simple to implement, efficient and, unlike most of the existing algorithms for multiclass spectral clustering, are not initialization-dependent. Moreover, they are applicable without modification for normalized and un-normalized clustering, which are two common variants of spectral clustering. Geometrically, the proposed algorithms can be interpreted as recovering a discrete weighted simplex by means of function optimization. We give complete necessary and sufficient conditions on contrast functions for the optimization to guarantee recovery of clusters. We show how these conditions can be interpreted in terms of certain "hidden convexity" of optimization over a sphere.
Learning Privately from Multiparty Data [arxiv]
Jihun Hamm, Paul Cao, Mikhail Belkin, ICML 2016

+ abstract

Learning a classifier from private data collected by multiple parties is an important problem that has many potential applications. How can we build an accurate and differentially private global classifier by combining locally-trained classifiers from different parties, without access to any party's private data? We propose to transfer the `knowledge' of the local classifier ensemble by first creating labeled data from auxiliary unlabeled data, and then train a global ε-differentially private classifier. We show that majority voting is too sensitive and therefore propose a new risk weighted by class probabilities estimated from the ensemble. Relative to a non-private solution, our private solution has a generalization error bounded by O(ε^2M^2) where M is the number of parties. This allows strong privacy without performance loss when M is large, such as in crowdsensing applications. We demonstrate the performance of our method with realistic tasks of activity recognition, network intrusion detection, and malicious URL detection.
A Pseudo-Euclidean Iteration for Optimal Recovery in Noisy ICA [link]
James Voss, Mikhail Belkin, Luis Rademacher, NIPS 2015

+abstract

Independent Component Analysis (ICA) is a popular model for blind signal separation. The ICA model assumes that a number of independent source signals are linearly mixed to form the observed signals. We propose a new algorithm, PEGI (for pseudo-Euclidean Gradient Iteration), for provable model recovery for ICA with Gaussian noise. The main technical innovation of the algorithm is to use a fixed point iteration in a pseudo-Euclidean (indefinite "inner product") space. The use of this indefinite "inner product" resolves technical issues common to several existing algorithms for noisy ICA. This leads to an algorithm which is conceptually simple, efficient and accurate in testing. Our second contribution is combining PEGI with the analysis of objectives for optimal recovery in the noisy ICA model. It has been observed that the direct approach of demixing with the inverse of the mixing matrix is suboptimal for signal recovery in terms of the natural Signal to Interference plus Noise Ratio (SINR) criterion. There have been several partial solutions proposed in the ICA literature. It turns out that any solution to the mixing matrix reconstruction problem can be used to construct an SINR-optimal ICA demixing, despite the fact that SINR itself cannot be computed from data. That allows us to obtain a practical and provably SINR-optimal recovery method for ICA with arbitrary Gaussian noise.
Polynomial Learning of Distribution Families [link]
M. Belkin, K. Sinha, Siam Journal on Computing (SICOMP),44(4), 889-911, 2015.
(Short version FOCS 2010).

+ abstract

The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learnability of Gaussian mixture distributions still remained open. The current work resolves the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call "polynomial families" in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a high-dimensional mixture to a polynomial number of parameter estimations in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions.
Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering [link]
Justin Eldridge, Mikhail Belkin, Yusu Wang, COLT 2015, Mark Fulk award (best student paper)!

+ abstract

Hierarchical clustering is a popular method for analyzing data which associates a tree to a dataset. Hartigan consistency has been used extensively as a framework to analyze such clustering algorithms from a statistical point of view. Still, as we show in the paper, a tree which is Hartigan consistent with a given density can look very different than the correct limit tree. Specifically, Hartigan consistency permits two types of undesirable configurations which we term over-segmentation and improper nesting. Moreover, Hartigan consistency is a limit property and does not directly quantify difference between trees. In this paper we identify two limit properties, separation and minimality, which address both over-segmentation and improper nesting and together imply (but are not implied by) Hartigan consistency. We proceed to introduce a merge distortion metric between hierarchical clusterings and show that convergence in our distance implies both separation and minimality. We also prove that uniform separation and minimality imply convergence in the merge distortion metric. Furthermore, we show that our merge distortion metric is stable under perturbations of the density. Finally, we demonstrate applicability of these concepts by proving convergence results for two clustering algorithms. First, we show convergence (and hence separation and minimality) of the recent robust single linkage algorithm of Chaudhuri and Dasgupta (2010). Second, we provide convergence results on manifolds for topological split tree clustering.
Crowd-ML: A Privacy-Preserving Learning Framework for a Crowd of Smart Devices
J. Hamm, A. Champion, G. Chen, M. Belkin, and D. Xuan, ICDCS 2015

+ abstract

Smart devices with built-in sensors, computational capabilities, and network connectivity have become increasingly pervasive. Crowds of smart devices offer opportunities to collec- tively sense and perform computing tasks at an unprecedented scale. This paper presents Crowd-ML, a privacy-preserving machine learning framework for a crowd of smart devices, which can solve a wide range of learning problems for crowdsensing data with differential privacy guarantees. Crowd-ML endows a crowdsensing system with the ability to learn classifiers or predictors online from crowdsensing data privately with minimal computational overhead on devices and servers, suitable for practical large-scale use of the framework. We analyze the performance and scalability of Crowd-ML and implement the system with off-the-shelf smartphones as a proof of concept. We demonstrate the advantages of Crowd-ML with real and simulated experiments under various conditions.
Learning with Fredholm Kernels [link]
Qichao Que, Mikhail Belkin, Yusu Wang, NIPS 2014

+ abstract

In this paper we propose a framework for supervised and semi-supervised learning based on reformulating the learning problem as a regularized Fredholm integral equation. Our approach fits naturally into the kernel framework and can be in- terpreted as constructing new data-dependent kernels, which we call Fredholm kernels. We proceed to discuss the “noise assumption” for semi-supervised learn- ing and provide both theoretical and experimental evidence that Fredholm kernels can effectively utilize unlabeled data under the noise assumption. We demonstrate that methods based on Fredholm learning show very competitive performance in the standard semi-supervised learning setting.
The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures. [arxiv]
Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, James Voss, COLT 2014

+ abstract

In this paper we show that very large mixtures of Gaussians with known and identical covariance matrix are efficiently learnable in high dimension. More precisely, we prove that a mixture whose number of components is a polynomial of any fixed degree in the dimension n is polynomially learnable as long as a certain non-degeneracy condition on the means is satisfied. It turns out that this condition is generic in the sense of smoothed complexity, as soon as the dimensionality of the space is high enough. Moreover, we prove that no such condition can exist in low dimension. Our main result on mixture recovery relies on a new "Poissonization"-based technique, which transforms a mixture of Gaussian to a projection of a product distribution. The problem of learning the projection can be efficiently solved using some recent results on tensor decompositions, and this gives an efficient algorithm for learning the mixture. While our results require fixed known covariance matrix, we believe that this work is among the first steps toward better understanding the rare phenomenon of the "blessing of dimensionality" in the computational aspects of statistical inference.
Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis. [NIPS archive, GI-ICA code]
James Voss, Luis Rademacher, Mikhail Belkin, NIPS 2013

+ abstract

The performance of standard algorithms for Independent Component Analysis quickly deteriorates under the addition of Gaussian noise. This is partially due to a common first step that typically consists of whitening, i.e., applying Principal Component Analysis (PCA) and rescaling the components to have identity covariance, which is not invariant under Gaussian noise. In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise. The two main contributions of this work are as follows: 1. We develop and implement a more efficient version of a Gaussian noise invariant decorrelation (quasi-orthogonalization) algorithm using Hessians of the cumulant functions. 2. We propose a very simple and efficient fixed-point GI-ICA (Gradient Iteration ICA) algorithm, which is compatible with quasi-orthogonalization, as well as with the usual PCA-based whitening in the noiseless case. The algorithm is based on a special form of gradient iteration (different from gradient descent). We provide an analysis of our algorithm demonstrating fast convergence following from the basic properties of cumulants. We also present a number of experimental comparisons with the existing methods, showing superior results on noisy data and very competitive performance in the noiseless case.
Inverse Density as an Inverse Problem: The Fredholm Equation Approach [arxiv]
Qichao Que, Mikhail Belkin, NIPS 2013

+ abstract

In this paper we address the problem of estimating the ratio $\frac{q}{p}$ where $p$ is a density function and $q$ is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration, in particular, when one needs to average a function with respect to one probability distribution, given a sample from another. It is often referred as {\it importance sampling} in statistical inference and is also closely related to the problem of {\it covariate shift} in transfer learning as well as to various MCMC methods. It may also be useful for separating the underlying geometry of a space, say a manifold, from the density function defined on it. Our approach is based on reformulating the problem of estimating $\frac{q}{p}$ as an inverse problem in terms of an integral operator corresponding to a kernel, and thus reducing it to an integral equation, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization and kernel methods, leads to a principled kernel-based framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel in the case of densities defined on $\R^d$, compact domains in $\R^d$ and smooth $d$-dimensional sub-manifolds of the Euclidean space. We also show experimental results including applications to classification and semi-supervised learning within the covariate shift framework and demonstrate some encouraging experimental comparisons. We also show how the parameters of our algorithms can be chosen in a completely unsupervised manner.
Blind Signal Separation in the Presence of Gaussian Noise [arxiv]
Mikhail Belkin, Luis Rademacher, James Voss, COLT 2013

+ abstract

A prototypical blind signal separation problem is the so-called cocktail party problem, with n people talking simultaneously and n different microphones within a room. The goal is to recover each speech signal from the microphone inputs. Mathematically this can be modeled by assuming that we are given samples from a n-dimensional random variable X=AS, where S is a vector whose coordinates are independent random variables corresponding to each speaker. The objective is to recover the matrix A^{-1} given random samples from X. A range of techniques collectively known as Independent Component Analysis (ICA) have been proposed to address this problem in the signal processing and machine learning literature. Many of these techniques are based on using the kurtosis or other cumulants to recover the components. In this paper we propose a new algorithm for solving the blind signal separation problem in the presence of additive Gaussian noise, when we are given samples from X=AS + \eta, where {\eta} is drawn from an unknown n-dimensional Gaussian distribution. Our approach is based on a method for decorrelating a sample with additive Gaussian noise under the assumption that the underlying distribution is a linear transformation of a distribution with independent components. Our decorrelation routine is based on the properties of cumulant tensors and can be combined with any standard cumulant-based method for ICA to get an algorithm that is provably robust in the presence of Gaussian noise. We derive polynomial bounds for sample complexity and error propagation of our method. Our results generalize the recent work of Arora et al. which deals with a special case of ICA when S is the uniform probability distribution over the binary cube.
Graph Laplacians on Singular Manifolds: Toward understanding complex spaces: graph Laplacians on manifolds with singularities and boundaries [arxiv]
Mikhail Belkin, Qichao Que, Yusu Wang, Xueyuan Zhou, COLT 2012.

+ abstract

Recently, much of the existing work in manifold learning has been done under the assumption that the data is sampled from a manifold without boundaries and singularities or that the functions of interest are evaluated away from such points. At the same time, it can be argued that singularities and boundaries are an important aspect of the geometry of realistic data. In this paper we consider the behavior of graph Laplacians at points at or near boundaries and two main types of other singularities: intersections, where different manifolds come together and sharp "edges", where a manifold sharply changes direction. We show that the behavior of graph Laplacian near these singularities is quite different from that in the interior of the manifolds. In fact, a phenomenon somewhat reminiscent of the Gibbs effect in the analysis of Fourier series, can be observed in the behavior of graph Laplacian near such points. Unlike in the interior of the domain, where graph Laplacian converges to the Laplace-Beltrami operator, near singularities graph Laplacian tends to a first-order differential operator, which exhibits different scaling behavior as a function of the kernel width. One important implication is that while points near the singularities occupy only a small part of the total volume, the difference in scaling results in a disproportionately large contribution to the total behavior. Another significant finding is that while the scaling behavior of the operator is the same near different types of singularities, they are very distinct at a more refined level of analysis. We believe that a comprehensive understanding of these structures in addition to the standard case of a smooth manifold can take us a long way toward better methods for analysis of complex non-linear data and can lead to significant progress in algorithm design.
Data Skeletonization via Reeb Graphs [pdf]
X. Ge, I. Safa, M. Belkin, Y. Wang, NIPS 2011.

+ abstract

Recovering hidden structure from complex and noisy non-linear data is one of the most fundamental problems in machine learning and statistical inference. While such data is often high-dimensional, it is of interest to approximate it with a low-dimensional or even one-dimensional space, since many important aspects of data are often intrinsically low-dimensional. Furthermore, there are many scenarios where the underlying structure is graph-like, e.g, river/road networks or various trajectories. In this paper, we develop a framework to extract, as well as to simplify, a one-dimensional "skeleton" from unorganized data using the Reeb graph. Our algorithm is very simple, does not require complex optimizations and can be easily applied to unorganized high-dimensional data such as point clouds or proximity graphs. It can also represent arbitrary graph structures in the data. We also give theoretical results to justify our method. We provide a number of experiments to demonstrate the effectiveness and generality of our algorithm, including comparisons to existing methods, such as principal curves. We believe that the simplicity and practicality of our algorithm will help to promote skeleton graphs as a data analysis tool for a broad range of applications.
Convergence of Laplacian Eigenmaps [pdf, bib]
M. Belkin, P. Niyogi
preprint, short version NIPS 2008.

+ abstract

Geometrically based methods for various tasks of data analysis have attracted considerable attention over the last few years. In many of these algorithms, a central role is played by the eigenvectors of the graph Laplacian of a data-derived graph. In this paper, we show that if points are sampled uniformly at random from an unknown submanifold ${\cal M}$ of $\R^N$, then the eigenvectors of a suitably constructed graph Laplacian converge to the eigenfunctions of the Laplace Beltrami operator on ${\cal M}$. This basic result directly establishes the convergence of spectral manifold learning algorithms such as Laplacian Eigenmaps and Diffusion Maps. It also has implications for the understanding of geometric algorithms in data analysis, computational harmonic analysis, geometric random graphs, and graphics.
On Learning with Integral Operators [pdf, bib]
L. Rosasco, M.Belkin, E. De Vito,
Journal of Machine Learning Research, vol.11, pp.905-934, 2010.

+ abstract

A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold. First, we use a technique based on a concentration inequality for Hilbert spaces to provide new simplified proofs for a number of results in spectral approximation. Second, using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from [28].
Data spectroscopy: eigenspaces of convolution operators and clustering [pdf, bib]
Tao Shi, Mikhail Belkin, Bin Yu
The Annals of Statistics, vol. 37, Number 6B (2009), 3960-3984.

+ abstract

This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the Data Spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.
Towards a Theoretical Foundation for Laplacian-Based Manifold Methods [pdf, bib]
M. Belkin, P. Niyogi
Journal of Computer and System Sciences, 2008.
Volume 74, Issue 8, pp. 1289-1308. Special Issue on Learning Theory, invited (short version in COLT 2005).

+ abstract

In recent years manifold methods have attracted a considerable amount of atten- tion in machine learning. However most algorithms in that class may be termed "manifold-motivated" as they lack any explicit theoretical guarantees. In this pa- per we take a step towards closing the gap between theory and practice for a class of Laplacian-based manifold methods. These methods utilize the graph Laplacian associated to a data set for a variety of applications in semi-supervised learning, clustering, data representation. We show that under certain conditions the graph Laplacian of a point cloud of data samples converges to the Laplace-Beltrami operator on the underlying mani- fold. Theorem 3.1 contains the .rst result showing convergence of a random graph Laplacian to the manifold Laplacian in the context of machine learning.
Discrete Laplace Operator for Meshed Surfaces [pdf, code, bib]
M. Belkin, J. Sun, Y. Wang, 24th Annual Symposium on Computational Geometry (SOCG) 2008.

+ abstract

In recent years a considerable amount of work in graphics and geometric optimization used tools based on the Laplace-Beltrami operator on a surface. The applications of the Laplacian include mesh editing, surface smoothing, and shape interpolations among others. However, it has been shown [12, 23, 25] that the popular cotangent approximation schemes do not provide convergent point-wise (or even L^2) estimates, while many applications rely on point-wise estimation. Existence of such schemes has been an open question [12]. In this paper we propose the first algorithm for approximating the Laplace operator of a surface from a mesh with point-wise convergence guarantees applicable to arbitrary meshed surfaces. We show that for a sufficiently fine mesh over an arbitrary surface, our mesh Laplacian is close to the Laplace-Beltrami operator on the surface at every point of the surface. Moreover, the proposed algorithm is simple and easily implementable. Experimental evidence shows that our algorithm exhibits convergence empirically and outperforms cotangent-based methods in providing accurate approximation of the Laplace operator for various meshes.
Consistency of Spectral Clustering [pdf, bib]
U. von Luxburg, M. Belkin, O. Bousquet,
The Annals of Statistics 2008, Vol. 36, No. 2, 555-586.

+ abstract

Consistency is a key property of statistical algorithms when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new methods to establish that for increasing sample size, those eigenvectors converge to the eigenvectors of certain limit operators. As a result we can prove that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other (unnormalized clustering) is only consistent under strong additional assumptions, which are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering.
Manifold Regularization: a Geometric Framework for Learning from Labeled and Unlabeled Examples [pdf, bib]
M. Belkin, P. Niyogi, V. Sindhwani
Journal of Machine Learning Research, 7(Nov):2399-2434, 2006.

+ abstract

We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework.
Heat Flow and a Faster Algorithm to Compute the Surface Area of a Convex Body [pdf, bib]
M. Belkin, H. Narayanan, P. Niyogi, FOCS 2006.

+ abstract

We draw on the observation that the amount of heat diffusing outside of a heated body in a short period of time is proportional to its surface area, to design a simple algorithm for approximating the surface area of a convex body given by a membership oracle. Our method has a complexity of O*(n^4), where n is the dimension, compared to O*(n^8.5) for the previous best algorithm. We show that our complexity cannot be improved given the current state-of-the-art in volume estimation.
Semi-supervised Learning on Riemannian Manifolds [pdf, bib]
M. Belkin, P. Niyogi
Machine Learning, 56 (invited, special Issue on clustering), 209-239, 2004 (short version in NIPS 2002).

+ abstract

We consider the general problem of utilizing both labeled and unlabeled data to improve classification accuracy. Under the assumption that the data lie on a submanifold in a high dimensional space, we develop an algorithmic framework to classify a partially labeled data set in a principled manner. The central idea of our approach is that classification functions are naturally defined only on the submanifold in question rather than the total ambient space. Using the Laplace-Beltrami operator one produces a basis (the Laplacian Eigenmaps) for a Hilbert space of square integrable functions on the submanifold. To recover such a basis, only unlabeled examples are required. Once such a basis is obtained, training can be performed using the labeled data set. Our algorithm models the manifold using the adjacency graph for the data and approximates the Laplace- Beltrami operator by the graph Laplacian. We provide details of the algorithm, its theoretical justification, and several practical applications for image, speech, and text classification.
Laplacian Eigenmaps for Dimensionality Reduction and Data Representation [pdf, bib]
M. Belkin, P. Niyogi
Neural Computation, June 2003; 15 (6):1373-1396 (short version in NIPS 2001).

+ abstract

One of the central problems in machine learning and pattern recognition is to develop appropriate representations for complex data. We consider the problem of constructing a representation for data lying on a lowdimensional manifold embedded in a high-dimensional space. Drawing on the correspondence between the graph Laplacian, the Laplace Beltrami operator on the manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for representing the highdimensional data. The algorithm provides a computationally efficient approach to non dimensionality reduction that has locality-preserving properties and a natural connection to clustering. Some potential applications and illustrative examples are discussed.

Algorithms.

Links to some implementations.

Talks.

Slides and videos for some of my talks.

My wonderful wife and colleague Yusu Wang.

Always thoughtful advisor and dear friend Partha Niyogi, no longer with us...