Toward universal steering and monitoring of AI models [Science, arxiv]
Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adsera, Mikhail Belkin
Science, 2026.
+ abstract
Modern AI models contain much of human knowledge, yet understanding of their internal representation of this knowledge remains elusive. Characterizing the structure and properties of this representation will lead to improvements in model capabilities and development of effective safeguards. Building on recent advances in feature learning, we develop an effective, scalable approach for extracting linear representations of general concepts in large-scale AI models (language models, vision-language models, and reasoning models). We show how these representations enable model steering, through which we expose vulnerabilities, mitigate misaligned behaviors, and improve model capabilities. Additionally, we demonstrate that concept representations are remarkably transferable across human languages and combinable to enable multi-concept steering. Through quantitative analysis across hundreds of concepts, we find that newer, larger models are more steerable and steering can improve model capabilities beyond standard prompting. We show how concept representations are effective for monitoring misaligned content (hallucinations, toxic content). We demonstrate that predictive models built using concept representations are more accurate for monitoring misaligned content than using models that judge outputs directly. Together, our results illustrate the power of using internal representations to map the knowledge in AI models, advance AI safety, and improve model capabilities.
Assembly and iteration: Transition to linearity of wide neural networks [ACHA]
Chaoyue Liu, Libin Zhu, Mikhail Belkin
Applied and Computational Harmonic Analysis, 82, 101834, 2026.
+ abstract
The recently discovered remarkable property that very wide neural networks in certain regimes are linear functions of their weights has become one of the key insights into understanding the mathematical foundations of deep learning. In this work, we show that this transition to linearity of wide neural networks can be viewed as an outcome of an iterated assembly procedure employed in the construction of neural networks. From the perspective of assembly, the output of a wide network can be viewed as an assembly of a large number of similar sub-models, which will transition to linearity as their number increases. This process can be iterated multiple times to show the transition to linearity of deep networks, including general feedforward neural networks with Directed Acyclic Graph (DAG) architecture.
A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a classical statistical estimator called the average gradient outer product (AGOP). The authors proposed Recursive Feature Machines (RFMs) as an algorithm that explicitly performs feature learning by alternating between (1) reweighting the feature vectors by the AGOP and (2) learning the prediction function in the transformed space. In this work, we develop the first theoretical guarantees for how RFM performs dimensionality reduction by focusing on the class of overparametrized problems arising in sparse linear regression and low-rank matrix recovery. Specifically, we show that RFM restricted to linear models (lin-RFM) generalizes the well-studied Iteratively Reweighted Least Squares (IRLS) algorithm. Our results shed light on the connection between feature learning in neural networks and classical sparse recovery algorithms. In addition, we provide an implementation of lin-RFM that scales to matrices with millions of missing entries. Our implementation is faster than the standard IRLS algorithm as it is SVD-free. It also outperforms deep linear networks for sparse linear regression and low-rank matrix completion.
Mechanism for feature learning in neural networks and backpropagation-free machine learning models [Science, arxiv]
Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Mikhail Belkin
Science, 2024.
+ abstract
Understanding how neural networks learn features, or relevant patterns in data, for prediction is necessary for their reliable use in technological and scientific applications. In this work, we presented a unifying mathematical mechanism, known as Average Gradient Outer Product (AGOP), that characterized feature learning in neural networks. We provided empirical evidence that AGOP captured features learned by various neural network architectures, including transformer-based language models, convolutional networks, multi-layer perceptrons, and recurrent neural networks. Moreover, we demonstrated that AGOP, which is backpropagation-free, enabled feature learning in machine learning models, such as kernel machines, that apriori could not identify task-specific features. Overall, we established a fundamental mechanism that captured feature learning in neural networks and enabled feature learning in general machine learning models.
Wide and deep neural networks achieve consistency for classification [PNAS, arxiv]
Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler
PNAS, 120(14), 2023.
+ abstract
While neural networks are used for classification tasks across domains, a long-standing open problem in machine learning is determining whether neural networks trained using standard procedures are consistent for classification, i.e., whether such models minimize the probability of misclassification for arbitrary data distributions. In this work, we identify and construct an explicit set of neural network classifiers that are consistent. Since effective neural networks in practice are typically both wide and deep, we analyze infinitely wide networks that are also infinitely deep. In particular, using the recent connection between infinitely wide neural networks and neural tangent kernels, we provide explicit activation functions that can be used to construct networks that achieve consistency. Interestingly, these activation functions are simple and easy to implement, yet differ from commonly used activations such as ReLU or sigmoid. More generally, we create a taxonomy of infinitely wide and deep networks and show that these models implement one of three well-known classifiers depending on the activation function used: 1) 1-nearest neighbor (model predictions are given by the label of the nearest training example); 2) majority vote (model predictions are given by the label of the class with the greatest representation in the training set); or 3) singular kernel classifiers (a set of classifiers containing those that achieve consistency). Our results highlight the benefit of using deep networks for classification tasks, in contrast to regression tasks, where excessive depth is harmful.
A Universal Trade-off Between the Model Size, Test Loss, and Training Loss of Linear Predictors [SIMODS, arxiv]
Nikhil Ghosh, Mikhail Belkin
SIAM Journal on Mathematics of Data Science, 5(4), 2023.
+ abstract
In this work we establish an algorithm and distribution independent non-asymptotic trade-off between the model size, excess test loss, and training loss of linear predictors. Specifically, we show that models that perform well on the test data (have low excess loss) are either "classical" -- have training loss close to the noise level, or are "modern" -- have a much larger number of parameters compared to the minimum needed to fit the training data exactly. We also provide a more precise asymptotic analysis when the limiting spectral distribution of the whitened features is Marchenko-Pastur. Remarkably, while the Marchenko-Pastur analysis is far more precise near the interpolation peak, where the number of parameters is just enough to fit the training data, in settings of most practical interest it differs from the distribution independent bound by only a modest multiplicative constant.
On the Inconsistency of Kernel Ridgeless Regression in Fixed Dimensions [SIMODS, arxiv]
Daniel Beaglehole, Mikhail Belkin, Parthe Pandit
SIAM Journal on Mathematics of Data Science, 5(4), 2023.
+ abstract
“Benign overfitting,” the ability of certain algorithms to interpolate noisy training data and yet perform well out-of-sample, has been a topic of considerable recent interest. We show, using a fixed design setup, that an important class of predictors, kernel machines with translation-invariant kernels, does not exhibit benign overfitting in fixed dimensions. In particular, the estimated predictor does not converge to the ground truth with increasing sample size, for any nonzero regression function and any (even adaptive) bandwidth selection. To prove these results, we give exact expressions for the generalization error and its decomposition in terms of an approximation error and an estimation error that elicits a trade-off based on the selection of the kernel bandwidth. Our results apply to commonly used translation-invariant kernels such as Gaussian, Laplace, and Cauchy.
Loss landscapes and optimization in over-parameterized non-linear systems and neural networks [ACHA, arxiv]
Chaoyue Liu, Libin Zhu, Mikhail Belkin
Applied and Computational Harmonic Analysis, 59, 2022.
+ abstract
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization problems corresponding to such systems are generally not convex, even locally. We argue that instead they satisfy PL*, a variant of the Polyak-Lojasiewicz condition on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL* condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL*-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL* condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL* condition applicable to "almost" over-parameterized systems.
Simple, fast, and flexible framework for matrix completion with infinite width neural networks [PNAS]
Adityanarayanan Radhakrishnan, George Stefanakis, Mikhail Belkin, Caroline Uhler
PNAS, 119(16), 2022.
+ abstract
Matrix completion is a fundamental problem in machine learning that arises in various applications. We envision that our infinite width neural network framework for matrix completion will be easily deployable and produce strong baselines for a wide range of applications at limited computational costs. We demonstrate the flexibility of our framework through competitive results on virtual drug screening and image inpainting/reconstruction. Simplicity and speed are showcased by the fact that most results in this work require only a central processing unit and commodity hardware. Through its connection to semisupervised learning, our framework provides a principled approach for matrix completion that can be easily applied to problems well beyond those of image completion and virtual drug screening considered in this paper. Matrix completion problems arise in many applications including recommendation systems, computer vision, and genomics. Increasingly larger neural networks have been successful in many of these applications but at considerable computational costs. Remarkably, taking the width of a neural network to infinity allows for improved computational performance. In this work, we develop an infinite width neural network framework for matrix completion that is simple, fast, and flexible. Simplicity and speed come from the connection between the infinite width limit of neural networks and kernels known as neural tangent kernels (NTK). In particular, we derive the NTK for fully connected and convolutional neural networks for matrix completion. The flexibility stems from a feature prior, which allows encoding relationships between coordinates of the target matrix, akin to semisupervised learning. The effectiveness of our framework is demonstrated through competitive results for virtual drug screening and image inpainting/reconstruction. We also provide an implementation in Python to make our framework accessible on standard hardware to a broad audience.
Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation [Acta Numerica, arxiv]
Mikhail Belkin
Acta Numerica, 30, 203-248, 2021.
+ abstract
In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges. However, the gap between theory and practice is gradually starting to close. In this paper I will attempt to assemble some pieces of the remarkable and still incomplete mathematical mosaic emerging from the efforts to understand the foundations of deep learning. The two key themes will be interpolation, and its sibling, over-parameterization. Interpolation corresponds to fitting data, even noisy data, exactly. Over-parameterization enables interpolation and provides flexibility to select a right interpolating model. As we will see, just as a physical prism separates colors mixed within a ray of light, the figurative prism of interpolation helps to disentangle generalization and optimization properties within the complex picture of modern Machine Learning. This article is written with belief and hope that clearer understanding of these issues brings us a step closer toward a general theory of deep learning and machine learning.
Classification vs regression in overparameterized regimes: Does the loss function matter? [JMLR, arxiv]
Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, Anant Sahai
Journal of Machine Learning Research, 22(222):1-69, 2021.
+ abstract
We compare classification and regression tasks in the overparameterized linear model with Gaussian features. On the one hand, we show that with sufficient overparameterization all training points are support vectors: solutions obtained by least-squares minimum-norm interpolation, typically used for regression, are identical to those produced by the hard-margin support vector machine (SVM) that minimizes the hinge loss, typically used for training classifiers. On the other hand, we show that there exist regimes where these solutions are near-optimal when evaluated by the 0-1 test loss function, but do not generalize if evaluated by the square loss function, i.e. they achieve the null risk. Our results demonstrate the very different roles and properties of loss functions used at the training phase (optimization) and the testing phase (generalization).
Memorization of data in deep neural networks has become a subject of significant research interest. We prove that over-parameterized single layer fully connected autoencoders memorize training data: they produce outputs in (a non-linear version of) the span of the training examples. In contrast to fully connected autoencoders, we prove that depth is necessary for memorization in convolutional autoencoders. Moreover, we observe that adding nonlinearity to deep convolutional autoencoders results in a stronger form of memorization: instead of outputting points in the span of the training images, deep convolutional autoencoders tend to output individual training images. Since convolutional autoencoder components are building blocks of deep convolutional networks, we envision that our findings will shed light on the important phenomenon of memorization in over-parameterized deep networks.
Two models of double descent for weak features [SIMODS, arxiv]
Mikhail Belkin, Daniel Hsu, Ji Xu
SIAM Journal on Mathematics of Data Science, 2(4), 1167-1180, 2020.
+ abstract
The "double descent" risk curve was recently proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models. This article provides a precise mathematical analysis for the shape of this curve in two simple data models with the least squares/least norm predictor. Specifically, it is shown that the risk peaks when the number of features p is close to the sample size n, but also that the risk decreases towards its minimum as p increases beyond n. This behavior is contrasted with that of "prescient" models that select features in an a priori optimal order.
Reconciling modern machine learning practice and the bias-variance trade-off [PNAS, arxiv]
Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal
PNAS, 116(32), 2019.
+ abstract
The question of generalization in machine learning---how algorithms are able to learn predictors from a training sample to make accurate predictions out-of-sample---is revisited in light of the recent breakthroughs in modern machine learning technology. The classical approach to understanding generalization is based on bias-variance trade-offs, where model complexity is carefully calibrated so that the fit on the training sample reflects performance out-of-sample. However, it is now common practice to fit highly complex models like deep neural networks to data with (nearly) zero training error, and yet these interpolating predictors are observed to have good out-of-sample accuracy even for noisy data. How can the classical understanding of generalization be reconciled with these observations from modern machine learning practice? In this paper, we bridge the two regimes by exhibiting a new "double descent" risk curve that extends the traditional U-shaped bias-variance curve beyond the point of interpolation. Specifically, the curve shows that as soon as the model complexity is high enough to achieve interpolation on the training sample---a point that we call the "interpolation threshold"---the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models. The double descent risk curve is demonstrated for a broad range of models, including neural networks and random forests, and a mechanism for producing this behavior is posited.
Eigenvectors of Orthogonally Decomposable Functions [SICOMP, arxiv]
Mikhail Belkin, Luis Rademacher, James Voss
SIAM Journal on Computing, 2018. Short version: COLT 2016.
+ abstract
n this paper, we generalize the eigendecomposition of quadratic forms (symmetric matrices) to a broad class of "orthogonally decomposable" functions. We focus on extending two characterizations of eigenvectors: First, that the eigenvectors of a quadratic form arise from the optima structure of the quadratic form on the sphere, and second that the eigenvectors are the fixed points of the matrix power iteration. We identify a key role of convexity in extending these characterizations to our setting. The generalized power iteration is a simple first order method which we call gradient iteration. Further, our framework captures as special cases recent methods for inferential problems in machine learning in areas including orthogonal tensor decompositions, Independent Component Analysis (ICA), topic modeling, spectral clustering, and Gaussian mixture learning. We provide a complete theoretical analysis of gradient iteration using the structure theory of discrete dynamical systems to show almost sure convergence and fast (super-linear) convergence rates. The analysis extends to the case when the observed function is only approximately orthogonally decomposable, with bounds that are polynomial in dimension and other relevant parameters, such as perturbation size. Our perturbation results can be considered as a non-linear version of the classical Davis-Kahan theorem for perturbations of eigenvectors of symmetric matrices.
Laplacian Support Vector Machines
Trained in the Primal
[pdf, bib, code]
S. Melacci, M.Belkin, Journal of Machine Learning Research, vol.12, pp. 1149-1184, 2011.
+ abstract
In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi--supervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. Whereas training a LapSVM in the dual requires two steps, using the primal form allows us to collapse training to a single step. Moreover, the computational complexity of the training algorithm is reduced from O(n^3) to O(n^2) using preconditioned conjugate gradient, where n is the combined number of labeled and unlabeled examples. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large datasets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach.
Polynomial Learning of Distribution Families [arxiv]
M. Belkin, K. Sinha, submitted. Short version, FOCS 2010.
+ abstract
The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learnability of Gaussian mixture distributions still remained open. The current work resolves the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call "polynomial families" in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a high-dimensional mixture to a polynomial number of parameter estimations in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions.
On Learning with Integral Operators
[pdf, bib]
L. Rosasco, M.Belkin, E. De Vito, Journal of Machine Learning Research, vol.11, pp.905-934, 2010.
+ abstract
A large number of learning algorithms, for example,
spectral clustering, kernel Principal Components
Analysis and many manifold methods are
based on estimating eigenvalues and eigenfunctions
of operators defined by a similarity function or a
kernel, given empirical data. Thus for the analysis
of algorithms, it is an important problem to
be able to assess the quality of such approximations.
The contribution of our paper is two-fold.
First, we use a technique based on a concentration
inequality for Hilbert spaces to provide new simplified
proofs for a number of results in spectral
approximation. Second, using these methods we
provide several new results for estimating spectral
properties of the graph Laplacian operator extending
and strengthening results from [28].
Data spectroscopy: eigenspace of convolution operators and clustering [pdf, bib]
Tao Shi, Mikhail Belkin, Bin Yu
The Annals of Statistics, vol. 37, Number 6B (2009), 3960-3984.
+ abstract
This paper focuses on obtaining clustering information about a
distribution from its i.i.d. samples. We develop theoretical results to
understand and use clustering information contained in the eigenvectors
of data adjacency matrices based on a radial kernel function
with a sufficiently fast tail decay. In particular, we provide population
analyses to gain insights into which eigenvectors should be used and
when the clustering information for the distribution can be recovered
from the sample. We learn that a fixed number of top eigenvectors
might at the same time contain redundant clustering information and
miss relevant clustering information. We use this insight to design
the Data Spectroscopic clustering (DaSpec) algorithm that utilizes
properly selected eigenvectors to determine the number of clusters
automatically and to group the data accordingly. Our findings extend
the intuitions underlying existing spectral techniques such as
spectral clustering and Kernel Principal Components Analysis, and
provide new understanding into their usability and modes of failure.
Simulation studies and experiments on real world data are conducted
to show the potential of our algorithm. In particular, DaSpec is found
to handle unbalanced groups and recover clusters of different shapes
better than the competing methods.
Convergence of Laplacian Eigenmaps [pdf, bib]
M. Belkin, P. Niyogi
preprint, short version NIPS 2008.
+ abstract
Geometrically based methods for various tasks of data analysis
have attracted considerable attention over the last few years.
In many of these algorithms, a central role is played by the
eigenvectors of the graph Laplacian of a
data-derived graph. In this paper, we show that if points are
sampled uniformly at random from an unknown submanifold ${\cal M}$
of $\R^N$, then the eigenvectors of a suitably constructed
graph Laplacian converge to the eigenfunctions of the Laplace
Beltrami operator on ${\cal M}$. This basic result directly
establishes the convergence of spectral manifold learning
algorithms such as Laplacian Eigenmaps and Diffusion Maps.
It also has implications for the understanding of geometric algorithms
in data analysis, computational harmonic analysis, geometric
random graphs, and graphics.
Towards a Theoretical Foundation for Laplacian-Based
Manifold Methods [pdf, bib]
M. Belkin, P. Niyogi
Journal of Computer and System Sciences, 2008.
Volume 74 , Issue 8, pp. 1289-1308
Special Issue on Learning Theory (invited).
+ abstract
In recent years manifold methods have attracted a considerable amount of atten-
tion in machine learning. However most algorithms in that class may be termed
"manifold-motivated" as they lack any explicit theoretical guarantees. In this pa-
per we take a step towards closing the gap between theory and practice for a class
of Laplacian-based manifold methods. These methods utilize the graph Laplacian
associated to a data set for a variety of applications in semi-supervised learning,
clustering, data representation.
We show that under certain conditions the graph Laplacian of a point cloud of
data samples converges to the Laplace-Beltrami operator on the underlying mani-
fold. Theorem 3.1 contains the .rst result showing convergence of a random graph
Laplacian to the manifold Laplacian in the context of machine learning.
Manifold Regularization:
a Geometric Framework for Learning from Labeled and
Unlabeled Examples [pdf, bib]
M. Belkin, P. Niyogi, V. Sindhwani
Journal of Machine Learning Research, 7(Nov):2399-2434,
2006.
+ abstract
We propose a family of learning algorithms based on a
new form of regularization that allows us to exploit
the geometry of the marginal distribution. We focus
on a semi-supervised framework that incorporates
labeled and unlabeled data in a general-purpose
learner. Some transductive graph learning algorithms
and standard methods including support vector
machines and regularized least squares can be
obtained as special cases. We use properties of
reproducing kernel Hilbert spaces to prove new
Representer theorems that provide theoretical basis
for the algorithms. As a result (in contrast to
purely graph-based approaches) we obtain a natural
out-of-sample extension to novel examples and so are
able to handle both transductive and truly
semi-supervised settings. We present experimental
evidence suggesting that our semi-supervised
algorithms are able to use unlabeled data
effectively. Finally we have a brief discussion of
unsupervised and fully supervised learning within our
general framework.
Consistency of Spectral Clustering [pdf, bib]
U. von Luxburg, M. Belkin, O. Bousquet,
The Annals of Statistics 2008, Vol. 36, No. 2, 555-586.
+ abstract
Consistency is a key property of statistical
algorithms when the data is drawn from some
underlying probability distribution. Surprisingly,
despite decades of work, little is known about
consistency of most clustering algorithms. In this
paper we investigate consistency of the popular
family of spectral clustering algorithms, which
clusters the data with the help of eigenvectors of
graph Laplacian matrices. We develop new methods to
establish that for increasing sample size, those
eigenvectors converge to the eigenvectors of certain
limit operators. As a result we can prove that one of
the two major classes of spectral clustering
(normalized clustering) converges under very general
conditions, while the other (unnormalized clustering)
is only consistent under strong additional
assumptions, which are not always satisfied in real
data. We conclude that our analysis provides strong
evidence for the superiority of normalized spectral
clustering.
Semi-supervised Learning on Riemannian Manifolds
[pdf, bib]
M. Belkin, P. Niyogi
Machine Learning, 56, Invited, secial Issue on
Clustering, 209-239, 2004.
+ abstract
We consider the general problem of utilizing both
labeled and unlabeled data to improve classification
accuracy. Under the assumption that the data lie on a
submanifold in a high dimensional space, we develop
an algorithmic framework to classify a partially
labeled data set in a principled manner. The central
idea of our approach is that classification functions
are naturally defined only on the submanifold in
question rather than the total ambient space. Using
the Laplace-Beltrami operator one produces a basis
(the Laplacian Eigenmaps) for a Hilbert space of
square integrable functions on the submanifold. To
recover such a basis, only unlabeled examples are
required. Once such a basis is obtained, training can
be performed using the labeled data set. Our
algorithm models the manifold using the adjacency
graph for the data and approximates the Laplace-
Beltrami operator by the graph Laplacian. We provide
details of the algorithm, its theoretical
justification, and several practical applications for
image, speech, and text classification.
Laplacian Eigenmaps for Dimensionality Reduction and
Data Representation [pdf, bib]
M. Belkin, P. Niyogi
Neural Computation, June 2003; 15 (6):1373-1396.
+ abstract
One of the central problems in machine learning and
pattern recognition is to develop appropriate
representations for complex data. We consider the
problem of constructing a representation for data
lying on a lowdimensional manifold embedded in a
high-dimensional space. Drawing on the correspondence
between the graph Laplacian, the Laplace Beltrami
operator on the manifold, and the connections to the
heat equation, we propose a geometrically motivated
algorithm for representing the highdimensional data.
The algorithm provides a computationally efficient
approach to non
dimensionality reduction that
has locality-preserving properties and a natural
connection to clustering. Some potential applications
and illustrative examples are discussed.
Refereed and Invited Conference Proceedings:
xRFM: Accurate, scalable, and interpretable feature learning models for tabular data [arxiv]
Daniel Beaglehole, David Holzmuller, Adityanarayanan Radhakrishnan, Mikhail Belkin
ICLR 2026.
+ abstract
Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to other methods, including recently introduced tabular foundation models (TabPFNv2) and GBDTs, xRFM achieves best performance across regression datasets and is competitive to the best methods across classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.
The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations [arxiv]
Enric Boix-Adsera, Neil Mallinar, James B. Simon, Mikhail Belkin
ICLR 2026.
+ abstract
It is a central challenge in deep learning to understand how neural networks learn representations. A leading approach is the Neural Feature Ansatz (NFA), a conjectured mechanism for how feature learning occurs. Although the NFA is empirically validated, it lacks a theoretical basis, and thus it is unclear when it might fail, and how to improve it. In this paper, we take a first-principles approach to understanding why this observation holds, and when it does not. We use first-order optimality conditions to derive the Features at Convergence Theorem (FACT), an alternative to the NFA that (a) obtains greater agreement with learned features at convergence, (b) explains why the NFA holds in most settings, and (c) captures essential feature learning phenomena in neural networks such as grokking behavior in modular arithmetic and phase transitions in learning sparse parities, similarly to the NFA.
Seeds of Structure: Patch PCA Reveals Universal Compositional Cues in Diffusion Models [NeurIPS]
Qingsong Wang, Zhengchao Wan, Mikhail Belkin, Yusu Wang
NeurIPS 2025.
Fast training of large kernel models with delayed projections [NeurIPS, arxiv]
Amirhesam Abedsoltan, Siyuan Ma, Parthe Pandit, Mikhail Belkin
NeurIPS 2025.
+ abstract
Classical kernel machines have historically faced significant challenges in scaling to large datasets and model sizes--a key ingredient that has driven the success of neural networks. In this paper, we present a new methodology for building kernel machines that can scale efficiently with both data size and model size. Our algorithm introduces delayed projections to Preconditioned Stochastic Gradient Descent (PSGD) allowing the training of much larger models than was previously feasible, pushing the practical limits of kernel-based learning. We validate our algorithm, EigenPro4, across multiple datasets, demonstrating drastic training speed up over the existing methods while maintaining comparable or better classification accuracy.
Task Generalization with Autoregressive Compositional Structure: Can Learning from D Tasks Generalize to D^T Tasks? [ICML, arxiv]
Amirhesam Abedsoltan, Huaqing Zhang, Kaiyue Wen, Hongzhou Lin, Jingzhao Zhang, Mikhail Belkin
ICML 2025.
Emergence in non-neural models: grokking modular arithmetic via average gradient outer product [ICML, arxiv]
Neil Rohit Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin
ICML 2025.
A Gap Between the Gaussian RKHS and Neural Networks: An Infinite-Center Asymptotic Analysis [COLT, arxiv]
Akash Kumar, Rahul Parhi, Mikhail Belkin
COLT 2025.
Unmemorization in large language models via self-distillation and deliberate imagination [NAACL, arxiv]
Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulic
NAACL 2025.
Average gradient outer product as a mechanism for deep neural collapse [NeurIPS, arxiv]
Daniel Beaglehole, Peter Sukenik, Marco Mondelli, Mikhail Belkin
NeurIPS 2024.
More is Better: when Infinite Overparameterization is Optimal and Overfitting is Obligatory [ICLR]
James B. Simon, Dhruva Karkada, Nikhil Ghosh, Mikhail Belkin
ICLR 2024.
Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning [ICML]
Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin
ICML 2024.
On the Nyström Approximation for Preconditioning in Kernel Machines [AISTATS]
Amirhesam Abedsoltan, Parthe Pandit, Luis Rademacher, Mikhail Belkin
AISTATS 2024.
Uncertainty Estimation with Recursive Feature Machines [UAI]
Daniel Gedon, Amirhesam Abedsoltan, Thomas B. Schön, Mikhail Belkin
UAI 2024.
Restricted strong convexity of deep learning models with smooth activations [ICLR, arxiv]
Arindam Banerjee, Pedro Cisneros-Velarde, Libin Zhu, Mikhail Belkin
ICLR 2023.
+ abstract
We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the "near initialization" perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with L layers, m width, and initialization variance. First, we establish an upper bound on the spectral norm of the Hessian of such models, considerably sharpening prior results. Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is sufficiently large for the square loss. The RSC based analysis does not need the "near initialization" perspective and guarantees geometric convergence for gradient descent (GD). To the best of our knowledge, ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models, thus becoming an alternative sufficient condition for convergence that does not depend on the widely-used Neural Tangent Kernel (NTK).
Neural tangent kernel at initialization: linear width suffices [UAI]
Arindam Banerjee, Pedro Cisneros-Velarde, Libin Zhu, Mikhail Belkin
UAI 2023.
Cut your Losses with Squentropy [ICML]
Like Hui, Mikhail Belkin, Stephen Wright
ICML 2023.
Aiming towards the minimizers: fast convergence of SGD for overparametrized problems [NeurIPS]
Chaoyue Liu, Dmitriy Drusvyatskiy, Mikhail Belkin, Damek Davis, Yian Ma
NeurIPS 2023.
Benign, Tempered, or Catastrophic: Toward a Refined Taxonomy of Overfitting [NeurIPS]
Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran
NeurIPS 2022.
Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture [NeurIPS, arxiv]
Libin Zhu, Chaoyue Liu, Mikhail Belkin
NeurIPS 2022.
Risk bounds for over-parameterized maximum margin classification on sub-gaussian mixtures [arxiv]
Y. Cao, Q. Gu, M. Belkin
NeurIPS 2021.
+ abstract
Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" (Bartlett et al. (2020)) phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.
Multiple Descent: Design Your Own Generalization Curve [arxiv]
Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi
NeurIPS 2021.
+ abstract
This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized. We show that the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled. Our results highlight the fact that both classical U-shaped generalization curve and the recently observed double descent curve are not intrinsic properties of the model family. Instead, their emergence is due to the interaction between the properties of the data and the inductive biases of learning algorithms.
Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks [arxiv]
Like Hui, Mikhail Belkin
ICLR 2021.
+ abstract
Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy.
On the linearity of large non-linear models: when and why the tangent kernel is constant [arxiv]
Chaoyue Liu, Libin Zhu, Mikhail Belkin
NeurIPS 2020.
+ abstract
The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian scaling applicable to the standard classes of neural networks. Our analysis provides a new perspective on the phenomenon of constant tangent kernel, which is different from the widely accepted "lazy training". Furthermore, we show that the transition to linearity is not a general property of wide neural networks and does not hold when the last layer of the network is non-linear. It is also not necessary for successful optimization by gradient descent.
Accelerating Stochastic Training for Over-parametrized Learning [arxiv]
Chaoyue Liu, Mikhail Belkin
ICLR 2020.
+ abstract
In this paper we introduce MaSS (Momentum-added Stochastic Solver), an accelerated SGD method for optimizing over-parameterized networks. Our method is simple and efficient to implement and does not require changing parameters or computing full gradients in the course of optimization. We provide a detailed theoretical analysis for convergence and parameter selection including their dependence on the mini-batch size in the quadratic case. We also provide theoretical convergence results for a more general convex setting. We provide an experimental evaluation showing strong performance of our method in comparison to Adam and SGD for several standard architectures of deep networks including ResNet, convolutional and fully connected networks. We also show its performance for convex kernel machines.
Kernel machines that adapt to GPUs for effective large batch training [arxiv, code]
Siyuan Ma, Mikhail Belkin
SysML 2019.
+ abstract
Modern machine learning models are typically trained using Stochastic Gradient Descent (SGD) on massively parallel computing resources such as GPUs. Increasing mini-batch size is a simple and direct way to utilize the parallel computing capacity. For small batch an increase in batch size results in the proportional reduction in the training time, a phenomenon known as linear scaling. However, increasing batch size beyond a certain value leads to no further improvement in training time. In this paper we develop the first analytical framework that extends linear scaling to match the parallel computing capacity of a resource. The framework is designed for a class of classical kernel machines. It automatically modifies a standard kernel machine to output a mathematically equivalent prediction function, yet allowing for extended linear scaling, i.e., higher effective parallelization and faster training time on given hardware. The resulting algorithms are accurate, principled and very fast. For example, using a single Titan Xp GPU, training on ImageNet with 1.3×106 data points and 1000 labels takes under an hour, while smaller datasets, such as MNIST, take seconds. As the parameters are chosen analytically, based on the theoretical bounds, little tuning beyond selecting the kernel and the kernel parameter is needed, further facilitating the practical use of these methods.
Kernel Machines Beat Deep Neural Networks on Mask-based Single-channel Speech Enhancement [arxiv]
Like Hui, Siyuan Ma, Mikhail Belkin
INTERSPEECH 2019.
+ abstract
We apply a fast kernel method for mask-based single-channel speech enhancement. Specifically, our method solves a kernel regression problem associated to a non-smooth kernel function (exponential power kernel) with a highly efficient iterative method (EigenPro). Due to the simplicity of this method, its hyper-parameters such as kernel bandwidth can be automatically and efficiently selected using line search with subsamples of training data. We observe an empirical correlation between the regression loss (mean square error) and regular metrics for speech enhancement. This observation justifies our training target and motivates us to achieve lower regression loss by training separate kernel model per frequency subband. We compare our method with the state-of-the-art deep neural networks on mask-based HINT and TIMIT. Experimental results show that our kernel method consistently outperforms deep neural networks while requiring less training time.
Does data interpolation contradict statistical optimality? [arxiv]
Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov
AISTATS 2019.
+ abstract
We show that learning methods interpolating the training data can achieve optimal rates for the problems of nonparametric regression and prediction with square loss.
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate [arxiv]
Mikhail Belkin, Daniel Hsu, Partha Mitra
NeurIPS 2018.
+ abstract
Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for "overfitted" / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and weighted k-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems. These schemes have an inductive bias that benefits from higher dimension, a kind of "blessing of dimensionality". Finally, connections to kernel machines, random forests, and adversarial examples in the interpolated regime are discussed.
To understand deep learning we need to understand kernel learning [arxiv]
Mikhail Belkin, Siyuan Ma, Soumik Mandal
ICML 2018.
+ abstract
Generalization performance of classifiers in deep learning has recently become a subject of intense study. Deep models, which are typically heavily over-parametrized, tend to fit the training data exactly. Despite this overfitting, they perform well on test data, a phenomenon not yet fully understood. The first point of our paper is that strong performance of overfitted classifiers is not a unique feature of deep learning. Using six real-world and two synthetic datasets, we establish experimentally that kernel classifiers trained to have zero classification error (overfitting) or zero regression error (interpolation) perform very well on test data.
We proceed to prove lower bounds on the norm of overfitted solutions for smooth kernels, showing that they increase nearly exponentially with data size. Since most generalization bounds depend polynomially on the norm of the solution, this result implies that they diverge as data increases. Furthermore, the existing bounds do not apply to interpolated classifiers.
We also show experimentally that (non-smooth) Laplacian kernels easily fit random labels using a version of SGD, a finding that parallels results recently reported for ReLU neural networks. In contrast, as expected from theory, fitting noisy data requires many more epochs for smooth Gaussian kernels. The observation that the ultimate performance of overfitted Laplacian and Gaussian classifiers on the test is quite similar, suggests that generalization is tied to the properties of the kernel function rather than the optimization process.
We see that some key phenomena of deep learning are manifested similarly in kernel methods in the ``modern'' overfitted regime. We argue that progress on understanding deep learning will be difficult, until more analytically tractable ``shallow'' kernel methods are better understood. The combination of the experimental and theoretical results presented in this paper indicates a need for new theoretical ideas for understanding properties of classical kernel methods.
The power of interpolation: understanding the effectiveness of SGD in modern over-parametrized learning [arxiv]
Siyuan Ma, Raef Bassily, Mikhail Belkin
ICML 2018.
+ abstract
Stochastic Gradient Descent (SGD) with small mini-batch is a key component in modern large-scale machine learning. However, its efficiency has not been easy to analyze as most theoretical results require adaptive rates and show convergence rates far slower than that for gradient descent, making computational comparisons difficult.
In this paper we aim to clarify the issue of fast SGD convergence. The key observation is that most modern architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on test data, these regimes allow for very fast convergence of SGD, comparable in the number of iterations to gradient descent.
Specifically, consider the setting with quadratic objective function, or near a minimum, where the quadratic term is dominant. We show that: (1) Mini-batch size 1with constant step size is optimal in terms of computations to achieve a given error. (2) There is a critical mini-batch size such that: (a:linear scaling) SGD iteration with mini-batch size m smaller than the critical size is nearly equivalent to m iterations of mini-batch size 1. (b:saturation) SGD iteration with mini-batch larger than the critical size is nearly equivalent to a gradient descent step.
The critical mini-batch size can be viewed as the limit for effective mini-batch parallelization. It is also nearly independent of the data size, implying O(n) acceleration over GD per unit of computation.
We give experimental evidence on real data, with the results closely following our theoretical analyses.
Finally, we show how the interpolation perspective and our results fit with recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.
Approximation beats concentration? An approximation view on inference with smooth kernels [arxiv]
Mikhail Belkin
COLT 2018.
Unperturbed: spectral analysis beyond Davis-Kahan [arxiv]
Justin Eldridge, Mikhail Belkin, Yusu Wang
ALT 2018.
Diving into the shallows: a computational perspective on large-scale shallow learning [arxiv, code]
Siyuan Ma, Mikhail Belkin
NIPS 2017 (spotlight).
Graphons, mergeons, and so on! [arxiv, video]
Justin Eldridge, Mikhail Belkin, Yusu Wang
NIPS 2016.
Clustering with Bregman Divergences: an Asymptotic Analysis [link]
Chaoyue Liu, Mikhail Belkin
NIPS 2016.
Back to the future: Radial Basis Function networks revisited [link]
Qichao Que, Mikhail Belkin
AISTATS 2016.
The Hidden Convexity of Spectral Clustering [arxiv]
James Voss, Mikhail Belkin, Luis Rademacher
AAAI 2016.
Learning Privately from Multiparty Data [arxiv]
Jihun Hamm, Paul Cao, Mikhail Belkin
ICML 2016.
A Pseudo-Euclidean Iteration for Optimal Recovery in Noisy ICA [arxiv]
James Voss, Mikhail Belkin, Luis Rademacher
NIPS 2015.
Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering [link]
Justin Eldridge, Mikhail Belkin, Yusu Wang
COLT 2015.
Crowd-ML: A Privacy-Preserving Learning Framework for a Crowd of Smart Devices
J. Hamm, A. Champion, G. Chen, M. Belkin, D. Xuan
ICDCS 2015.
Learning with Fredholm Kernels [link]
Qichao Que, Mikhail Belkin, Yusu Wang
NIPS 2014.
The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures [arxiv]
Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, James Voss
COLT 2014.
Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis [link, code]
James Voss, Luis Rademacher, Mikhail Belkin
NIPS 2013.
Inverse Density as an Inverse Problem: The Fredholm Equation Approach [arxiv]
Qichao Que, Mikhail Belkin
NIPS 2013.
Blind Signal Separation in the Presence of Gaussian Noise [arxiv]
Mikhail Belkin, Luis Rademacher, James Voss
COLT 2013.
Graph Laplacians on Singular Manifolds: Toward understanding complex spaces: graph Laplacians on manifolds with singularities and boundaries [arxiv]
Mikhail Belkin, Qichao Que, Yusu Wang, Xueyuan Zhou
COLT 2012.
Data Skeletonization via Reeb Graphs
X. Ge, I. Safa, M. Belkin, Y. Wang, NIPS 2011.
+ abstract
Recovering hidden structure from complex and noisy non-linear data is one of the most fundamental
problems in machine learning and statistical inference. While such data is often high-dimensional, it is of
interest to approximate it with a low-dimensional or even one-dimensional space, since many important
aspects of data are often intrinsically low-dimensional. Furthermore, there are many scenarios where
the underlying structure is graph-like, e.g, river/road networks or various trajectories. In this paper, we
develop a framework to extract, as well as to simplify, a one-dimensional "skeleton" from unorganized
data using the Reeb graph. Our algorithm is very simple, does not require complex optimizations and
can be easily applied to unorganized high-dimensional data such as point clouds or proximity graphs.
It can also represent arbitrary graph structures in the data. We also give theoretical results to justify
our method. We provide a number of experiments to demonstrate the effectiveness and generality of
our algorithm, including comparisons to existing methods, such as principal curves. We believe that the
simplicity and practicality of our algorithm will help to promote skeleton graphs as a data analysis tool
for a broad range of applications.
An Iterated Graph Laplacian Approach for Ranking on Manifolds
Xueyuan Zhou, Mikhail Belkin, Nathan Srebro, KDD 2011.
+ abstract
Ranking is one of the key problems in information retrieval.
Recently, there has been significant interest in a class of
ranking algorithms based on the assumption that data is
sampled from a low dimensional manifold embedded in a
higher dimensional Euclidean space.
In this paper, we study a popular graph Laplacian based
ranking algorithm [23] using an analytical method, which
provides theoretical insights into the ranking algorithm go-
ing beyond the intuitive idea of diffusion. Our analysis
shows that the algorithm is sensitive to a commonly used
parameter due to the use of symmetric normalized graph
Laplacian. We also show that the ranking function may diverge to infinity at the query point in the limit of infinite
samples. To address these issues, we propose an improved
ranking algorithm on manifolds using Green's function of
an iterated unnormalized graph Laplacian, which is more
robust and density adaptive, as well as pointwise continu-
ous in the limit of infinite samples.
We also for the first time in the ranking literature empirically explore two variants from a family of twice normalized
graph Laplacians. Experimental results on text and image
data support our analysis, which also suggest the potential
value of twice normalized graph Laplacians in practice.
Semi-supervised Learning by Higher Order Regularization [link]
X. Zhou, M. Belkin, AISTATS 2011.
+ abstract
Solutions of several graph Laplacian regularization based semi-supervised learning algorithms were recently shown by Nadler et al. (2009) to degenerate to constant functions with ``spikes'' at labeled points given infinite unlabeled points in $\R^d$ for $d\ge 2$. These algorithms solve optimization problems by penalizing an energy term $\int_{\Omega}\|\nabla f(x)\|^2 dx$. In this paper, we address this problem by using regularization based on the iterated Laplacian, which is equivalent to higher order Sobolev semi-norm. Alternatively, it can be viewed as a generalization of the thin plate spline an unknown domain in high dimension. We also discuss relationships between Reproducing Kernel Hilbert Spaces and Green's functions. Experimental results support our analysis by showing consistently improved results by using iterated Laplacians.
Polynomial Learning of Distribution Families [arxiv, bib]
M. Belkin, K. Sinha, FOCS 2010.
+ abstract
The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learnability of Gaussian mixture distributions still remained open. The current work resolves the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call "polynomial families" in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a high-dimensional mixture to a polynomial number of parameter estimations in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions.
In recent years analysis of complexity of learning Gaussian mixture models from sampled data
has received significant attention in computational machine learning and theory communities. In
this paper we present the first result showing that polynomial time learning of multidimensional
Gaussian Mixture distributions is possible when the separation between the component means is
arbitrarily small. Specifically, we present an algorithm for learning the parameters of a mixture of k
identical spherical Gaussians in n-dimensional space with an arbitrarily small separation between
the components, which is polynomial in dimension, inverse component separation and other input
parameters for a fixed number of components k. The algorithm uses a projection to k dimensions
and then a reduction to the 1-dimensional case. It relies on a theoretical analysis showing that
two 1-dimensional mixtures whose densities are close in the L2 norm must have similar means
and mixing coefficients. To produce the necessary lower bound for the L2 norm in terms of the
distances between the corresponding means, we analyze the behavior of the Fourier transform of a
mixture of Gaussians in one dimension around the origin, which turns out to be closely related to
the properties of the Vandermonde matrix obtained from the component means. Analysis of minors
of the Vandermonde matrix together with basic function approximation results allows us to provide
a lower bound for the norm of the mixture in the Fourier domain and hence a bound in the original
space. Additionally, we present a separate argument for reconstructing variance.
Semi-supervised Learning using Sparse
Eigenfunction Bases
[pdf, bib]
K. Sinha, M.Belkin, NIPS 2009.
+ abstract
We present a new framework for semi-supervised learning with sparse eigenfunction
bases of kernel matrices. It turns out that when the data has clustered, that
is, when the high density regions are suf.ciently separated by low density valleys,
each high density area corresponds to a unique representative eigenvector.
Linear combination of such eigenvectors (or, more precisely, of their Nystrom
extensions) provide good candidates for good classi.cation functions when the
cluster assumption holds. By .rst choosing an appropriate basis of these eigenvectors
from unlabeled data and then using labeled data with Lasso to select a
classi.er in the span of these eigenvectors, we obtain a classi.er, which has a very
sparse representation in this basis. Importantly, the sparsity corresponds naturally
to the cluster assumption.
Experimental results on a number of real-world data-sets show that our method
is competitive with the state of the art semi-supervised learning algorithms and
outperforms the natural base-line algorithm (Lasso in the Kernel PCA basis).
A Note on Learning with Integral Operators
L. Rosasco, M.Belkin, E. De Vito, COLT 2009.
+ abstract
A large number of learning algorithms, for example,
spectral clustering, kernel Principal Components
Analysis and many manifold methods are
based on estimating eigenvalues and eigenfunctions
of operators defined by a similarity function or a
kernel, given empirical data. Thus for the analysis
of algorithms, it is an important problem to
be able to assess the quality of such approximations.
The contribution of our paper is two-fold.
First, We use a technique based on a concentration
inequality for Hilbert spaces to provide new simplified
proofs for a number of results in spectral
approximation. Second, using these methods we
provide several new results for estimating spectral
properties of the graph Laplacian operator extending
and strengthening results from [28].
Constructing Laplace Operator from Point Clouds in R^d
[pdf, bib, code]
M. Belkin, J. Sun, Y. Wang, SODA 2009, pp. 1031--1040.
+ abstract
We present an algorithm for approximating the Laplace-
Beltrami operator from an arbitrary point cloud obtained
from a k-dimensional manifold embedded in the d-
dimensional space. We show that this PCD Laplace (Point-
Cloud Data Laplace) operator converges to the Laplace-
Beltrami operator on the underlying manifold as the point
cloud becomes denser. Unlike the previous work, we do
not assume that the data samples are independent identically
distributed from a probability distribution and do
not require a global mesh. The resulting algorithm is easy
to implement. We present experimental results indicating
that even for point sets sampled from a uniform distribution,
PCD Laplace converges faster than the weighted graph
Laplacian. We also show that using our PCD Laplacian we
can directly estimate certain geometric invariants, such as
manifold area.
Data Spectroscopy: Learning Mixture Models using Eigenspaces of
Convolution Operators
[pdf, bib]
T. Shi, M. Belkin, B. Yu, ICML 2008.
+ abstract
In this paper we develop a spectral frame-
work for estimating mixture distributions,
specifically Gaussian mixture models. In
physics, spectroscopy is often used for the
identification of substances through their
spectrum. Treating a kernel function K(x, y)
as "light" and the sampled data as "sub-
stance", the spectrum of their interaction
(eigenvalues and eigenvectors of the kernel
matrix K) unveils certain aspects of the un-
derlying parametric distribution p, such as
the parameters of a Gaussian mixture. Our
approach extends the intuitions and analyses
underlying the existing spectral techniques,
such as spectral clustering and Kernel Prin-
cipal Components Analysis (KPCA).
We construct algorithms to estimate param-
eters of Gaussian mixture models, includ-
ing the number of mixture components, their
means and covariance matrices, which are im-
portant in many practical applications. We
provide a theoretical framework and show en-
couraging experimental results.
Discrete Laplace Operator for Meshed Surfaces
[pdf, bib]
M. Belkin, J. Sun, Y. Wang, SOCG 2008.
+ abstract
In recent years a considerable amount of work in graphics and geometric optimization used tools
based on the Laplace-Beltrami operator on a surface. The applications of the Laplacian include mesh
editing, surface smoothing, and shape interpolations among others. However, it has been shown [12, 23,
25] that the popular cotangent approximation schemes do not provide convergent point-wise (or even
L^2) estimates, while many applications rely on point-wise estimation. Existence of such schemes has
been an open question [12].
In this paper we propose the first algorithm for approximating the Laplace operator of a surface from
a mesh with point-wise convergence guarantees applicable to arbitrary meshed surfaces. We show that
for a sufficiently fine mesh over an arbitrary surface, our mesh Laplacian is close to the Laplace-Beltrami
operator on the surface at every point of the surface.
Moreover, the proposed algorithm is simple and easily implementable. Experimental evidence shows
that our algorithm exhibits convergence empirically and outperforms cotangent-based methods in providing
accurate approximation of the Laplace operator for various meshes.
Component Based Shape Retrieval Using Differential Profiles.
L. Ding, M. Belkin, Proceedings of ACM International Conference on Multimedia Information Retrieval, 2008.
+ abstract
In this paper, we describe the use of differential profiles, which are computed from 2D shapes smoothed with Gaussian
functions, as the shape features for building a shape retrieval system. We build a global shape component dictionary
from all the shape descriptors collected from shapes available in a database and then represent each shape as a
probabilistic mixture of elements from such a dictionary. Finally, shape retrieval from a given database is simply
done by comparing the mixing coefficients of the model of a query shape and those of known shapes. Our retrieval
experiments are done on both object contour and line drawing collections and show promising results.
The value of labeled and unlabeled examples when the model is imperfect
[pdf, bib]
K. Sinha, M. Belkin, NIPS 2007.
+ abstract
Semi-supervised learning, i.e. learning from both labeled and unlabeled data has
received signi.cant attention in the machine learning literature in recent years.
Still our understanding of the theoretical foundations of the usefulness of unlabeled
data remains somewhat limited. The simplest and the best understood situation
is when the data is described by an identi.able mixture model, and where
each class comes from a pure component. This natural setup and its implications
ware analyzed in [11, 5]. One important result was that in certain regimes, labeled
data becomes exponentially more valuable than unlabeled data.
However, in most realistic situations, one would not expect that the data comes
from a parametric mixture distribution with identifiable components. There have
been recent efforts to analyze the non-parametric situation, for example, .cluster.
and manifold assumptions have been suggested as a basis for analysis. Still,
a satisfactory and fairly complete theoretical understanding of the nonparametric
problem, similar to that in [11, 5] has not yet been developed.
In this paper we investigate an intermediate situation, when the data comes from a
probability distribution, which can be modeled, but not perfectly, by an identi.able
mixture distribution. This seems applicable to many situation, when, for example,
a mixture of Gaussians is used to model the data. the contribution of this paper is
an analysis of the role of labeled and unlabeled data depending on the amount of
imperfection in the model.
Convergence of Laplacian
Eigenmaps [long tech report]
M. Belkin, P. Niyogi, NIPS 2006.
On the Relation Between Low Density Separation,
Spectral Clustering and Graph Cuts [pdf, bib]
H. Narayanan, M. Belkin, P. Niyogi, NIPS 2006.
+ abstract
One of the intuitions underlying many graph-based
methods for clustering and semi-supervised learning,
is that class or cluster boundaries pass through
areas of low probability density. In this paper we
provide some formal analysis of that notion for a
probability distribution. We introduce a notion of
weighted boundary volume, which measures the length
of the class/cluster boundary weighted by the density
of the underlying probability distribution. We show
that sizes of the cuts of certain commonly used data
adjacency graphs converge to this continuous weighted
volume of the boundary.
Heat Flow and a Faster Algorithm to Compute the
Surface Area of a Convex Body [pdf, bib]
M. Belkin, H. Narayanan, P. Niyogi, FOCS 2006.
+ abstract
We draw on the observation that the amount of heat
diffusing outside of a heated body in a short period
of time is proportional to its surface area, to
design a simple algorithm for approximating the
surface area of a convex body given by a membership
oracle. Our method has a complexity of O*(n^4), where
n is the dimension, compared to O*(n^8.5) for the
previous best algorithm. We show that our complexity
cannot be improved given the current state-of-the-art
in volume estimation.
Maximum Margin Semi-Supervised Learning for Structured
Variables [pdf, bib]
Y. Altun, D. McAllester, M. Belkin, NIPS 2005.
+ abstract
Many real-world classification problems involve the
prediction of multiple inter-dependent variables
forming some structural dependency. Recent progress
in machine learning has mainly focused on supervised
classification of such structured variables. In this
paper, we investigate structured classification in a
semi-supervised setting. We present a discriminative
approach that utilizes the intrinsic geometry of
input patterns revealed by unlabeled data points and
we derive a maximum-margin formulation of
semi-supervised learning for structured variables.
Unlike transductive algorithms, our formulation
naturally extends to new test points.
Beyond the Point Cloud: from Transductive to
Semi-supervised Learning [pdf, bib]
V. Sindhwani, P. Niyogi, M. Belkin, ICML 2005.
+ abstract
Due to its occurrence in engineering domains and
implications for natural learning, the problem of
utilizing unlabeled data is attracting increasing
attention in machine learning. A large body of recent
literature has focussed on the transductive setting
where labels of unlabeled examples are estimated by
learning a function defined only over the point cloud
data. In a truly semi-supervised setting however, a
learning machine has access to labeled and unlabeled
examples and must make predictions on data points
never encountered before. In this paper, we show how
to turn transductive and standard supervised learning
algorithms into semi-supervised learners. We
construct a family of data-dependent norms on
Reproducing Kernel Hilbert Spaces (RKHS). These norms
allow us to warp the structure of the RKHS to reflect
the underlying geometry of the data. We derive
explicit formulas for the corresponding new kernels.
Our approach demonstrates state of the art
performance on a variety of classification tasks.
A Co-Regularization Approach to Semi-supervised
Learning with Multiple Views [pdf]
V. Sindhwani, P. Niyogi, M. Belkin,
ICML Workshop on Learning with Multiple Views, 2005.
+ abstract
The Co-Training algorithm uses unlabeled
examples in multiple views to bootstrap classifiers in each view, typically in a greedy
manner, and operating under assumptions
of view-independence and compatibility. In
this paper, we propose a Co-Regularization
framework where classifiers are learnt in each
view through forms of multi-view regularization. We propose algorithms within this
framework that are based on optimizing measures of agreement and smoothness over la-
beled and unlabeled examples. These algorithms naturally extend standard
regularization methods like Support Vector Machines
(SVM) and Regularized Least squares (RLS)
for multi-view semi-supervised learning, and
inherit their benefits and applicability to
high-dimensional classification problems. An
empirical investigation is presented that confirms the promise of this approach.
Linear Manifold Regularization for Large Scale
Semi-supervised Learning
V. Sindhwani, P. Niyogi, M. Belkin, S. Keerthi
ICML Workshop on Learning with Partially Classified
Training Data, 2005
Towards a Theoretical Foundation for Laplacian-Based
Manifold Methods [pdf, bib]
M. Belkin, P. Niyogi, COLT 2005
On Manifold Regularization [pdf]
M. Belkin, P. Niyogi, V. Sindhwani, AISTATS 2005
Limits of Spectral Clustering [pdf]    Outstanding
Student Paper Award
U. von Luxburg, O. Bousquet, M. Belkin, NIPS 2004.
Regularization and Semi-supervised Learning on Large
Graphs [pdf, bib]
M. Belkin, I. Matveeva, P. Niyogi, COLT 2004.
On the Convergence of Spectral Clustering on Random
Samples: the Normalized Case [pdf]
U. von Luxburg, O. Bousquet, M. Belkin, COLT 2004.
Tikhonov Regularization and Semi-supervised Learning
on Large Graphs
M. Belkin, I. Matveeva, P. Niyogi
ICASSP 2004, Special Session: Manifolds and Geometry in
Signal Processing (Invited paper).
Using Manifold Structure for Partially Labeled
Classification [pdf]
M. Belkin, P. Niyogi, NIPS 2002
Laplacian Eigenmaps and Spectral Techniques for
Embedding and Clustering [pdf]
M. Belkin, P. Niyogi, NIPS 2001.
Using Eigenvectors of the Bigram Graph to Infer
Morpheme Identity [pdf]
M. Belkin, J. Goldsmith
Morphological and Phonological Learning: Proceedings of
the 6th Workshop
of the ACL Special Interest Group in Computational
Phonology (SIGPHON), Philadelphia, July 2002, pp.
41-47.
Preprints and Technical Reports:
Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing [arxiv]
Parsa Mirtaheri, Mikhail Belkin
2026.
+ abstract
Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM families and datasets demonstrating that motivated reasoning can be identified by probing internal activations even in cases when it cannot be easily determined from CoT. Using supervised probes trained on the model's residual stream, we show that (i) pre-generation probes, applied before any CoT tokens are generated, predict motivated reasoning as well as a LLM-based CoT monitor that accesses the full CoT trace, and (ii) post-generation probes, applied after CoT generation, outperform the same monitor. Together, these results show that motivated reasoning is detected more reliably from internal representations than from CoT monitoring. Moreover, pre-generation probing can flag motivated behavior early, potentially avoiding unnecessary generation.
General and Efficient Steering of Unconditional Diffusion [arxiv]
Qingsong Wang, Mikhail Belkin, Yusu Wang
2026.
+ abstract
Guiding unconditional diffusion models typically requires either retraining with conditional inputs or per-step gradient computations (e.g., classifier-based guidance), both of which incur substantial computational overhead. We present a general recipe for efficiently steering unconditional diffusion without gradient guidance during inference, enabling fast controllable generation. Our approach is built on two observations about diffusion model structure: Noise Alignment: even in early, highly corrupted stages, coarse semantic steering is possible using a lightweight, offline-computed guidance signal, avoiding any per-step or per-sample gradients. Transferable concept vectors: a concept direction in activation space once learned transfers across both timesteps and samples; the same fixed steering vector learned near low noise level remains effective when injected at intermediate noise levels for every generation trajectory, providing refined conditional control with efficiency. Such concept directions can be efficiently and reliably identified via Recursive Feature Machine (RFM), a light-weight backpropagation-free feature learning method. Experiments on CIFAR-10, ImageNet, and CelebA demonstrate improved accuracy/quality over gradient-based guidance, while achieving significant inference speedups.
Recent advances in machine learning have led to increased interest in reproducing kernel Banach spaces (RKBS) as a more general framework that extends beyond reproducing kernel Hilbert spaces (RKHS). These works have resulted in the formulation of representer theorems under several regularized learning schemes. However, little is known about an optimization method that encompasses these results in this setting. In this paper, we address a learning problem on Banach spaces endowed with a reproducing kernel. To tackle this challenge, we propose an algorithm based on Mirror Descent. Our approach involves an iterative method that employs gradient steps in the dual space of the Banach space using the reproducing kernel. We analyze the convergence properties of the algorithm under various assumptions and establish two types of results: first, we identify conditions under which a linear convergence rate is achievable; second, we demonstrate a standard convergence rate in a constrained setting. Moreover, to instantiate the algorithm in practice, we introduce a novel family of RKBSs with p-norm (p ≠ 2), characterized by both an explicit dual map and a kernel.
Context-Scaling versus Task-Scaling in In-Context Learning [arxiv]
Amirhesam Abedsoltan, Adityanarayanan Radhakrishnan, Jingfeng Wu, Mikhail Belkin
2024.
+ abstract
Transformers exhibit In-Context Learning (ICL), where models solve new tasks by using examples in the prompt without additional training. We identify and analyze two key components of ICL: (1) context-scaling, where model performance improves as the number of in-context examples increases, and (2) task-scaling, where model performance improves as the number of pre-training tasks increases. While transformers are capable of both context-scaling and task-scaling, standard MLPs with vectorized input are empirically shown to be only capable of task-scaling. We propose a simplified transformer (SGPT) that exhibits both context and task-scaling and is competitive with GPT-2 on a range of ICL tasks. By analyzing a single layer of the proposed model, we identify classes of feature maps that enable context scaling and show that they can implement the Hilbert estimate, a model that is provably consistent for context-scaling. We further show that using the output of the Hilbert estimate along with vectorized input empirically enables both context-scaling and task-scaling with MLPs, providing a simple setting to study context and task-scaling for ICL.
Mechanism of feature learning in convolutional neural networks [arxiv]
Daniel Beaglehole, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin
2023.
+ abstract
Understanding the mechanism of how convolutional neural networks learn features from image data is a fundamental problem in machine learning and computer vision. In this work, we identify such a mechanism. We posit the Convolutional Neural Feature Ansatz, which states that covariances of filters in any convolutional layer are proportional to the average gradient outer product (AGOP) taken with respect to patches of the input to that layer. We present extensive empirical evidence for our ansatz, including identifying high correlation between covariances of filters and patch-based AGOPs for convolutional layers in standard neural architectures, such as AlexNet, VGG, and ResNets pre-trained on ImageNet. We also provide supporting theoretical evidence. We then demonstrate the generality of our result by using the patch-based AGOP to enable deep feature learning in convolutional kernel machines. We refer to the resulting algorithm as (Deep) ConvRFM and show that our algorithm recovers similar features to deep convolutional networks including the notable emergence of edge detectors. Moreover, we find that Deep ConvRFM overcomes previously identified limitations of convolutional kernels, such as their inability to adapt to local signals in images and, as a result, leads to sizable performance improvement over fixed convolutional kernels.
Linear Convergence and Implicit Regularization of Generalized Mirror Descent with Time-Dependent Mirrors [arxiv]
Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler
2020.
+ abstract
The following questions are fundamental to understanding the properties of over-parameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum? (2) What form of implicit regularization occurs through training? While significant progress has been made in answering both of these questions for gradient descent, they have yet to be answered more completely for general optimization methods. In this work, we establish sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), a generalization of mirror descent with a possibly time-dependent mirror. GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad. By using the Polyak-Lojasiewicz inequality, we first present a simple analysis under which non-stochastic GMD converges linearly to a global minimum. We then present a novel, Taylor-series based analysis to establish sufficient conditions for linear convergence of stochastic GMD. As a corollary, our result establishes sufficient conditions and provides learning rates for linear convergence of stochastic mirror descent and Adagrad. Lastly, we obtain approximate implicit regularization results for GMD by proving that GMD converges to an interpolating solution that is approximately the closest interpolating solution to the initialization in l2-norm in the dual space, thereby generalizing the result of Azizan, Lale, and Hassibi (2019) in the full batch setting.
Consistency of Spectral Clustering [pdf]
U. von Luxburg, M. Belkin, O. Bousquet
Max Planck Insitute for Biological Cybernetics Technical
Report TR-134, 2004. (The Annals of Statistics, to
appear)
+ abstract
Consistency is a key property of statistical
algorithms when the data is drawn from some
underlying probability distribution. Surprisingly,
despite decades of work, little is known about
consistency of most clustering algorithms. In this
paper we investigate consistency of the popular
family of spectral clustering algorithms, which
clusters the data with the help of eigenvectors of
graph Laplacian matrices. We develop new methods to
establish that for increasing sample size, those
eigenvectors converge to the eigenvectors of certain
limit operators. As a result we can prove that one of
the two major classes of spectral clustering
(normalized clustering) converges under very general
conditions, while the other (unnormalized clustering)
is only consistent under strong additional
assumptions, which are not always satisfied in real
data. We conclude that our analysis provides strong
evidence for the superiority of normalized spectral
clustering.
Manifold Regularization: a Geometric Framework for
Learning from Examples
M. Belkin, P. Niyogi, V. Sindhwani
University of Chicago CS Technical Report TR-2004-06,
2004.
Regression and Regularization on Large Graphs
[link]
M. Belkin, I. Matveeva, P. Niyogi
The University of Chicago CS Technical Report
TR-2003-11.
Semi-supervised learning on manifolds
[link]
M. Belkin, P. Niyogi
The University of Chicago CS Technical Report
TR-2002-12.
Laplacian Eigenmaps for Dimensionality Reduction and
Data Representation
[link]
M. Belkin, P. Niyogi
The University of Chicago CS Technical Report TR-2002-01.
The Geometric Basis of Semi-supervised Learning [pdf]
V. Sindhwani, M. Belkin, P. Niyogi.
Book chapter in "Semi-supervised Learning", Chapelle, O.,
B. Schölkopf and A. Zien, Eds., MIT Press, 2006.