Publications of Mikhail Belkin

[Journal papers] -- [Conference papers] -- [Ph.D. Thesis] -- [Book Chapter] -- [Technical Reports]

Journal Papers and preprints:

Laplacian Support Vector Machines Trained in the Primal [pdf, bib, code]
S. Melacci, M.Belkin,
Journal of Machine Learning Research, vol.12, pp. 1149-1184, 2011.

+ abstract

In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi--supervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. Whereas training a LapSVM in the dual requires two steps, using the primal form allows us to collapse training to a single step. Moreover, the computational complexity of the training algorithm is reduced from O(n^3) to O(n^2) using preconditioned conjugate gradient, where n is the combined number of labeled and unlabeled examples. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large datasets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach.

Polynomial Learning of Distribution Families [arxiv]
M. Belkin, K. Sinha, submitted. Short version, FOCS 2010.

+ abstract: The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learnability of Gaussian mixture distributions still remained open. The current work resolves the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call "polynomial families" in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a high-dimensional mixture to a polynomial number of parameter estimations in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions.

On Learning with Integral Operators [pdf, bib]
L. Rosasco, M.Belkin, E. De Vito,
Journal of Machine Learning Research, vol.11, pp.905-934, 2010.

+ abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold. First, we use a technique based on a concentration inequality for Hilbert spaces to provide new simplified proofs for a number of results in spectral approximation. Second, using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from [28].

Data spectroscopy: eigenspace of convolution operators and clustering [pdf, bib]
Tao Shi, Mikhail Belkin, Bin Yu
The Annals of Statistics, vol. 37, Number 6B (2009), 3960-3984.

+ abstract: This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the Data Spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.

Convergence of Laplacian Eigenmaps [pdf, bib]
M. Belkin, P. Niyogi
preprint, short version NIPS 2008.

+ abstract: Geometrically based methods for various tasks of data analysis have attracted considerable attention over the last few years. In many of these algorithms, a central role is played by the eigenvectors of the graph Laplacian of a data-derived graph. In this paper, we show that if points are sampled uniformly at random from an unknown submanifold ${\cal M}$ of $\R^N$, then the eigenvectors of a suitably constructed graph Laplacian converge to the eigenfunctions of the Laplace Beltrami operator on ${\cal M}$. This basic result directly establishes the convergence of spectral manifold learning algorithms such as Laplacian Eigenmaps and Diffusion Maps. It also has implications for the understanding of geometric algorithms in data analysis, computational harmonic analysis, geometric random graphs, and graphics.

Towards a Theoretical Foundation for Laplacian-Based Manifold Methods [pdf, bib]
M. Belkin, P. Niyogi
Journal of Computer and System Sciences, 2008.
Volume 74 , Issue 8, pp. 1289-1308 Special Issue on Learning Theory (invited).

+ abstract: In recent years manifold methods have attracted a considerable amount of atten- tion in machine learning. However most algorithms in that class may be termed "manifold-motivated" as they lack any explicit theoretical guarantees. In this pa- per we take a step towards closing the gap between theory and practice for a class of Laplacian-based manifold methods. These methods utilize the graph Laplacian associated to a data set for a variety of applications in semi-supervised learning, clustering, data representation. We show that under certain conditions the graph Laplacian of a point cloud of data samples converges to the Laplace-Beltrami operator on the underlying mani- fold. Theorem 3.1 contains the .rst result showing convergence of a random graph Laplacian to the manifold Laplacian in the context of machine learning.

Manifold Regularization: a Geometric Framework for Learning from Labeled and Unlabeled Examples [pdf, bib]
M. Belkin, P. Niyogi, V. Sindhwani
Journal of Machine Learning Research, 7(Nov):2399-2434, 2006.

+ abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework.

Consistency of Spectral Clustering [pdf, bib]
U. von Luxburg, M. Belkin, O. Bousquet,
The Annals of Statistics 2008, Vol. 36, No. 2, 555-586.

+ abstract: Consistency is a key property of statistical algorithms when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new methods to establish that for increasing sample size, those eigenvectors converge to the eigenvectors of certain limit operators. As a result we can prove that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other (unnormalized clustering) is only consistent under strong additional assumptions, which are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering.

Semi-supervised Learning on Riemannian Manifolds [pdf, bib]
M. Belkin, P. Niyogi
Machine Learning, 56, Invited, secial Issue on Clustering, 209-239, 2004.

+ abstract: We consider the general problem of utilizing both labeled and unlabeled data to improve classification accuracy. Under the assumption that the data lie on a submanifold in a high dimensional space, we develop an algorithmic framework to classify a partially labeled data set in a principled manner. The central idea of our approach is that classification functions are naturally defined only on the submanifold in question rather than the total ambient space. Using the Laplace-Beltrami operator one produces a basis (the Laplacian Eigenmaps) for a Hilbert space of square integrable functions on the submanifold. To recover such a basis, only unlabeled examples are required. Once such a basis is obtained, training can be performed using the labeled data set. Our algorithm models the manifold using the adjacency graph for the data and approximates the Laplace- Beltrami operator by the graph Laplacian. We provide details of the algorithm, its theoretical justification, and several practical applications for image, speech, and text classification.

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation [pdf, bib]
M. Belkin, P. Niyogi
Neural Computation, June 2003; 15 (6):1373-1396.

+ abstract: One of the central problems in machine learning and pattern recognition is to develop appropriate representations for complex data. We consider the problem of constructing a representation for data lying on a lowdimensional manifold embedded in a high-dimensional space. Drawing on the correspondence between the graph Laplacian, the Laplace Beltrami operator on the manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for representing the highdimensional data. The algorithm provides a computationally efficient approach to non dimensionality reduction that has locality-preserving properties and a natural connection to clustering. Some potential applications and illustrative examples are discussed.

Refereed and Invited Conference Proceedings:

Data Skeletonization via Reeb Graphs
X. Ge, I. Safa, M. Belkin, Y. Wang, NIPS 2011.

+ abstract

Recovering hidden structure from complex and noisy non-linear data is one of the most fundamental problems in machine learning and statistical inference. While such data is often high-dimensional, it is of interest to approximate it with a low-dimensional or even one-dimensional space, since many important aspects of data are often intrinsically low-dimensional. Furthermore, there are many scenarios where the underlying structure is graph-like, e.g, river/road networks or various trajectories. In this paper, we develop a framework to extract, as well as to simplify, a one-dimensional "skeleton" from unorganized data using the Reeb graph. Our algorithm is very simple, does not require complex optimizations and can be easily applied to unorganized high-dimensional data such as point clouds or proximity graphs. It can also represent arbitrary graph structures in the data. We also give theoretical results to justify our method. We provide a number of experiments to demonstrate the effectiveness and generality of our algorithm, including comparisons to existing methods, such as principal curves. We believe that the simplicity and practicality of our algorithm will help to promote skeleton graphs as a data analysis tool for a broad range of applications.
An Iterated Graph Laplacian Approach for Ranking on Manifolds
Xueyuan Zhou, Mikhail Belkin, Nathan Srebro, KDD 2011.

+ abstract

Ranking is one of the key problems in information retrieval. Recently, there has been significant interest in a class of ranking algorithms based on the assumption that data is sampled from a low dimensional manifold embedded in a higher dimensional Euclidean space. In this paper, we study a popular graph Laplacian based ranking algorithm [23] using an analytical method, which provides theoretical insights into the ranking algorithm go- ing beyond the intuitive idea of diffusion. Our analysis shows that the algorithm is sensitive to a commonly used parameter due to the use of symmetric normalized graph Laplacian. We also show that the ranking function may diverge to infinity at the query point in the limit of infinite samples. To address these issues, we propose an improved ranking algorithm on manifolds using Green's function of an iterated unnormalized graph Laplacian, which is more robust and density adaptive, as well as pointwise continu- ous in the limit of infinite samples. We also for the first time in the ranking literature empirically explore two variants from a family of twice normalized graph Laplacians. Experimental results on text and image data support our analysis, which also suggest the potential value of twice normalized graph Laplacians in practice.
Semi-supervised Learning by Higher Order Regularization [link]
X. Zhou, M. Belkin, AISTATS 2011.

+ abstract

Solutions of several graph Laplacian regularization based semi-supervised learning algorithms were recently shown by Nadler et al. (2009) to degenerate to constant functions with ``spikes'' at labeled points given infinite unlabeled points in $\R^d$ for $d\ge 2$. These algorithms solve optimization problems by penalizing an energy term $\int_{\Omega}\|\nabla f(x)\|^2 dx$. In this paper, we address this problem by using regularization based on the iterated Laplacian, which is equivalent to higher order Sobolev semi-norm. Alternatively, it can be viewed as a generalization of the thin plate spline an unknown domain in high dimension. We also discuss relationships between Reproducing Kernel Hilbert Spaces and Green's functions. Experimental results support our analysis by showing consistently improved results by using iterated Laplacians.
Polynomial Learning of Distribution Families [arxiv, bib]
M. Belkin, K. Sinha, FOCS 2010.

+ abstract

The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learnability of Gaussian mixture distributions still remained open. The current work resolves the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call "polynomial families" in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a high-dimensional mixture to a polynomial number of parameter estimations in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions.
Toward Learning Gaussian Mixtures with Arbitrary Separation [pdf]
Mikhail Belkin, Kaushik Sinha, COLT 2010

+ abstract

In recent years analysis of complexity of learning Gaussian mixture models from sampled data has received significant attention in computational machine learning and theory communities. In this paper we present the first result showing that polynomial time learning of multidimensional Gaussian Mixture distributions is possible when the separation between the component means is arbitrarily small. Specifically, we present an algorithm for learning the parameters of a mixture of k identical spherical Gaussians in n-dimensional space with an arbitrarily small separation between the components, which is polynomial in dimension, inverse component separation and other input parameters for a fixed number of components k. The algorithm uses a projection to k dimensions and then a reduction to the 1-dimensional case. It relies on a theoretical analysis showing that two 1-dimensional mixtures whose densities are close in the L2 norm must have similar means and mixing coefficients. To produce the necessary lower bound for the L2 norm in terms of the distances between the corresponding means, we analyze the behavior of the Fourier transform of a mixture of Gaussians in one dimension around the origin, which turns out to be closely related to the properties of the Vandermonde matrix obtained from the component means. Analysis of minors of the Vandermonde matrix together with basic function approximation results allows us to provide a lower bound for the norm of the mixture in the Fourier domain and hence a bound in the original space. Additionally, we present a separate argument for reconstructing variance.
Semi-supervised Learning using Sparse Eigenfunction Bases [pdf, bib]
K. Sinha, M.Belkin, NIPS 2009.

+ abstract

We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices. It turns out that when the data has clustered, that is, when the high density regions are suf.ciently separated by low density valleys, each high density area corresponds to a unique representative eigenvector. Linear combination of such eigenvectors (or, more precisely, of their Nystrom extensions) provide good candidates for good classi.cation functions when the cluster assumption holds. By .rst choosing an appropriate basis of these eigenvectors from unlabeled data and then using labeled data with Lasso to select a classi.er in the span of these eigenvectors, we obtain a classi.er, which has a very sparse representation in this basis. Importantly, the sparsity corresponds naturally to the cluster assumption. Experimental results on a number of real-world data-sets show that our method is competitive with the state of the art semi-supervised learning algorithms and outperforms the natural base-line algorithm (Lasso in the Kernel PCA basis).

A Note on Learning with Integral Operators [pdf, bib]
L. Rosasco, M.Belkin, E. De Vito, COLT 2009.

+ abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold. First, We use a technique based on a concentration inequality for Hilbert spaces to provide new simplified proofs for a number of results in spectral approximation. Second, using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from [28].

Constructing Laplace Operator from Point Clouds in R^d [pdf, bib, code]
M. Belkin, J. Sun, Y. Wang, SODA 2009, pp. 1031--1040.

+ abstract: We present an algorithm for approximating the Laplace- Beltrami operator from an arbitrary point cloud obtained from a k-dimensional manifold embedded in the d- dimensional space. We show that this PCD Laplace (Point- Cloud Data Laplace) operator converges to the Laplace- Beltrami operator on the underlying manifold as the point cloud becomes denser. Unlike the previous work, we do not assume that the data samples are independent identically distributed from a probability distribution and do not require a global mesh. The resulting algorithm is easy to implement. We present experimental results indicating that even for point sets sampled from a uniform distribution, PCD Laplace converges faster than the weighted graph Laplacian. We also show that using our PCD Laplacian we can directly estimate certain geometric invariants, such as manifold area.

Data Spectroscopy: Learning Mixture Models using Eigenspaces of Convolution Operators [pdf, bib]
T. Shi, M. Belkin, B. Yu, ICML 2008.

+ abstract: In this paper we develop a spectral frame- work for estimating mixture distributions, specifically Gaussian mixture models. In physics, spectroscopy is often used for the identification of substances through their spectrum. Treating a kernel function K(x, y) as "light" and the sampled data as "sub- stance", the spectrum of their interaction (eigenvalues and eigenvectors of the kernel matrix K) unveils certain aspects of the un- derlying parametric distribution p, such as the parameters of a Gaussian mixture. Our approach extends the intuitions and analyses underlying the existing spectral techniques, such as spectral clustering and Kernel Prin- cipal Components Analysis (KPCA). We construct algorithms to estimate param- eters of Gaussian mixture models, includ- ing the number of mixture components, their means and covariance matrices, which are im- portant in many practical applications. We provide a theoretical framework and show en- couraging experimental results.

Discrete Laplace Operator for Meshed Surfaces [pdf, bib]
M. Belkin, J. Sun, Y. Wang, SOCG 2008.

+ abstract: In recent years a considerable amount of work in graphics and geometric optimization used tools based on the Laplace-Beltrami operator on a surface. The applications of the Laplacian include mesh editing, surface smoothing, and shape interpolations among others. However, it has been shown [12, 23, 25] that the popular cotangent approximation schemes do not provide convergent point-wise (or even L^2) estimates, while many applications rely on point-wise estimation. Existence of such schemes has been an open question [12]. In this paper we propose the first algorithm for approximating the Laplace operator of a surface from a mesh with point-wise convergence guarantees applicable to arbitrary meshed surfaces. We show that for a sufficiently fine mesh over an arbitrary surface, our mesh Laplacian is close to the Laplace-Beltrami operator on the surface at every point of the surface. Moreover, the proposed algorithm is simple and easily implementable. Experimental evidence shows that our algorithm exhibits convergence empirically and outperforms cotangent-based methods in providing accurate approximation of the Laplace operator for various meshes.

Component Based Shape Retrieval Using Differential Profiles.
L. Ding, M. Belkin, Proceedings of ACM International Conference on Multimedia Information Retrieval, 2008.

+ abstract: In this paper, we describe the use of differential profiles, which are computed from 2D shapes smoothed with Gaussian functions, as the shape features for building a shape retrieval system. We build a global shape component dictionary from all the shape descriptors collected from shapes available in a database and then represent each shape as a probabilistic mixture of elements from such a dictionary. Finally, shape retrieval from a given database is simply done by comparing the mixing coefficients of the model of a query shape and those of known shapes. Our retrieval experiments are done on both object contour and line drawing collections and show promising results.

The value of labeled and unlabeled examples when the model is imperfect [pdf, bib]
K. Sinha, M. Belkin, NIPS 2007.

+ abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received signi.cant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identi.able mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, .cluster. and manifold assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identi.able mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

Convergence of Laplacian Eigenmaps [long tech report]
M. Belkin, P. Niyogi, NIPS 2006.

On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts [pdf, bib]
H. Narayanan, M. Belkin, P. Niyogi, NIPS 2006.

+ abstract: One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary.

Heat Flow and a Faster Algorithm to Compute the Surface Area of a Convex Body [pdf, bib]
M. Belkin, H. Narayanan, P. Niyogi, FOCS 2006.

+ abstract: We draw on the observation that the amount of heat diffusing outside of a heated body in a short period of time is proportional to its surface area, to design a simple algorithm for approximating the surface area of a convex body given by a membership oracle. Our method has a complexity of O*(n^4), where n is the dimension, compared to O*(n^8.5) for the previous best algorithm. We show that our complexity cannot be improved given the current state-of-the-art in volume estimation.

Maximum Margin Semi-Supervised Learning for Structured Variables [pdf, bib]
Y. Altun, D. McAllester, M. Belkin, NIPS 2005.

+ abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points.

Beyond the Point Cloud: from Transductive to Semi-supervised Learning [pdf, bib]
V. Sindhwani, P. Niyogi, M. Belkin, ICML 2005.

+ abstract: Due to its occurrence in engineering domains and implications for natural learning, the problem of utilizing unlabeled data is attracting increasing attention in machine learning. A large body of recent literature has focussed on the transductive setting where labels of unlabeled examples are estimated by learning a function defined only over the point cloud data. In a truly semi-supervised setting however, a learning machine has access to labeled and unlabeled examples and must make predictions on data points never encountered before. In this paper, we show how to turn transductive and standard supervised learning algorithms into semi-supervised learners. We construct a family of data-dependent norms on Reproducing Kernel Hilbert Spaces (RKHS). These norms allow us to warp the structure of the RKHS to reflect the underlying geometry of the data. We derive explicit formulas for the corresponding new kernels. Our approach demonstrates state of the art performance on a variety of classification tasks.

A Co-Regularization Approach to Semi-supervised Learning with Multiple Views [pdf]
V. Sindhwani, P. Niyogi, M. Belkin,
ICML Workshop on Learning with Multiple Views, 2005.

+ abstract: The Co-Training algorithm uses unlabeled examples in multiple views to bootstrap classifiers in each view, typically in a greedy manner, and operating under assumptions of view-independence and compatibility. In this paper, we propose a Co-Regularization framework where classifiers are learnt in each view through forms of multi-view regularization. We propose algorithms within this framework that are based on optimizing measures of agreement and smoothness over la- beled and unlabeled examples. These algorithms naturally extend standard regularization methods like Support Vector Machines (SVM) and Regularized Least squares (RLS) for multi-view semi-supervised learning, and inherit their benefits and applicability to high-dimensional classification problems. An empirical investigation is presented that confirms the promise of this approach.

Linear Manifold Regularization for Large Scale Semi-supervised Learning
V. Sindhwani, P. Niyogi, M. Belkin, S. Keerthi
ICML Workshop on Learning with Partially Classified Training Data, 2005

Towards a Theoretical Foundation for Laplacian-Based Manifold Methods [pdf, bib]
M. Belkin, P. Niyogi, COLT 2005

On Manifold Regularization [pdf]
M. Belkin, P. Niyogi, V. Sindhwani, AISTATS 2005

Limits of Spectral Clustering [pdf] Outstanding Student Paper Award
U. von Luxburg, O. Bousquet, M. Belkin, NIPS 2004.

Regularization and Semi-supervised Learning on Large Graphs [pdf, bib]
M. Belkin, I. Matveeva, P. Niyogi, COLT 2004.

On the Convergence of Spectral Clustering on Random Samples: the Normalized Case [pdf]
U. von Luxburg, O. Bousquet, M. Belkin, COLT 2004.

Tikhonov Regularization and Semi-supervised Learning on Large Graphs
M. Belkin, I. Matveeva, P. Niyogi
ICASSP 2004, Special Session: Manifolds and Geometry in Signal Processing (Invited paper).

Using Manifold Structure for Partially Labeled Classification [pdf]
M. Belkin, P. Niyogi, NIPS 2002

Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering [pdf]
M. Belkin, P. Niyogi, NIPS 2001.

Using Eigenvectors of the Bigram Graph to Infer Morpheme Identity [pdf]
M. Belkin, J. Goldsmith
Morphological and Phonological Learning: Proceedings of the 6th Workshop
of the ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia, July 2002, pp. 41-47.

Ph.D. Thesis:

Problems of Learning on Manifolds [pdf, bib]
Ph.D. Dissertation, University of Chicago, Dept. of Mathematics, 2003.
Advisor: Partha Niyogi.

Book chapter:

The Geometric Basis of Semi-supervised Learning [pdf]
V. Sindhwani, M. Belkin, P. Niyogi.
Book chapter in "Semi-supervised Learning", Chapelle, O., B. Schölkopf and A. Zien, Eds., MIT Press, 2006.

Technical Reports:

Consistency of Spectral Clustering [pdf]
U. von Luxburg, M. Belkin, O. Bousquet
Max Planck Insitute for Biological Cybernetics Technical Report TR-134, 2004. (The Annals of Statistics, to appear)

+ abstract

Consistency is a key property of statistical algorithms when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new methods to establish that for increasing sample size, those eigenvectors converge to the eigenvectors of certain limit operators. As a result we can prove that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other (unnormalized clustering) is only consistent under strong additional assumptions, which are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering.
Manifold Regularization: a Geometric Framework for Learning from Examples
M. Belkin, P. Niyogi, V. Sindhwani
University of Chicago CS Technical Report TR-2004-06, 2004.
Regression and Regularization on Large Graphs [link]
M. Belkin, I. Matveeva, P. Niyogi
The University of Chicago CS Technical Report TR-2003-11.
Semi-supervised learning on manifolds [link]
M. Belkin, P. Niyogi
The University of Chicago CS Technical Report TR-2002-12.
Laplacian Eigenmaps for Dimensionality Reduction and Data Representation [link]
M. Belkin, P. Niyogi
The University of Chicago CS Technical Report TR-2002-01.