En poursuivant votre navigation sur ce site, vous acceptez l'utilisation d'un simple cookie d'identification. Aucune autre exploitation n'est faite de ce cookie. OK

Documents 62H30 6 results

Filter
Select: All / None
Q
Déposez votre fichier ici pour le déplacer vers cet enregistrement.
y

Learning on the symmetric group - Vert, Jean-Philippe (Author of the conference) | CIRM H

Multi angle

Many data can be represented as rankings or permutations, raising the question of developing machine learning models on the symmetric group. When the number of items in the permutations gets large, manipulating permutations can quickly become computationally intractable. I will discuss two computationally efficient embeddings of the symmetric groups in Euclidean spaces leading to fast machine learning algorithms, and illustrate their relevance on biological applications and image classification.[-]
Many data can be represented as rankings or permutations, raising the question of developing machine learning models on the symmetric group. When the number of items in the permutations gets large, manipulating permutations can quickly become computationally intractable. I will discuss two computationally efficient embeddings of the symmetric groups in Euclidean spaces leading to fast machine learning algorithms, and illustrate their relevance ...[+]

62H30 ; 62P10 ; 68T05

Bookmarks Report an error
Déposez votre fichier ici pour le déplacer vers cet enregistrement.
2y
Data mining methods based on finite mixture models are quite common in many areas of applied science, such as marketing, to segment data and to identify subgroups with specific features. Recent work shows that these methods are also useful in micro econometrics to analyze the behavior of workers in labor markets. Since these data are typically available as time series with discrete states, clustering kernels based on Markov chains with group-specific transition matrices are applied to capture both persistence in the individual time series as well as cross-sectional unobserved heterogeneity. Markov chains clustering has been applied to data from the Austrian labor market, (a) to understanding the effect of labor market entry conditions on long-run career developments for male workers (Frühwirth-Schnatter et al., 2012), (b) to study mothers' long-run career patterns after first birth (Frühwirth-Schnatter et al., 2016), and (c) to study the effects of a plant closure on future career developments for male worker (Frühwirth-Schnatter et al., 2018). To capture non- stationary effects for the later study, time-inhomogeneous Markov chains based on time-varying group specific transition matrices are introduced as clustering kernels. For all applications, a mixture-of-experts formulation helps to understand which workers are likely to belong to a particular group. Finally, it will be shown that Markov chain clustering is also useful in a business application in marketing and helps to identify loyal consumers within a customer relationship management (CRM) program.[-]
Data mining methods based on finite mixture models are quite common in many areas of applied science, such as marketing, to segment data and to identify subgroups with specific features. Recent work shows that these methods are also useful in micro econometrics to analyze the behavior of workers in labor markets. Since these data are typically available as time series with discrete states, clustering kernels based on Markov chains with ...[+]

62C10 ; 62M05 ; 62M10 ; 62H30 ; 62P20 ; 62F15

Bookmarks Report an error
Déposez votre fichier ici pour le déplacer vers cet enregistrement.
y
In many health studies, interest often lies in assessing health effects on a large set of outcomes or specific outcome subtypes, which may be sparsely observed, even in big data settings. For example, while the overall prevalence of birth defects is not low, the vast heterogeneity in types of congenital malformations leads to challenges in estimation for sparse groups. However, lumping small groups together to facilitate estimation is often controversial and may have limited scientific support.
There is a very rich literature proposing Bayesian approaches for clustering starting with a prior probability distribution on partitions. Most approaches assume exchangeability, leading to simple representations in terms of Exchangeable Partition Probability Functions (EPPF). Gibbs-type priors encompass a broad class of such cases, including Dirichlet and Pitman-Yor processes. Even though there have been some proposals to relax the exchangeability assumption, allowing covariate-dependence and partial exchangeability, limited consideration has been given on how to include concrete prior knowledge on the partition. We wish to cluster birth defects into groups to facilitate estimation, and we have prior knowledge of an initial clustering provided by experts. As a general approach for including such prior knowledge, we propose a Centered Partition (CP) process that modifies the EPPF to favor partitions close to an initial one. Some properties of the CP prior are described, a general algorithm for posterior computation is developed, and we illustrate the methodology through simulation examples and an application to the motivating epidemiology study of birth defects.[-]
In many health studies, interest often lies in assessing health effects on a large set of outcomes or specific outcome subtypes, which may be sparsely observed, even in big data settings. For example, while the overall prevalence of birth defects is not low, the vast heterogeneity in types of congenital malformations leads to challenges in estimation for sparse groups. However, lumping small groups together to facilitate estimation is often ...[+]

62F15 ; 62H30 ; 60G09 ; 60G57 ; 62G05 ; 62P10

Bookmarks Report an error
Déposez votre fichier ici pour le déplacer vers cet enregistrement.
y
In this talk we consider high-dimensional classification. We discuss first high-dimensional binary classification by sparse logistic regression, propose a model/feature selection procedure based on penalized maximum likelihood with a complexity penalty on the model size and derive the non-asymptotic bounds for the resulting misclassification excess risk. Implementation of any complexity penalty-based criterion, however, requires a combinatorial search over all possible models. To find a model selection procedure computationally feasible for high-dimensional data, we consider logistic Lasso and Slope classifiers and show that they also achieve the optimal rate. We extend further the proposed approach to multiclass classification by sparse multinomial logistic regression.

This is joint work with Vadim Grinshtein and Tomer Levy.[-]
In this talk we consider high-dimensional classification. We discuss first high-dimensional binary classification by sparse logistic regression, propose a model/feature selection procedure based on penalized maximum likelihood with a complexity penalty on the model size and derive the non-asymptotic bounds for the resulting misclassification excess risk. Implementation of any complexity penalty-based criterion, however, requires a combinatorial ...[+]

62H30 ; 62C20

Bookmarks Report an error
Déposez votre fichier ici pour le déplacer vers cet enregistrement.
y
Community detection is a fundamental problem in network analysis which is made more challenging by overlaps between communities which often occur in practice. Here we propose a general, flexible, and interpretable generative model for overlapping communities, which can be thought of as a generalization of the degree-corrected stochastic block model. We develop an efficient spectral algorithm for estimating the community memberships, which deals with the overlaps by employing the $K$-medians algorithm rather than the usual $K$-means for clustering in the spectral domain. We show that the algorithm is asymptotically consistent when networks are not too sparse and the overlaps between communities not too large. Numerical experiments on both simulated networks and many real social networks demonstrate that our method performs very well compared to a number of benchmark methods for overlapping community detection. This is joint work with Yuan Zhang and Ji Zhu.

community detection - networks - pseudo-likelihood[-]
Community detection is a fundamental problem in network analysis which is made more challenging by overlaps between communities which often occur in practice. Here we propose a general, flexible, and interpretable generative model for overlapping communities, which can be thought of as a generalization of the degree-corrected stochastic block model. We develop an efficient spectral algorithm for estimating the community memberships, which deals ...[+]

62G20 ; 62H30 ; 65C60

Bookmarks Report an error
Déposez votre fichier ici pour le déplacer vers cet enregistrement.
y
Random forests are among the most popular supervised machine learning methods. One of their most practically useful features is the possibility to derive from the ensemble of trees an importance score for each input variable that assesses its relevance for predicting the output. These importance scores have been successfully applied on many problems, notably in bioinformatics, but they are still not well understood from a theoretical point of view. In this talk, I will present our recent works towards a better understanding, and consequently a better exploitation, of these measures. In the first part of my talk, I will present a theoretical analysis of the mean decrease impurity importance in asymptotic ensemble and sample size conditions. Our main results include an explicit formulation of this measure in the case of ensemble of totally randomized trees and a discussion of the conditions under which this measure is consistent with respect to a common definition of variable relevance. The second part of the talk will be devoted to the analysis of finite tree ensembles in a constrained framework that assumes that each tree can be built only from a subset of variables of fixed size. This setting is motivated by very high dimensional problems, or embedded systems, where one can not assume that all variables can fit into memory. We first consider a simple method that grows each tree on a subset of variables randomly and uniformly selected among all variables. We analyse the consistency and convergence rate of this method for the identification of all relevant variables under various problem and algorithm settings. From this analysis, we then motivate and design a modified variable sampling mechanism that is shown to significantly improve convergence in several conditions.[-]
Random forests are among the most popular supervised machine learning methods. One of their most practically useful features is the possibility to derive from the ensemble of trees an importance score for each input variable that assesses its relevance for predicting the output. These importance scores have been successfully applied on many problems, notably in bioinformatics, but they are still not well understood from a theoretical point of ...[+]

62H30 ; 68T05

Bookmarks Report an error