En poursuivant votre navigation sur ce site, vous acceptez l'utilisation d'un simple cookie d'identification. Aucune autre exploitation n'est faite de ce cookie. OK
1

Random forests variable importances: towards a better understanding and large-scale feature selection

Sélection Signaler une erreur
Multi angle
Auteurs : Geurts, Pierre (Auteur de la Conférence)
CIRM (Editeur )

Loading the player...

Résumé : Random forests are among the most popular supervised machine learning methods. One of their most practically useful features is the possibility to derive from the ensemble of trees an importance score for each input variable that assesses its relevance for predicting the output. These importance scores have been successfully applied on many problems, notably in bioinformatics, but they are still not well understood from a theoretical point of view. In this talk, I will present our recent works towards a better understanding, and consequently a better exploitation, of these measures. In the first part of my talk, I will present a theoretical analysis of the mean decrease impurity importance in asymptotic ensemble and sample size conditions. Our main results include an explicit formulation of this measure in the case of ensemble of totally randomized trees and a discussion of the conditions under which this measure is consistent with respect to a common definition of variable relevance. The second part of the talk will be devoted to the analysis of finite tree ensembles in a constrained framework that assumes that each tree can be built only from a subset of variables of fixed size. This setting is motivated by very high dimensional problems, or embedded systems, where one can not assume that all variables can fit into memory. We first consider a simple method that grows each tree on a subset of variables randomly and uniformly selected among all variables. We analyse the consistency and convergence rate of this method for the identification of all relevant variables under various problem and algorithm settings. From this analysis, we then motivate and design a modified variable sampling mechanism that is shown to significantly improve convergence in several conditions.

Codes MSC :
62H30 - Classification and discrimination; cluster analysis
68T05 - Learning and adaptive systems

    Informations sur la Vidéo

    Réalisateur : Hennenfent, Guillaume
    Langue : Anglais
    Date de publication : 19/02/16
    Date de captation : 02/02/16
    Sous collection : Research talks
    arXiv category : Computer Science ; Machine Learning ; Statistics Theory
    Domaine : Probability & Statistics ; Computer Science
    Format : MP4 (.mp4) - HD
    Durée : 00:52:03
    Audience : Researchers
    Download : https://videos.cirm-math.fr/2016-02-02_Geurts.mp4

Informations sur la Rencontre

Nom de la rencontre : Thematic month on statistics - Week 1: Statistical learning / Mois thématique sur les statistiques - Semaine 1 : apprentissage
Organisateurs de la rencontre : Ghattas, Badih ; Ralaivola, Liva
Dates : 01/02/16 - 05/02/16
Année de la rencontre : 2016
URL Congrès : http://conferences.cirm-math.fr/1615.html

Données de citation

DOI : 10.24350/CIRM.V.18920603
Citer cette vidéo: Geurts, Pierre (2016). Random forests variable importances: towards a better understanding and large-scale feature selection. CIRM. Audiovisual resource. doi:10.24350/CIRM.V.18920603
URI : http://dx.doi.org/10.24350/CIRM.V.18920603

Bibliographie

  • [1] Châtel C., Sélection de variables à grande échelle à partir de forêts alétoires, Master thesis, Ecole Centrale de Marseille/Universite de Liege, 2015 -

  • [2] Huynh-Thu, V. A., Saeys, Y., Wehenkel, L., & Geurts, P. (2012). Statistical interpretation of machine learning-based feature importance scores for biomarker discovery, Bioinformatics 28(13), 1766-1774 - http://dx.doi.org/10.1093/bioinformatics/bts238

  • [3] Huynh-Thu, V. A., Irrthum, A., Wehenkel, L., & Geurts, P. (2010). Inferring regulatory networks from expression data using tree-based methods, Plos ONE 5(9), e12776 - http://dx.doi.org/10.1371/journal.pone.0012776

  • [4] Louppe G. Understanding random forests: From theorey to practice, PhD thesis, University of Liege, 2014. - http://arxiv.org/abs/1407.7502v3

  • [5] Louppe, G., Wehenkel, L., Sutera, A., & Geurts, P. (2013). Understanding variable importances in forests of randomized trees. In C. J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K.Q. Weinberger (Eds.), Advances in neural information processing 26 (pp. 431-439). Curran Associates - http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf

  • [6] Louppe, G., & Geurts, P. (2012). Ensembles on random patches. In P.A. Flach, T. De Bie, & N. Cristianini (Eds.), Machine Learning and Knowledge Discovery in Databases (pp. 346-361). Berlin: Springer. (Lecture Notes in Computer Science, 7523) - http://dx.doi.org/10.1007/978-3-642-33460-3_28

  • [7] Marbach, D., Costello, J.C, Kuffner, R., Vega, N.M., Prill, R.J., Camacho, D.M., Allison, K.R., Kellis, M., Collins, J.J., & Stolovitzky, G. (2012). Wisdom of crowds for robust gene network inference. Nature Methods, 9(8), 796-804 - http://dx.doi.org/10.1038/nmeth.2016



Sélection Signaler une erreur