En poursuivant votre navigation sur ce site, vous acceptez l'utilisation d'un simple cookie d'identification. Aucune autre exploitation n'est faite de ce cookie. OK
1

Random forests variable importances: towards a better understanding and large-scale feature selection

Bookmarks Report an error
Multi angle
Authors : Geurts, Pierre (Author of the conference)
CIRM (Publisher )

Loading the player...

Abstract : Random forests are among the most popular supervised machine learning methods. One of their most practically useful features is the possibility to derive from the ensemble of trees an importance score for each input variable that assesses its relevance for predicting the output. These importance scores have been successfully applied on many problems, notably in bioinformatics, but they are still not well understood from a theoretical point of view. In this talk, I will present our recent works towards a better understanding, and consequently a better exploitation, of these measures. In the first part of my talk, I will present a theoretical analysis of the mean decrease impurity importance in asymptotic ensemble and sample size conditions. Our main results include an explicit formulation of this measure in the case of ensemble of totally randomized trees and a discussion of the conditions under which this measure is consistent with respect to a common definition of variable relevance. The second part of the talk will be devoted to the analysis of finite tree ensembles in a constrained framework that assumes that each tree can be built only from a subset of variables of fixed size. This setting is motivated by very high dimensional problems, or embedded systems, where one can not assume that all variables can fit into memory. We first consider a simple method that grows each tree on a subset of variables randomly and uniformly selected among all variables. We analyse the consistency and convergence rate of this method for the identification of all relevant variables under various problem and algorithm settings. From this analysis, we then motivate and design a modified variable sampling mechanism that is shown to significantly improve convergence in several conditions.

MSC Codes :
62H30 - Classification and discrimination; cluster analysis
68T05 - Learning and adaptive systems

    Information on the Video

    Film maker : Hennenfent, Guillaume
    Language : English
    Available date : 19/02/16
    Conference Date : 02/02/16
    Subseries : Research talks
    arXiv category : Computer Science ; Machine Learning ; Statistics Theory
    Mathematical Area(s) : Probability & Statistics ; Computer Science
    Format : MP4 (.mp4) - HD
    Video Time : 00:52:03
    Targeted Audience : Researchers
    Download : https://videos.cirm-math.fr/2016-02-02_Geurts.mp4

Information on the Event

Event Title : Thematic month on statistics - Week 1: Statistical learning / Mois thématique sur les statistiques - Semaine 1 : apprentissage
Event Organizers : Ghattas, Badih ; Ralaivola, Liva
Dates : 01/02/16 - 05/02/16
Event Year : 2016
Event URL : http://conferences.cirm-math.fr/1615.html

Citation Data

DOI : 10.24350/CIRM.V.18920603
Cite this video as: Geurts, Pierre (2016). Random forests variable importances: towards a better understanding and large-scale feature selection. CIRM. Audiovisual resource. doi:10.24350/CIRM.V.18920603
URI : http://dx.doi.org/10.24350/CIRM.V.18920603

Bibliography

  • [1] Châtel C., Sélection de variables à grande échelle à partir de forêts alétoires, Master thesis, Ecole Centrale de Marseille/Universite de Liege, 2015 -

  • [2] Huynh-Thu, V. A., Saeys, Y., Wehenkel, L., & Geurts, P. (2012). Statistical interpretation of machine learning-based feature importance scores for biomarker discovery, Bioinformatics 28(13), 1766-1774 - http://dx.doi.org/10.1093/bioinformatics/bts238

  • [3] Huynh-Thu, V. A., Irrthum, A., Wehenkel, L., & Geurts, P. (2010). Inferring regulatory networks from expression data using tree-based methods, Plos ONE 5(9), e12776 - http://dx.doi.org/10.1371/journal.pone.0012776

  • [4] Louppe G. Understanding random forests: From theorey to practice, PhD thesis, University of Liege, 2014. - http://arxiv.org/abs/1407.7502v3

  • [5] Louppe, G., Wehenkel, L., Sutera, A., & Geurts, P. (2013). Understanding variable importances in forests of randomized trees. In C. J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K.Q. Weinberger (Eds.), Advances in neural information processing 26 (pp. 431-439). Curran Associates - http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf

  • [6] Louppe, G., & Geurts, P. (2012). Ensembles on random patches. In P.A. Flach, T. De Bie, & N. Cristianini (Eds.), Machine Learning and Knowledge Discovery in Databases (pp. 346-361). Berlin: Springer. (Lecture Notes in Computer Science, 7523) - http://dx.doi.org/10.1007/978-3-642-33460-3_28

  • [7] Marbach, D., Costello, J.C, Kuffner, R., Vega, N.M., Prill, R.J., Camacho, D.M., Allison, K.R., Kellis, M., Collins, J.J., & Stolovitzky, G. (2012). Wisdom of crowds for robust gene network inference. Nature Methods, 9(8), 796-804 - http://dx.doi.org/10.1038/nmeth.2016



Bookmarks Report an error