m

F Nous contacter


0

Documents  62P10 | enregistrements trouvés : 16

O
     

-A +A

P Q

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

I shall classify current approaches to multiple inferences according to goals, and discuss the basic approaches being used. I shall then highlight a few challenges that await our attention : some are simple inequalities, others arise in particular applications.

62J15 ; 62P10

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

Post-edited  Bayesian modelling
Mengersen, Kerrie (Auteur de la Conférence) | CIRM (Editeur )

This tutorial will be a beginner’s introduction to Bayesian statistical modelling and analysis. Simple models and computational tools will be described, followed by a discussion about implementing these approaches in practice. A range of case studies will be presented and possible solutions proposed, followed by an open discussion about other ways that these problems could be tackled.

62C10 ; 62F15 ; 62P12 ; 62P10

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

In recent years, new pandemic threats have become more and more frequent (SARS, bird flu, swine flu, Ebola, MERS, nCoV...) and analyses of data from the early spread more and more common and rapid. Particular interest is usually focused on the estimation of $ R_{0}$ and various methods, essentially based estimates of exponential growth rate and generation time distribution, have been proposed. Other parameters, such as fatality rate, are also of interest. In this talk, various sources of bias arising because observations are made in the early phase of spread will be discussed and also possible remedies proposed.
In recent years, new pandemic threats have become more and more frequent (SARS, bird flu, swine flu, Ebola, MERS, nCoV...) and analyses of data from the early spread more and more common and rapid. Particular interest is usually focused on the estimation of $ R_{0}$ and various methods, essentially based estimates of exponential growth rate and generation time distribution, have been proposed. Other parameters, such as fatality rate, are also of ...

92B05 ; 92B15 ; 62P10

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

Differences in disease predisposition or response to treatment can be explained in great part by genomic differences between individuals. This has given birth to precision medicine, where treatment is tailored to the genome of patients. This field depends on collecting considerable amounts of molecular data for large numbers of individuals, which is being enabled by thriving developments in genome sequencing and other high-throughput experimental technologies.
Unfortunately, we still lack effective methods to reliably detect, from this data, which of the genomic features determine a phenotype such as disease predisposition or response to treatment. One of the major issues is that the number of features that can be measured is large (easily reaching tens of millions) with respect to the number of samples for which they can be collected (more usually of the order of hundreds or thousands), posing both computational and statistical difficulties.
In my talk I will discuss how to use biological networks, which allow us to understand mutations in their genomic context, to address these issues. All the methods I will present share the common hypotheses that genomic regions that are involved in a given phenotype are more likely to be connected on a given biological network than not.
Differences in disease predisposition or response to treatment can be explained in great part by genomic differences between individuals. This has given birth to precision medicine, where treatment is tailored to the genome of patients. This field depends on collecting considerable amounts of molecular data for large numbers of individuals, which is being enabled by thriving developments in genome sequencing and other high-throughput ex...

92C42 ; 92-08 ; 92B15 ; 62P10

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

We analyse patterns of genetic variability of populations in the presence of a large seed bank with the help of a new coalescent structure called seed bank coalescent. This ancestral process appears naturally as scaling limit of the genealogy of large populations that sustain seed banks, if the seed bank size and individual dormancy times are of the same order as the active population. Mutations appear as Poisson process on the active lineages, and potentially at reduced rate also on the dormant lineages. The presence of ‘dormant’ lineages leads to qualitatively altered times to the most recent common ancestor and non-classical patterns of genetic diversity. To illustrate this we provide a Wright-Fisher model with seed bank component and mutation, motivated from recent models of microbial dormancy, whose genealogy can be described by the seed bank coalescent. Based on our coalescent model, we derive recursions for the expectation and variance of the time to most recent common ancestor, number of segregating sites, pairwise differences, and singletons. Commonly employed distance statistics, in the presence and absence of a seed bank, are compared. The effect of a seed bank on the expected site-frequency spectrum is also investigated. Our results indicate that the presence of a large seed bank considerably alters the distribution of some distance statistics, as well as the site-frequency spectrum. Thus, one should be able to detect the presence of a large seed bank in genetic data. Joint work with Bjarki Eldon, Adrián González Casanova, Noemi Kurt, Maite Wilke-Berenguer
We analyse patterns of genetic variability of populations in the presence of a large seed bank with the help of a new coalescent structure called seed bank coalescent. This ancestral process appears naturally as scaling limit of the genealogy of large populations that sustain seed banks, if the seed bank size and individual dormancy times are of the same order as the active population. Mutations appear as Poisson process on the active lineages, ...

92D10 ; 60K35 ; 62P10

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

Faced with data containing a large number of inter-related explanatory variables, finding ways to investigate complex multi-factorial effects is an important statistical task. This is particularly relevant for epidemiological study designs where large numbers of covariates are typically collected in an attempt to capture complex interactions between host characteristics and risk factors. A related task, which is of great interest in stratified medicine, is to use multi-omics data to discover subgroups of patients with distinct molecular phenotypes and clinical outcomes, thus providing the potential to target treatments more precisely. Flexible clustering is a natural way to tackle such problems. It can be used in an unsupervised or a semi-supervised manner by adding a link between the clustering structure and outcomes and performing joint modelling. In this case, the clustering structure is used to help predict the outcome. This latter approach, known as profile regression, has been implemented recently using a Bayesian non parametric DP modelling framework, which specifies a joint clustering model for covariates and outcome, with an additional variable selection step to uncover the variables driving the clustering (Papathomas et al, 2012). In this talk, two related issues will be discussed. Firstly, we will focus on categorical covariates, a common situation in epidemiological studies, and examine the relation between: (i) dependence structures highlighted by Bayesian partitioning of the covariate space incorporating variable selection; and (ii) log linear modelling with interaction terms, a traditional approach to model dependence. We will show how the clustering approach can be employed to assist log-linear model determination, a challenging task as the model space becomes quickly very large (Papathomas and Richardson, 2015). Secondly, we will discuss clustering as a tool for integrating information from multiple datasets, with a view to discover useful structure for prediction. In this context several related issues arise. It is clear that each dataset may carry a different amount of information for the predictive task. Methods for learning how to reweight each data type for this task will therefore be presented. In the context of multi-omics datasets, the efficiency of different methods for performing integrative clustering will also be discussed, contrasting joint modelling and stepwise approaches. This will be illustrated by analysis of genomics cancer datasets.
Joint work with Michael Papathomas and Paul Kirk.
Faced with data containing a large number of inter-related explanatory variables, finding ways to investigate complex multi-factorial effects is an important statistical task. This is particularly relevant for epidemiological study designs where large numbers of covariates are typically collected in an attempt to capture complex interactions between host characteristics and risk factors. A related task, which is of great interest in stratified ...

62F15 ; 62P10

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

Dans une première partie, je présenterai différentes problématiques liées à des statistiques d'occurrences de mots dans des génomes et décortiquerai plus en détail la question de savoir comment détecter si un mot a une fréquence d'apparition significativement anormale dans une séquence. Dans une deuxième partie, je présenterai différentes extensions pour tenir compte du fait qu'un motif d'ADN fonctionnel n'est pas toujours un " mot ", mais qu'il peut avoir une structure plus complexe qui nécessite le développement de nouvelles méthodes statistiques.
Dans une première partie, je présenterai différentes problématiques liées à des statistiques d'occurrences de mots dans des génomes et décortiquerai plus en détail la question de savoir comment détecter si un mot a une fréquence d'apparition significativement anormale dans une séquence. Dans une deuxième partie, je présenterai différentes extensions pour tenir compte du fait qu'un motif d'ADN fonctionnel n'est pas toujours un " mot ", mais qu'il ...

92C40 ; 62P10 ; 60J20 ; 92C42

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

Multi angle  Selective inference in genetics
Sabatti, Chiara (Auteur de la Conférence) | CIRM (Editeur )

Geneticists have always been aware that, when looking for signal across the entire genome, one has to be very careful to avoid false discoveries. Contemporary studies often involve a very large number of traits, increasing the challenges of "looking every-where". I will discuss novel approaches that allow an adaptive exploration of the data, while guaranteeing reproducible results.

62F15 ; 62J15 ; 62P10 ; 92D10

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

Multi angle  Learning on the symmetric group
Vert, Jean-Philippe (Auteur de la Conférence) | CIRM (Editeur )

Many data can be represented as rankings or permutations, raising the question of developing machine learning models on the symmetric group. When the number of items in the permutations gets large, manipulating permutations can quickly become computationally intractable. I will discuss two computationally efficient embeddings of the symmetric groups in Euclidean spaces leading to fast machine learning algorithms, and illustrate their relevance on biological applications and image classification.
Many data can be represented as rankings or permutations, raising the question of developing machine learning models on the symmetric group. When the number of items in the permutations gets large, manipulating permutations can quickly become computationally intractable. I will discuss two computationally efficient embeddings of the symmetric groups in Euclidean spaces leading to fast machine learning algorithms, and illustrate their relevance ...

62H30 ; 62P10 ; 68T05

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

The term ‘Public Access Defibrillation’ (PAD) is referred to programs based on the placement of Automated External Defibrillators (AED) in key locations along cities’ territory together with the development of a training plan for users (first responders). PAD programs are considered necessary since time for intervention in cases of sudden cardiac arrest outside of a medical environment (out-of-hospital cardiocirculatory arrest, OHCA) is strongly limited: survival potential decreases from a 67% baseline by 7 to 10% for each minute of delay in first defibrillation. However, it is widely recognized that current PAD performance is largely below its full potential. We provide a Bayesian spatio-temporal statistical model for predidicting OHCAs. Then we construct a risk map for Ticino, adjusted for demographic covariates, that explains and forecasts the spatial distribution of OHCAs, their temporal dynamics, and how the spatial distribution changes over time. The objective is twofold: to efficiently estimate, in each area of interest, the occurrence intensity of the OHCA event and to suggest a new optimized distribution of AEDs that accounts for population exposure to the geographic risk of OHCA occurrence and that includes both displacement of current devices and installation of new ones.
The term ‘Public Access Defibrillation’ (PAD) is referred to programs based on the placement of Automated External Defibrillators (AED) in key locations along cities’ territory together with the development of a training plan for users (first responders). PAD programs are considered necessary since time for intervention in cases of sudden cardiac arrest outside of a medical environment (out-of-hospital cardiocirculatory arrest, OHCA) is strongly ...

62F15 ; 62P10 ; 62H11 ; 91B30

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

In many health studies, interest often lies in assessing health effects on a large set of outcomes or specific outcome subtypes, which may be sparsely observed, even in big data settings. For example, while the overall prevalence of birth defects is not low, the vast heterogeneity in types of congenital malformations leads to challenges in estimation for sparse groups. However, lumping small groups together to facilitate estimation is often controversial and may have limited scientific support.
There is a very rich literature proposing Bayesian approaches for clustering starting with a prior probability distribution on partitions. Most approaches assume exchangeability, leading to simple representations in terms of Exchangeable Partition Probability Functions (EPPF). Gibbs-type priors encompass a broad class of such cases, including Dirichlet and Pitman-Yor processes. Even though there have been some proposals to relax the exchangeability assumption, allowing covariate-dependence and partial exchangeability, limited consideration has been given on how to include concrete prior knowledge on the partition. We wish to cluster birth defects into groups to facilitate estimation, and we have prior knowledge of an initial clustering provided by experts. As a general approach for including such prior knowledge, we propose a Centered Partition (CP) process that modifies the EPPF to favor partitions close to an initial one. Some properties of the CP prior are described, a general algorithm for posterior computation is developed, and we illustrate the methodology through simulation examples and an application to the motivating epidemiology study of birth defects.
In many health studies, interest often lies in assessing health effects on a large set of outcomes or specific outcome subtypes, which may be sparsely observed, even in big data settings. For example, while the overall prevalence of birth defects is not low, the vast heterogeneity in types of congenital malformations leads to challenges in estimation for sparse groups. However, lumping small groups together to facilitate estimation is often ...

62F15 ; 62H30 ; 60G09 ; 60G57 ; 62G05 ; 62P10

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

Low-dimensional compartment models for biological systems can be fitted to time series data using Monte Carlo particle filter methods. As dimension increases, for example when analyzing a collection of spatially coupled populations, particle filter methods rapidly degenerate. We show that many independent Monte Carlo calculations, each of which does not attempt to solve the filtering problem, can be combined to give a global filtering solution with favorable theoretical scaling properties under a weak coupling condition. The independent Monte Carlo calculations are called islands, and the operation carried out on each island is called adapted simulation, so the complete algorithm is called an adapted simulation island filter. We demonstrate this methodology and some related algorithms on a model for measles transmission within and between cities.
Low-dimensional compartment models for biological systems can be fitted to time series data using Monte Carlo particle filter methods. As dimension increases, for example when analyzing a collection of spatially coupled populations, particle filter methods rapidly degenerate. We show that many independent Monte Carlo calculations, each of which does not attempt to solve the filtering problem, can be combined to give a global filtering solution ...

60G35 ; 60J20 ; 62M02 ; 62M05 ; 62M20 ; 62P10 ; 65C35

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

The highly influential two-group model in testing a large number of statistical hypotheses assumes that the test statistics are drawn independently from a mixture of a high probability null distribution and a low probability alternative. Optimal control of the marginal false discovery rate (mFDR), in the sense that it provides maximal power (expected true discoveries) subject to mFDR control, is known to be achieved by thresholding the local false discovery rate (locFDR), i.e., the probability of the hypothesis being null given the set of test statistics, with a fixed threshold.
We address the challenge of controlling optimally the popular false discovery rate (FDR) or positive FDR (pFDR) rather than mFDR in the general two-group model, which also allows for dependence between the test statistics. These criteria are less conservative than the mFDR criterion, so they make more rejections in expectation.
We derive their optimal multiple testing (OMT) policies, which turn out to be thresholding the locFDR with a threshold that is a function of the entire set of statistics. We develop an efficient algorithm for finding these policies, and use it for problems with thousands of hypotheses. We illustrate these procedures on gene expression studies.
The highly influential two-group model in testing a large number of statistical hypotheses assumes that the test statistics are drawn independently from a mixture of a high probability null distribution and a low probability alternative. Optimal control of the marginal false discovery rate (mFDR), in the sense that it provides maximal power (expected true discoveries) subject to mFDR control, is known to be achieved by thresholding the local ...

62F03 ; 62J15 ; 62P10

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

Inferring causal effects of a treatment or policy from observational data is central to many applications. However, state-of-the-art methods for causal inference suffer when covariates have missing values, which is ubiquitous in application.
Missing data greatly complicate causal analyses as they either require strong assumptions about the missing data generating mechanism or an adapted unconfoundedness hypothesis. In this talk, I will first provide a classification of existing methods according to the main underlying assumptions, which are based either on variants of the classical unconfoundedness assumption or relying on assumptions about the mechanism that generates the missing values. Then, I will present two recent contributions on this topic: (1) an extension of doubly robust estimators that allows handling of missing attributes, and (2) an approach to causal inference based on variational autoencoders adapted to incomplete data.
I will illustrate the topic an an observational medical database which has heterogeneous data and a multilevel structure to assess the impact of the administration of a treatment on survival.
Inferring causal effects of a treatment or policy from observational data is central to many applications. However, state-of-the-art methods for causal inference suffer when covariates have missing values, which is ubiquitous in application.
Missing data greatly complicate causal analyses as they either require strong assumptions about the missing data generating mechanism or an adapted unconfoundedness hypothesis. In this talk, I will first ...

62P10 ; 62H12 ; 62N99

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

Multiple testing problems are a staple of modern statistics. The fundamental objective is to reject as many false null hypotheses as possible, subject to controlling an overall measure of false discovery, like family-wise error rate (FWER) or false discovery rate (FDR). We formulate multiple testing of simple hypotheses as an infinite-dimensional optimization problem, seeking the most powerful rejection policy which guarantees strong control of the selected measure. We show that for exchangeable hypotheses, for FWER or FDR and relevant notions of power, these problems lead to infinite programs that can provably be solved. We explore maximin rules for complex alternatives, and show they can be found in practice, leading to improved practical procedures compared to existing alternatives. We derive explicit optimal tests for FWER or FDR control for three independent normal means. We find that the power gain over natural competitors is substantial in all settings examined. We apply our optimal maximin rule to subgroup analyses in systematic reviews from the Cochrane library, leading to an increased number of findings compared to existing alternatives.
Multiple testing problems are a staple of modern statistics. The fundamental objective is to reject as many false null hypotheses as possible, subject to controlling an overall measure of false discovery, like family-wise error rate (FWER) or false discovery rate (FDR). We formulate multiple testing of simple hypotheses as an infinite-dimensional optimization problem, seeking the most powerful rejection policy which guarantees strong control of ...

62F03 ; 62J15 ; 62P10

Déposez votre fichier ici pour le déplacer vers cet enregistrement.

In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level. The canonical pivotal estimator is the square-root Lasso, formulated along with its derivatives as a "non-smooth + non-smooth'' optimization problem.
Modern techniques to solve these include smoothing the datafitting term, to benefit from fast efficient proximal algorithms.
In this work we focus on minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators. We also provide some guidelines on how to set the smoothing hyperparameter, and illustrate on synthetic data the interest of such guidelines.
This is joint work with Quentin Bertrand (INRIA), Mathurin Massias, Olivier Fercoq and Alexandre Gramfort.
In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level. The canonical pivotal estimator is the square-root Lasso, formulated along with its derivatives as a "non-smooth + non-smooth'' optimization problem.
Modern techniques to solve these include smoothing the datafitting term, to benefit from fast efficient proximal algorithms.
In this work we ...

62J05 ; 62J12 ; 62P10

Z