Quick instructions to use FDist2 for detecting null alleles: a follow up to "A Robust Statistical Method to Detect Null Alleles in Microsatellite and SNP Datasets in Both Panmictic and Inbred Populations" by Girard (2011)
This is not a published paper. It was written to explain how to use the method I described in "A Robust Statistical Method to Detect Null Alleles in Microsatellite and SNP Datasets in Both Panmictic and Inbred Populations"
A Robust Statistical Method to Detect Null Alleles in Microsatellite and SNP Datasets in Both Panmictic and Inbred Populations
Statistical Applications in Genetics and Molecular Biology. 2011. 10(1) Article 9.
Consult the follow-up I made available on my profile (see Papers). It explains quickly and easy how to perform the analysis I described in this paper.
Assessment of power and accuracy of methods developed for detection and frequency-estimation of null alleles.
Co-authored with Bernard Angers.
Genetica 2008. 134(2):187-197.
Checking proportional hazards assumption for covariates in COX proportional hazards model
by Pankaj kumar
A graphical approach for checking proportionality assumption Cox proportional hazard model is discussed, present in a... more A graphical approach for checking proportionality assumption Cox proportional hazard model is discussed, present in a research paper which is implemented in SAS.
Microarray Data Analysis - MSc Thesis
MSc Thesis
Supervised by the Professor Asoke. K. Nandi,
Department of Electrical Engineering and Electronics
University of Liverpool
UK
Microarrays provide enormous amounts of raw genetic data which needs to be analyzed. This has encouraged information... more
Microarrays provide enormous amounts of raw genetic data which needs to be analyzed. This has encouraged information engineers to collaborate in the analysis of microarrays in order to enhance biology and medicine. The main objective of this research is to extract as much of useful biological information as possible by analyzing two microarray data sets; the Stanford University yeast cell-cycle data set and the University of Oxford haemoglobin data set.
Both data sets were investigated by gene clustering whose robustness was enhanced by proposing a new original ensemble clustering method named: Consensus Fuzzy Partition Matrix Binarization (CFPMB). This method relaxes conventional clustering constraints by allowing genes to be assigned to multiple clusters simultaneously or not to be assigned at all. With this property, special genes can be identified to be considered for further analysis.
By using MATLAB, 4 clustering methods were applied over both data sets with many different configurations. 5 validation indices were used to compare and filter the results on which the CFPMB method was applied to generate consensus partitions.
The 384 genes of the yeast data set were grouped into 5 clusters as the biologists suggested. 3 clusters agreed with the suggestions of the biologists, 1 cluster combined 2 biologically suggested groups and 1 cluster contained less than a dozen of oddly expressed genes. Different CFPMB experiments identified up to 40 special genes that are subject to further analysis.
For the haemoglobin data set, Probe Ids (PIDs) are used instead of genes. 34 PIDs are considered significant out of the biologically interesting 50 PIDs picked from the list of 16387 PIDs. The 34 PIDs were grouped into 4 clusters, 2 of which showed statistical significance. The PIDs of these 2 clusters, when combined, can be regrouped according to their biological functions into 2 well-defined groups. This important finding suggests that these 2 biological functions work in collaboration, which was confirmed by the biologists supporting the validity of the proposed approach.
Keywords: Microarrays, yeast cell-cycle, haemoglobin molecules biosynthesis, gene clustering, ensemble clustering, consensus fuzzy partition matrix binarization (CFPMB), k-means, SOMs, hierarchical clustering, SOON clustering, clustering validation indices.
Research Project Interim Report - The Analysis of Microarray Data (DEMO version)
Demo Version
This interim report presents the current progress of the MSc project “The Analysis of Microarray Data”. The project is... more
This interim report presents the current progress of the MSc project “The Analysis of Microarray Data”. The project is divided into two independent subprojects by considering two different microarray data sets; The University of Stanford yeast cell-cycle microarray data set and the University of Oxford hemoglobin molecules data set. The main objective of analyzing the first data set is gene discovery through clustering the genes according to the cell-cycle stages at which they show peaks. The second data set analysis aims at identifying the most optimal subset of genes that directly influence the target biological function then discovering their relative patterns.
With MATLAB as a programming language and tool, 4 clustering techniques have been applied over the first data set with 19 different configurations. The clustering results were validated using 5 indices then a shortlist of 11 configurations was extracted. A new novel method, which is called “combined fuzzy partition matrix formulation and binarization”, is proposed in this research to combine different clustering results. The proposed method has succeeded in reaching a deeper level of understanding of the shortlisted clustering results. Investigating the new method and its originality has been added as an additional theoretical objective of the project.
Currently, analyzing the first data set has been finished except for results’ comparisons and conclusions’ finalizing. The next phase consists of further investigation of the proposed method and undertaking the analysis of the second data set.
184 views
Seen by: and 2 moreRegularization for Cox's Proportional Hazards Model With NP-Dimensionality
Co-authored with Jianqing Fan and Jianqeng Jiang
To appear at 'The Annals of Statistics'.
High throughput genetic sequencing arrays with thousands of measurements per sample and a great amount of related... more High throughput genetic sequencing arrays with thousands of measurements per sample and a great amount of related censored clinical data have increased demanding need for better measurement specific model selection. In this paper we establish strong oracle properties of non-concave penalized methods for {\it non-polynomial} (NP) dimensional data with censoring in the framework of Cox's proportional hazards model. A class of folded-concave penalties are employed and both LASSO and SCAD are discussed specifically. We unveil the question under which dimensionality and correlation restrictions can an oracle estimator be constructed and grasped. It is demonstrated that non-concave penalties lead to significant reduction of the "irrepresentable condition" needed for LASSO model selection consistency. The large deviation result for martingales, bearing interests of its own, is developed for characterizing the strong oracle property. Moreover, the non-concave regularized estimator, is shown to achieve asymptotically the information bound of the oracle estimator. A coordinate-wise algorithm is developed for finding the grid of solution paths for penalized hazard regression problems, and its performance is evaluated on simulated and gene association study examples.
Joint Modeling of HCV and HIV Infections among Injecting Drug Users in Italy Using Repeated Cross-Sectional Prevalence Data
Del Fava, Emanuele; Kasim, Adetayo; Usman, Muhammad; Shkedy, Ziv; Hens, Niel; Aerts, Marc; Bollaerts, Kaatje; Scalia Tomba, Gianpaolo; Vickerman, Peter; Sutton, Andrew J.; Wiessing, Lucas; and Kretzschmar, Mirjam (2011) "Joint Modeling of HCV and HIV Infections among Injecting Drug Users in Italy Using Repeated Cross-Sectional Prevalence Data," Statistical Communications in Infectious Diseases: Vol. 3: Iss. 1, Article 1.
DOI: 10.2202/1948-4690.1009
During their injecting career, injecting drug users (IDUs) are exposed to some infections, like hepatitis C virus... more During their injecting career, injecting drug users (IDUs) are exposed to some infections, like hepatitis C virus (HCV) infection and human immunodeficiency virus (HIV) infection, due to their injecting behavioral risk factors, such as sharing syringes or other paraphernalia containing infected blood, or sexual behavior risk factors. If we consider that these IDUs might belong to a social network of people where these behavioral risk factors are spread, then HCV and HIV infections might be associated at both the individual and the population level. In this paper, we study the association between HCV and HIV infection at the population level using aggregate data. Our aim is to define a hierarchy of structured models with which the association between HCV and HIV infection at population level and the time trend of prevalence can be investigated. The data analyzed in the paper are “diagnostic testing data,” which consist of repeated cross-sectional prevalence measurements from 1998 to 2006 for HCV and HIV infection, obtained from a sample of 515 drug treatment centers spread among the 20 regions in Italy, where subjects went for a serum diagnostic test. Since we do not have any individual data, it is not possible to relate these prevalence data to socio-demographic or behavioral risk data. Each region defines a cluster with repeated prevalence data for HCV and HIV infection over time. Several modeling approaches, such as generalized linear mixed models (GLMMs) and hierarchical Bayesian models are applied to the data. First, we test different covariance structures for the region-specific random effects in the GLMM context; second, a hierarchical Bayesian model is used to refit the best GLMM in order to obtain the posterior distribution for the parameters of primary interest. We found that the correlation at population level between HCV and HIV is approximately 0.68 and the prevalence of the two infections generally decreased over the years, compared to the situation in 1998.
87 views
Seen by:Cluster Analysis of Fish Community Data: New Tools for Determining Meaningful Groupings of Sites and Species Assemblages. In Gido K. and Jackson D.A. (eds.) Community Ecology of Stream Fishes: Concepts, Approaches and Techniques, American Fisheries Society, Bethesda, MD.
Co-authored with D.A. Jackson (lead), and Steve Walker.
Community ecologists face the challenge of summarizing considerable amounts of information regarding species... more Community ecologists face the challenge of summarizing considerable amounts of information regarding species distributions and environmental conditions. Often this challenge is met through the use of multivariate statistical approaches. Stream fish community ecologists, much like the broader ecological community, appear to favor the use of ordination methods over clustering approaches. One potential reason is due to the development of various tools to help us determine the interpretability or “significance” of ordination axes whereas ecologists appear unfamiliar with the comparable tools available for examining cluster analysis. We use fish abundance data from two river systems to demonstrate several of these approaches. We demonstrate how the methods may be used to determine the relative strength of groups of sampling locations and species assemblages relative to the background variability. We contrast the methods to demonstrate their relative merits, both advantages and disadvantages in studies commonly conducted by stream ecologists.
Functional rarefaction: estimating functional diversity from field data. Oikos 117: 286-292.
Co-authored with Steve Walker (lead) and Donald A. Jackson (University of Toronto)
Studies in biodiversity-ecosystem function and conservation biology have led to the development of diversity indices... more Studies in biodiversity-ecosystem function and conservation biology have led to the development of diversity indices that take species’ functional differences into account. We identify two broad classes of indices: those that monotonically increase with species richness (MSR indices) and those that weight the contribution of each species by abundance or occurrence (weighted indices). We argue that weighted indices are easier to estimate without bias but tend to ignore information provided by rare species. Conversely, MSR indices fully incorporate information provided by rare species but are nearly always underestimated when communities are not exhaustively surveyed. This is because of the well-studied fact that additional sampling of a community may reveal previously undiscovered species. We use the rarefaction technique from species richness studies to address sample-size-induced bias when estimating functional diversity indices. Rarefaction transforms any given MSR index into a family of unbiased weighted indices, each with a different level of sensitivity to rare species. Thus rarefaction simultaneously solves the problem of bias and the problem of sensitivity to rare species. We present formulae and algorithms for conducting a functional rarefaction analysis of the two most widely cited MSR indices: functional attribute diversity (FAD) and Petchey and Gaston’s functional diversity (FD). These formulae also demonstrate a relationship between three seemingly unrelated functional diversity indices: FAD, FD and Rao’s quadratic entropy. Statistical theory is also provided in order to prove that all desirable statistical properties of species richness rarefaction are preserved for functional rarefaction.
Functional-diversity indices can be driven by methodological choices and species richness. Ecology 90: 341-346.
Co-authored with Steve Walker and Donald A. Jackson
Functional diversity is an important concept in community ecology because it captures information on functional traits... more Functional diversity is an important concept in community ecology because it captures information on functional traits absent in measures of species diversity. One popular method of measuring functional diversity is the dendrogram-based method, FD. To calculate FD, a variety of methodological choices are required, and it has been debated about whether biological conclusions are sensitive to such choices. We studied the probability that conclusions regarding FD were sensitive, and that patterns in sensitivity were related to alpha and beta components of species richness. We developed a randomization procedure that iteratively calculated FD by assigning species into two assemblages and calculating the probability that the community with higher FD varied across methods. We found evidence of sensitivity in all five communities we examined, ranging from a probability of sensitivity of 0 (no sensitivity) to 0.976 (almost completely sensitive). Variations in these probabilities were driven by differences in alpha diversity between assemblages and not by beta diversity. Importantly, FD was most sensitive when it was most useful (i.e., when differences in alpha diversity were low). We demonstrate that trends in functional-diversity analyses can be largely driven by methodological choices or species richness, rather than functional trait information alone.
Boredom-proneness, loneliness, social engagement and depression and their association with cognitive function in older people: a population study
by Ronan Conroy
In this study, we use data from a population survey of persons aged 65 and over living in the Irish Republic to... more
In this study, we use data from a population survey of persons aged 65 and over living in the Irish Republic to examine the relationship of cognitive impairment, assessed using the Abbreviated Mental Test, with loneliness, boredom-proneness, social relations, and depression. Participants were randomly selected community-dwelling Irish people aged 65+ years. An Abbreviated Mental Test score of 8 or 9 out of 10 was classified as 'low normal', and a score of less than 8 as 'possible cognitive impairment'. We used clustering around latent variables analysis (CLV) to identify families of variables associated with reduced cognitive function. The overall prevalence of possible cognitive impairment was 14.7% (95% CI 12.4-17.3%). Low normal scores had a prevalence of 30.5% (95% CI 27.2-33.7%). CLV analysis identified three groups of predictors: 'Low social support' (widowed, living alone, low social support), 'personal cognitive reserve' (low social activity, no leisure exercise, never having married, loneliness and boredom-proneness), and 'sociodemographic cognitive reserve' (primary education, rural domicile). In multivariate analysis, both cognitive reserve clusters, but not social support, were independently associated with cognitive function. Loneliness and boredom-proneness are associated with reduced cognitive function in older age, and cluster with other factors associated with cognitive reserve. Both may have a common underlying mechanism in the failure to select and maintain attention on particular features of the social environment (loneliness) or the non-social environment (boredom-proneness).
Fast Fitting of Joint Models for Longitudinal and Event Time Data using a Pseudo-Adaptive Gaussian Quadrature Rule
Rizopoulos, D. (2012). Computational Statistics and Data Analysis 56, 491-501.
Joint models for longitudinal and time-to-event data have recently attracted a lot of attention in the statistics and... more Joint models for longitudinal and time-to-event data have recently attracted a lot of attention in the statistics and biostatistics literature. Even though these models enjoy a wide range of applications in many different statistical fields, they have not found yet their rightful place in the toolbox of modern applied statisticians mainly due to the fact that they are rather computationally intensive to fit. The main difficulty arises from the requirement for numerical integration with respect to the random effects. This integration is typically performed using Gaussian quadrature rules whose computational complexity increases exponentially with the dimension of the random-effects vector. In this paper we offer a solution to this problem by basing the fit of the model on a pseudo-adaptive Gauss-Hermite rule. The idea behind this rule is to use information for the shape of the integrand by separately fitting a mixed model for the longitudinal outcome. Simulation studies have shown that the pseudo-adaptive rule performs excellent in practice, and is considerably faster than the standard Gauss-Hermite rule.
Warm, wet weather associated with increased Legionnaires’ disease incidence in the Netherlands
It has been suggested that warm and humid weather is related to a high incidence of Legionnaires' disease (LD), but no... more It has been suggested that warm and humid weather is related to a high incidence of Legionnaires' disease (LD), but no data on this association existed in The Netherlands. The objective of this study was to investigate the short-term effects of the weather on LD in The Netherlands. National LD surveillance and meteorological data were obtained. We analysed the data using Poisson regression, adjusting for long-term trends, and using principal components analysis. The highest weekly incidence of LD occurred when the mean weekly temperature was +17·5°C. Mean weekly relative humidity, temperature and precipitation intensity were associated with LD incidence in the multivariable model. Warm, humid and showery summer weather was found to be associated with higher incidence of LD in The Netherlands. These results may be used to predict an increase in the number of cases of LD in The Netherlands during the summer.
ESTIMACIONES PUNTUALES E INTERVALOS DE CONFIANZA: RESUMEN DE LOS RESULTADOS DE INVESTIGACIÓN
Publicado en Revista De La Facultad Ciencias De La Salud de la Universidad del Cauca.
Popayán, Colombia
Dynamic Predictions and Prospective Accuracy in Joint Models for Longitudinal and Time-to-Event Data
Rizopoulos, D. (2011). Biometrics 67, 819-829.
In longitudinal studies it is often of interest to investigate how a marker that is repeatedly measured in time is... more In longitudinal studies it is often of interest to investigate how a marker that is repeatedly measured in time is associated with a time to an event of interest. This type of research questions has given rise to a rapidly developing field of biostatistics research that deals with the joint modeling of longitudinal and time-to-event data. In this paper we consider this modeling framework and focus particularly on the assessment of the predictive ability of the longitudinal marker for the time-to-event outcome. In particular, we start by presenting how survival probabilities can be estimated for future subjects based on their available longitudinal measurements and a fitted joint model. Following we derive accuracy measures under the joint modeling framework and assess how well the marker is capable of discriminating between subjects who experience the event within a medically meaningful time frame from subjects who do not. We illustrate our proposals on a real data set on HIV infected patients for which we are interested in predicting the time-to-death using their longitudinal CD4 cell count measurements.
A Two-Part Joint Model for the Analysis of Survival and Longitudinal Binary Data with Excess Zeros
Rizopoulos, D., Verbeke, G., Lesaffre, E. and Vanrenterghem, Y. (2008). Biometrics 64, 611-619.
Many longitudinal studies generate both the time to some event of interest and repeated measures data. This article... more Many longitudinal studies generate both the time to some event of interest and repeated measures data. This article is motivated by a study on patients with a renal allograft, in which interest lies in the association between longitudinal proteinuria (a dichotomous variable) measurements and the time to renal graft failure. An interesting feature of the sample at hand is that nearly half of the patients were never tested positive for proteinuria (≥1g/day) during follow-up, which introduces a degenerate part in the random-effects density for the longitudinal process. In this article we propose a two-part shared parameter model framework that effectively takes this feature into account, and we investigate sensitivity to the various dependence structures used to describe the association between the longitudinal measurements of proteinuria and the time to renal graft failure.
Shared Parameter Models under Random Effects Misspecification
Rizopoulos, D., Verbeke, G. and Molenberghs, G. (2008). Biometrika 95, 63-74.
A common objective in longitudinal studies is the investigation of the association structure between a longitudinal... more A common objective in longitudinal studies is the investigation of the association structure between a longitudinal response process and the time to an event of interest. An attractive paradigm for the joint modelling of longitudinal and survival processes is the shared parameter framework where a set of random effects is assumed to induce their interdependence. In this work, we propose an alternative parameterization for shared parameter models and investigate the effect of misspecifying the random effects distribution in the parameter estimates and their standard errors.
