clusterProfiler: an R Package for Comparing Biological Themes Among Gene Clusters
Increasing quantitative data generated from transcriptomics and proteomics require integrative strategies for... more Increasing quantitative data generated from transcriptomics and proteomics require integrative strategies for analysis. Here, we present an R package, clusterProfiler that automates the process of biological-term classification and the enrichment analysis of gene clusters. The analysis module and visualization module were combined into a reusable workflow. Currently, clusterProfiler supports three species, including humans, mice, and yeast. Methods provided in this package can be easily extended to other species and ontologies. The clusterProfiler package is released under Artistic-2.0 License within Bioconductor project. The source code and vignette are freely available at http://bioconductor.org/packages/release/bioc/html/clusterProfiler.html
Multiscale integration for spatio-temporal ecoclimatic ecoregioning delineation
This the abstract presented at IGARSS 2008
see also the paper:
Leibovici, DG and Jackson, M (2011) Multi-scale Integration for Spatio-Temporal Ecoregioning Delineation. International Journal of Image and Data Fusion, 2(2): 105-119
9 views
Non-negative Matrix Factorization: Assessing Methods for Evaluating the Number of Components, and the Effect of Normalization Thereon
by Jose Maisog
2009 Master's thesis; Karthik Devarajan and George Luta, advisors
Non-negative matrix factorization (NMF) is a relatively new method of matrix decomposition which factors an m by n... more Non-negative matrix factorization (NMF) is a relatively new method of matrix decomposition which factors an m by n data matrix X into an m by k matrix W and a k by n matrix H, so that X = W * H. Importantly, all values in X, W, and H are constrained to be non-negative. NMF can be used for dimensionality reduction, since the k columns of W can be considered components or latent "parts" into which X has been decomposed. The question arises: how does one choose k? In this thesis, we assess multiple methods for estimating the number of components k in the context of NMF, and we also examine the eects of various types of normalization on this estimate. We conclude that when estimating k, it is best not to perform any normalization. If it is known or assumed that the underlying components are orthogonal or nearly so, then perhaps Velicer's MAP or Minka's Laplace-PCA method might be best to use. However, in the general case where it is unknown whether the underlying components are orthogonal or not, none of the methods for estimating k seemed obviously better than the others.
The Significance of Linear Trends and Clusters of Fault-Related Mesothermal Lode Gold Mineralization
Gold deposits from the Laverton region, Western Australia, form both linear trends and cluster-like distributions in... more
Gold deposits from the Laverton region, Western Australia, form both linear trends and cluster-like distributions in map view, associated with regional-scale shear zone systems. Clusters of deposits are common, and they form 5 × 5 km domains, spaced every 15 to 20 km. In contrast, there is also a linear trend of mineralization that can be traced continuously for ~20 km along a shear zone. In detail, mineralization is found in a range of different rock types and is hosted in fault-shear zone networks that are smaller than the adjacent regionalscale shear zones by several orders of magnitude. A mixture of different fault types comprise these mineralized fault-shear networks often within the same gold field, although thrust and reverse faults are a common component of the richer deposits (e.g., Sunrise Dam, Wallaby, Granny Smith).
The seismogenic behavior of fault systems during mineralization can explain these observations, illustrated by simple stress transfer models that show the location of aftershock distributions around different fault types. Strike-slip ruptures on regional-scale structures can repeatedly trigger restricted clusters of aftershocks. These aftershock clusters describe vertical, pipe-like domains in three dimensions. In contrast, regional-scale thrust or reverse ruptures trigger continuous, more diffuse domains of aftershocks along the full strike of the rupture. Thus ruptures along regional strike-slip structures are able to trigger ruptures on preexisting thrust faultsshears at the deposit scale. When they do so, the thrust aftershocks generate a vertical domain of elevated permeability over the depth of the crust, where the individual flow paths are likely to be highly tortuous, increasing the potential for fluid-rock interaction or fluid mixing. The existence of linear trends and cluster-like distributions of deposits at Laverton may be consistent with two different ages of mineralization in this gold camp. The results also show that distributions of deposits provide an indication of the kinematics of the regional fault systems that they relate to, and that stress transfer modeling is a simple, non-computationally demanding technique that is easily adapted to exploration for fault-related mineral deposits.
Hierarchical Agglomerative Clustering of English-Bulgarian Parallel Corpora
R. Alfred, E. Paskaleva, D. Kazakov, M. Bartlett, et. al. 2007. Hierarchical Agglomerative Clustering of English-Bulgarian Parallel Corpora. RANLP:24-29.
Most multilingual parallel corpora have become an essential resource for work in multilingual natural language... more
Most multilingual parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the hierarchical agglomerative clustering (HAC) technique to cluster multilingual parallel text on web contents. A clustering algorithm taking constraints from parallel corpora potentially has several attractive features. Firstly, training samples in another language provide indirect evidence for a classification or clustering result. Secondly, constraints from both languages may help to eliminate some biased language-specific usages, resulting in classes of better quality. Finally, the alignment between pairs of clustered documents can be used to extract words from each language, which may then be used for other applications, as an example in this paper, we utilise these words for term reduction. We explain the findings that we obtain from the clustering of a significant parallel corpusfor a low-density and high-density of paired language, English and Bulgarian. Preliminary results show that the HAC algorithm can effectively cluster bilingual parallel corpora separately and still produce the same extracted
words that best describe these clusters for both English and Bulgarian corpora.
23 views
Seen by:Fuzzy Cluster Analysis of Shipping Accidents in the Bosporus
European Journal of Navigation, Vol. 6, No:3, 32-41
This paper introduces a fuzzy cluster analysis to ex- amine shipping accidents occurred in the Bosporus in relation to... more This paper introduces a fuzzy cluster analysis to ex- amine shipping accidents occurred in the Bosporus in relation to accident types and the localities wherein incidents m o s t l y occur. Fuzzy cluster analysis allows gradual memberships of data points to clusters. This gives the flexibility to express data points belonging to more than one cluster at the same time. Furthermore, these membership de- grees offer a much finer degree of detail of the data model. In order to understand the effect of the key factors causing shipping accidents in the Strait of Is- tanbul, or Bosporus, two different and well-known fuzzy clustering algorithms are employed. The first algorithm is the fuzzy C-means (FCM) which recog- nizes a known or given number of c hyper-spherical clouds of points in a given p-dimensional data set. In the second algorithm called Gustafson-Kessel (GK), the cluster prototypes are endowed with a fuzzy covariance matrix in addition to the center vectors for detecting ellipsoidal clusters. The data set of this study consists of the location and the types of the shipping accidents in the Strait of Istanbul between 1982-2005 periods. Outcome of the FCM and the GK algorithms with different cluster numbers are compared and accordingly reported.
A fuzzy clustering-based hybrid method for a multi-facility location problem
J Intell Manuf (2009) 20:259–265
A fuzzy clustering-based hybrid method for a multi-facility location problem is presented in this study. It is assumed... more A fuzzy clustering-based hybrid method for a multi-facility location problem is presented in this study. It is assumed that capacity of each facility is unlimited. The method uses different approaches sequentially. Initially, cus- tomers are grouped by spherical and elliptical fuzzy cluster analysis methods in respect to their geographical locations. Different numbers of clusters are experimented. Then facil- ities are located at the proposed cluster centers. Finally each cluster is solved as a single facility location problem. The cen- ter of gravity method, which optimizes transportation costs is employed to fine tune the facility location. In order to com- pare logistical performance of the method, a real world data is gathered. Results of existing and proposed locations are reported.
Movie segmentation into scenes and chapters using locally weighted bag of visual words
Movies segmentation into semantically correlated units is a quite tedious task due to "semantic gap".... more Movies segmentation into semantically correlated units is a quite tedious task due to "semantic gap". Low-level features do not provide useful information about the semantical correlation between shots and usually fail to detect scenes with constantly dynamic content. In the method we propose herein, local invariant descriptors are used to represent the key-frames of video shots and a visual vocabulary is created from these descriptors resulting to a visual words histogram representation (bag of visual words) for each shot. A key aspect of our method is that, based on an idea from text segmentation, the histograms of visual words corresponding to each shot are further smoothed temporally by taking into account the histograms of neighboring shots. In this way, valuable contextual information is preserved. The final scene and chapter boundaries are determined at the local maxima of the difference of successive smoothed histograms for low and high values of the smoothing parameter respectively. Numerical experiments indicate that our method provides high detection rates while preserving a good tradeo between recall and precision.
Trance Depth estimation during Hypnosis Induction, Using Cluster-Time Map
Golnaz Baghdadi, Ali Motie Nasrabadi
16th Iranian Conferences on Biomedical Engineering, 2009
Golnaz_baghdadi@yahoo.com, Nasrabadi@shahed.ac.ir
In hypnosis treatment, if the subject does not reach to the required hypnosis depth, the treatment does not work on... more In hypnosis treatment, if the subject does not reach to the required hypnosis depth, the treatment does not work on him. Therefore, determination of the depth level along hypnosis induction is necessary. In this study, using Kmeans clustering way, EEG signal’s feature changes were mapped on a page along the time. Investigating the cluster-time maps shows that the ratio of Beta band relative energy of hypnosis EEG to normal EEG in channel C3 represents a significant cluster-time map. In this map, three hypnotizable groups selected different clusters in a consecutive manner along three minutes time windows of hypnosis induction. There for in this study, transferring the ratio of Beta band relative energy of hypnosis EEG to normal EEG in channel C3 to the cluster-time map, a novel procedure was introduced for evaluating the hypnosis depth along hypnosis suggestion.
Density-based spatial clustering in the presence of obstacles and facilitators
In this paper, we propose a new spatial clustering method, called DBRS+, which aims to cluster spatial data in the... more In this paper, we propose a new spatial clustering method, called DBRS+, which aims to cluster spatial data in the presence of both obstacles and facilitators. It can handle datasets with intersected obstacles and facilitators. Without preprocessing, DBRS+ processes constraints during clustering. It can find clusters with arbitrary shapes and varying densities. DBRS+ has been empirically evaluated using synthetic and real data sets and its performance has been compared to DBRS, AUTOCLUST+, and DBCLuC*.
21 views
Seen by: and 2 moreA parallel workflow for real-time correlation and clustering of high-frequency stock market data
We investigate the design and implementation of a parallel workflow environment targeted towards the financial... more We investigate the design and implementation of a parallel workflow environment targeted towards the financial industry. The system performs real-time correlation analysis and clustering to identify trends within streaming high-frequency intra-day trading data. Our system utilizes state-of-the-art methods to optimize the delivery of computationally-expensive real-time stock market data analysis, with direct applications in automated/algorithmic trading as well as knowledge discovery in high-throughput electronic exchanges. This paper describes the design of the system including the key online parallel algorithms for robust correlation calculation and clique-based clustering using stochastic local search. We evaluate the performance and scalability of the system, followed by a preliminary analysis of the results using data from the Toronto Stock Exchange
22 views
Seen by:Functional similarity analysis of human virus-encoded miRNAs
miRNAs are a class of small RNAs that regulate gene expression via RNA silencing machinery. Some viruses also encode... more miRNAs are a class of small RNAs that regulate gene expression via RNA silencing machinery. Some viruses also encode miRNAs, contributing to the complex virus-host interactions. A better understanding of viral miRNA functions would be useful in designing new preventive strategies for treating diseases induced by viruses. To meet the challenge for how viruses module host gene expression by their encoded miRNAs, we measured the functional similarities among human viral miRNAs by using a method we reported previously. Higher order functions regulated by viral miRNAs were also identified by KEGG pathway analysis on their targets. Our study demonstrated the biological processes involved in virus-host interactions via viral miRNAs. Phylogenetic analysis suggested that viral miRNAs have distinct evolution rates compared with their corresponding genome.
C-DRIVE: Clustering Based on Direction in Vehicular Environment
In Proc. of 4th IFIP NTMS' 2011
In this paper we propose a direction based clustering
algorithm to disseminate the information amongst vehicles.... more
In this paper we propose a direction based clustering
algorithm to disseminate the information amongst vehicles. This
protocol can be used to realize efficiency applications in VANETs
such as adaptive traffic signal control by estimating the density of
vehicles on a given road. C-DRIVE uses the direction of the
vehicle to optimize the data dissemination during the formation
of the clusters and hence facilitating for better utilization of the
available bandwidth. The simulations show that the proposed
protocol reduces the number of broadcast packets being sent on
the network. We also find that the protocol is efficient in terms of
overheads during the cluster formation.
Keywords- Clustering; VANET;
An Improved Co-Similarity Measure for Document Clustering
Co-clustering has been defined as a way to organize simultaneously subsets of instances and subsets of
features... more
Co-clustering has been defined as a way to organize simultaneously subsets of instances and subsets of
features in order to improve the clustering of both of them. In previous work, we proposed an efficient co-similarity measure allowing to simultaneously compute two similarity matrices between objects and features, each built on the basis of the other. Here we propose a generalization of this approach by introducing a notion of pseudo-norm and a pruning algorithm.
Our experiments show that this new algorithm significantly improves the accuracy of the results when using either supervised or unsupervised feature selection data and that it outperforms other algorithms on various corpora.
