A Parallel Workflow for Online Correlation and Clique-finding -- with applications to finance
This thesis investigates how a state-of-the-art Stochastic Local Search (SLS) algorithm for the maximum clique problem... more This thesis investigates how a state-of-the-art Stochastic Local Search (SLS) algorithm for the maximum clique problem can be adapted for and employed within a fully distributed parallel workfiow environment. First we present parallel variants of Dynamic Local Search (DLS-MC) and Phased Local Search (PLS), demonstrating how a simple yet effective multiple independent runs strategy can offer superior speedup performance with minimal communication overhead. We then extend PLS into an online algorithm so that it can operate in a dynamic environment where the input graph is constantly changing, and show that in most cases trajectory continuation is more efficient than restarting the search from scratch. Finally, we embed our new algorithm within a data processing pipeline that performs high throughput correlation and clique-based clustering of thousands of variables from a high-frequency data stream. For communication within and between system components, we use MPI, the de-facto standard API for message passing in high-performance computing. We present algorithmic and system performance results using synthetically generated data streams, as well as a preliminary investigation into the applicability of our system for processing high-frequency, real-life intra-day stock market data in order to determine clusters of stocks exhibiting highly correlated short-term trading patterns.
System optimization with artificial neural networks: parallel implementation using transputers
by Ivan Ricarte
Co-authored with Roseli Francelin and Fernando Gomide. Published in IJCNN, 1992
A neural network with a three-layer feedback topology for solving continuous optimization problems has been proposed.... more A neural network with a three-layer feedback topology for solving continuous optimization problems has been proposed. A parallel implementation of the proposed neural network is presented. The implementation described here uses a transputer system, which enables solving problems with several variables. Results from this implementation and comparisons with sequential implementation results are also presented
Analysis of pipelined external sorting on a reconfigurable message-passing multicomputer
by Ivan Ricarte
Co-authored with Bernard Menezes and Ramki Thurimella. Published in Parallel Computing, 1993.
External sorting is a frequent operation in relational database systems, sometimes as a step in important operations... more External sorting is a frequent operation in relational database systems, sometimes as a step in important operations such as joins. Therefore, external sorting on a parallel system is a key index of system performance for database applications. However, the problem of external sorting on multicomputers is not as well understood as parallel internal sorting, when keys reside in main memory. In many case, analysis is performed under assumptions such as unlimited resources (number of processors, amount of memory, network bandwidth) and full overlapped use of resources, limiting its applicability in practice. [...]
External sorting on a reconfigurable message-passing multicomputer: Experimental results and analysis
by Ivan Ricarte
Co-authored with Bernard Menezes and Ramki Thurimella. Published in MWSCS'1992
In this paper, we report on an actual implememtation of the external sorting problem on a multicomputer with careful... more In this paper, we report on an actual implememtation of the external sorting problem on a multicomputer with careful attention paid to the overlap bewteen computation and I/O in order to minimize total execution time. The problem is divided into two steps - the first involves creation of multiple sorted runs (Step 1), the second involves merging the runs (Step 2). Step 1 was accomplished using pipelined sort; Step 2 was implemented on a tree of processors. We also present an analytical model for Step 1; the execution time predicted by the proposed analytical model is compared with the experimental results.
Fast Random Graph Generation
Proc. of the 14th Intl Conf. on Extending Database Technology (EDBT'11), Uppsala, Sweden.
Today, several database applications call for the generation of random graphs. A fundamental, versatile random graph... more Today, several database applications call for the generation of random graphs. A fundamental, versatile random graph model adopted for that purpose is the Erdős–Rényi Γv,p model. This model can be used for directed, undirected, and multipartite graphs, with and without self-loops; it induces algorithms for both graph generation and sampling, hence is useful not only in applications necessitating the generation of random structures but also for simulation, sampling and in randomized algorithms. However, the commonly advocated algorithm for random graph generation under this model performs poorly when generating large graphs, and fails to make use of the parallel processing capabilities of modern hardware. In this paper, we propose PPreZER, an alternative, data parallel algorithm for random graph generation under the Erdős–Rényi model, designed and implemented in a graphics processing unit (GPU). We are led to this chief contribution of ours via a succession of seven intermediary algorithms, both sequential and parallel. Our extensive experimental study shows an average speedup of 19 for PPreZER with respect to the baseline algorithm.
A Data Parallel Minimum Spanning Tree Algorithm for Most Graphics Processing Units
Proc. of the Annual Intl Conf. on Advances in Distributed and Parallel Computing (ADPC'10), Singapore.
We propose a fast data parallel minimum spanning tree algorithm designed for general purpose computation graphical... more We propose a fast data parallel minimum spanning tree algorithm designed for general purpose computation graphical processing units (GPU). Our algorithm is a data parallel version of Borůvka's minimum spanning tree algorithm. Its gist is a synchronization on the central processing unit after each of the parallel iterations computing the components and their outgoing edge minimum weight. Our implementation uses both BrookGPU and CUDA from NVIDIA as programming environments and the performance of our algorithm was evaluated in comparison with the state-of-the art algorithms on different types of datasets. The experimental results show that our algorithm out performs other algorithms substantially in terms of execution time with up to ten fold speedup.
Scalable parallel minimum spanning forest computation
Proc. of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP'12), New Orleans, LA, USA
The proliferation of data in graph form calls for the development of scalable graph algorithms that exploit parallel... more The proliferation of data in graph form calls for the development of scalable graph algorithms that exploit parallel processing environments. One such problem is the computation of a graph's minimum spanning forest (MSF). Past research has proposed several parallel algorithms for this problem, yet none of them scales to large, high-density graphs. In this paper we propose a novel, scalable, Parallel MSF Algorithm (PMA) for undirected weighted graphs. Our algorithm leverages Prim's algorithm in a parallel fashion, concurrently expanding several subsets of the computed MSF. Our effort focuses on minimizing the communication among different processors without constraining the local growth of a processor's computed subtree. In effect, we achieve a scalability that previous approaches lacked. We implement our algorithm in CUDA, running on a GPU and study its performance using real and synthetic, sparse as well as dense, structured and unstructured graph data. Our experimental study demonstrates that our algorithm outperforms the previous state-of-the-art GPU-based MSF algorithm, while being several orders of magnitude faster than sequential CPU-based algorithms.
An Efficient Hierarchical Parallel Genetic Algorithm for Graph Coloring Problem
! NOMINATED FOR BEST PAPER AWARD AT GECCO 2011 !
R. Abbasian and M. Mouhoub. An efficient hierarchical parallel genetic algorithm for graph coloring problem, 13th Annual Genetic and Evolutionary Computation Conference (GECCO 2011), ACM, pages 521-528, Dublin, Ireland, July 12-16, 2011. Also presented at the International Joint Conferences on Artificial Intelligence (IJCAI 2011), RCRA, July 2011.
Graph coloring problems (GCPs) are constraint optimization problems with various applications including scheduling,... more Graph coloring problems (GCPs) are constraint optimization problems with various applications including scheduling, time tabling, and frequency allocation. The GCP consists in finding the minimum number of colors for coloring the graph vertices such that adjacent vertices have distinct colors. We propose a parallel approach based on Hierarchical Parallel Genetic Algorithms (HPGAs) to solve the GCP. We also propose a new extension to PGA, that is Genetic Modification (GM) operator designed for solving constraint optimization problems by taking advantage of the properties between variables and their relations. Our proposed GM for solving the GCP is based on a novel Variable Ordering Algorithm (VOA). In order to evaluate the performance of our new approach, we have conducted several experiments on GCP instances taken from the well known DIMACS website. The results show that the proposed approach has a high performance in time and quality of the solution returned in solving graph coloring instances taken from DIMACS website. The quality of the solution is measured here by comparing the returned solution with the optimal one.
6 views
Seen by:A Concurrent Dataflow Algorithm for Ray Tracing
Ray tracing has become established as one of the most important and popular rendering techniques for synthesizing... more
Ray tracing has become established as one of the most important and popular rendering techniques for synthesizing photo-realistic images. However, the high quality images require long computation times and memory-consuming scene description. Parallel architectures with distributed memories are increasingly being used for rendering and provide more memory and CPU power. This paper describes a new way to employ such computers efficiently for ray tracing. Here each pixel is processed independently, so a natural method of parallelization is to distribute pixels over the machine nodes. If the entire scene can be duplicated in the memory of each processor (in the absence of global memory), a scheme without dataflow is used, otherwise, objects composing the scene have to be distributed over processor nodes. In this last case two strategies are applicable to the computation: object dataflow and ray dataflow.
The fastest of these algorithms is the one without dataflow. If we want to deal with realistic pictures described by large scenes, we have to distribute objects among processors.
Accordingly a choice has to be made between object and ray dataflow. This choice is not easy because of variability in computers, communication networks and scenes, also algorithms might not be based on the same sequential model. Here, we propose a method which chooses the type of flow to use dynamically. This arose from the development of a parallel ray tracing algorithm that included the two modes of dataflow, i.e. a concurrent dataflow algorithm.
As our scheme allows the use of both algorithms, processors need to exchange objects and rays. The processors’ load and some local parameters are used to choose between ray or
object dataflow at any instant. Global information is transmitted by a non-centralized strategy, which assures scalability and the production of relevant messages. Finally, our algorithm offers a
dynamic management of concurrent dataflows, which intrinsically assures dynamic load balancing.
Using concurrent dataflow our algorithm gives very encouraging results, as the computation time for rendering pictures and the number of exchanged messages are reduced by significant factors in relation to classical dataflow algorithms.
This concurrent algorithm is scalable as performances on a CRAY T3E with 64 processors shows. Finally the reduction of the magnitude of the flow of data communication reduces the risk of network saturation, which might otherwise compromise the algorithm.
A mixed dataflow algorithm for ray tracing on the CRAY T3E
The ray tracing scheme is one of the most complete and efficient rendering methods. A major drawback of this model is... more
The ray tracing scheme is one of the most complete and efficient rendering methods. A major drawback of this model is its high computational cost which limits its practical use. Moreover, the quest for realistic rendering requires larger and larger databases to describe scenes. With the development of distributed memory parallel computers such as the CRAY T3E, the most promising way to improve ray traced pictures productions seems to be
parallelization which offers both increased CPU power and memory facilities.
In the ray tracing algorithm, each pixel of the screen is processed independently. A natural way of parallelization is to distribute pixels over the machine nodes. However, since
we want to deal with large scenes, objects also have to be distributed among processors, so a modified parallel algorithm is necessary.
Strategies based on object dataflow have been proposed, but their communication load is too high. More efficient algorithms have to reduce the number of messages. Therefore we
propose a mixed dataflow approach : each message will contain several pieces of information on both objects and rays. By this way, we hope to limit the communication load and to ensure
a dynamic load balancing.
A parallel ray tracing algorithm on the T3E using our mixed dataflow approach is implemented. The results are very encouraging since the computation time and the communication flows may be reduced by significant factors. Scalability is globally improved, mainly because saturation occurs for larger problem sizes.
Parallel Discrepancy-based Search: An efficient and scalable search strategy for massively parallel supercomputers providing intrinsic load-balancing without communication
Thierry Moisan, Jonathan Gaudreault, and Claude-Guy Quimper. Parallel Discrepancy-based Search: An efficient and scalable search strategy for massively parallel supercomputers providing intrinsic load-balancing without communication. In Proceedings of the Workshop on Parallel Methods for Constraint Solving (PMCS'11), held with 17th International Conference on Principles and Practice of Constraint Programming (CP'11), 2011.
Backtracking strategies based on the computation of discrepancies have proved themselves successful at solving large... more
Backtracking strategies based on the computation of discrepancies have proved themselves successful at solving large problems. They show really good performance when provided with a high quality domain-specific branching heuristic (variable and value ordering heuristic), which is the case for many industrial problems.
We propose a novel approach (PDS) that allows parallelizing a strategy based on the computation of discrepancies (LDS). The pool of processors visits the leaves in exactly the same order as the centralized algorithm would do. The implementation allows for a natural/intrinsic load balancing to occur (filtering induced by constraint propagation would affect each processor pretty much in the same way), although there is no communication between processors. These properties make PDS a scalable algorithm to be used on massively parallel supercomputer with thousands of cores.
Exploiting Parallelism in Decomposition Methods for Constraint Satisfaction
Constraint Satisfaction Problems (CSPs) are NP-complete in general, however, there are many tractable subclasses that... more
Constraint Satisfaction Problems (CSPs) are NP-complete in general, however, there are many tractable subclasses that rely on the restriction of the structure of their underlying hypergraphs. It is a well-known fact, for instance, that CSPs whose underlying hypergraph is acyclic are tractable. Trying to define “nearly acyclic” hypergraphs led to the definition of various hypergraph decomposition methods. An important member in this
class is the hypertree decomposition method, introduced by Gottlob et al. It possesses the property that CSPs falling into this class can be solved efficiently, and that hypergraphs in this class can be recognized efficiently as well. Apart from polynomial tractability, complexity analysis has shown, that both afore-mentioned problems lie in the low complexity class LOGCFL and are thus moreover efficiently parallelizable. A parallel algorithm has been proposed for the “evaluation problem”, however all algorithms for the “recognition problem” presented to date are sequential.
The main contribution of this dissertation is the creation of an object oriented programming library including a task scheduler which allows the parallelization of a whole range of computational problems, fulfilling certain complexity-theoretic restrictions. This library merely requires the programmer to provide the implementation of several classes and methods, representing a general alternating algorithm, while the mechanics of the task scheduler remain hidden. In particular, we use this library to create an efficient parallel algorithm, which computes hypertree decompositions of a fixed width.
Another result of a more theoretical nature is the definition of a new type of decomposition method, called Balanced Decompositions. Solving CSPs of bounded balanced width and recognizing such hypergraphs is only quasi-polynomial, however still parallelizable to a certain extent. A complexity-theoretic analysis leads to the definition of a new complexity class hierarchy, called the DC-hierarchy, with the first class in this hierarchy, DC1 , precisely capturing the complexity of solving CSPs of bounded balanced width.
47 views
Seen by:3 views
Seen by:
