Parallelization of an unstructured multi-grid Navier-Stokes solver for free-surface flow
by Yohei Sato
"Co-authored with Takanori Hino", "published in Proceedings of 22nd Symposium on Computational Fluid Dynamics, Tokyo, 2008"
Recently, multi-core CPU, which consists of plural CPUs, becomes popular and dual-core or quad-core CPU is widely used... more Recently, multi-core CPU, which consists of plural CPUs, becomes popular and dual-core or quad-core CPU is widely used for desktop PC. Computational Fluid Dynamics, hereafter written as CFD, computation requires a lot of CPU resource, thus it is very efficient to use multi-core CPU for CFD computation by using OpenMP technique. In this paper, a parallelized iterative solution technique for an unstructured multi-grid Navier-Stokes solver by using OpenMP programming is presented. Sample calculations of (i) the free-surface flow around a ship and (ii) the flow around a human hand are carried out to evaluate the technique. The results show that the presented technique is efficient and the parallelization efficiency is more than 80%.
1 views
Seen by:A High Speed and Performance Optimization Algorithm Based on Gravitational Approach
by Mina Sohrabi
Naji, H.R., Sohrabi, M., Rashedi, E., "A High Speed and Performance Optimization Algorithm Based on Gravitational Approach", Computing in Science and Engineering (ieee), 2011.
Recently a novel heuristic search algorithm, called Gravitational Search Algorithm (GSA), which is based on the law of... more
Recently a novel heuristic search algorithm, called Gravitational Search Algorithm (GSA), which is based on the law of gravity and mass interactions, has been proposed. Although GSA has high performance in solving various optimization problems, it has some time consuming computations for calculation of the total force on each mass which makes the speed of optimization low.
In this paper we introduce a new approach, which improves GSA's speed considerably. Our approach is based on the multi-agent systems where multiple agents are the mechanism used to express the parallelism. In multi-agent based GSA, complex problems are decomposed into smaller and simpler components that are handled by different agents in the system. Our
experimental results show our multi-agent based GSA approach provides a high performance and high speed optimization methodology that can help scientists in a variety of science and engineering computations.
25 views
Seen by:A parallel algorithm for the verification of Covering Arrays
Himer Avila-George, Jose Torres-Jimenez, Vicente Hernandez and Nelson Rangel-Valdez
Covering Arrays (CAs) are combinatorial objects that, with a small number of cases, cover a certain level of... more
Covering Arrays (CAs) are combinatorial objects that, with a small number of cases, cover a certain level of interaction of a set of parameters. CAs have found application in a variety of fields where interactions among factors need to be identified; some of these fields are biology, agriculture, medicine, and software and hardware testing. In particular, a covering array is an N × k matrix over an alphabet v s.t. each N × k subset contains at least one time each combination from {0, 1, ..., v-1}t , given a positive integer value t. The process of ensuring that a CA contains each of the v t combinations is called verification of CA. When CAs have many variables or their strength is greater than 3, its verification is computationally very expensive.
In this paper we present an algorithm for CA verification and its implementation details in sequential and parallel computing.
1 views
Seen by:Translating Haskell# Programs into Petri Nets
Lecture Notes in Computer Science
Volume 2565, pages 635-649
DOI: 10.1007/3-540-36569-9_43
Abstract Haskell# is a concurrent programming environment aimed at parallel distributed architectures. Haskell#... more Abstract Haskell# is a concurrent programming environment aimed at parallel distributed architectures. Haskell# programs may be automatically translated to Petri nets, an important formalism for analysis of properties of concurrent and non-determinisc systems. This paper motivates and formalizes the translation of Haskell# programs into Petri nets, providing some examples of their usage.
15 views
Seen by:Avaliação do Desempenho de Operações Coletivas em Memória Distribuída e Compartilhada para Implementações de MPI
I Concurso de Trabalhos de Iniciação Científica em Arquitetura de Computadores e Computação de Alto Desempenho, WSCAD-CTIC 2007 (aluno participante: Lucas Pinheiro Queiroz)
A eficiência de comunicações coletivas e o suporte ao multiprocessamento são dois fatores atualmente determinantes do... more A eficiência de comunicações coletivas e o suporte ao multiprocessamento são dois fatores atualmente determinantes do desempenho de bibliotecas de passagem de mensagens do padrão MPI. Este artigo apresenta uma avaliação de desempenho de duas destas implementações, MPICH e Open MPI, com relação aos fatores mencionados.
29 views
Seen by:Performance Analysis of Parallel Demographic Simulation
Conference Proceeding: 01/2010; In proceeding of: Proceedings of the 24th European Simulation and Modelling Conference (ESM10)
Today, we have seen an increase in the number of papers on parallel simulation applications outside the traditional... more Today, we have seen an increase in the number of papers on parallel simulation applications outside the traditional military and network simulations areas, such as in the physical science and management science. One of the new areas in which parallel simulation could be used is demography, specifically for population projection. In this paper, we report the performance evaluation results of a parallel demographic simulation tool called Yades. We investigate the effect of three factors: unbalanced workload, heterogeneous processing speed and heterogeneous communication latency on performance measures such as: time spent in executing useful events, time spent for overhead and the number of rollbacks. The results are consistent with what has been reported in other application areas of parallel simulation. Since the application in demography is new, it is useful to quantify the effect of the three factors on performance.
106 views
Seen by:106 views
Seen by:Workload Balancing Methodology for Data-Intensive Applications with Divisible Load
Claudia Rosas Anna Morajko Josep Jorba
SBAC-PAD '11 Proceedings of the 2011 23rd International Symposium on Computer Architecture and High Performance Computing
ISBN: 978-0-7695-4573-8 doi>10.1109/SBAC-PAD.2011.15
Data-intensive applications are those that explore, query, analyze, and, in general, process very large data sets.... more Data-intensive applications are those that explore, query, analyze, and, in general, process very large data sets. Generally in High Performance Computing (HPC), the main performance problem associated to these applications is the load unbalance or inefficient resources utilization. This paper proposes a methodology for improving performance of data-intensive applications based on performing multiple data partitions prior to the execution, and ordering the data chunks according to their processing times during the application execution. As a first step, we consider that a single execution includes multiple related explorations on the same data set. Consequently, we propose to monitor the processing of each exploration and use the data gathered to dynamically tune the performance of the application. The tuning parameters included in the methodology are the partition factor of the data set, the distribution of these data chunks, and the number of processing nodes to be used by the application. The methodology has been initially tested using the well-known bioinformatics tool BLAST, obtaining encouraging results (up to a 40% of improvement).
A Compiler Extension for Parallelizing Arrays Automatically on the Cell Heterogeneous Processor
Joint with Y Gdura, presented at CPC 2012
This paper describes the approaches taken to extend an array
programming language compiler using a Virtual SIMD... more
This paper describes the approaches taken to extend an array
programming language compiler using a Virtual SIMD Machine (VSM)
model for parallelizing array operations on Cell Broadband Engine heterogeneous
machine. This development is part of ongoing work at the
University of Glasgow for developing array compilers that are beneficial
for applications in many areas such as graphics, multimedia, image processing
and scientific computation. Our extended compiler, which is built
upon the VSM interface, eases the parallelization processes by allowing
automatic parallelisation without the need for any annotations or process
directives. The preliminary results demonstrate significant improvement
especially on data-intensive applications.
32 views
Seen by:Two Alternative Implementations of Automatic Parallelisation
Presented at CPC 2012, joint with Y Gdura, P Keir
This paper is a description of the recent parallelising compilers
from our group at the University of Glasgow.... more
This paper is a description of the recent parallelising compilers
from our group at the University of Glasgow. Our group is part of
the Computer Vision and Graphics research group and we have for some
years been developing array compilers because we think these are a good
tool both for expressing graphics algorithms and for exploiting the parallelism
that computer vision applications require. We shall describe the
implementation of two different languages on two different platforms and
we shall compare the performance of these with reference C implementations
running on the same platforms. Finally we shall draw conclusions
both about the viability of the array language approach as compared to
other approaches used in the challenge and also about the strengths and
weaknesses of the two, very different, processor architectures we used.
The SCC and the SICSA Multi-core Challenge
Paper given at the Many Core Applications Research Conferenced, Potsdam University, December 2011
Abstract—Two phases of the SICSA Multi-core Challenge have
gone past. The first challenge was to produce... more
Abstract—Two phases of the SICSA Multi-core Challenge have
gone past. The first challenge was to produce concordances of
books for sequences of words up to length N; and the second
to simulate the motion of N celestial bodies under gravity. We
took both challenges on the SCC, using C and the Linux Shell.
This paper is an account of the experiences gained. It also gives
a shorter account of the performance of other systems on the
same set of problems, as they provide benchmarks against which
the SCC performance can be compared with.
16 views
Seen by:Load balancing in homogeneous pipeline based applications
A. Moreno, E. Cesar, A. Guevara, J. Sorribes, T. Margalef
Parallel Comput. (2011),
doi:10.1016/j.parco.2011.11.001
We propose to use knowledge about a parallel application’s structure that was acquired with the use of a skeleton... more We propose to use knowledge about a parallel application’s structure that was acquired with the use of a skeleton based development strategy to dynamically improve its performance. Parallel/distributed programming provides the possibility of solving highly demanding computational problems. However, this type of application requires support tools in all phases of the development cycle because the implementation is extremely difficult, especially for non-expert programmers. This work shows a new strategy for dynamically improving the performance of pipeline applications. We call this approach Dynamic Pipeline Mapping (DPM), and the key idea is to have free computational resources by gathering the pipeline’s fastest stages and then using these resources to replicate the slowest stages. We present two versions of this strategy, both with complexity O(N log (N)) on the number of pipe stages, and we compare them to an optimal mapping algorithm and to the Binary Search Closest (BSC) algorithm [1]. Our results show that the DPM leads to significant performance improvements, increasing the application throughput up to 40% on average.
Dynamic performance tuning environment
Anna Morajko, Eduardo César, Tomás Margalef, Joan Sorribes and Emilio Luque
EURO-PAR 2001 PARALLEL PROCESSING
Lecture Notes in Computer Science, 2001, Volume 2150/2001, 36-45, DOI: 10.1007/3-540-44681-8_7
Performance analysis and tuning of parallel/distributed applications is a very difficult tasks for non-expert... more
Performance analysis and tuning of parallel/distributed applications is a very difficult tasks for non-expert programmers. It is necessary to provide tools that automatically carry out these tasks. Many applications have different behavior according to the input data set or even change their behavior dynamically during the execution. Therefore, it is necessary that the performance tuning can be done on the fly by modifying the application according to the particular conditions of the execution. A dynamic automatic performance tuning environment supported by dynamic instrumentation techniques is presented. The environment is completed by pattern based pplication design tool that allows the user to concentrate on the design phase and facilitates on the fly overcoming of performance bottlenecks.
This work was supported by the Comisión Interministerial de Ciencia y Tecnología (CICYT) under contract number TIC 98-0433.
6 views
Seen by:A Concurrent Dataflow Algorithm for Ray Tracing
Ray tracing has become established as one of the most important and popular rendering techniques for synthesizing... more
Ray tracing has become established as one of the most important and popular rendering techniques for synthesizing photo-realistic images. However, the high quality images require long computation times and memory-consuming scene description. Parallel architectures with distributed memories are increasingly being used for rendering and provide more memory and CPU power. This paper describes a new way to employ such computers efficiently for ray tracing. Here each pixel is processed independently, so a natural method of parallelization is to distribute pixels over the machine nodes. If the entire scene can be duplicated in the memory of each processor (in the absence of global memory), a scheme without dataflow is used, otherwise, objects composing the scene have to be distributed over processor nodes. In this last case two strategies are applicable to the computation: object dataflow and ray dataflow.
The fastest of these algorithms is the one without dataflow. If we want to deal with realistic pictures described by large scenes, we have to distribute objects among processors.
Accordingly a choice has to be made between object and ray dataflow. This choice is not easy because of variability in computers, communication networks and scenes, also algorithms might not be based on the same sequential model. Here, we propose a method which chooses the type of flow to use dynamically. This arose from the development of a parallel ray tracing algorithm that included the two modes of dataflow, i.e. a concurrent dataflow algorithm.
As our scheme allows the use of both algorithms, processors need to exchange objects and rays. The processors’ load and some local parameters are used to choose between ray or
object dataflow at any instant. Global information is transmitted by a non-centralized strategy, which assures scalability and the production of relevant messages. Finally, our algorithm offers a
dynamic management of concurrent dataflows, which intrinsically assures dynamic load balancing.
Using concurrent dataflow our algorithm gives very encouraging results, as the computation time for rendering pictures and the number of exchanged messages are reduced by significant factors in relation to classical dataflow algorithms.
This concurrent algorithm is scalable as performances on a CRAY T3E with 64 processors shows. Finally the reduction of the magnitude of the flow of data communication reduces the risk of network saturation, which might otherwise compromise the algorithm.
A mixed dataflow algorithm for ray tracing on the CRAY T3E
The ray tracing scheme is one of the most complete and efficient rendering methods. A major drawback of this model is... more
The ray tracing scheme is one of the most complete and efficient rendering methods. A major drawback of this model is its high computational cost which limits its practical use. Moreover, the quest for realistic rendering requires larger and larger databases to describe scenes. With the development of distributed memory parallel computers such as the CRAY T3E, the most promising way to improve ray traced pictures productions seems to be
parallelization which offers both increased CPU power and memory facilities.
In the ray tracing algorithm, each pixel of the screen is processed independently. A natural way of parallelization is to distribute pixels over the machine nodes. However, since
we want to deal with large scenes, objects also have to be distributed among processors, so a modified parallel algorithm is necessary.
Strategies based on object dataflow have been proposed, but their communication load is too high. More efficient algorithms have to reduce the number of messages. Therefore we
propose a mixed dataflow approach : each message will contain several pieces of information on both objects and rays. By this way, we hope to limit the communication load and to ensure
a dynamic load balancing.
A parallel ray tracing algorithm on the T3E using our mixed dataflow approach is implemented. The results are very encouraging since the computation time and the communication flows may be reduced by significant factors. Scalability is globally improved, mainly because saturation occurs for larger problem sizes.

