Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016
A Parallel Approach to Object Identification in Large-scale Images
Young-Min Kang
Sung-Soo Kim
Gyung-Tae Nam
Tongmyong University
ETRI
GCSC Inc.
Busan, 48520, Korea Daejeon, 34129, Korea Busan, 47607, Korea
ymkang@tu.ac.kr
sungsoo@etri.re.kr
gtnam@gcsc.co.kr
ABSTRACT
As the computing power of processors is being
drastically improved, the sizes of image data for
various applications are also increasing. One of the
most basic operations on image data is to identify
objects within the image, and the connected component labeling (CCL) is the most frequently used
strategy for this problem. However, CCL cannot
be easily implemented in a parallel fashion because
the connected pixels can be found basically only
by graph traversal. In this paper, we propose a
GPU-based efficient algorithm for object identification in large-scale images and the performance
of the proposed method is compared with that of
the most commonly used method implemented with
OpenCV libraries. The method was implemented
and tested on computing environments with commodity CPUs and GPUs. The experimental results
show that the proposed method outperforms the reference method when the pixel density is below 0.7.
Object identification in image data is the fundamental operation and rapid computation is highly requested as the sizes of the currently available image data rapidly increase. The experimental results show the proposed method can be a good solution to the object identification in large-scale image
data.
KEYWORDS
GPGPU, Object Identification, CCL
1
INTRODUCTION
As the computing power of processors is being drastically improved, the sizes of image
data for various applications are also increasing. Therefore, efficient algorithms for manip-
ISBN: 978-1-941968-40-6 ©2016 SDIWC
ulating the large-scale image data are required.
One of the most basic operations on image data
is to identify objects within the image, and
the connected component labeling (CCL) is the
most frequently used strategy for this problem.
Various image processing techniques can be
easily implemented in parallel fashion, and
GPU parallelism has been successfully exploited in this field. However, CCL cannot be
easily implemented with parallel tasks because
the connected pixels are represented as adjacent nodes in a graph and the adjacency among
all the nodes can be investigated basically only
by graph traversal.
In this paper, we propose a GPU-based efficient algorithm for object identification in
large-scale images and the performance of the
proposed method is compared with that of an
OpenCV-based CCL algorithm.
2
RELATED WORK
Object identification is a fundamental problem
in image processing. In many cases, the object identification is performed based on CCL.
The most CCL algorithms are reduced to the
traversal of adjacent nodes (pixels) along the
edges in the graph that represents the input image. The traversal approaches are naturally sequential, and parallel implementations of typical CCL algorithms have not been very successful [10].
Even in the early stage of computer vision research, it was found that the connectivity cannot be determined by completely parallel tasks
[7]. However, the rapid development of general purpose graphics processing unit (GPU)
technologies made it possible for GPU-based
71
Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016
parallel approaches to achieve better performance than CPU-based traditional CCL methods [1, 9].
The basic approach to CCL is to use union-find
algorithm which can determine whether two
nodes in an undirected graph are connected or
not [8]. However, such methods based on this
approach use sequential computations. Several
methods have been proposed to exploit parallel
computing architectures [5, 9]. However, these
methods were applied to relatively small images. Some methods utilized cluster architectures [4]. These methods split the data volume
and the data segments are assigned to different
computing units. Parallelism within a single
GPU, therefore, cannot be exploited in these
methods.
Label equivalence method is implemented on
GPU [6], and this method iterates to resolve
the label equivalence by finding the roots of
equivalence trees. However, this approach relies on decision tables which cannot be efficiently handled on GPU [10].
Block-based labeling and efficient block processing with decision table were proposed in
[2, 3]. Block equivalence method based on
“scan mask” was proposed [10]. However, this
method also has to iterate the equivalence resolving until it converges to the state where no
label update is found.
3.1
Data Initialization
The pixels in an image can be classified into
either ‘on’ pixels and ‘off’ pixels. The ‘on’
pixels are regarded nodes in graph representation, and it is assumed that an edge exists between two nodes of which positions neighbor
each other in the image space.
The goal of CCL algorithms is to assign an
identical label for linked nodes. In order to
achieve this goal, each pixel is assigned unique
label in the initialization stage. The simplest
method is to assign sequential numbers to the
pixels. Suppose we have an image with w × h
pixels and the pixel at (x,y) is denoted as
p(x, y) where x = [1, w], y = [1, h]. The ‘on’
pixel at (x, y) is then labeled with the number
x + w(y − 1). Therefore, the labeling numbers
range from 1 to wh. All the ‘off’ pixels are
labeled to be -1.
The label map Iλ is an image composed of label of each pixel, and each label at (x, y) in the
label map are denoted by Iλx,y . In other words,
the image with initial labels can be described
as follows:
Iλ ∈ Zw×h
idx,y = x + w(y − 1) ∈ [1, wh]
p(x, y) = 1 ⇒ lλx,y = idx,y
= −1
p(x, y) = 0 ⇒ lλx,y
(1)
After the initialization is done, the rest of the
algorithm is to merge the positive labels in each
connected component into a single label.
3
PROPOSED METHOD
3.2
In this section, a GPU-based parallel approach
to CCL is proposed. The method is composed
of four major tasks: 1) data initialization, 2)
computation of column-wise label runs, 3) label merge of connected components. Each task
is explained in details in the following subsections.
ISBN: 978-1-941968-40-6 ©2016 SDIWC
Computing Column-wise Label Runs
In order to merge labels, adjacent positive labels are merged. The block of contiguous object pixels in a column is a ‘run.’ The first
stage of label merge is to find runs. In other
words, each run is identified and labeled with
a unique number. Each column is assigned to
72
Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016
id
initial labels
1
-1
column-wise update
-1
2
2
3
3
3
3
4
-1
-1
5
5
7
6
6
7
7
7
8
-1
column-wise label runs
Equivalence Forest
-1
-1
3
-1
3
13
-1
13
7
13
7
-1
7
7
-1
-1
-1
-1
2
5
connected
6
7
3
connected
7
3
5
2
connected
6
13
11
12
connected
merged labels in two columns
Figure 1. Label runs and equivalence forest
a CUDA thread and processed in parallel fashion. Therefore, we have w threads running separately.
The computation within a thread is to simply
scan pixels and change the label of the current
pixel to be that of the previously scanned pixel
if the both pixels are ‘on.’
Fig. 1 shows how the label assigned to each
pixel is updated through column-wise label run
computation described in Algorithm 1. After
the update, the each label represents the root
node in the equivalence tree it belongs to as
shown in Fig. 1
Algorithm 1: Column-wise label run
kernel vertLabel
Data: Iλ ∈ Zw×h : In, Out
begin
col = thread:[1, w]
for row: h − 1 downto 1 do
if Iλrow,col > 0 and Iλrow+1,col > 0 then
Iλrow,col = Iλrow+1,col
3.3
Label Merge
Once the column-wise label update is finished,
the vertical connectivity must be investigated.
Let us suppose, for simplicity, that we have
only two columns. In the previous columnwise update, the labels are merged to the
largest value in the equivalence tree. If two
pixels in a row are connected, the equivalence
ISBN: 978-1-941968-40-6 ©2016 SDIWC
-1
-1
3
-1
13
13
-1
13
7
13
7
-1
13
-1
-1
-1
connected
2
13
7
3
connected
5
6
11
12
Figure 2. Label merge with two columns
trees those pixels belong to should be merged
into a single tree. In order to achieve this, the
root node of each tree must be found and the
root node with a smaller label is relabeled to
point the other root with a larger label as shown
in Fig. 2.
Note that the labels of the connected pixels are
not directly updated. Fig. 3 shows the merge
process. As shown in Fig. 3 (a), two pixels a
and b in different equivalent trees are found to
be connected. The direct update of connected
pixels does not successfully merge the equivalence trees as shown in Fig. 3 (b). The correct label merge can be done by comparing and
update of the roots of the equivalence trees as
shown in Fig. 3 (c).
In order to perform the two-column label
merge for an images with w columns, w/2
column pairs are separately merged with w/2
threads. After every two adjacent column pairs
are merged, we iterate the label merge. In
the second merge phase, as shown in Fig. 4,
we have only to consider the boundaries between the previously merged column pairs, and
the computation is reduced to half the previous one. It is easily noticed that the lg w iter-
73
Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016
smaller root
bigger root
connected
a
b
(a) connected pixels found
smaller root
a
bigger root
smaller root
bigger root
b
(b) direct update of pixel labels
(c) root node merge
Figure 3. Label merge with two columns
Algorithm 2: Label merge algorithm
host callMergeLabel
Data: Iλ ∈ Zw×h : In, Out
begin
div = 2
for i: 0 upto lg w − 1 do
merge <<h · w/div>> (div,Iλ )
div = 2 × div
kernel merge
Data: div ∈ Z: in, Iλ ∈ Zw×h : In, Out
begin
thread:[1, wh/div]
nBoundary = w/div
col = div/2 + (thread % nBoundary) × div
row = thread / nBoundary
if Iλrow,col > 0 and Iλrow,col+1 > 0 then
rootL = findRoot(row,col)
rootR = findRoot(row,col+1)
max(rootL ,rootR )
min(rootL ,rootR )
= Iλ
Iλ
device findRoot
Data: row, col ∈ Z: In, label ∈ Z: Out
begin
if Iλrow,col < 0 then
return -1
label = w row + col
while Iλlabel 6= label do
label := Iλlabel
return label
ations are sufficient for finding the final label
equivalence. Let us denote the computational
cost for the first iteration to be C(1). The total
cost for the label merge is
Plgcomputational
w−1 1
C(1) = O(C(1)).
i=0
2i
Algorithm 2 shows the implementation details of our method. The label merge is implemented with one host function, one kernel function, and one device function. In the
host function callMergeLabel, we determine
the number of boundaries where column-pairs
are merged and call kernel function merge with
the necessary number of threads. The host
function iterates this call lg w times, and the
number of threads to be called decreases as the
iteration is repeated. In the i=th call, w · h/2i
threads are required.
Every thread executes the kernel function
merge. In the kernel function, every two pixels across the merge boundaries are investigated in parallel fashion. If the pixels are both
‘on’, the root nodes of equivalence trees the
pixels belong to are found and compared. The
label equivalence trees are merged by relabeling the root with smaller label to have the same
label with the other root. The device function
findRoot is called in this process to find the
root of the pixel currently being investigated.
After the execution of Algorithm 2, the equivalence tree will be obtained. However, the fi-
ISBN: 978-1-941968-40-6 ©2016 SDIWC
74
Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016
merge boundaries
1st merge
merge boundaries
2nd merge
column pairs for label merge
merge boundaries
3rd merge
column pairs for label merge
column pairs for label merge
Figure 4. Label merge iteration
nal goal of CCL is to make all the pixels in a
connected component have an identical label.
This can be achieved by applying the device
function findRoot to each pixel and updating
its label to be the returned value. This process
can be easily performed in a parallel fashion
because the label updates for any nodes in the
tree do not destroy the equivalence of the nodes
in the tree.
Algorithm 3 describes the operations of relabeling thread. Total w · h threads separately
call findRoot for corresponding pixels and update the label in order to make the connected
pixels have an identical label.
Algorithm 3: Relabeling
kernel relabel
Data: Iλ ∈ Zw×h : In, Out
begin
thread:[1, w · h]
row = thread/w, col = thread%w
Iλrow,col = findRoot(row, col)
4
EXPERIMENTAL RESULTS
The method proposed in this paper was implemented on computing environments with commodity CPUs and GPUs. The experimental results were collected from the tests on a system
with i7-3630QM 2.4 GHz CPU and Geforce
GTX 670MX GPU.
ISBN: 978-1-941968-40-6 ©2016 SDIWC
Table 1. CCL performance comparison with random
density noise patterns (2048×2048 pixels).
density
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2048×2048
Grana (ms) Proposed method (ms)
22.8
6.3
30.7
7.0
46.0
7.7
51.3
8.6
47.6
10.2
43.6
15.2
35.5
26.1
28,0
40.8
20.8
64.6
In order to verify the efficiency of our method,
we compared the performance of our method
with the most commonly used method. The
reference method was proposed in [3], and implemented with OpenCV libraries.
In the first experiment, the performance of each
method was measured by applying images with
random noise. The random noise was automatically generated and the density of the
noise ranges from 0.1 to 0.9. Table. 1 shows
the experimental results when 2048×2048 images with random noise are applied. The first
column represents the noise density, and the
second column shows the measured execution time of the reference method in milliseconds. The third column shows the execution
time of the proposed method. As shown in
the table, the reference method (denoted by
Grana) requires more execution time when the
75
Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016
Table 2. CCL performance comparison with random
density noise patterns (4096×4096 pixels).
density
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
4096×4096
Grana (ms) Proposed method (ms)
85.7
21.4
135.3
24.3
190.4
27.1
206.1
30.8
205.4
36.9
177.6
59.0
153.1
126.7
112.1
216.7
85.6
411.5
noise density is around 0.5 while the proposed
method requires more time as the density increases.
Table. 2 shows the similar experimental results except that the size of the input images is
4096×4096. As shown in the table, the computational cost is similarly changing in accordance with the density.
Table 3. CCL performance comparison with test images
A and B with different image sizes.
methods
Grana (ms)
Proposed (ms)
Gain (%)
Grana (ms)
Proposed (ms)
Gain (%)
Images sizes
512
10242 20482
Test Image A
1.14
3.86
14.39
0.87
2.46
7.68
23.7
36.3
46.6
Test Image B
1.01
3.53
12.35
0.83
2.35
7.21
17.8
33.4
41.6
2
4096
2
41.1
26.0
36.7
40.78
24.03
41.1
Fig. 5 (a) visualizes the experimental results
shown in Table. 1. As shown in the figure, the
computational cost of the proposed method increases as the density increases. However, the
proposed method is far better until the density
is below 0.7.
Fig. 5 (b) shows the similar results. This figure visualizes the result shown in Table. 2. As
shown in the figure, the computational cost of
the proposed method increases as the density
increases. However, the proposed method is
far better until the density is below 0.7.
ISBN: 978-1-941968-40-6 ©2016 SDIWC
The CCL algorithms are not actually applied
to noise data. In order to measure the performance of the proposed method in more feasible environments, we prepared two test images
shown in Fig. 6. There are two different test
images and the sizes of the images can be either 5122 , 10242 , 20482 or 40962 . Test image
A has two components, and each of them is
a long spiral curve without touching the other
component. The other test image B has many
scattered stars, and two of them are connected
with a star-shaped thin line.
Table. 3 shows the execution time required for
reference method and the proposed method applied to the test images with different sizes. In
the last row in each data set obtained by using each test image, the performance gain is
computed and shown. The performance gain is
obtained by computing the ratio of the reduced
computational cost by applying the proposed
method to the cost of the reference method.
As shown in the table, the performance gain is
more noticeable as the size of the input image
increases.
Fig. 7 visually compares the computational
costs of the reference method and our method
(lines), and the performance gain is also visualized with bars. Fig. 7 (a) shows the result
when the test image A was used as input image. The formance gain was the largest when
the size of the input image is 2048×2048.
Fig. 7 (b) shows the similar results when
the other test image is used. As shown in
the figure, the performance gain is again the
most noticeable when the size of the image is
2048 × 2048.
5
CONCLUSION
In this paper, an efficient GPGPU implementation of connected component labeling (CCL)
was proposed. The method exploits the data
parallelism of GPUs to improve the performance of CCL. Object identification in image
data is the fundamental operation and rapid
76
Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016
CCL on 2048×2048 Noise Patterns
Execution Time (ms)
75
60
CCL on 4096×4096 Noise Patterns
Execution Time (ms)
400
Grana
360
Grana
Proposed
320
Proposed
280
45
240
30
160
15
80
0
0
200
120
40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Density
(a)2048×2048-sized noise patterns
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Density
(b) 4096×4096-sized noise patterns
Figure 5. CCL execution time on noise patterns with different noise densities
16ZS1410), and also in part by NIPA SW Convergence Technologies Enhancement Program
(Development of Big Data Processing and Decision Support System for Offshore Maritime
Safety, S0142-15-1014).
REFERENCES
[1] P. Chen, H. Zhao, C. Tao, and H. Sang. Block-runbased connected component labelling algorithm
for gpgpu using shared memory. Electronics letters, 47(24):1309–1311, 2011.
[2] C. Grana, D. Borghesani, and R. Cucchiara. Connected component labeling techniques on modern architectures. In International Conference on
Image Analysis and Processing, pages 816–824.
Springer, 2009.
Figure 6. Test images for practical labeling: (a) two
spiral curves (b) scattered stars
[3] C. Grana, D. Borghesani, and R. Cucchiara. Optimized block-based connected components labeling with decision trees. IEEE Transactions on Image Processing, 19(6):1596–1609, 2010.
computation is highly requested as the sizes of
the currently available image data rapidly increase. The experimental results show the proposed method can be a good solution to the object identification in large-scale image data.
[4] C. Harrison, H. Childs, and K. P. Gaither. Dataparallel mesh connected components labeling and
analysis.
In Eurographics Parallel Graphics
and Visualization Symposium, Llandudno, Wales,
2012.
ACKNOWLEDGMENT
[5] K. A. Hawick, A. Leist, and D. P. Playne. Parallel graph component labelling with gpus and cuda.
Parallel Computing, 36(12):655–678, 2010.
This work was supported in part by ETRI
R&D Program (Development of Big Data Platform for Dual Mode Batch-Query Analytics,
[6] O. Kalentev, A. Rai, S. Kemnitz, and R. Schneider.
Connected component labeling on a 2d grid using
cuda. Journal of Parallel and Distributed Computing, 71(4):615–620, 2011.
ISBN: 978-1-941968-40-6 ©2016 SDIWC
77
Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016
Performance Gain (Image A)
Execution Time (ms)
Performance Gain (Image B)
Performance Gain (%)
50
45
50
Execution Time (ms)
Performance Gain (%)
50
50
46.6%
Grana
45
Proposed
Grana
Proposed
40
40
41.6%
41.1%
40
40
36.3%
35
30
25
35
36.7%
30
33.4%
30
30
25
23.7%
20
20
15
20
20
17.8%
15
10
10
5
10
10
5
0
0
512×512
1024×1024
2048×208
4096×4096
Image Size
(a)test image A
0
0
512×512
1024×1024
2048×208
4096×4096
Image Size
(b)test image B
Figure 7. CCL execution time for test images and measured performance gains
[7] M. Minsky and S. Papert. Perceptrons. MIT press,
1988.
[8] B. Preto, F. Birra, A. Lopes, and P. Medeiros.
Object identification in binary tomographic images using gpgpus. International Journal of
Creative Interfaces and Computer Graphics (IJCICG), 4(2):40–56, 2013.
[9] O. Št́ava and B. Beneš. Connected component labeling in cuda. Hwu., WW (Ed.), GPU Computing
Gems, 2010.
[10] S. Zavalishin, I. Safonov, Y. Bekhtin, and
I. Kurilin. Block equivalence algorithm for labeling 2d and 3d images on gpu. Electronic Imaging,
2016(2):1–7, 2016.
ISBN: 978-1-941968-40-6 ©2016 SDIWC
78