Academia.eduAcademia.edu

A Parallel Approach to Object Identification in Large-scale Images

As the computing power of processors is being drastically improved, the sizes of image data for various applications are also increasing. One of the most basic operations on image data is to identify objects within the image, and the connected component labeling (CCL) is the most frequently used strategy for this problem. However, CCL cannot be easily implemented in a parallel fashion because the connected pixels can be found basically only by graph traversal. In this paper, we propose a GPU-based efficient algorithm for object identification in large-scale images and the performance of the proposed method is compared with that of the most commonly used method implemented with OpenCV libraries. The method was implemented and tested on computing environments with commodity CPUs and GPUs. The experimental results show that the proposed method outperforms the reference method when the pixel density is below 0.7. Object identification in image data is the fundamental operation and rapid computation is highly requested as the sizes of the currently available image data rapidly increase. The experimental results show the proposed method can be a good solution to the object identification in large-scale image data.

Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016 A Parallel Approach to Object Identification in Large-scale Images Young-Min Kang Sung-Soo Kim Gyung-Tae Nam Tongmyong University ETRI GCSC Inc. Busan, 48520, Korea Daejeon, 34129, Korea Busan, 47607, Korea ymkang@tu.ac.kr sungsoo@etri.re.kr gtnam@gcsc.co.kr ABSTRACT As the computing power of processors is being drastically improved, the sizes of image data for various applications are also increasing. One of the most basic operations on image data is to identify objects within the image, and the connected component labeling (CCL) is the most frequently used strategy for this problem. However, CCL cannot be easily implemented in a parallel fashion because the connected pixels can be found basically only by graph traversal. In this paper, we propose a GPU-based efficient algorithm for object identification in large-scale images and the performance of the proposed method is compared with that of the most commonly used method implemented with OpenCV libraries. The method was implemented and tested on computing environments with commodity CPUs and GPUs. The experimental results show that the proposed method outperforms the reference method when the pixel density is below 0.7. Object identification in image data is the fundamental operation and rapid computation is highly requested as the sizes of the currently available image data rapidly increase. The experimental results show the proposed method can be a good solution to the object identification in large-scale image data. KEYWORDS GPGPU, Object Identification, CCL 1 INTRODUCTION As the computing power of processors is being drastically improved, the sizes of image data for various applications are also increasing. Therefore, efficient algorithms for manip- ISBN: 978-1-941968-40-6 ©2016 SDIWC ulating the large-scale image data are required. One of the most basic operations on image data is to identify objects within the image, and the connected component labeling (CCL) is the most frequently used strategy for this problem. Various image processing techniques can be easily implemented in parallel fashion, and GPU parallelism has been successfully exploited in this field. However, CCL cannot be easily implemented with parallel tasks because the connected pixels are represented as adjacent nodes in a graph and the adjacency among all the nodes can be investigated basically only by graph traversal. In this paper, we propose a GPU-based efficient algorithm for object identification in large-scale images and the performance of the proposed method is compared with that of an OpenCV-based CCL algorithm. 2 RELATED WORK Object identification is a fundamental problem in image processing. In many cases, the object identification is performed based on CCL. The most CCL algorithms are reduced to the traversal of adjacent nodes (pixels) along the edges in the graph that represents the input image. The traversal approaches are naturally sequential, and parallel implementations of typical CCL algorithms have not been very successful [10]. Even in the early stage of computer vision research, it was found that the connectivity cannot be determined by completely parallel tasks [7]. However, the rapid development of general purpose graphics processing unit (GPU) technologies made it possible for GPU-based 71 Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016 parallel approaches to achieve better performance than CPU-based traditional CCL methods [1, 9]. The basic approach to CCL is to use union-find algorithm which can determine whether two nodes in an undirected graph are connected or not [8]. However, such methods based on this approach use sequential computations. Several methods have been proposed to exploit parallel computing architectures [5, 9]. However, these methods were applied to relatively small images. Some methods utilized cluster architectures [4]. These methods split the data volume and the data segments are assigned to different computing units. Parallelism within a single GPU, therefore, cannot be exploited in these methods. Label equivalence method is implemented on GPU [6], and this method iterates to resolve the label equivalence by finding the roots of equivalence trees. However, this approach relies on decision tables which cannot be efficiently handled on GPU [10]. Block-based labeling and efficient block processing with decision table were proposed in [2, 3]. Block equivalence method based on “scan mask” was proposed [10]. However, this method also has to iterate the equivalence resolving until it converges to the state where no label update is found. 3.1 Data Initialization The pixels in an image can be classified into either ‘on’ pixels and ‘off’ pixels. The ‘on’ pixels are regarded nodes in graph representation, and it is assumed that an edge exists between two nodes of which positions neighbor each other in the image space. The goal of CCL algorithms is to assign an identical label for linked nodes. In order to achieve this goal, each pixel is assigned unique label in the initialization stage. The simplest method is to assign sequential numbers to the pixels. Suppose we have an image with w × h pixels and the pixel at (x,y) is denoted as p(x, y) where x = [1, w], y = [1, h]. The ‘on’ pixel at (x, y) is then labeled with the number x + w(y − 1). Therefore, the labeling numbers range from 1 to wh. All the ‘off’ pixels are labeled to be -1. The label map Iλ is an image composed of label of each pixel, and each label at (x, y) in the label map are denoted by Iλx,y . In other words, the image with initial labels can be described as follows: Iλ ∈ Zw×h idx,y = x + w(y − 1) ∈ [1, wh] p(x, y) = 1 ⇒ lλx,y = idx,y = −1 p(x, y) = 0 ⇒ lλx,y (1) After the initialization is done, the rest of the algorithm is to merge the positive labels in each connected component into a single label. 3 PROPOSED METHOD 3.2 In this section, a GPU-based parallel approach to CCL is proposed. The method is composed of four major tasks: 1) data initialization, 2) computation of column-wise label runs, 3) label merge of connected components. Each task is explained in details in the following subsections. ISBN: 978-1-941968-40-6 ©2016 SDIWC Computing Column-wise Label Runs In order to merge labels, adjacent positive labels are merged. The block of contiguous object pixels in a column is a ‘run.’ The first stage of label merge is to find runs. In other words, each run is identified and labeled with a unique number. Each column is assigned to 72 Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016 id initial labels 1 -1 column-wise update -1 2 2 3 3 3 3 4 -1 -1 5 5 7 6 6 7 7 7 8 -1 column-wise label runs Equivalence Forest -1 -1 3 -1 3 13 -1 13 7 13 7 -1 7 7 -1 -1 -1 -1 2 5 connected 6 7 3 connected 7 3 5 2 connected 6 13 11 12 connected merged labels in two columns Figure 1. Label runs and equivalence forest a CUDA thread and processed in parallel fashion. Therefore, we have w threads running separately. The computation within a thread is to simply scan pixels and change the label of the current pixel to be that of the previously scanned pixel if the both pixels are ‘on.’ Fig. 1 shows how the label assigned to each pixel is updated through column-wise label run computation described in Algorithm 1. After the update, the each label represents the root node in the equivalence tree it belongs to as shown in Fig. 1 Algorithm 1: Column-wise label run kernel vertLabel Data: Iλ ∈ Zw×h : In, Out begin col = thread:[1, w] for row: h − 1 downto 1 do if Iλrow,col > 0 and Iλrow+1,col > 0 then Iλrow,col = Iλrow+1,col 3.3 Label Merge Once the column-wise label update is finished, the vertical connectivity must be investigated. Let us suppose, for simplicity, that we have only two columns. In the previous columnwise update, the labels are merged to the largest value in the equivalence tree. If two pixels in a row are connected, the equivalence ISBN: 978-1-941968-40-6 ©2016 SDIWC -1 -1 3 -1 13 13 -1 13 7 13 7 -1 13 -1 -1 -1 connected 2 13 7 3 connected 5 6 11 12 Figure 2. Label merge with two columns trees those pixels belong to should be merged into a single tree. In order to achieve this, the root node of each tree must be found and the root node with a smaller label is relabeled to point the other root with a larger label as shown in Fig. 2. Note that the labels of the connected pixels are not directly updated. Fig. 3 shows the merge process. As shown in Fig. 3 (a), two pixels a and b in different equivalent trees are found to be connected. The direct update of connected pixels does not successfully merge the equivalence trees as shown in Fig. 3 (b). The correct label merge can be done by comparing and update of the roots of the equivalence trees as shown in Fig. 3 (c). In order to perform the two-column label merge for an images with w columns, w/2 column pairs are separately merged with w/2 threads. After every two adjacent column pairs are merged, we iterate the label merge. In the second merge phase, as shown in Fig. 4, we have only to consider the boundaries between the previously merged column pairs, and the computation is reduced to half the previous one. It is easily noticed that the lg w iter- 73 Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016 smaller root bigger root connected a b (a) connected pixels found smaller root a bigger root smaller root bigger root b (b) direct update of pixel labels (c) root node merge Figure 3. Label merge with two columns Algorithm 2: Label merge algorithm host callMergeLabel Data: Iλ ∈ Zw×h : In, Out begin div = 2 for i: 0 upto lg w − 1 do merge <<h · w/div>> (div,Iλ ) div = 2 × div kernel merge Data: div ∈ Z: in, Iλ ∈ Zw×h : In, Out begin thread:[1, wh/div] nBoundary = w/div col = div/2 + (thread % nBoundary) × div row = thread / nBoundary if Iλrow,col > 0 and Iλrow,col+1 > 0 then rootL = findRoot(row,col) rootR = findRoot(row,col+1) max(rootL ,rootR ) min(rootL ,rootR ) = Iλ Iλ device findRoot Data: row, col ∈ Z: In, label ∈ Z: Out begin if Iλrow,col < 0 then return -1 label = w row + col while Iλlabel 6= label do label := Iλlabel return label ations are sufficient for finding the final label equivalence. Let us denote the computational cost for the first iteration to be C(1). The total cost for the label merge is Plgcomputational w−1 1 C(1) = O(C(1)). i=0 2i Algorithm 2 shows the implementation details of our method. The label merge is implemented with one host function, one kernel function, and one device function. In the host function callMergeLabel, we determine the number of boundaries where column-pairs are merged and call kernel function merge with the necessary number of threads. The host function iterates this call lg w times, and the number of threads to be called decreases as the iteration is repeated. In the i=th call, w · h/2i threads are required. Every thread executes the kernel function merge. In the kernel function, every two pixels across the merge boundaries are investigated in parallel fashion. If the pixels are both ‘on’, the root nodes of equivalence trees the pixels belong to are found and compared. The label equivalence trees are merged by relabeling the root with smaller label to have the same label with the other root. The device function findRoot is called in this process to find the root of the pixel currently being investigated. After the execution of Algorithm 2, the equivalence tree will be obtained. However, the fi- ISBN: 978-1-941968-40-6 ©2016 SDIWC 74 Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016 merge boundaries 1st merge merge boundaries 2nd merge column pairs for label merge merge boundaries 3rd merge column pairs for label merge column pairs for label merge Figure 4. Label merge iteration nal goal of CCL is to make all the pixels in a connected component have an identical label. This can be achieved by applying the device function findRoot to each pixel and updating its label to be the returned value. This process can be easily performed in a parallel fashion because the label updates for any nodes in the tree do not destroy the equivalence of the nodes in the tree. Algorithm 3 describes the operations of relabeling thread. Total w · h threads separately call findRoot for corresponding pixels and update the label in order to make the connected pixels have an identical label. Algorithm 3: Relabeling kernel relabel Data: Iλ ∈ Zw×h : In, Out begin thread:[1, w · h] row = thread/w, col = thread%w Iλrow,col = findRoot(row, col) 4 EXPERIMENTAL RESULTS The method proposed in this paper was implemented on computing environments with commodity CPUs and GPUs. The experimental results were collected from the tests on a system with i7-3630QM 2.4 GHz CPU and Geforce GTX 670MX GPU. ISBN: 978-1-941968-40-6 ©2016 SDIWC Table 1. CCL performance comparison with random density noise patterns (2048×2048 pixels). density 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 2048×2048 Grana (ms) Proposed method (ms) 22.8 6.3 30.7 7.0 46.0 7.7 51.3 8.6 47.6 10.2 43.6 15.2 35.5 26.1 28,0 40.8 20.8 64.6 In order to verify the efficiency of our method, we compared the performance of our method with the most commonly used method. The reference method was proposed in [3], and implemented with OpenCV libraries. In the first experiment, the performance of each method was measured by applying images with random noise. The random noise was automatically generated and the density of the noise ranges from 0.1 to 0.9. Table. 1 shows the experimental results when 2048×2048 images with random noise are applied. The first column represents the noise density, and the second column shows the measured execution time of the reference method in milliseconds. The third column shows the execution time of the proposed method. As shown in the table, the reference method (denoted by Grana) requires more execution time when the 75 Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016 Table 2. CCL performance comparison with random density noise patterns (4096×4096 pixels). density 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 4096×4096 Grana (ms) Proposed method (ms) 85.7 21.4 135.3 24.3 190.4 27.1 206.1 30.8 205.4 36.9 177.6 59.0 153.1 126.7 112.1 216.7 85.6 411.5 noise density is around 0.5 while the proposed method requires more time as the density increases. Table. 2 shows the similar experimental results except that the size of the input images is 4096×4096. As shown in the table, the computational cost is similarly changing in accordance with the density. Table 3. CCL performance comparison with test images A and B with different image sizes. methods Grana (ms) Proposed (ms) Gain (%) Grana (ms) Proposed (ms) Gain (%) Images sizes 512 10242 20482 Test Image A 1.14 3.86 14.39 0.87 2.46 7.68 23.7 36.3 46.6 Test Image B 1.01 3.53 12.35 0.83 2.35 7.21 17.8 33.4 41.6 2 4096 2 41.1 26.0 36.7 40.78 24.03 41.1 Fig. 5 (a) visualizes the experimental results shown in Table. 1. As shown in the figure, the computational cost of the proposed method increases as the density increases. However, the proposed method is far better until the density is below 0.7. Fig. 5 (b) shows the similar results. This figure visualizes the result shown in Table. 2. As shown in the figure, the computational cost of the proposed method increases as the density increases. However, the proposed method is far better until the density is below 0.7. ISBN: 978-1-941968-40-6 ©2016 SDIWC The CCL algorithms are not actually applied to noise data. In order to measure the performance of the proposed method in more feasible environments, we prepared two test images shown in Fig. 6. There are two different test images and the sizes of the images can be either 5122 , 10242 , 20482 or 40962 . Test image A has two components, and each of them is a long spiral curve without touching the other component. The other test image B has many scattered stars, and two of them are connected with a star-shaped thin line. Table. 3 shows the execution time required for reference method and the proposed method applied to the test images with different sizes. In the last row in each data set obtained by using each test image, the performance gain is computed and shown. The performance gain is obtained by computing the ratio of the reduced computational cost by applying the proposed method to the cost of the reference method. As shown in the table, the performance gain is more noticeable as the size of the input image increases. Fig. 7 visually compares the computational costs of the reference method and our method (lines), and the performance gain is also visualized with bars. Fig. 7 (a) shows the result when the test image A was used as input image. The formance gain was the largest when the size of the input image is 2048×2048. Fig. 7 (b) shows the similar results when the other test image is used. As shown in the figure, the performance gain is again the most noticeable when the size of the image is 2048 × 2048. 5 CONCLUSION In this paper, an efficient GPGPU implementation of connected component labeling (CCL) was proposed. The method exploits the data parallelism of GPUs to improve the performance of CCL. Object identification in image data is the fundamental operation and rapid 76 Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016 CCL on 2048×2048 Noise Patterns Execution Time (ms) 75 60 CCL on 4096×4096 Noise Patterns Execution Time (ms) 400 Grana 360 Grana Proposed 320 Proposed 280 45 240 30 160 15 80 0 0 200 120 40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Density (a)2048×2048-sized noise patterns 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Density (b) 4096×4096-sized noise patterns Figure 5. CCL execution time on noise patterns with different noise densities 16ZS1410), and also in part by NIPA SW Convergence Technologies Enhancement Program (Development of Big Data Processing and Decision Support System for Offshore Maritime Safety, S0142-15-1014). REFERENCES [1] P. Chen, H. Zhao, C. Tao, and H. Sang. Block-runbased connected component labelling algorithm for gpgpu using shared memory. Electronics letters, 47(24):1309–1311, 2011. [2] C. Grana, D. Borghesani, and R. Cucchiara. Connected component labeling techniques on modern architectures. In International Conference on Image Analysis and Processing, pages 816–824. Springer, 2009. Figure 6. Test images for practical labeling: (a) two spiral curves (b) scattered stars [3] C. Grana, D. Borghesani, and R. Cucchiara. Optimized block-based connected components labeling with decision trees. IEEE Transactions on Image Processing, 19(6):1596–1609, 2010. computation is highly requested as the sizes of the currently available image data rapidly increase. The experimental results show the proposed method can be a good solution to the object identification in large-scale image data. [4] C. Harrison, H. Childs, and K. P. Gaither. Dataparallel mesh connected components labeling and analysis. In Eurographics Parallel Graphics and Visualization Symposium, Llandudno, Wales, 2012. ACKNOWLEDGMENT [5] K. A. Hawick, A. Leist, and D. P. Playne. Parallel graph component labelling with gpus and cuda. Parallel Computing, 36(12):655–678, 2010. This work was supported in part by ETRI R&D Program (Development of Big Data Platform for Dual Mode Batch-Query Analytics, [6] O. Kalentev, A. Rai, S. Kemnitz, and R. Schneider. Connected component labeling on a 2d grid using cuda. Journal of Parallel and Distributed Computing, 71(4):615–620, 2011. ISBN: 978-1-941968-40-6 ©2016 SDIWC 77 Proceedings of The Second International Conference on Electronics and Software Science (ICESS2016), Japan 2016 Performance Gain (Image A) Execution Time (ms) Performance Gain (Image B) Performance Gain (%) 50 45 50 Execution Time (ms) Performance Gain (%) 50 50 46.6% Grana 45 Proposed Grana Proposed 40 40 41.6% 41.1% 40 40 36.3% 35 30 25 35 36.7% 30 33.4% 30 30 25 23.7% 20 20 15 20 20 17.8% 15 10 10 5 10 10 5 0 0 512×512 1024×1024 2048×208 4096×4096 Image Size (a)test image A 0 0 512×512 1024×1024 2048×208 4096×4096 Image Size (b)test image B Figure 7. CCL execution time for test images and measured performance gains [7] M. Minsky and S. Papert. Perceptrons. MIT press, 1988. [8] B. Preto, F. Birra, A. Lopes, and P. Medeiros. Object identification in binary tomographic images using gpgpus. International Journal of Creative Interfaces and Computer Graphics (IJCICG), 4(2):40–56, 2013. [9] O. Št́ava and B. Beneš. Connected component labeling in cuda. Hwu., WW (Ed.), GPU Computing Gems, 2010. [10] S. Zavalishin, I. Safonov, Y. Bekhtin, and I. Kurilin. Block equivalence algorithm for labeling 2d and 3d images on gpu. Electronic Imaging, 2016(2):1–7, 2016. ISBN: 978-1-941968-40-6 ©2016 SDIWC 78