ACCELERATION OF THE EDGE STRENGTH FUNCTION ON GPU USING CUDA

Edge strength function (ESF) generates a family of level curves that evolves under the influence of curvature motions. This function is proved to be useful in representing images in computer vision for analysis and recognition purposes in different contexts. Computation of the ESF requires solving a partial differential equation, hence computationally costly. In this work, we present two parallel implementations of ESF that work on Graphics Processing Units (GPU) using Compute Unified Design Architecture (CUDA). Both implementations reduce the computational time significantly with respect to their serial counterpart. The implementations differ mainly in the type of memory that is utilized for accessing data; the first approach utilizes the shared memory and the second one utilizes the texture memory. We obtain between 40 to 65 times speedup in the shared memory based implementation and between 35 to 55 times in the texture memory based implementation with respect to the single threaded CPU implementation. The amount of speedup changes depending on the data size..


INTRODUCTION
Edge strength function (ESF) is offered by (Tari et al., 1997) as an alternative tool for curve evolution.It is a modified form of the Ambrossio-Tortorelli approximation (Ambrossio and Tortorelli, 2003) to the Mumford-Shah functional (Mumford and Shah, 1989) that uses large diffusion values and interprets a smoothed distance function (Erdem and Tari, 2009).The function generates a family of level curves that evolves under the influence of curvature motions.ESF is proved to be useful for extracting essential shape characteristics and used in different computer vision tasks, such as object recognition via axis based image representation (Aslan and Tari, 2005), image segmentation (Erdem et al., 2005), part embedding (Keles et al., 2012), and critical points detection (Keles and Tari, 2015).A sample image and an ESF of this image are depicted in Figure 1.This research is performed to improve our preliminary shape grammar implementation, which is a multi-core evolutionary approach that solves the embedding problem for sketches (Keles, 2015).In this work, we had to use a smoothed version of the standard distance transform instead of the ESF for part embeddings, due to its high computational cost.As a follow up research, to solve this efficiency problem, we have designed two efficient parallel solutions for ESF computations.The motivation for this improvement comes from the fact that ESF is reported to be more accurate for representing intrinsic characteristics of shapes than the standard distance transform and its smoothed variants (Keles et al., 2012).In this work, we provide the details of two different parallel implementations of the ESF that run on many-core GPU architectures.We evaluate the performance of the proposed implementations with respect to our serial, single threaded CPU implementation.The rest of the paper is organized as follows.In Section 1.1, we introduce the edge strength function and its discrete domain solution in detail.Following that, an overview of the related works on GPUs is provided.Then, in Section 2, we present the two CUDA based solutions in detail.Finally, in Section 3, we compare the performances of the two parallel implementations with respect to the single threaded CPU implementation.

The Edge Strength Function
The edge strength function, which will be depicted here as , generates an exponentially decaying, smoothed distance field.It is defined as the minimizer to the following energy functional (Aslan and Tari): Subject to  = 1 on the shape boundary.Here ρ is a small number which controls the level of smoothing.On a binary drawing, where each pixel on the drawing is represented as 1 and the rest as 0 in a bounded image domain Ω, it is computed by solving the following PDE (Tari et al. 1997): Where,

𝜕𝑣 𝜕𝑛
= 0 on Ω.In this equation, Ω is the boundary of the image plane and  is the normal direction on the boundary.The discrete version of the equation ( 2) can be written using the finite difference approximations, where central differences are used for spatial derivatives and forward differences are used for temporal derivatives.In this equation, ∇ 2 is the Laplacian operator, which is approximated in the ESF computations using the 3x3 convolution kernel that is shown below: Then, the solution becomes: Where, ,  ∈ Ω are the spatial indexes and  is the iteration number.At the beginning,  0 is initialized with the original image.During the iterations, the pixel values that belong to the drawing/shape are kept constant and have their original value; i.e. 1.For convergence, ∆ needs to be less than 0.25 (Tari et al., 1997); in our experiments, we set ∆ to 0.2.

Related Works
Data parallel computation on commodity graphics processors has started with the fixed function graphics pipeline in the early 2000's.Early works utilized graphics API functions to solve PDEs for non-linear diffusion (Rumpf and Strzodka, 2001) and levelsets (Lefohn et al., 2003).
The extended capabilities of the fixed function graphics pipeline, through programmable vertex and fragment processors, enabled general purpose programming on GPUs more convenient using the shader languages like Cg (Fernando and Kilgar, 2003), Direct3D's shading language (HLSL), OpenGL shading language (GLSL).Programming a GPU using shading languages was still based on graphics pipeline, yet it was easier to utilize the computational power of GPUs for general purpose programming.This attracted many researchers in the field to optimize computationally expensive algorithms by redesigning the data structures and the data flow for these GPU architectures (Keles et al., 2006;Rudomin et al., 2005;Ruiz et al., 2008).CUDA provides a programming model which enables utilization of massive data parallelism through the abstractions for thread groups, shared memory and thread synchronization with a minimal set of extensions to C programming language (Owens et al., 2008).This framework provides opportunities for developing massively parallel desktop applications that run on GPUs to solve performance problems for various scientific domains, such as physics (Gremse et al., 2016), material sciences (Jimenez and Ortiz 2016), and applied mathematics (Reis et al., 2016).Solving partial differential equations is a computationally demanding task; therefore there are some ongoing research activities in the design and implementation of PDE solvers that run in parallel.One of the early works that solves a nonlinear diffusion model on GPUs belongs to Rumpf and Strzodka (2001).The aim in this work is smoothing images based on a modified Perona-Malik model (Perona and Malik, 1990).Their solution is based on a finite element scheme implemented using texture hardware where graphics pipeline is utilized to modify the data that is stored as a texture through texture processing operations.Sanderson et al. (2009) present a framework that solves PDEs of advectionreaction-diffusion models on GPU.The aim of this work is to create spatiotemporal patterns that can be used for texture synthesis and the visualization of vector fields.They implemented their GPU based solver with fragment programs where the data is stored as a texture and the graphics pipeline functions are utilized to access and modify data.
In the literature, more recent works provide solutions to some reactiondiffusion problems using CUDA.Molnar et al. (2011) generated simulations for four different reaction-diffusion problems in 3d; they report 5-40 speedup in reference to their single threaded CPU implementation.Similarly, Holmen and Foster (2014) focused on 3d reaction-diffusion simulations to solve the diffusion of a chemically inert compound using CUDA.Their focus is on improving the performance of single iteration; they reported 8.69x speedup in reference to their multi-threaded CPU based implementation.
Recently, D'Ambra and Filippone (2016) presented a GPU based solution to the image segmentation problem that solves the Ambrossio-Tortorelli model.Their approach is based on solving a system of two coupled elliptic PDEs by a generalized relaxation method.In the implementation, they use the GPU extensions of the Parallel Sparse Basic Linear Algebra Subprograms (PSBLAS) library.Although there are GPU based solutions to some reaction-diffusion problems in different domains, to the best of our knowledge, a parallel implementation to the ESF, as it is presented in Tari et al. (1997), has not been presented in the literature, yet.The details of our solution are provided in Section 2.

MATERIALS AND METHODS
In this section, we discuss the design of our CUDA based solutions to the ESF computation.Two different approaches are implemented in CUDA, regarding the GPU memory utilization: (1) shared memory, (2) texture memory.In both approaches, the same CPU-CUDA implementation is used as a framework; only the kernel resource initializations and invoked kernels are changed depending on the configuration.The flow chart of the framework is depicted in Figure 2.
In this framework, the image is first copied from the host (CPU) memory to the device (GPU) memory, to be used in the ESF computations over multiple iterations.The number of iterations is configured by the user in order to evaluate the performance of the algorithms for a range of iterations.During these iterations, the evolved ESF of the image is kept in the device memory all the time.The details of the two solutions will be made clear in the upcoming sections.

Shared Memory Implementation
Shared memory provides data sharing among the threads in the same block, hence it is more efficient to load and access the data that is used by multiple threads in a block into the shared memory instead of accessing it through the global memory.In the CUDA supported GPUs, the access times of shared memory is 100x faster than the global memory; hence, efficiency increases when a thread needs to access multiple data elements in the same kernel.In the ESF computation, for convolution computation, each pixel is accessed 9 times.Even when the kernel is optimized by excluding the pixels that correspond to the zero coefficients, each pixel is read at least 5 times.
In our solution, the image is semantically represented as a set of tiles.The related image contents corresponding to each tile are copied to the shared memory.Copy operation is performed in parallel as well.We represent the two dimensional input image as a one dimensional array in the global memory of the GPU; while image tiles are represented as 2d arrays in shared memory.
In our implementation, each thread copies four pixels (i.e.floating point values) from the global memory; the pixels that reside at the upper left, upper right, lower left and lower right corners of the pixel corresponding to the kernel center.When all the threads in a block complete this operation, the content of a tile is loaded into the shared memory.In order to apply convolution at each pixel in a tile, we also need to copy the apron pixels to the boundaries of the shared memory.The convolution function in the ESF requires replication of the boundary values for the apron pixels at the image boundary (Figure 3).Moreover, the apron pixels for the inner tiles need to be copied from the content of the neighboring tile boundaries (Figure 4).positions, in order to satisfy the boundary conditions so that the gradients at the image boundary are zero.

Texture Memory Implementation
In this approach, a 2d floating point texture is utilized for the Laplacian computation.The advantage of using texture memory is that texture memory is cached on-chip and texture caches are optimized for thread operations, where memory access patterns depict spatial locality.Therefore, accessing the cached texture memory improves the kernel performances and reduces the memory traffic considerably.
In this approach, a CUDA array, which has the same width and height with the input image, is allocated and bound as the texture memory.CUDA arrays optimize the cache coherence for filtering functions, so that reading the data from the upper and lower rows and neighboring columns are efficient.Moreover, it is not necessary to consider the tile apron pixels explicitly; because this is handled automatically by tex2D function.Hence, the implementation with texture memory is more trivial for the ESF computation.Initially, the input image is on the host memory.It is copied from the host memory to the allocated CUDA array, which is on the device memory.

RESULTS AND DISCUSSIONS
In this section, we present the performances of the two CUDA implementations with respect to our single threaded, serial CPU implementation.We selected two images with different characteristics for performance evaluations.The first image is a Seljuk pattern from Kayseri (Anatolia) that is composed of thin, crowded line stripes.The second image is the Lena image in binary form that is composed of large segments of foreground and background parts (Figure 5).Image size is the bottleneck for the ESF computations; hence, we provide the performances of the parallel implementations for 4 different image sizes for the test images.In order to evaluate the performances, we use the speedup metric.Speedup is computed as the ratio of the execution time on CPU to the one that we obtain on GPU.It is depicted in Equation ( 4).Here,   is the total execution time of the ESF kernels on GPU and   is the total execution time of the ESF implementation on CPU.We did not include the memory allocation and deallocation times, yet we keep track of the execution times of the kernel calls for a specified number of iterations.All the experiments are performed for 50 times and their average is reported.
The experiments on the CUDA kernels are performed on an NVidia GeForce GTX 970 graphics card and the CPU implementation is tested on a 4GHz Intel i7 processor.In order to evaluate the speedup performances, we prepared a totally white image as a benchmark.A white image is the most saturated image for ESF computations; hence the diffusion computation is not performed for any of the pixels.The speedup obtained from this image demonstrates the level of performance improvement by merely subdividing the data into tiles on GPU.The speedup from our benchmark images in various sizes are depicted in Figure 7. copy data from d_image to d_array.// device to device copy 7: bind d_array as floatTex 8: end for 9: end procedure

Experiments with the Seljuk Pattern Images
The crowded and overlapping line stripes in Seljuk Patterns make them ideal for ESF performance tests, because in each tile there are both foreground and background pixels, there is a high branch divergence in almost every tile.The execution times of the Seljuk patterns in various sizes are depicted in Table 1.These values are obtained for 50 iterations.The corresponding speedup obtained with the GPU implementations are shown graphically in Figure 8. Execution time of the ESF increases when we increase the data size, proportionately for both CPU and GPU; yet there is still a slight improvement in speedup.When the size of the data is small, i.e. 4 MB, speedup does not increase as fast due to rather inefficient utilization of the data parallelism of the GPU.Since the ESF kernels are memory-bound, i.e. they require accessing multiple memory locations, yet do not contain dense arithmetic computations; the memory access cost becomes more dominant in the overall execution time (Table 2).

Experiments with the Lena Images
The execution times for the ESF computation using the Lena images are depicted in Table 3.The test results show that the speedup for the Lena images are relatively higher than the ones that we get from Seljuk images (Figure 9).It is due to the decrease in branch divergence in Lena images, compared to the Seljuk pattern; because there is a relatively more coherent distribution of foreground and background pixels in the Lena images.Note that, the bigger sized Lena images are created by scaling the image.Therefore, when we scale the image, the foreground region is doubled while tile dimensions in the kernels are fixed.This enables the threads in the same block, hence in the same tile, to execute the same branch most of the time.This is not the case for the complicated line stripes in Seljuk image, since the Seljuk tiles are extended to the bigger sizes without scaling.The tests in Lena images also show that when the data size is increased to 256MB, the average speedup do not improve further.Although the speedup in this size is not better than 64MB, it is still higher than the other configurations.We believe that the slight decrease in speedup is due to the change in the tile alignments when the data is doubled in size.The speedup gets better than 64MB, in this particular test image, in this size, when the number of iterations is increased to 400 or more.

CONCLUSION
In this research, we present two GPU implementations of the ESF using CUDA, namely the shared memory and the texture memory implementations; and compared their performances.The test images are selected considering the domains that the ESF is utilized more frequently, and the experiments are planned on varied sizes of these images.Our experiments show that, the speedup in the shared memory implementation is higher in all the experiments, due to the efficient block level data sharing among the treads in the same tile and faster access times of shared memory.Although the access times in the texture memory is fast with cached data, there is an overhead for moving data from the global memory to the texture memory in each iteration.Both of the GPU implementations result in significant speedups with respect to our single threaded CPU implementation; the improvement is at least 35 times in the texture memory based implementation and more than 40 times in the shared memory implementation.Speedup increases as the data size is increased.
The methods developed in this research are utilized in a shape grammar interpreter to improve computation times of part embeddings.Moreover, the proposed CUDA based ESF solutions will also be used in an upcoming research on a novel shape segmentation algorithm.

Figure 1 .
Figure 1.Top: A sample image, bottom: ESF of the image.

Figure 2 .
Figure 2. Flow chart of the ESF computation framework.

Figure 3 .
Figure 3. Apron pixels on the boundaries.

Figure 4 .
Figure 4. Apron pixels of the inner tiles.

Figure 5 .
Figure 5. Test images: on the left, Seljuk pattern from Kayseri (Anatolia); on the right, binary Lena image.

Figure 6 .
Figure 6.On the left, ESF of the Seljuk pattern; on the right, ESF of the Lena image.

Figure 9 .
Figure 9. Speedup obtained with the Lena images, for 200 iterations.

Table 1 .
Execution times of the Seljuk patterns for 50 iterations (in milliseconds).SM: Shared memory implementation, TM: Texture memory implementation.

Table 2 .
Execution times of the Seljuk patterns for 200 iterations (in ms).

Table 3 .
Execution times of the Lena images for 200 iterations (in ms).