CUDA Based Computation of Quadratic Image Filters

Image processing applications usually requires nonlinear methods due to the nonlinear characteristics of images. Quadratic image filter which is a class of nonlinear image filters are widely used in practice such as noise elimination edge detection and image enhancement. On the other hand, second order products of the pixels make quadratic image filters computationally expensive to implement when compared to linear convolution. In the last decade, CUDA accelerated computing has been widely used in image processing applications to reduce computation times. In this study, an efficient method for the CUDA acceleration of the quadratic image filter has been implemented. For this purpose, alternative algorithms were examined comparatively since the performance of the GPU is sensitive to memory utilization. Because quadratic filter has a large number of coefficients and quadratic terms, the algorithm which utilizes the shared memory for storing image blocks provided the best throughput among the examined methods. Comparative results that were obtained using various images in different sizes show significant accelerations over sequential implementation.


Introduction
Image processing applications such as noise filtering, edge detection, and image enhancement are usually implemented using convolution or correlation operations. Convolution based image filtering involves moving a filtering kernel over the whole image to compute the pixels of output image. Each pixel is computed by multiplying the selected window of pixels from the input image with a filtering kernel. Then the multiplication results are summed and the resulting value is written to target pixel. [1]. Due the nonlinear structure of image contents, performance of linear filters may not be satisfactory in some applications. In such cases, nonlinear filters can be preferred over linear filters [2]. Theoretically, Volterra series approach is usually used to model nonlinear systems using infinite elements. In practice, Volterra model of a nonlinear system is defined by truncating it to a reasonable size. Usually, the preferred approach is to truncate it to include up to second order terms to reduce computational complexity and these filters are usually called as Quadratic Image Filter (QIF) [3]. QIFs are utilized in various research fields such as Gaussian noise or impulsive noise removal due to its edge preserving features [4]- [6]. QIFs are successful in edge detection applications [7], [8], medical image processing applications such as enhancement or noise reduction for mammogram images [9]- [15]. QIFs are utilized for feature extraction to use in face recognition applications [16]- [18]. QIFs requires to compute second order multiplications for computing a filtered pixel [19]- [21]. Hence, required computations considerably larger than a convolutional filter of the same window size. For example, a QIF application using 33 window requires additional 45 number of multiplications for obtaining all possible second degree of multiplications of the input pixels. Also, another 45 number of multiplications for window weights are required. On the other hand linear convolution only requires 9 number of multiplications with windows weights. Recent GPU (Graphics Processing Unit) products provides a good mean to accelerate computation of image processing applications. For this purpose, NVIDIA provides CUDA (Compute Unified Device Architecture) model to utilize GPU efficiently. CUDA enables writing programs for high performance parallel applications. Various image processing algorithms that can be parallelized utilizes the GPU technology for acceleration [22]- [28]. Since pixel the operations of QIF are independent, the algorithm can be parallelized to run on GPU environment.
The focus of the present study is to develop an efficient algorithm based on CUDA for the GPU accelerated computing of QIFs. Sequential implementation of the QIFs usually require long execution durations when running times are investigated on an average desktop processor. Hence, an efficient CUDA based implementation may help utilization of QIFs practical. GPU acceleration is highly dependent on the programming approach and the utilization of GPU memory. Efficient approaches usually requires more complicated algorithm designs and programming approaches. Hence, three alternative implementations from simple to complex were discussed for comparison. As will be shown by experimental results the proposed method provides significant reductions in computation time when compared to sequential execution. Organization of the paper is as follows; in the second section background information for QIFs was given. In the third section, implementation of the alternative algorithms using CUDA kernels were explained in detail. In the fourth section, experimental execution times using sequential implementation and CUDA implementations were presented. Finally conclusions about the method and the results were given.

Background
QIFs, which are in the subclass of Volterra filters are the important alternatives of the linear filters [3], [29]. Eq-1 describes the output equation of the QIF. In this study, only the terms up to second degree were used and the constant term is excluded; In above equation, m and n show the pixel coordinates to be filtered and o(m,n) shows the filtered output pixel. The linear component of the output is represented by o1(m,n) and the quadratic part is represented by o2(m,n). Eq-2 shows the expressions for the outputs o1(m,n) and o2(m,n).
Where M means that a window size of (2M+1)(2M+1) is used for filtering. xm+i,n+j represents a pixel taken from the input image, w 1 i,j and w 2 i,j,k,l represents first order and second order filter weights respectively. The equations for linear and quadratic components can be expressed simpler. For this purpose, linear part of the expression can be redefined using one dimensional summation as given by Eq-3.
Where N is equal to 2M+1. W1 describes weights, Xm,n is the selected window of pixels as given by Eq-4.
For further simplification, above expression can be written as a dot product as below; Similar to above arrangements, the quadratic component can be simplified. Initially it is useful to redefine it in two dimensional form as below; Most of the multiplications of input pixels in Eq-6 is symmetric. This is shown by Eq-7 where the quadratic input terms forms a matrix of (N×N)×(N×N) size.
In order to eliminate symmetric terms in Eq-6, the initial value of the j is set to i as shown by Eq-8; The second order products of input given by Eq-8 do not contain symmetric multiplications. Therefore, the unique multiplications can be rewritten in vector form; Also the corresponding weights used in Eq-8 can be written in vector form as below, Therefore, the quadratic part of the Eq-1 can be expressed as a dot product as below; The number of elements in W2 and X 2 m,n depends on the size of the filter kernel. The number of terms in quadratic vector excluding the symmetric terms for N×N size of kernel can be calculated by Eq-12; The sum of linear and quadratic part gives the total equation to filter a pixel as shown by Eq-13.

Proposed implementation
In the present study, three alternative method for the CUDA implementation of the QIF were investigated. Before addressing the most efficient method described in this study, the two alternative approaches were also discussed for comparison. First method which is the most straightforward implementation shown by Algorithm 1.
Every CUDA thread executes the kernel given by Algorithm 1 and filters a pixel selected according to the block and thread number. As explained in the previous section, once GPU initialized the data that kernel uses transferred to global memory. Before filtering function is executed, the neighborhood of selected pixel from the global memory is copied to global memory to reduce repeated reads from the global memory. Then, kernel filters a pixel selected according to pixelId variable using quadraticFilter() function. In a simpler implementation, this operation can also be discarded image data can be used directly. But due to the computation of quadratic terms, it is obvious that repeated reads of image data from global memory will decrease the performance. Filter functions normalizes the input pixels to 0-1 interval. After linear and quadratic components are computed, it is normalized back to 0-255 interval and it is written to back to pixel location in global memory area where the output image is defined.
Algorithm 2 shows the kernel implementation using Method 2 where block shared memory is utilized to improve the throughput. In addition to input pixels, each threads uses weights from global memory during filtering. On the other hand copying weights to thread local memory is not useful, since each thread reads weights one time and therefore this doesn't change the number of global memory reads. In this case block shared memory were used to reduce the number of global memory reads. Prior to filtering operation, the weights are copied to block shared memory by each thread as shown by the line 11 of Algorithm 2. Therefore the number of global memory accesses is reduced to the number of blocks. Similar to Algorithm 1, this method is also uses thread local memory for storing the neighborhood of pixel to be filtered. Once all weights are copied to shared memory, all threads in the same block read the weights from block shared memory. However, Method 2 requires synchronization of threads in a block to ensure copy operation completed before filtering starts. Above methods store an input pixel and its neighborhood in the local thread memory to reduce the memory reads from global memory. However, each thread reads neighborhood of a pixel to be filtered repeatedly. It is desired that once a pixel is read from global memory, it doesn't required to be read repeatedly. For this purpose an approach that use block shared memory for storing pixels of a block size is used. Method 3 is based on partitioning the input image into sub-blocks and storing every block in the block shared memory. Therefore repetitive accesses to global memory is eliminated. Although the pixels at the edges of blocks are read two times from the global memory, the other pixels are read one time from the global memory. After each of the sub-images are carried to block shared memories, no access to global memory is required to implement filtering.
Method 1 requires reading a mask size of pixels and the filter weights from global memory for filtering a pixel. In Method 2, the weights are copied to block shared memory for each block and therefore reading the weights from the global memory is reduced to the number of blocks. Method 3 also reduces the necessity of reading the neighbor of each pixel for filtering by carrying the sub-part of image to shared memory. The number of sub-blocks is determined by the number of threads in a block as shown by Eq16. In this equation splitSize shows the width and height of a block, borderLines is the number of additional lines on all sides of a sub-block. In the present study, the number of threads in a block is set to 256 and therefore image is split into subimages of the size of 16×16. Since, sub-image in a block also contains the border lines, the sub-image size to be filtered in a 16×16 block is 14×14 for a 3×3 filtering mask. Threads in border lines were used for copying the image. Once subImageSize variable is determined in Eq. 16, blockCols and blockRows which are the number of columns and the number of rows respectively can be determined.
After the parameters defined by Eq. 16 are set, the kernel shown by Algorithm 3 is called to filter image. Before filtering starts, block shared arrays are allocated for weights and sub-image and then weights are copied to block shared memory as in Method 2. After block coordinates and thread coordinates and the corresponding filter coordinates are determined, a sub-image of block size is copied to block shared memory. Each pixel is copied by a thread and all threads are synchronized to ensure that all pixels in a subblock are copied before filtering starts. Both transferring filter weights and a block of input image to shared block memory reduce the global memory accesses significantly.

Experimental Results
Experimental results were obtained by measuring running durations of sequential implementations on CPU and parallel implementation on GPU using the described approaches in the previous section. Sequential execution durations were obtained using a computer that has Intel(R) Core(TM) i7-4710MQ CPU@2.50GHz processor. CUDA based execution durations were also obtained on the same computer which has NVIDIA GeForce GTX 960M video card. The algorithms were written using VC++ Visual Studio 2015™ environment. Color test images used in the performance measurements were read using OpenCV library [30]. Before processing, image data is normalized to 0-1 interval and stored in float data type for both CPU and GPU implementations. The weights of the quadratic filter were determined by optimization approach using training and test images [31]. An example filtering result for QIF using the example weights were given in Figure 1 which also contains original image, noisy image and other filtering results using Gaussian filter, Average filter and Median filter for comparison. Table 1 shows the experimental measurements for execution durations repeated 8 times for the realized methods using a test image 1024×768 and 3×3 filter mask size. GPU execution times for the first run is relatively slow when compared to the rest of the results. In GPU computing this is usually called as warm-up run and discarded in performance measurements [32]. Additional kernel launches can be used to remove warm-up overheads. Because the size of the image has significant effect on the computation durations, the performance results were obtained for various sizes of images.  Table 2 shows average computation times for 30 runs using all approaches and various sizes of images ranging from 640×480 to 3840×2160. According to results, both kernel and kernel+initialization durations are considerably shorter than the sequential execution durations. While Method 1 and Method 2 provide close computation durations, the best acceleration was obtained by using Method 3. Kernel times where the filtering take place are even smaller than the total time which includes initialization times. Method 2 produced always smaller computation times than compared Method 1. This is because, the weights are read from block shared memory in Method 2 while weights are read from global memory in Method 1. Because weights are used one time by each thread for filtering, this doesn't provide significant acceleration like Method 3 where a block of image transferred to block shared memory. All threads uses the image data stored in shared memory. This significantly reduces the number of accesses to global memory and because each thread reads only a pixel from global memory to store to shared memory. On the other hand, the pixels at the edges are shared with blocks and they are read two times by neighbor blocks.

Conclusion
Quadratic filters are computationally expensive to implement due to the computation of all possible second degree multiplications. In addition, these terms are multiplied with weights and linear component is also computed for determining an output pixel. According to experimental evaluations, sequential implementation produces long execution times resulting as the image size is increased. In the present study, a GPU based method for the acceleration of the quadratic filter was introduced. For this purpose three alternative implementations from simple to complex were discussed. The first method simply reads a frame of pixels from global memory to thread local memory and reads weights from global memory and apply quadratic filtering to produce filtered pixel. Method 2 reads filter weights global memory and keep them in shared memory for the threads within the block. Therefore threads in a block reads weights from shared memory instead of global memory. Because each of the weights are read one time, this doesn't provide much improvement over Method 1. Method 3 requires the image be separated into a number of blocks determined by a thread block size. Within each block, a number pixels that is equal to the size of a block is filtered by the threads of that block. The image to be filtered is transferred to shared memory of each block to obtain better performance results. Although Method 3 increases the complexity of code it provides better management of memory for reducing access times, and therefore produces better results than the simpler implementations. The results show that, Method 3 considerably reduces the running duration of QIF when compared to CPU and other implementations with GPU. In the future studies, performance can further be improved using alternative algorithmic and programming approaches.