Ruitao (Toby) Du, Xide Xia
CS205: Computing Foundations for Computational Science
Instructor: Thouis "Ray" Jones
Compute a batch of points at the same time with OpenCL to speed up the processing
Further accelerate the calculation by taking advantage of local buffers
Use an index trick to reuse part of the buffer of the previous iterations
Smoothing is one of the most commonly used image processing methods. Smoothing an image or a data set is to create an approximating function that attempts to capture important patterns in the data while leaving the noise points - the data points of a signal are modified so individual points are reduced and points that are lower than the adjacent points are increased leading to a smoother signal. There are many kinds of image smoothing algorithms such as Gaussian smoothing, Laplace smoothing, and Bilateral smoothing. A bilateral filter is a nonlinear, edge-preserving and noise-reducing smoothing filter for images. However, for high resolution images, it would take a long time to run.
In this project, our goal is to develop an efficient algorithm for bilateral filtering. In traditional image processing, only one pixel’s value is going to be changed at one time. In our project, we plan to improve efficiency via parallel computing. For example, in the smoothing processing, a predefined filter A is applied to the input image. Traditionally, the center of A is multiplied with the current pixel and the adjacent elements of A are multiplied with the adjacent pixels of the image respectively. With the help of parallel programming in python and OpenCL, we implemented multithreads with local buffers. We also tried some index tricks to further speed up the performance.
Bilateral Filtering is one of the most fundamental operations of image processing. It smooths images while preserving edges, by means of a nonlinear combination of nearby image values. It is a non-iterative, local, and simple method that combines gray levels or colors based on both their geometric closeness and their photometric similarity, and prefers near values to distant values in both domain and range. The basic method behind bilateral filtering is to update the values of pixels based on their neighborhood. It takes a weighted sum of the pixels in a neighborhood. The values of weights depend on the spatial distance between current pixel to the neighbor as well as depend on the intensity of the pixels. Because of the way the weights are calculated, the bilateral filtering is able to preserve edges while still averaging and getting rid of the noise.
Domain Filter: In order to preserve the normalize the pixel: Range filtering: In this case, the kernel measures the intensity similarity between pixels. The normalization constant in this case is: Bilateral filtering method combines domain and range filtering. Therefore, combined filtering is: with the normalization constant: Bilateral filtering method replaces the pixel value at x with a weighed average of similar and nearby pixel values in x's neighborhood. Pixel values in a small neighborhood are similar to each other. It acts essentially as a standard domain filter- it averages away the small, weakly correlated differences between pixel values caused by noise.This project explores different parallel implementations of bilateral filtering in OpenCL and compares the performance of them with the serial version in python. Below are four methods we implemented. In these methods, we all precompute the domain kernel first so that we don't need to calculate them multiple times.
Calculate the output pixel by pixel. For each pixel, we need to calculate the pixel difference within a certain neighborhood. To speed up the process, we vectorize the calculation by utilizing NumPy.
Calculate the output pixels in blockwise parallel. We use OpenCL with different work group sizes to parallelize the code. Work group sizes are 8×8, 12×12, 16×16, 20×20. Inside the OpenCL code, it directly reads from global memory when we access the neighborhood.
Calculate the output pixels in blockwise parallel. We use OpenCL with different work group size to parallelize the code. Work group sizes are 8×8, 12×12, 16×16, 20×20. Inside the OpenCL code, first we read in all neighborhood of a work group to the buffer. And then we access the neighborhood by reading from local memory. This way can save time on accessing the global memory.
Calculate the output pixels in columnwise parallel. In previous OpenCL methods, we put some pixels to the buffer multiple times. Work group sizes are 16×8, 20×8, 24×8, 28×8, 20×4, 24×4. Instead, we were reusing the buffer by introducing an index trick. Also, to increase the percentage of reused buffer, we set the work group size to be a long thin rectangle.
Spatial σ = halo / 2, Intensity σ = 50 pixels
We use half of the neighborhood size as the spatial sigma because we use gaussian kernel. When the neighbor is two sigma away, the weight should be approximately zero.
Apple OpenCL version: OpenCL 1.2
CPU: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
Maximum clock Frequency: 2200 MHz
Maximum allocable memory size: 4294 MB
Maximum work group size 1024
GPU: Iris Pro Maximum clock Frequency: 1200 MHz
Maximum allocable memory size: 402 MB
Maximum work group size 512 .
To show the performance of our Bilateral Filtering, we used several different images to test. Following is the output image result of different size of neighborhood:
We can see that as the size of neighborhood increases, the image is more blurred. This is because we average the pixel value across more points.
Serial version is much slower than the parallel one. Even though we calculate a small neighborhood, serial version still takes a few seconds to run while parallel one just takes 0.3 second. Therefore, our OpenCL version is much faster than the serial version. However, we want to further explore our parallel version
In our machine, best workgroup is 16 by 16. There is something interesting in the graph. 8 by 8 is very fast when neighborhood is small. However, we can see that after size of 8, the run time rises dramatically. We think it is because small workgroup will seperate the picture into more blocks. When neighborhood size increases, the run time for each block increase in a complexity of N^2. Therefore, if we have more blocks, run time will also increase more.
This plot is similar to the without buffer one. Best work group size is still 16 by 16.
Best work group size is still 20 by 8.
From the results, we see a larger neighborhood requires more runtime. It also shows parallel versions are much faster than the serial version. Within the three OpenCL versions, the result shows that using local buffers will save more runtime than using the global memory. However, when we add an index trick, the processing slows down. That’s because although the index trick helps to reuse part of the buffer of previous iterations, it requires extra time to calculate the index and less total number of threads.
Bilateral Filtering is one of the most fundamental operations of image processing. From our output, we can see that it can preserve edges and get rid of the noise of our images. In our machine and our sample images, the best method to speed up the bilateral filtering is the OpenCL version with buffer with a work group size of 16 by 16. When the halo size is 10, it takes about 2.5 seconds to finish the filter. This method outperforms the serial version a lot.
In the future, we can further explore more parallel algorithm such as using CUDA to see which is faster. Also we can speed up the process of other image filtering algorithm with similar parallelized methods.
Watch our screencast? Youtube
Get our codes? Github
1. Paris, Sylvain, et al. "A gentle introduction to bilateral filtering and its applications." ACM SIGGRAPH 2007 courses. ACM, 2007.
2. Paris, Sylvain, et al. Bilateral filtering: Theory and applications. Now Publishers Inc, 2009.
3. OpenCL, Khronos. "The open standard for parallel programming of heterogeneous systems." Website. URL http://www.khronos.org/opencl. Symposium on Microarchitecture, MICRO.
4. Tomasi, Carlo, and Roberto Manduchi. "Bilateral filtering for gray and color images." Computer Vision, 1998. Sixth International Conference on. IEEE, 1998.
We thank Ray and all CS205 TFs for providing the wonderful course and all helpful instructions.
This work was done as part of the CS205 course at Harvard. You can find more information and other cool projects here: CS205
Toby Du: ruitaodu at g.harvard.edu
Xide Xia: xidexia at g.harvard.edu