A novel accuracy assessment model for video stabilization approaches based on background motion

: In this paper, we propose a new accuracy measurement model for the video stabilization method based on background motion that can accurately measure the performance of the video stabilization algorithm. Undesired residual motion present in the video can quantitatively be measured by the pixel by pixel background motion displacement between two consecutive background frames. First of all, foregrounds are removed from a stabilized video, and then we find the two-dimensional flow vectors for each pixel separately between two consecutive background frames. After that, we calculate a Euclidean distance between these two flow vectors for each pixel one by one, which is regarded as a displacement of each pixel. Then a total Euclidean distance of each frame is averaged to get a mean displacement for each pixel, which is called mean displacement error, and finally we calculate the average mean displacement error. Our experimental results show the effectiveness of our proposed method.


Introduction
The demand for video stabilization (VS) is increasing over time [1]. Now many video processing software programs and camcorders are using a video stabilization facility [2]. Most importantly, Apple uses gyroscopes on the iPhone and iPad, but the devices cannot discard an unexpected large translational motion [3]. Microsoft [4][5][6], Facebook [7], and Adobe [8] have been doing much research in this area, too. In addition, video stabilization has major applications in aerial video surveillance, geo-registration, autonomous vehicle navigation, model-based compression, structure from motion, motion analysis, and mosaicking [1]. The quality of a video depends on the accuracy of video stabilization. All are to satisfy the human eye. The human eye is very sensitive to motion frequency, amplitude, spatial image frequency, color, intensity, and context of a video [9][10][11][12]. Currently, a large number of video stabilization algorithms are using subjective measurements to validate the VS algorithm. Though subjective measurement has some usages in psycho-visual experiments, this evaluation strategy does not reflect much in scientific measurements. Unfortunately, the state-of-the-art video stabilization algorithms such as Subspace [13] and the L1 camera optimization path for rolling shutter removal [14] are evaluated based on optimal user experience. If spatial consistency is not achieved for a video, the video frame will suffer from noticeable unnatural seams [15]. Researchers observe that the nearest frame shows the smallest alignment error [15], whereas Liu et al. [16] showed that consecutive background pixels have similarity over 90%. Safdarnejad  al. [17] mentioned that the predominant foreground and textureless region are the main cause of video stability failure. Most importantly, a foreground can moderately affect the accuracy of video stabilization [13,15,17].
Liu et al. [13] stated that a rolling shutter distortion is a noise and it cannot be modeled accurately. Grundmann et al. [18] mentioned that compensating the rolling shutter effect [3,4,19] caused by the complementary metal oxide semiconductor (CMOS) camera is a crucial task compared to a video captured a charge coupled device (CCD) camera in video stabilization.
Though the video stabilization algorithm tries to entirely remove unwanted motions present in a video, undesired motions always exist in the video. The sequences of video frame transformations in video stabilization algorithms follow a background of a reference frame until the reference frame is changed. If a video is perfectly stable, no residual motions will remain in the video except for the foreground motion [20], i.e. the difference between consecutive background frames will be zero. Any deviation from zero background motion will be due to inaccurate motion detection and estimation, transformation error, and accumulation error. Motivated by the warping process of video stabilization, in this paper, we estimate a background motion, which is the key to judging the stability of a video. Our hypothesis is that the background pixels in the coordinate (x i , y j , t) are equal to the background pixels in the coordinate (x i , y j , t + dt) at time t and t + dt respectively in the case of a nondynamic background. If the background pixels in the coordinate (x i , y j , t) are not equal to the background pixels in the coordinate (x i , y j , t + dt) at time t and t + dt, respectively, we calculate each background pixel displacement between frames I t (x i , y j , t) and I t+1 (x i , y j , t + dt). To compute background pixel displacement, the two-dimensional motion vectors in the horizontal direction and the vertical direction are calculated between background frames I t (x i , y j , t) and I t+1 (x i , y j , t + dt) , and then we calculate a Euclidean distance for the horizontal and vertical directional motion vectors of each pixel separately, which is the displacement of each pixel. Next, we compute a total sum of the L 2 norm distance of each pixel per frame. Finally, a total sum of the L 2 norm distance is averaged out to get a mean displacement error (MDE) that indicates an average motion of each background pixel present in the video. The overall architecture of our proposed algorithm is shown in Figure 1. The main contribution of this paper is twofold: the proposed method can quantitatively measure undesired motions present in the stabilized video, and we appraise the performance of three state-of-the-art VS algorithms using our proposed MDE or average mean displacement error (AMDE) algorithm. In addition to this, the accuracy of our proposed model is evaluated based on our synthetically created videos.

Related works
Basically, the peak signal to noise ratio (PSNR) and the structural similarity (SSIM) index [21] are the two most important video quality assessment metrics that do not consider the jitters of the video [2,21]. Mainly, the PSNR [20][21][22][23] measures the quality of a compressed video pixel by pixel. Morimoto et al. [20] calculated interframe transform fidelity (ITF) and global transform fidelity (GTF) in the case of VS performance analysis.
Here, fidelity refers to the fitting capability of a motion model to the actual motion, ITF is the amount of PSNR between two sequent frames, and GTF indicates PSNR between the ground truth frame and stabilized frame.
Mean opinion score (MOS) is calculated as a part of a subjective measurement [10,11,22,23] to check the quality of a video. Niskanen et al. [24] considered divergence, jitter, and blur to measure VS accuracy. Here, jitter refers to a high-frequency motion in a video, whereas the divergence estimates latency between consecutive frames. Later, researchers considered the point spread function (PSF) to measure the blur introduced by the video stabilization method. In the end, the proposed method tests the accuracy of the two anonymous VS methods A and B. Mean square error (MSE) [2] between a reference path and a path after stabilization of a video is used to assess the stability of the VS method. Qu et al. [25] tested their proposed method based on synthetically shaking video to measure the accuracy of the three stabilization algorithms of Deshaker, Grundmann, and Qu. Zhang et al. [1] assessed the stability of intensity-based video stabilization and featurebased video stabilization. From the ground truth motion and the estimated motion, the researchers computed a mean square error (MSE) between consecutive frames and root mean square (RMS) error for the whole sequences. Zhai et al. [26] took a threshold to categorize a high-frequency component (jitter) and a lowfrequency component (divergence) and decomposed these jitters and divergence utilizing a high-pass filter and low-pass filter. Furthermore, synthetic jitters were added to a stable video to get an unstable video, and finally the researchers assessed the stability of the video stabilization algorithm based on the MSE of the reference path and the processed path. A major limitation of these methods is that the synthetically created motion and the real undesired motion will not be the same. Zhang et al. [27] claimed that there are no methods that can measure the stability of a video. The researchers claimed that an unstable video will have more average motions than the average motions of a stable video. They computed the ratio of rotation and translation components of a stabilized video to rotation and translation components of a real video. They called this procedure a heuristic stability approach. Liu et al. [6] suggested an empirical process that assumes that if the more energy is in the low-frequency part of a video, the video will be more stable. We see that most existing proposed metrics [1,2,21,25,26] for VS approaches are either synthetically shaking video-based or heuristic approaches [2,6,24,27].

Design methodology
In this section, we discuss our accuracy assessment model. The model has basic three steps: preprocessing and foreground mask generation, foreground removal and background, and MDE and AMDE model construction. Figure 2 depicts the details of our accuracy measurement model for the VS method.
1) Preprocessing and foreground mask generation: First, an unstable or stabilized video is input into our system. Then the unstable or stabilized video is converted into a gray-scale I t (x, y) video. We borrow the idea of foreground mask F (x, y) generation from the SuBSENSE background subtraction algorithm, where The foreground and background classification is shown in Eq. (2). Here, 1 and 0 represent the foreground and background, respectively. where F t (x, y) is the foreground mask, dist(I t (x, y), BG n (x, y)) is the intensity and local binary similarity pattern ( LBSP ) distance between the current observation I t (x, y) and BG n (x, y) , R is the current updated decision threshold, and # min denotes minimum matching requirement.
2) Foreground removal and background: We assume that I(x, y), B(x, y), and F (x, y) , are a gray-scale video frame, a background frame, and a foreground frame respectively at the coordinate (x, y). At time t , we determine foreground mask F t (x, y) as in Eq. (2), and the background model B t (x, y) for the grayscale video I t (x, y) will be At time t + 1 , we will get the following background: Therefore, according to Eqs. (3) and (4), we will get the background frame sequence B 1 (x, y), B 2 (x, y), y) for the K number of gray-scale video sequences I k (x, y) , where ∀k ≥ 2 and ∀n ≥ 1 .
Then we calculate the vertical displacement vectors u and the horizontal displacement vectors v between the background B 1 (x, y) and B 2 (x, y), B 2 (x, y) and B 3 (x, y), . . . , B n−1 (x, y) and B n (x, y) for each pixel one by one at the coordinate (x, y) , where ∀n ≥ 2 .
3) MDE and AMDE model construction: Our proposed MDE or AMDE method models the undesired motion of the background pixels. We know that the optical flow algorithm takes the assumption that pixel intensities between consecutive frames do not change much and neighboring pixels have similar motion characteristics, so pixel intensities will be where dx and dy are the vertical and the horizontal distance in case of the pixel moving at the time dt (https://docs.opencv.org/3.3.1/d7/d8b/tutorial_py_lucas_kanade.html). Utilizing the Taylor series, discarding common elements, and dividing by dt in Eq. (5), we get the following equation: Here, Therefore, we will get the optical flow vectors For the L 1 distance, the MDE is defined as We know that the Manhattan distance ( L 1 norm) measures the shortest path to move horizontally and vertically, whereas the Euclidean ( L 2 norm) distance computes the smallest distance in the plane. In our cases, we use the Euclidean ( Therefore, the mean of the displacement of each pixel between the background frames B 1 (x, y) and The MDE between the background frames B 2 (x, y) and B 3 (x, y) is We continue the MDE calculation like in Eqs. (11) and (12), and finally we determine the MDE between background frames B N −1 (x, y) and B N (x, y) : The average of the MDEs is calculated for the N video frames as Algorithm 1. Video assessment model without the foreground.
Input : Video was taken after stabilization Output: MDE, AMDE 1 read the previousframe from a video; 2 convert the previousframe into the gray scale; 3 remove foreground pixels from the previousframe using SubSENSE algorithm;  Finally, our video assessment algorithm is organized in order: Algorithm 1 displays the video assessment model without the foreground. In the case of video assessment with the foreground, it needs to escape lines 3 and 11 of Algorithm 1. The displacement is calculated in pixels (unit =pixel). The MDE as in Eq. (11) is the mean displacement of each pixel per frame. If the background pixels have d pixels displacement from the (4), we can claim that the foreground (2) will also displace the same distance d pixels from the video frame I t (x, y) in Eq.
(3) to the video frame I t+1 (x, y) in Eq. (4), since all the pixels for a video frame move at the same rate when a camera unintentionally moves. The displacement of each pixel in pixels indicates the amount of undesired motion for each pixel separately. In practice, all pixels will not move the same amount for both types of pixels (background pixel, foreground pixel) in the case of a CMOS camera. The camera will have rolling shutter effects. That is why we find the separate background pixel movements and calculate the average movement of the pixels for each frame, which we call MDE. Therefore, we can conclude that the foreground pixel movement also follows the background MDE or AMDE as in Eq. (14) on an average. In this way, our proposed MDE and AMDE models measure the performance of the different video stabilization algorithms. The proposed method is absolutely effective, as our experiment proves. The displacement is due to estimation, transformation, and accumulation errors that come from different steps of the video stabilization process. Our proposed MDE in Eqs. (11) θ about the coordinate origin, the transformed matrix will be like Eq. (17). Regarding rotation translation, the matrix representation will be different as in Eq. (18). To create a more real shaking video, each pixel is transformed in terms of scaling, rotation, and translation. To generate a SRT video, the new pixel position (x ′ , y ′ ) is transformed as x ′ = sx cos θ − sy sin θ + h , y ′ = sx sin θ + sy cos θ + k , and its matrix is represented as in Eq. (19). We create all these types of videos to prove that our method can detect all kinds of the motion.

Experimental results
This section explains experimental results of our proposed method, evaluates our proposed method based on synthetically created unstable videos and real videos, and appraises the performance of three of the state-ofthe-art VS approaches using our proposed model. Finally, we compare our method with the PSNR and SSIM.

Dataset
We use the changedetection dataset of 2014 1 to prove that the background motion between consecutive frames will be zero. Also, we create eight different types of videos of different frame dimensions using the images shown in Figures 3a-3h and based on the translation, rotation, composition of translation and rotation, composition of rotation and translation, and composition of scaling, rotation, and translation (SRT) transformation of Eqs.
(15) to (19). Figures 3a-3d are images collected from the Internet, whereas images of 3a-3h are captured by a camera. All the images shown in Figure 3 are used only for creating the synthetically shaking videos.
The synthetically created shaking videos are only used to ensure our proposed method's accuracy.
Afterward, we select five categories of videos (regular, quick rotation, crowd, parallax, and running) with the foreground objects from the video database. 2 All the categories of the videos are very complex except the regular category. The quick rotation category has very large motion among the other categories. We select three digital video stabilization software programs: VirtualDub-Deshaker, 3 Google YouTube Stabilizer, 4 and Adobe Premiere Pro Warp Stabilizer. 5 Adobe Premiere Pro generally applies the subspace video stabilization [13] approach and the YouTube Stabilizer uses the L1 optimal camera path algorithm [14]. The comparison of the three digital VS (DVS) approaches is performed by our method with the same dataset. 2 Finally, the comparison of our proposed approach with the PSNR and the SSIM is carried out by the MOS of the videos stabilized by the DVS algorithms.

Background motion and our method's robustness
To evaluate our algorithm, we executed the experiments extensively. Our hypothesis is that the nondynamic background motion of a video will be zero if a video is perfectly stable. In Table 1, background motions (AMDE) of all the videos are going to approach zero. The motions are not exactly zero because each BG 6 of the videos is not completely static. We find 0.00019, 0.00051, 0.00034, and 0.00016 MDEs between the two same video frames for four different cases. Therefore, the above experiments prove that nondynamic background motions between consecutive frames will be zero. Baseline / office 0.0782 3.
Shadow / backdoor 0.0816 Based on this motionless background criterion, we can measure the stability of a video. We also analyze the unstable and stabilized videos of the four different categories provided in the above mentioned dataset. The analysis is tabulated in Table 2 and depicted in Figure 4. All the videos have undesired background motions. The background motion of the simple case video is less than the other cases. For the crowd and the running videos, the unwanted background motions are larger than the others. The observation is that it is easy to    Figure 5. Unstable BG and stabilized BG for frames. discard the simpler motion than complex motion. In Figure 5, we depict frame by frame MDEs of an unstable and stabilized video named 8.avi of the regular category provided online. 2 Clearly, the MDEs of the unstable video are higher than the MDEs of the stabilized video. Therefore, we can check the stability of a video based on the background motion.
Afterward, to test our method's accuracy, we compare the ground truth values with the measured values of the synthetically shaking videos. The ground truth values and the measured values are tabulated in the experiments in the case of translation, rotation, TR, RT, and SRT in Eqs. (15) to (19). The average errors in the case of a single translation and single rotation are 0.023 and 0.084 (Table 3 and Table 4), respectively, whereas the average errors of both composite transformation TR and RT are 0.12 (Table 5 and Table 6).
Though the composite translation and rotation (TR) and the composite rotation and translation (RT) are not commutative, our results (Table 5 and Table 6) have no noticeable differences. In the case of the SRT video, the average error is 0.13 (Table 7). Moreover, the error is increasing for the composite transformation compared to the single translation and the single rotation. In the above mentioned five cases, In the above mentioned five cases, the differences between the ground truth AMDEs and our measured AMDEs are very small, which are negligible. From Tables 3 to 7, it is concluded that the translational displacement is easyier to detect than the scaling and rotational displacement. Therefore, our proposed method is robust and it can detect the background motion, which will quantify the accuracy of the VS methods.   Figure 6 due to the large motion. The red line shows the filled image area and the blue line indicates the unfilled area. Our proposed method only considers the filled image area in terms of stability of a video. The proposed MDE calculates the total motion pixel by pixel between the two consecutive frames, then divides by the frame size. In the case of Figure 6, the displacements of the black portion (the unfilled area) will be zero. Therefore, we only estimate the displacement of the filled image area and divide the total displacement by the filled image area size. Another important issue is that the frame rate should be the same when comparing the DVS algorithm.

Comparison of three DVS methods
The performance of the three DVS methods (the VirtualDub-Deshaker video stabilizer, the YouTube stabilizer, and the Adobe warp stabilizer) is measured in terms of the background motion using our proposed model. The dataset is taken from a website. 2 The experiment is carried out on five different categories without foreground videos. The summary of the performance is shown in Tables 8 to 12. All the methods fail to discard the undesired motion entirely. The three DVS methods reduce more undesired motions for the regular case video than the other cases. The average background motion is less than 0.65 pixel AMDE for the regular case videos, whereas the other four cases of videos have more than 1.30 pixel AMDE background motion. On the average, the YouTube stabilizer shows the best performance for all cases while the VirtualDub-Deshaker stabilizer gains better performance and the the Adobe warp stabilizer scores well in all cases.

Our method compared with the other methods
We compare our proposed method with the PSNR 7 and SSIM [21] based on the MOS of the five different categories of videos stabilized by the three DVS algorithms. To the best of our knowledge, there is no dataset with the MOS for video stabilization. Liu only provides some unstable and stabilized videos of different categories at the url, 2 but the MOS is not included there. Therefore, for the MOS, we use the Double Stimulus Continuous 7 https://docs.opencv.org/2.4/doc/tutorials/highgui/video-input-psnr-ssim/video-input-psnr-ssim.html Quality Scale (DSCQS) 8 of the ITU-R standard to assess the perceptual quality in terms of the stability of a video. According to the DSCQS, when taking a test, an input and processed video are displayed in the consecutive windows. Then the observer directly assesses and puts a score according to the given scale. The rating scale of 1-5 is used, where we designate 1 as excellent (perfect stability quality) and 5 as bad (very difficult to understand the stability quality). The other scores of 2, 3, and 4 indicate good (very satisfactory), fair (requires more stability), and poor (hard to understand the stability), respectively. We took 22 observers, where two were female and rest of them were male. All the observers were students from five different labs who came from two different countries. Among all the observers, some observers had experience in video and image processing, computer vision, and video stabilization and rest of them were from artificial intelligence, cloud computing, and big data analysis labs. For a single unstable input video, three videos stabilized by the three DVS algorithms were displayed consecutively. The participants were asked to watch the videos carefully and rate the overall stability according to the above mentioned rating scale. Finally, the collected scores from the 22 participants were averaged to get the final assessment MOS value. The Spearman rank correlation coefficient (S-corr.) was used to estimate the correlation between our proposed method, the PSNR, and the SSIM with the MOS of the stabilized video separately. The S-corr. shows the correlation between the two variables. When the two variables increase, the S-corr. becomes positive. Alternatively, if one variable is increasing while the other variable is decreasing, S-corr. will be negative. S-corr. of zero value means that there is no correlation between the two variables. When the two variables are perfectly monotonically correlated, the S-corr. becomes one.
The comparisons of our proposed method with the PSNR and SSIM are tabulated in Tables 8 to 12 in terms of the correlation with the MOS. The correlation is computed for each video of each category separately. In addition to the averages of the MOS, our MDE, the PSNR, and the SSIM are computed, and then the S-corr. values for these average values are also estimated. Among the three methods, our method gains the best correlation results, whereas the SSIM shows fewer correlation coefficients and the PSNR is in the second position in the case of the five different categories where each category belongs to five different videos. Our proposed method has the perfect S-corr. value of 1 in all cases. The PSNR achieves the S-corr. value of 1 only for the two videos of the regular case and the quick rotation case separately, while the SSIM gives one S-corr. value of 1 for each case and both the PSNR and SSIM score the S-corr. value of 0.5 for the rest of the videos. The three methods obtain an average S-corr. value of 1 only for the quick rotation case. The PSNR is usually used to check the image reconstruction quality and the SSIM estimates the image structural degradation. These two methods can only measure the distortion of an image. When an unstable video is stabilized, most of the pixels change their coordinate positions. Therefore, the PSNR and SSIM fail to estimate the stability of a video. However, our proposed method can count the stability criterion. Our experimental results and MOS survey data can be found online. 9

Conclusion and future work
We have proposed a background motion-based novel performance metric, which can measure pixel-wise motion errors of the stabilized video. Our hypothesis is that background pixels do not move too much in the consecutive frames after video stabilization if a video is nearly perfectly stabilized. Similarly, the unwanted motions of foreground pixels always follow the background pixels' movement. To get a more accurate result, we remove the foreground pixels from a stabilized video. Afterwards, 2 -dimensional motions are detected, and then the L 2     norm distance displacement is computed. As a result, our proposed algorithm detects a temporal inconsistency effectively and distinguishes between a stable and unstable video clearly ( Figure 5). Moreover, the small differences between the experimental results and the ground truths (Tables 3 to 7) show our proposed method's robustness. Our proposed model sufficiently differentiates the three best DVS methods (Tables 8 to 12). The distinguished results of the three latest algorithms in the case of real videos and our method's validation by the synthetically shaking videos show the effectiveness of our algorithm. Most importantly, our proposed MDE method shows the best correlation scores with the MOS compared to the PSNR and the SSIM (Tables 8 to 12).
Our proposed algorithm neither depends on the synthetically shaking video nor assumes heuristic or empirical assumptions. In the future, we plan to consider the blurring effect introduced by the video stabilization and the unfilled spatial areas in addition to the stability criterion to distinguishably measure the accuracy of the DVS method.