Smoke detection from foggy environment based on color spaces

Detection of smoke from videos captured by surveillance cameras in outdoor environments is one of the useful outcome of Internet of Things (IoT) applications. The potential benefit increases when deep learning (DL) architectures are involved. However, an inherent difficulty is to detect smoke while natural events like fog exists. The effectiveness of color spaces in detection performance has not yet fully evaluated in those architectures. Moreover, the energy and memory requirements of DL architectures may not be applicable for handling IoT implementation demands. Therefore, in this work, a DL architecture with a suitable color space model, applicable for IoT implementations is proposed to detect smoke from videos in foggy environment. By collecting several videos including smoke samples, the performance comparison of popular and the state-of-the-art DL architectures denoted the outperforming result according to both accuracy and memory usage.


Introduction
Detection of fire is one of the major part of early warning systems for the safety of environment and people. At the initial instant, fire emits a visible smoke which may come with or without a flame. Today, electronic smoke detector equipment is already in common use for indoor environments while cameras are becoming a major part of the detection systems. In the case of outdoor or wildlife, early detection systems make use of surveillance systems with cameras including automated and intelligent capabilities. Those systems are now being considered in Internet of Things (IoT) framework which requires energy-efficient approaches when interconnected with other larger systems, for example in a smart city concept [1].
During the last two decades, studies on computer vision for fire detection systems have been increased [2,3]. Since the smoke is visible before the flame, a majority of the works have also included smoke detection approaches especially for the detection of wildfire [4][5][6][7]. Those detection systems depend on extracting reliable and effective features reflecting texture, shape, color, movement, energy, and frequency [8] that will help to deal with the real-world conditions such as fog, rain, or snow. Particularly, color information has shown to be useful and easily applicable [9][10][11][12][13][14][15].
Recent studies have focused on the use of the deep learning (DL) techniques that does not necessarily require a handcrafted feature extraction step. Therefore, it is more convenient to build end-to-end systems directly using the image or video data without the need of extracting features from them. In this manner, models based on convolutional neural networks (CNNs) have been considered [16][17][18]. The well-known baseline models like VGG-16 [19], AlexNet [20], GoogLeNet [21], ResNet [22], DenseNet [23], and MobileNet [24] have been applied. The DL methods for object recognition such as YOu LOok only once (YOLO) based on detecting the interested regions on the images have been used in an embedded system [25]. Apart from those general models, several smoke detection studies made use of other specific CNN architectures. Among them, a Deep Normalization and Convolutional Neural Network (DNCNN) has been proposed for smoke detection which handles both feature extraction and smoke recognition at the same time [26]. The Faster R-CNN [27] model has been used to detect smoke in forest fire with augmented data of synthetic images [28]. Stacking basic blocks as a deep multiscale CNN (DMCNN) for smoke recognition has been proposed [29] as a lightweight model. Likewise, a combination of VGG-16 and Resnet50 network architectures has been fused as a deep network to improve feature expression ability while increasing the depth of the whole network [30]. Specialized network models like FireNet [31] is another example of a lightweight model suitable for mobile and embedded applications. Energy efficient network models for similar intentions have been also given [32][33][34][35][36].
Other recent examples of CNN models for smoke detection include temporal evolution or combinations of networks such as the two-stage training of a Deep Convolutional Generative Adversarial Neural Network (DCGAN) [37], dilated CNN [38], deep saliency network [39], and deep dual-channel CNN based solutions [40,41].
On the other hand, IoT is becoming more and more associated with the digitalization of environments for ease of control and respond to events. The IoT technology helps to detect fire incidents in forests, or in other areas by measuring real-world information such as temperature, gas levels, humidity, wind direction and speed. Today, computer visionbased techniques replace the conventional fire detection by overcoming the shortcomings of sensor-based methods [14]. Therefore, for the case of image/video, some of the aforementioned studies have presented their work [25,[31][32][33][34] considering the minimization of resources to be applicable for IoT implementations.
However, most of those works do not consider the effect of color spaces in camera recordings. Based on this motivation, we propose an energy-efficient smoke detection architecture of a color space based on CNNs. The novelty and the difference of this work lies in incorporating the color space models with the DL architectures in order to determine the best performance in detection of smoke. This is achieved by modifying the DL architectures and using the artificially generated foggy images to determine smoke from a foggy environment. Another aim is to determine the best structure requiring lower resources for IoT applications. A large set of videos is collected for this purpose and evaluations are performed for validation.
In the next section, the color space models are reviewed and the video collections gathered to be used with the DL architectures are summarized. Section 3 presents the proposed structure including the data preparation and modification of the DL architectures. Section 4 displays the evaluations and performance comparison with the state-ofthe-art results. Final section summarizes the results and concludes the paper.

Smoke Detection
Smoke detection from image or video features vary based on the properties of the image texture, the segmentation applied for certain shape information, the representation captured as color spaces, the consecutive changes because of the movement, and other fundamental signal level features based on energy or frequency [8]. A brief information of the color spaces used in this work will be given in the sequel. Then the deep learning architectures and the sources of videos used in this work will be listed.

Color spaces
In computer graphics, based on tristimulus representation theory, color spaces are simply representations of color in three dimensional linear spaces or intensity channels of the red, green, and blue colors known as RGB. However, there are different interpretations which the selected components might be transformed into different color models. Perceptual color models use hue and saturation referring to chromaticity and additionally brightness information. The most common is the HSV (hue, saturation, value) which is also referred to as HSI (hue, saturation, intensity) or as the HSB (hue, saturation, brightness). A convenient model to represent the brightness in videos is the YUV that uses luminance and chrominance components. The Y component is called as the luma and the remaining components are referred to as the chrominance and specifically blue-based chrominance and red-based chrominance as in the YCbCr color space. A uniform color space is obtained by transformation of the reference points and the lightness value in L*a*b* defined by the International Commission on Illumination (CIE), describing a color on the red-green chrominances (a*), and on the yellow-blue chrominances (b*) [42,43].
A color in a color model is described by numbers indicating how much of color, brightness, or other components is included. In digital imaging done by computers, these component values are often in the range 0 to 255, for an 8-bit resolution. For each of the images taken from the sequence of the videos, the component values of each pixel in the image, represented by the values of RGB, can be expressed with other color spaces by suitable transformations. In the following, the transformation formulas converting from RGB to YUV, HSV, and L*a*b* are presented, respectively.
From RGB to YUV: = 0.229R + 0.587G + 0.114B = 0.1687R + 0.3313G + 0.5B + 128 = 0.5R + 0.4187G + 0.813B + 128 While most of the cameras work based on the RGB intensity values, by using these transformations it is straightforward to convert the image pixel information irrespective of the cameras.

Deep learning architectures
Recent studies in both machine learning and computer hardware have contributed to propose efficient methods for training deep neural networks. Instead of fully connected hidden layers, the CNN typically has alternating convolution and pooling layers. Following the record-breaking success of AlexNet at 2010 ImageNet Large-Scale Visual Recognition Challenge several research groups have achieved lower error rates with higher number of layers as in GoogLeNet, ResNet and DenseNet. Furthermore, advanced modules combine different deep learning architectures aiming to increase the performances. In Inception, many mini-network modules are built, multiple convolution filters of different sizes are then concatenated. On the other hand, it is known that increasing the number of layers are prone to the problem of vanishing gradients. The ResNet structure offers a solution for this problem. By combining the Inception with the ResNet lower error rates have been obtained. Another solution came up by making short-cuts between the input and output layers through the transition layers as in the DenseNet architecture. Further achievements have been proposed in Xception when depthwise and point-wise convolutions are involved instead of conventional convolutional layers. An important achievement is obtained by using the MobileNet architecture, where mobile models of inverted residual structure is built with shortcut connections between the thin bottleneck layers. This is resulted with the minimization of requirements while improving the performance.

Video sources
The increasing use of cameras for wildfire detection led to many video sources available to be used for forest fire/smoke detection. However, most of the studies presented their results with a limited number of available videos. Today, with the increasing number of available videos and DL architectures, the problem can be investigated with more number of image data and with recent modules. As deeper learning structures may obtain better accuracy performances while handling more data, in this work an important number of videos is collected. The list of the video sources used in this study is presented in Table 1.

Proposed Structure for Smoke Detection
There are many DL architectures used for fire/smoke detection from videos as mentioned in the previous sections. However, the real world weather conditions may degrade the performance. In the case of fog, our proposal depends on the effectiveness of the color spaces that has not yet presented for some of the DL architectures. On the other hand, the performance of DL structures highly depends on the massive amount of data. As there is a lack of smoke images in a foggy environment, our proposal includes augmenting artificial smoke to the images. Therefore, the structure is defined not only to decide whether the image contains smoke or not, but extended to determine the four possible outputs as smoky, foggy smoky, normal, and foggy.
The artificially generated foggy images use the 8-bit RGB image ( , , ) for every channel. The image is first brightened by adding the 100 brightness value to the pixel values of the image. The highest brightness value of each color channel of RGB is then scaled with the highest brightness value of brightened image.
An example of a real and its modified version with the addition of synthetically generated fog are presented in Figure 1. Note that the increase in brightness may add further difficulty in detection of smoke. Figure 1. An example of a real [46] (left) and its synthetically generated foggy image (right) Irrespective of the images being real or includes artificial fog, the RGB images are converted to other color spaces using the equations (1) to (4). Our experiments denoted that the HSV color space is effective in detection of smoke regions in the image. In order to reveal this effectiveness a comparison is given with an example image of RGB, YUV, L*a*b*, and HSV, as presented in Figure 2. Note that the images presented for the other color spaces are the corresponding representations in RGB color space. In brief, the proposed structure depends on the use of HSV color space.

Experimental Results
We performed experiments using the images collected from the video sources given in Table 1. A total of 188 video files have been converted into the image sequences of a total of 72220 images. As those videos have varying duration and frame per second rates, a selection is performed based on a similar work [32]. The whole data is separated into three groups with 20% for training, 30% for validation, and the rest 50% for testing in order to compare with the existing studies [32,33]. The number of images for each group is displayed in Figure 3. To maintain the structure of four outputs, the output softmax layer of the DL architectures is replaced with the proposed scheme accordingly as graphed in Figure 4. Then the performance of the proposed structure is compared with the selected and correspondingly modified architectures of VGG-16, VGG-19, InceptionV3, InceptionResNetV2, Xception, DenseNet169, DenseNet201, and MobileNetV2. In all of the architectures, the stochastic gradient descent (SGD) optimizer was used with a learning rate of 0.001. The batch size was selected as 16 with a number of epoch as 30.  Table 2. Notice that the RGB has the poorest performance with no highest score in any of the architectures, while HSV has the highest score in most of the DL architectures. The confusion matrix corresponding to the best performance of HSV with MobileNetV2 is shown in Table  3. The misclassifications of foggy and smoky is similar to the one of misclassifying a normal image with a smoky image. A comparison of memory usage for the DL architectures used in this study is given in Table 4. It is seen that the MobileNetV2 has considerably much lower memory space requirement. The performance of the proposed method is compared with the state-of-the-art architectures using similar parameters, where a summary is presented in Table 5. Remark that the other studies not listed here utilizing any one of the color spaces make a classification between smoke and non-smoke.

Conclusions
The detection of smoke from captured video images is an important step to prevent fire and its outcomes. While a binary detection scheme gives satisfactory results, real world situations necessitate detection systems to work in harsh conditions. One of the major condition in detection of smoke is the fog which share almost the same visible color information. Thus, our proposal is to benefit from color spaces in detection of smoke in foggy environments. This is accomplished with the DL architectures using a big number of image data collected and combined for this purpose. As the data is sparse for a fair comparison of smoke images in a foggy environment, artificially generated images are used.
Results demonstrated the efficiency of the proposed structure of the modified MobileNetV2 architecture with HSV color space. When compared with the well-known DL architectures and similar works, it performed with a best accuracy score, while requiring lower memory space. This is foreseen to be an important issue especially in the IoT applications, as the DL architectures are becoming saturated in performance accuracy but they might still be compacted for further incoming lightweight implementations.