FEATURE EXTRACTION FROM SATELLITE IMAGES USING SEGNET AND FULLY CONVOLUTIONAL NETWORKS (FCN)

: Object detection and classification are among the most popular topics in Photogrammetry and Remote Sensing studies. With technological developments, a large number of high-resolution satellite images have been obtained and it has become possible to distinguish many different objects. Despite all these developments, the need for human intervention in object detection and classification is seen as one of the major problems. Machine learning has been used as a priority option to this day to reduce this need. Although success has been achieved with this method, human intervention is still needed. Deep learning provides a great convenience by eliminating this problem. Deep learning methods carry out the learning process on raw data unlike traditional machine learning methods. Although deep learning has a long history, the main reasons for its increased popularity in recent years are; the availability of sufficient data for the training process and the availability of hardware to process the data. In this study, a performance comparison was made between two different convolutional neural network architectures (SegNet and Fully Convolutional Networks (FCN)) which are used for object segmentation and classification on images. These two different models were trained using the same training dataset and their performances have been evaluated using the same test dataset. The results show that, for building segmentation, there is not much significant difference between these two architectures in terms of accuracy, but FCN architecture is more successful than SegNet by 1%. However, this situation may vary according to the dataset used during the training of the system.


INTRODUCTION
Building detection from remote sensing and photogrammetric images has been one of the most challenging tasks with important development and research efforts during recent years (Vakalopoulou et al., 2015). In remote sensing field, applications such as urban planning, land cover/use analysis and automatic generation or updating of the maps, along with the detection of buildings, are long-standing problems (Wu et al., 2018a).
Buildings, which are the most significant places for human life, are key elements in the mapping of urban areas (Chen et al., 2019). Due to the rapid changes in urban areas, it is important to create and update the location information of buildings (Wu et al., 2018b). Remote sensing has been an effective technology for accurate detection and mapping of buildings due to its capability for high-resolution imaging over large areas and advantages of fast and high accuracy data acquisition (Chen et al., 2019, Comert et al., 2019. Unfortunately, automatic building detection on aerial images is usually limited by the inadequate detection and segmentation accuracy (Chen et al., 2019). Most tasks still require great amounts of manual interventions by experts.
In recent years, as a consequence of the developments of imaging sensors and corresponding platforms, a rapid increase in the availability and accessibility of very highresolution (VHR) remote sensing images has made this problem more and more urgent (Ma et al., 2017). In the literature, satellite images have been used widely for the classification of urban areas (Sevgen, 2019). Building extraction from satellite and aerial images is not an easy task because of complex backgrounds, different lightning conditions and external factors that reduce visibility or separability of buildings (Akbulut et al., 2018).
Recent progress in the field of computer vision (CV) indicates that, with the help of sufficient computing power and large training datasets (Cordts et al., 2016;Deng et al., 2009;Everingham et al., 2010;Lin et al., 2014), deep learning methods such as Convolutional Neural Networks (CNNs) (LeCun et al., 1989) can considerably improve the performance of object detection and segmentation tasks from high-resolution imagery (He et al., 2016;Krizhevsky et al., 2012). Neural networks can deal with complex problems to reach accurate solutions (Tasdemir & Ozkan, 2019). This situation strongly indicates that deep learning will play a critical role in promoting the accuracy of building segmentation toward practical applications of automatic mapping of features (Chen et al., 2019).
Since AlexNet overwhelmingly won the ImageNet Large-Scale Visual Recognition Challenge 2012 (LSVRC-2012) (URL-1), CNN-based algorithms have become the go-to standard in many computer vision tasks, such as image classification, object detection, and image segmentation (Wu et al., 2018a). In the beginning, researchers mainly applied patch-based CNN methods to detecting, classifying or segmenting buildings in aerial or satellite images and significantly improved the performances (Guo e al., 2016). However, as a result of extreme memory costs and low computational efficiency, Fully Convolutional Networks (FCNs) have eventually attracted more attention in this area (Wu et al., 2018a).
In this study, a comparison was made between SegNet and Fully Convolutional Networks (FCN) architectures. Inria Aerial Image Labeling Dataset which consists of 180 training images (with corresponding labels) and 180 test images was used. Two different models that use these architectures were trained using the prepared dataset and their performances have been evaluated. The creation of models and object segmentation processes were performed on the Python environment on Google Colab.

Dataset
Dataset selected to be used is "Inria Aerial Image Labeling Dataset" (Maggiori et al., 2017). This dataset features:


Coverage of 810 km² (405 km² for the training set and 405 km² for the testing set),  Aerial (in color and orthorectified) imagery with a spatial resolution of 30 cm,  Label images for two semantic classes: building and not building) (Maggiori et al., 2017).
The images from the dataset cover dissimilar urban settlements, differing from densely populated areas (e.g., Vienna) to less dense rural areas (e.g., Austrian Tyrol) (Fig 1) (Maggiori et al., 2017). The purpose of this is to improve the generalization power of the models (Maggiori et al., 2017). For example, while Chicago imagery may be used for training, the model should label images over other regions with varying conditions, urban landscape and time of the year (Maggiori et al., 2017). Figure.1 Chicago -5 sample image and corresponding label image (Maggiori et al., 2017) In this study, only images from the training set were used. It is not possible to make comparisons between label images and predictions since there are no corresponding label images in the test set.
The training set contains 180 color images of size 5000×5000, covering a surface of 1500 m×1500 m each (Maggiori et al., 2017). There are 36 tiles for each of the following regions: The format of the images is GeoTIFF. The pixels of label images have value 255 for building class and 0 for not building class (Maggiori et al., 2017).
To prepare the datasets for training and testing of the models, images from the training set and their corresponding label images were selected and divided into patches of size 224x224 pixels to reduce the computational cost and not lose resolution with resizing of images. Since the used architectures work with images in this size, images were prepared in size of 224x224.
To create the training dataset, 5 images and their corresponding label images were selected (Austin9, Chicago25, Kitsap18, Tyrol_w21 and Vienna15). For the test dataset, another 5 images were selected (Austin1, Chicago2, Kitsap30, Tyrol_w29 and Vienna9). During these selections, the distribution of rural and urban areas was considered. Images with no building or a low amount of buildings were removed from the datasets. Consequently, a total of 1500 images and label images for the training dataset and 300 images and label images for the test dataset were generated (Fig 2). Figure.2 Sample image and corresponding label image from training dataset

Methodology
SegNet and FCN neural network architectures were used to train models using prepared training dataset.

SegNet
SegNet is a CNN architecture developed at Machine Intelligence Lab. of the University of Cambridge to design more suitable deep learning algorithms for image segmentation tasks (Badrinarayanan et al., 2017). SegNet has an encoder network and a decoder network that works according to this encoder, followed by a pixel-wise classification layer (Bozkurt, 2018).
Encoder network consists of 13 convolution layers, corresponding to the VGG16's first 13 convolution layers, which is a pre-trained network for object classification (Badrinarayanan et al., 2017). As mentioned in Badrinarayanan et al., 2017, at this network, convolutions and max-pooling are performed. At the deepest encoder output, fully connected layers are eliminated to protect higher resolution feature maps. This significantly reduces the number of parameters in the SegNet encoder network compared to other architectures.
Within the SegNet architecture, each encoder layer has its decoder layer (Badrinarayanan et al., 2017). Thus, the decoder network also has 13 layers (Badrinarayanan et al., 2017). The output of the last decoder layer produces probabilities of classes for each pixel, which feeds the classifier with probability values (Badrinarayanan et al., 2017). Illustration of the SegNet architecture is shown in Fig 3. Figure.3 SegNet architecture (Du et al., 2018)

Fully Convolutional Networks (FCN)
Fully Convolutional Networks (FCNs) are being used for semantic segmentation of images, analysis of multimodal medical images and classification and segmentation of high-resolution and multispectral satellite images (Long et al., 2015). In 2015, Long et al. adapted modern classification networks (AlexNet, VGGNet and GoogLeNet) into FCNs and transfer their learned representations by fine-tuning to the segmentation task. After that, they defined a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed comprehensive (URL-2) (Fig 4). Figure.4 FCN architecture (De Souza, 2017) FCNs built from locally connected convolutional, pooling and convolutional transpose layers (Long et al., 2015). No dense layer is used in this architecture (URL-3). The absence of dense layers makes it possible to feed the network in variable inputs (URL-3). An FCN has 2 parts:


Downsampling path  Upsampling path (URL-4) As described in URL-4, downsampling path extract and interpret the context. The downsampling path consists of convolutional and max-pooling layers. Upsampling path enables precise localization of features. Upsampling path consists of convolutional, convolutional transpose and concatenate layers. Concatenation layers are used for skip connections. Skip connection is a type of connection that bypasses at least one layer. They are often used to transfer local information from the downsampling path to the upsampling path.

STUDY
In this study, all training and testing processes were conducted on Google Colab. Google Colab is a free Jupyter notebook environment that allows users to use free Tesla K80 GPU. It runs in the cloud and stores its notebooks and data on Google Drive.

Training and Testing
To train the models, images loaded into the network. Thereafter, training dataset split according to an 85% / 15% training/validation ratio, 1275 images and 225 images respectively.
For training, the "Adam" optimizer was used to update model parameters with a fixed learning rate of 0.001. Both models were trained for 50 iterations with a batch size of 16 using the same hyperparameters.  To test the trained models, the test dataset that prepared separately from the training dataset was used.

RESULTS
The final accuracy results are shown in Fig 6. When the validation accuracy results examined it was seen that the model that uses FCN architecture has 94.39% training accuracy and 90.55% validation accuracy. On the other hand, the model that uses SegNet architecture has 95.49% training accuracy and 89.49% validation accuracy. FCN model is more accurate than the SegNet model by 1% according to validation accuracy results.
When training and validation accuracies of the models were compared, it was been seen that the FCN model has higher validation accuracy and the SegNet model has higher training accuracy.

Figure.6 Training and validation accuracy results of models
When the differences between training and validation accuracies of the models examined, the model that uses SegNet architecture has a larger gap between them. This shows that the model's performance on training data is ahead of validation data. For the model that uses FCN architecture, this gap is smaller and it shows that this model is more accurate than the SegNet model.
Consequently, building segmentation was performed on the prepared test dataset using trained models. Examples from test, label and segmented images are shown in Fig 7 and  International Journal of Engineering and Geosciences (IJEG), Vol;5, Issue;3, pp. 138-143, October, 2020, Test Image Label Image FCN SegNet Figure.8 Segmentation results for test image 165

CONCLUSIONS
In this study, building segmentation from highresolution images using SegNet and FCN neural network architectures were realized. Comparisons between these architectures were made. Models were trained and tested using datasets prepared from images from Inria Aerial Image Labeling Dataset.
It was observed that the model that uses FCN architecture gives more accurate results. It has higher accuracy and a smaller difference between training and validation accuracies. This can also be observed from the predicted segmentation results.
Further studies could include more datasets and different neural network architectures to make comparisons. Dataset could be augmented with unused images from Inria Dataset. More data to train the models would increase their performances. For this study, default settings were used for hyperparameters. Hyperparameter tuning could be done to improve the performances of the models. This is because hyperparameter optimization is crucial to achieve maximum performance.