Autonomous Navigation in Search and Rescue Simulated Environment using Deep Reinforcement Learning

— Human assisted search and rescue (SAR) robots are increasingly being used in zones of natural disasters, industrial accidents, and civil wars. Due to complex terrains, obstacles, and uncertainties in time availability, there is a need for these robots to have a certain level of autonomy to act independently for approaching certain SAR tasks. One of these tasks is autonomous navigation. Previous approaches to develop autonomous or semi-autonomous SAR navigating robots use heuristics-based methods. These algorithms, however, require environment-related prior knowledge and enough sensing capabilities, which are hard to maintain due to restrictions of size and weight in highly unstructured environments such as collapsed buildings. This study approaches the problem of autonomous navigation using a modified version of the Deep Q-Network algorithm. Unlike the classical usage of the entire game screen images to train the agent, our approach uses only the images captured by the agent's low-resolution camera to train the agent for navigating through an arena avoiding obstacles and to reach a victim. This approach is a much more relevant way of decision making in complex, uncertain contexts; since in real-world SAR scenarios, it is almost impossible to have the area's full information to be used by SAR teams. We simulated a SAR scenario, which consists of an arena full of randomly generated obstacles, a victim, and an autonomous SAR robot. The simulation results show that the agent was able to reach the victim in 56% of the evaluation episodes after 400 episodes of training.


I. INTRODUCTION
Every year, around 60.000 people die worldwide in natural disasters. The majority of the deaths are caused by building collapse in earthquakes [1]. Only a small fraction of the victims of urban disasters might survive. Consider from [2,3] that 80% of survivors of urban disasters are surface victims, i.e., the people lying on the surface of the rubble or readily visible. However, only 20% of survivors of urban disasters come from the interior of the rubble, yet the interior is often where the majority of victims are located. The complexity of the collapses and the need for a fast response after the disaster requires multiple tasks to be done simultaneously. One approach to solve this problem is the use of robots that have a variety of capabilities such as unmanned ground vehicles (UGVs) and unmanned aerial vehicles (UAVs) to assess human search and rescue experts. In other words, the number of human experts can be limited and does not serve the need for controlling the robots along with fulfilling the other critic tasks that require human interference and moral understanding. Hence, there is a need for supplying these robots with a certain level of autonomy to be able to navigate through a SAR area and thus provide some tasks without the need for human control. In [4], Murphy et al. describe the tasks that can be addressed. Those tasks include (a) reaching small and dangerous places (voids) that humans and dogs cannot reach, (b) exploring the interior and exterior of the hot zone and extending wireless communication ranges.
Navigation is one of the fundamental tasks of mobile robots. The algorithms used to approach this task are divided into two sets [5]: • Greedy heuristics-based algorithms that operate using a map, which needs to be provided or extracted from the environment. After generating such a map, the central system performs the task of path planning, which is finding the optimal or suboptimal path to reach the target. Then the robots perform the navigation through the specified path. Autonomous mobile robots that use these techniques need to localize themselves within the generated map at the start of the planned trajectory, and then monitor their motion along the path using sensors (e.g., wheel encoders and inertial sensors) which are hard to be maintained on board of a small mobile robot navigating through small voids within rubbles. • Algorithms that operate without the use of maps (e.g., DistBug); These algorithms are often used if a path has to be traveled only once and therefore does not necessarily have to be optimal. Although the Bug family algorithms use limited sensors and computation capabilities, they tend to rely on perfect position estimations. In [6], McGuire et al. concluded that the Bug algorithms performance is sensitive to sensor-related noise, which makes them hard to implement in real-world uncertain noisy cotexts.
In [7], the authors studied different search methods based on greedy heuristics, and partially observable Markov Decision Process (POMDP) to design the autonomous control of UAVs to act in search and rescue contexts. Their evaluation results showed the great potential of POMDP in SAR scenarios; they also acknowledged that using these methods requires a high computation power as the cost grows exponentially (i.e., the curse of dimensionality) and that existed algorithms were not sufficient yet to tackle real-world problems. With the rise of deep learning and the expanding computational power, researchers have been able to come up with algorithms that can be used to tackle the curse of dimensionality more efficiently and solve problems that were not intractable in the past, such as playing Atari games at a superhuman level [8] and defeating a 9-dan Go player [9].
The main purpose of this study is to approach the problem of autonomous navigation using a modified version of DQN by relying only on the agent's camera without the use of any environment-related prior knowledge or any info gathered from another source (e.g., human experts, other robots, maps). This decentralized approach using only the agent's limited visionbased capabilities is a relevant strategy in deep voids where wireless connections might be hard to maintain, and also in any other situation where no enough outside information is available in the process of decision making. The DQN algorithm modified and used in this work is a valid method of solving real-world problems that can be interpreted as a POMDP.
The rest of the paper has the following structure. Section II gives an overview of related works. Section III describes the theoretical foundations of the utilized learning algorithms. Section IV presents the simulation setup and evaluation methodology. Section V presents the results and discussion. Finally, Section VI outlines the paper along with future directions.

II. RELATED WORKS
One of the applications of autonomous navigation is creating 3D visualizations of buildings and large-scale construction sites using scanning technologies. The scanning process has been often carried out using scanners handled by people [10] or scanners on board of vehicles controlled by people [11]. In [12], Kim et al. assisted the scanning process with UAVs that generated an initial 3D map of the area, then based on this map, the optimal scanning locations are estimated, and the optimal navigation path is determined. After that, a robot equipped with a laser scanning system is used to follow the path and obtain high-resolution scans in the estimated locations. Although such techniques are powerful in navigating through large construction sites, they cannot be used to assess small UGVs navigating through small-scale indoor environments. These robots must perform navigation without any outside assessments due to the limitation of wireless sensing technologies.
The problem of autonomous navigation in search and rescue contexts has been approached previously using heuristics-based methods. A heuristic function ranks a set of options available at one point and decides which one to choose according to the ranking. These approaches require situation related prior knowledge and a wide range of sensing abilities. In the EU funded project ICARUS [13], several robots with different structures and capabilities have been designed to assess SAR experts. Small Unmanned Ground Vehicles (UGVs) used to navigate through narrow voids were supplemented with a very limited level of autonomy. This is caused by the use of heuristic algorithms that require a massive amount of sensors and power, which cannot be added to the vehicles due to the restrictions of size and weight in highly unstructured environments, such as collapsed buildings. A similar approach has been carried out in the SHERPA project [14] in which the goal was to develop a mix of ground and aerial robots to support SAR teams in realworld hostile environments.
DQN was implemented in [15] to support a UAV with the ability to navigate independently towards a specified goal in a simulated environment. They evaluated their work by comparing the success rate, and the average steps undertook by the trained agent and a random decision-making agent. This research resulted in an 8% success rate for the trained agent compared to a 5% success rate for the random-action taking agent. Although the algorithm showed some success as the trained agent outperformed the random agent in terms of success rate, it is clear that it can be further developed.
This study approaches autonomous navigation using a modified version of the DQN algorithm relying only on the agent's camera. Unlike the previous methods, this approach does not require a map to be generated or any extra sensors to monitor the agent's motion. While the recent implementation of DQN in autonomous navigation [15] used the same proposed CNN, the CNN (convolutional neural network) being used in this work for training the agent to make decisions based on the captured images has been designed with simplicity in mind to suit the simulated scenario.

A. Reinforcement Learning
RL is the machine learning paradigm concerned with training agents to act autonomously to reach certain goals in an environment. A typical reinforcement learning scenario consists of the following components: an agent (the decisionmaker), an environment in which its state changes according to the agent's actions. As the agent explores the environment, it receives a reward for each action it makes. Hence by trial and error, it learns to maximize this reward.
Formally an RL scenario is modeled as a Markov Decision Process (MDP) which consists of a tuple ( , , , , ) where is the set of states, is the set of actions, (alternatively ( ): the set of actions that can be taken at state ); R(s) is the reward function, it returns the immediate reward; ( , , ′ ) ~ ( ′ | , ) is the probability that taking action will change the environment state from to ′ ; and finally, ϵ [0,1] is the discount factor, which is how much we value future reward over the immediate one.
The main problem of an MDP is to find the optimal policy * . A policy ( ) is a function that specifies the action that should be taken at a certain state. Thus the optimal policy * is the policy that maximizes the expected long term reward (i.e., the reward from all states): probability that taking action will change the environment state from to ′ ; and finally, ϵ [0,1] is the discount factor, which is how much we value future reward over the immediate one.
A POMDP scenario can be described as follows: at each time frame the environment is in some state ∈ , the agent takes action ∈ which changes the environment state from to ′ according to the transaction probability T, after a step time, the agent receives an observation from the environment state ′ . The observation depends on the current state ′ and the previous action according to the probability Ω( , ′ , ). The goal of the agent is to find the optimal policy π* that maximizes the expected long-term reward. As the states in POMDP are not fully observable, the agents only receive partial information about the environment. This generalization is more realistic than MDP, and it can describe a variety of real-world problems. Hence it is an appropriate framework to describe the simulated scenario.

B. Q-Learning
Reinforcement learning methods fall into two categories; the policy-based and value-based methods. The policy-based methods aim at finding the optimal policy π* by starting with a parameterized policy and updating its parameters using gradient-based or gradient-free methods, on the other hand, the value-based methods use a value function that determines the expected long term reward of executing an action at a particular state, and hence providing which action to undertake. One of the most well-known value-based methods is Q-learning. It can find the optimal policy in finite stationary and fully-observable environments with discrete actions . The algorithm starts by  initializing a table called the Q-table with arbitrary Q-values. A q-value is a value that corresponds to how well an action is at a specific state. Then at each state, an action a is being selected, and its corresponding Q-value is being calculated using the Qfunction: where Q(s, a) is the Q-value of executing an action at state , ∈ (0,1] is the learning rate, and r is the immediate reward. Although Q-learning can find the optimal policy in small finite environments, it performs poorly on non-stationary partially observable complex environments.

C. Deep Q-Network
With the rise of deep learning and particularly in [8], Mnih et al. introduced the Deep Q-Network algorithm (DQN), which gave birth to the field of deep reinforcement learning. DQN was the first algorithm to reach a superhuman performance in playing Atari games from pixels. Instead of using a table to track the Q-values, they used a convolutional neural network as a function approximator to predict Q-values from states. One of the problems the RL algorithms have when using artificial neural networks as function approximators is the problem of instability. The DQN algorithm uses two separate models to stabilize the learning; namely, the main model and the target model, the main model's weights get copied to the target model every c steps. The DQN also keeps a memory buffer termed Experience Reply to store the transactions. At every training step, the DQN uses the main model to predict the action to take, and then the agent executes the action and observes a new state together with the immediate reward. After adding the new transaction to the experience replay, it takes random samples from the experience reply and uses them as data to fit the main model by minimizing the mean-square error (MSE) between the main and target models. The training process is visualized in Fig. 1.

D. The ε-Greedy Policy
The ε-greedy policy is a method used to balance the exploration and exploitation done by the agent; that is deciding at a certain point whether it should take random actions to explore unexplored space, or whether it should use the trained model's action-value estimations to predict the best action. It keeps track of a parameter ε, which represents the probability of choosing a random action. Similar to the original DQN, in this work, ε starts at 1 and decreases linearly throughout the training until it reaches a fixed point.

IV. SIMULATION SETUP AND METHODOLOGY
The scenario was simulated using Webots [16], an opensource robot simulator widely used for educational and professional purposes. It includes a complete toolbox to design environments, robots, and simulation experiments. It can execute controllers written in compiled C/C++ and Java or interpreted python and Matlab languages. In this study, the agent controller was written in python to take advantage of the required libraries (i.e., Numby, Tensorflow, and Keras).
An e-puck robot was used for its motion simplicity. It has two motors, a left wheel motor and a right wheel one. It also has a color camera with a maximum resolution of 640x480, eight light sensors, and eight distance sensors. For this research, only the camera and the motors have been used.
At the beginning of the simulation, a square arena is created. The center of the arena is at the center of the coordination, and the objects can be placed within [-1, 1] and [-1, 1] at X and Z dimensions, respectively, whereas the Y dimension remains the same. At every episode, the agents and obstacles are placed randomly. The victim robot is randomly placed according to the uniform distribution at a position within the upper left part of the arena between [-1, -0.5] and [-1, -0.5] at X and Z dimensions, respectively. Similarly, the agent is placed randomly according to the uniform distribution at the downright part of the arena between [0.5, 1] and [0.5, 1] at X and Z dimensions, respectively. The obstacles (i.e., 20 boxes) are positioned randomly according to the normal distribution of 0 mean and standard deviation of 0.5. The objects are positioned in this manner to ensure that the obstacles will get in the agent's way. Fig. 2 shows a view of the arena at the beginning of an episode.
In this work, DQN has been modified to ensure more success in the simulated scenario. These modifications are as follows:  The CNN that has been used in the original DQN has performed poorly on the simulated scenario. The agent at an early level of the training starts to memorize one action regardless of the image fed into the network, and hence it rarely reaches the victim robot. Instead, in this work, a simpler CNN has been used. It showed much better results. Fig. 3 shows the architecture of the modified CNN, which is used to predict the best action to be executed according to the image fed.  The agent was trained for 400 training episodes executed in 8 identical parts; within each part, 50 episodes were executed. At each training part, the ε-greedy policy's ε starts from 1 and decays until it reaches 0.1 after 25 episodes, which is half of a training part, as illustrated in Fig. 4.
 While in the original DQN, a stack of 4 frames is being fed into the CNN as one observation, for simplicity, we use only one image. Fig. 5 shows an image captured by the agent during training.
 The use of a simple distance-dependent reward function: where is the distance between the agent and the victim at the previous step, and is the distance between the agent and the victim at the current step.

V. RESULTS AND DISCUSSION
Several evaluation methods were used to test the suitability of the original CNN architecture proposed in [8]. The original CNN was designed to learn generalizing the learning process over many distinct Atari games, while in our scenario, the agent has to learn to interact in one environment. In other words, due to the complexity of the model, the agent starts to memorize behaviors at a very early level, and thus at one point, it starts to take only one action regardless of the image captured. This resulted in a bad performance in terms of navigating towards the target. Fig. 6 illustrates the number of steps taken by the agent to reach the victim in a single episode by using the original CNN. If the agent does not reach the victim during the episode, the finishing step will be equal to the maximum number of steps used (i.e., 1000). This is due to the low ε value 0.1 which limits the agent exploration. The line shows that using the original approach the agent was only able to reach the victim before the episode 25 (i.e., the episode at which the randomness is minimized), which demonstrates clearly that the agent is not able to reach the victim when relying on the original complex CNN and hence performs poorly in the simulated environment. Using our modified CNN proposed in the methods section, the agent showed better results when evaluated in 1000 step epochs. Fig. 7 illustrates the number of steps taken by the agent to reach the victim in a single episode by using the modified CNN. If the trained agent does not reach the victim during the episode, the finishing step will be equal to the maximum number of steps used (i.e., 1000). The line shows that using the modified approach, the agent was able to reach the victim at most of the episodes, while also reaching the victim in less number of finishing steps on average after the episode 25. This, in turn, clearly demonstrates that the agent is able to reach the victim more efficiently relying only on the trained model using the modified CNN. The finishing step is going smaller along with the training.
Since the ε is decaying linearly, this shows that the agent is performing better when it predicts the actions based on the trained model. Table 1 shows the training results of the agent during the different parts of the training. Step Training Episode Modified CNN with limit step 1000 Using the methods proposed above, the agent was trained on an Intel Core (TM) i5-3230M CPU. Training 400 episodes lasted appr. 21 hours and 7 minutes. The training episodes were executed in 8 parts to ensure the stability of the simulation physics engine (i.e., after training for more than 50 epochs, the physics engine starts to produce some delay and the simulation info file stored in the RAM reaches its limits).
After the training, the agent's performance was evaluated using the same evaluation method used in [15]. The trained agent's success rate and the number of steps it took to finish the simulation were calculated and compared with a fully randomaction taken agent. Table 2 shows a comparison between the evaluation results of this work and the previous study.
While the random agent was not able to reach the victim even once, after 400 episodes of training, the agent was able to reach a 56% success rate in 40 evaluation episodes. That means that the e-puck robot was able to reach the victim in 56% of the evaluation episodes. Moreover, it was able to reach the victim in 21 steps at least in one episode, which is a great result giving the complexity of the environment. The poor performance of the random agent in our study shows the complexity of our simulated scenario when compared with the simulation scenario in [15]. Hence, this proves how well the trained agent performed.
Compared with the results proposed in [15], this work has reached significant improvements in terms of the difference between the results of the random agent and the trained one. This is due to the use of a simpler CNN that learns to navigate through obstacles contained in similar settings, instead of the use of the original CNN that generalize the learning over many Atari games. Another factor is the used version of the εgreedy policy; that is partitioning the training into parts with an ε that starts at each part from 1 and linearly decays until it reaches a 0.1. With this method, the agent has been able to explore the changing state of the environment across episodes more efficiently.

VI. CONCLUSION AND FUTURE WORK
This paper presented a SAR simulation scenario, in which an autonomous e-puck robot's main goal is to explore an area full of randomly generated obstacles and yet to learn to navigate through them to reach a certain victim without any use of environment-related prior knowledge. The scenario and the environment, in general, can be used to test the suitability and the efficiency of different deep reinforcement learning methods to solve autonomous SAR problems. A modified DQN algorithm was implemented using some new approaches. These approaches include (a) the use of modified CNN instead of the original one, a distance-dependent reward function, and agent's low-resolution camera images to train the agent; (b) feeding one frame to the CNN as an observation instead of a stack of four images; and finally (c) training the agent in 8 parts; at each part, the ε-greedy policy's ε is decaying linearly from one until it reaches 0.1 at the middle of the training part. Using these methods, our modified DQN has shown great results compared to previous work when evaluated using the same evaluation methods and metrics, reaching a success rate of 56% at an average step of 242.
Despite the work's success, a variety of future works can be conducted to increase efficiency and real-world interpretability. The scenario can be further developed to become more realistic; e.g., by placing a variety of obstacles, increasing the number of obstacles and victims. A more complex CNN can be designed to train the agent to navigate in the developed complex scenario. The reward function can be optimized to depend on other factors, such as the distance between the agent and the nearest Note: The success rate shows the percentage of the successful episodes, in which the agent successfully reaches the victim, whereas the minimum steps show the minimum step number at which the agent reached the victim.