INTEGRATION OF CUSTOM STREET VIEW AND LOW COST MOTION SENSOR

: Virtual reality is an artificial computer-generated environment generally referred as virtual reality environment which can be navigated and interacted with by a user. Street View, which was released by Google in 2007, is an ideal tool to discover places and locations. This service doesn’t only provide spatial information, but also a virtual reality environment for the user. Since this service is only available in certain locations, Google enables users to create a street view with custom panoramic images with the help of Google Maps Application Programming Interface (API) for JavaScript. In this study, it is aimed to integrate body motions with a custom created street view service for Yildiz Technical University Davutpasa Campus which has a historical environment and huge places to discover. Microsoft Kinect for Xbox 360 motion sensor along with Flexible Action and Articulated Skeleton Toolkit (FAAST) interface has been employed for this purpose. This integration provides a low-cost alternative for virtual reality experience. The proposed system can be implemented for virtual museums, heritage sites or planetariums consisting of panoramic images.


INTRODUCTION
Virtual reality is a virtual environment that can be navigated and interacted e.g. moving around and exploring the scene or having the ability to select and manipulate objects (Gutierrez et al., 2008). As photogrammetric technologies advance, researchers are able to represent reality more accurately. Researchers and scientists are able to generate affordable and satisfying walk through representations of complex facilities by using panoramic images (Chapman & Deacon, 1998;Le Yaouanc et al., 2010) or 3D data (Yemenicioglu et al., 2016).
Street View is a technology that consists of streetlevel 360 degree panoramic images and provides mobile and desktop clients with a virtual reality environment in which users can virtually explore streets and cities (Anguelov et al., 2010). Panoramic image is defined as a picture of an area, providing an unlimited view in all directions (Amiri Parian & Gruen, 2010). The omnidirectional vision of an area gives an overview understanding of the environment (Fangi & Nardinocchi, 2013), therefore street view service is used in various applications. (Rundle et al., 2011) used Google street view to audit neighbourhood environment in their study. (Hanson et al., 2013) used Google street view to calculate the severity of pedestrian crashes. (Kelly et al., 2012) and (Curtis et al., 2013) used Google street view to observe the built environment.
Human computer interaction is an indispensable need for seamless communication (Isikdag, 2020). Mouse, keyboard, joystick etc. are the most commonly used tools for navigation in desktop virtual reality applications. But the use of the human body motions can provide a better understanding and interpretation of the virtual environment (Roupé et al., 2014). Human action recognition is a challenging subject in the computer vision community, which aims to understand human gestures from video and image sequences (Tran et al., 2012;Zhou et al., 2009). A new way to overcome this challenging task has risen with the release of depth cameras that allow acquiring dense and threedimensional scans of a scene in real-time (Schwarz et al., 2012). However, these types of devices (e.g. time of flight cameras) couldn't become widespread due to their high prices until the release of Microsoft Kinect for Xbox 360 in November 2010.
Microsoft Kinect is a depth sensing hardware which was designed to change the way people play games. It enables users to play video games with body motions. The way of playing games without controllers have brought a new perspective and users adapted Kinect to other applications. Depth maps acquired with Microsoft Kinect are widely used in computer vision applications. (Bakirman et al., 2017) employed Kinect for human face modelling. (Yue et al., 2014) and (Izadi et al., 2011) used depth images captured with Kinect to reconstruct 3D environment. (Xia et al., 2011) and (Shotton et al., 2011) proposed different human detection methods using depth information derived from Kinect. (Raheja et al., 2011) used Kinect depth images to track fingertips and centres of palm.
In this study, it is aimed to create a virtual reality environment for Yildiz Technical University Davutpasa Campus via developing a custom Google Street View service using Google Maps JavaScript API and integrate this service with Microsoft Kinect and FAAST software (Suma et al., 2013) to navigate it with human body motions.

MATERIALS AND METHODS
The study area is Yildiz Technical University Davutpasa Campus which has a historical environment located in Istanbul, Turkey. The location was used as a military base during Ottoman Empire and called Davutpasa Barracks which is believed to have built in 1832. In 1999, Davutpasa Barracks was turned into a campus and became a part of Yildiz Technical University. The campus area is 1.75 square kilometres so it is a huge area to explore. Therefore, it is aimed to create a virtual reality environment to explore and learn about the campus.
In this study, Microsoft Kinect for Xbox 360 is used for sensing human body motions. Kinect has two types of drivers in order to function on PC. The first driver is Kinect for Windows SDK released by Microsoft and the second driver is released by an organization called OpenNI which consists of three members including PrimeSense (along with ASUS and Willow Garage) who has developed the base technology behind Kinect. In this study, Kinect for Windows SDK v1.8 is used. Figure.1 (a) Microsoft Kinect, (b) Projected Infrared Pattern (Roborealm, 2016), (c) Depth Image.
Microsoft Kinect consists of as infrared camera, RGB camera and infrared laser projector (Fig 1a). It measures the distance from the sensor into the environment by using the structured light principle (Freedman et al., 2013). A pattern which is known by the device is scattered into the scene by infrared laser projector ( Fig  1b). The scattered pattern is captured by the infrared camera which is a monochrome complementary metal oxide semiconductor (CMOS) sensor. Since relative geometry between the infrared projector and the infrared camera is known, the depth map can be produced by using 3D triangulation (Fig 1c). Dark red and light green represent small to high distance from the sensor respectively.
FAAST (Flexible Action and Articulated Skeleton Toolkit) is a toolkit that lets users control video games and virtual reality environments by human motion using Kinect for Windows SDK or OpenNI developed by University of South California, Institute for Creative Technologies (Suma et al., 2013).
Street View is a technology that presents panoramic images in the street level around the world via Google Earth software or Google Maps. In 2007, Google released Street View for 5 American cities. Coverage area has been rapidly increased in the following years (Fig. 2). With the release of Google Maps JavaScript API v3, Google provided users with Street view service. Thus, third party users can present custom street view services in personal websites with Google interface. Google also enables users to linking custom Street View services with Google's existing street view panoramic images.
Street view consists of 360 degree spherical panoramas. Spherical panoramas are obtained by using spherical video cameras. Panoramic images that are used in Street View must be conformed to the equirectangular (plate carrée) projection in which meridians, parallels and two poles are straight lines and these images have 2:1 aspect ratio. In this study, 360 degree video streams were captured using Ladybug 2 spherical video camera which is developed by Point Grey Research Inc. This camera has 6 fisheye lenses with Sony ICX204 sensors (Point-Grey, 2014). Panoramic image is created from raw images that are captured from these 6 cameras. Image stitching process can be overviewed in Fig 3. Six images are synchronically captured. Images can be compressed as JPEG to reduce time while transferring files to PC (Akcay et al., 2017). In this case, images will be uncompressed on PC to get raw images again. Raw images are converted to RGB colour code with selected interpolation technique (Fig 4a). In this study, we have used Rigorous colour processing technique which provides the best quality colour results. RGB images are rectified and mapped to polygon meshes whose geometric vertices arranged in a three dimensional coordinate system (Fig 4b). Since all images are captured in outdoors, a 20 meter virtual sphere has been utilized. We have also used blending width value of 100 and applied brightness correction for darker areas.

RESULTS AND DISCUSSION
In this study, 561 images were captured from 11 spherical video streams on February 3rd 2009 in the campus. Weather conditions were mostly cloudy, so images were not as bright as expected.
Google Maps JavaScript API requires images to be on the equirectangular projection which has 2:1 orientation. So, spherical images with 5400x2700 resolution are used.
Street view was created with selected 447 images by using Google Maps JavaScript API. Custom street view service was created with workflow shown in Fig 5. Images were given IDs based on their paths. API's Street View Service doesn't work on local PCs due to security reasons. Hence, all images were uploaded to a server. Street view provider is set up using HTML which includes street view options like zoom levels, starting panorama, etc. Subsequently, street view object and street view link object are created using API library which is followed by modelling a function to get custom panoramas. This function determines panorama image size and custom panorama URLs. With the help of the created function, all panoramic images were defined by their image IDs and locations using a switch-case loop. Finally, links which provide shifting from one panorama to another are created for each panorama (case). An HTML file was created for each of different starting locations. The final look of the created street view page can be seen in Fig. 6. (a) (b) Figure.   With the use of FAAST, a specified human motion can trigger a keyboard command. In this study, six moves are determined to assign them as keyboard commands. Body motions with their responsive keyboard commands and functions are listed in Table 2.
Each input has a different type and amount of descriptors. For example, turn right, turn left, lean backwards and lean forwards moves have five descriptors. The first descriptor defines the type of move, for example, turn, lean or jump. So, if turn right move is defined, the first descriptor would be 'Turn'. The second descriptor defines what direction the body will turn to. In this case, the second descriptor would be 'Right'. The third descriptor determines if the move would be the upper limit (at most) or the lower limit (at least) move. In this scenario, the third descriptor would be 'At least'. Fourth and fifth descriptor define move's measure and unit. Twenty-five degrees would be enough measure for turn right move.
Thus, turn right move occurs when the user's body turns at least twenty-five degrees to the right. Turn left, lean backwards and lean forwards moves are defined as explained above. The moves and descriptors are listed in Table 3.  Right foot forward and right foot backwards moves have a different set of descriptors because these moves consist of the relation of two body parts. The first of six descriptors defines the first body part. So, if right foot forwards move is defined, the first descriptor would be 'Right Foot'. The second descriptor is the relationship type with the second body part such as 'to the right of', 'above', etc. The third descriptor defines the second body part which will be related to the first body part. Fourth, fifth and sixth descriptors are same as the first set of moves' third, fourth and fifth descriptor respectively. As a result right foot forwards move occur when user's right foot is at least 25 centimetres in front of user's torso. Right foot backwards move is defined in the same manner and descriptors are listed in Table 4.
Each input has an output command which consists of four descriptors. The first descriptor of output determines if the command button will be 'pressed once' or 'kept hold'. The second descriptor is the keyboard command. The third descriptor specifies when the keyboard command will end. The last descriptor defines the measure of the third descriptor. All output commands and their descriptors are listed in Table 5. Since input and output descriptors are assigned into the software, Street View can be controlled with body movements by starting FAAST emulator. For example, required move to turn street view angle to the right is shown in Fig. 7. Fig. 7a shows the current state of street view. Turn right move occurs as can be seen in Fig. 7b and street view rotates to right as long as the move goes on (Fig. 7c).

CONCLUSION
Video game technologies have rapidly improved around the world so these technologies can be integrated into engineering applications. An application of this integration was presented by this study. In this study, a virtual reality environment for Yildiz Technical University Davutpasa Campus is created via developing a custom Street view service using panoramic images via Google Maps JavaScript API v3 and integrating with human body motions. Human body motions are used to navigate in a virtual reality environment with Microsoft Kinect, which creates a greater experience. Thus, an alternative low-cost way to control Google street view service to achieve a better virtual reality environment and provide information about the study area are proposed. We also plan to implement the proposed framework for an indoor application in the future. This system can also be implemented for other applications e.g. creating virtual museums, heritage sites or planetariums which will also contribute to preservation and documentation of cultural heritage.