A Manhattan distance based hybrid recommendation system

ABSTRACT


Introduction
In the last 15 years, the Internet has evolved from a group work tool to support the work of scientists at CERN, but rather a global knowledge space with more than a billion users.With the spread of the internet, many opportunities have emerged, such as sharing information and ideas with other users [3].
Nevertheless, this time, users encountered a new problem with the internet.The amount of data and units has increased greatly, leading to data overload.Finding out what the user is actually looking for has become a big problem.With the need to filter and sort items and information, developers found recommendation systems as a solution.
With the rapid development of e-commerce, online tourism website has become a quite common way to book tourism services [17].Nowadays, people widely use online reservation systems to plan their holidays.A large number of options makes it difficult for hotel customers to decide when and where to go.In addition, due to the wealth of information available in online reservation systems, customers may miss out on a more suitable option for them.In this sense, recommendation systems play a major role in customers' choices [22].Recommendation systems are useful for service providers and users [5].They decrease transaction costs for finding and choosing products in an online shopping environment [6].Recommender systems also have some benefits for businesses.Firstly, Revenue Algorithm studies using recommendation systems to increase revenue by increasing the number of sales for companies with online customers such as Amazon sites.Secondly, Personalization -Data collected indirectly can be used to ensure that the services of the website are suitable for the user's preference.Thirdly, Discovery -giving recommendations to people like shopping, movies, songs, etc. increases the chances of revisiting a web page when they find it [4].
Recommender systems have the following different types of filtering to create an effective recommendation engine; content-based, collaborative-based, and hybrid [4].
In this study, real data from SeturTech, a technology company that develops systems that offer hotels and holidays to users, are used.These data are made with a hybrid approach to hotel recommendations for users.This hybrid recommendation system is developed by combining two recommendation systems based on content-based and collaborative filtering to increase the real-life performance of the hotel recommendation system.Unlike similar studies in the literature, a rich hotel attribute list was used in the content-based method in this study.In addition, in the collaborative filtering method, the interaction matrix is created with the preference amounts of the guests and resembles the user-product score matrix.There are different distance metrics to calculate the similarity of customers; The Cosine distance & Cosine Similarity metric is mainly used to find similarities between two data points.As the cosine distance between the data points increases, the cosine similarity, or the amount of similarity decreases, and vice versa.Thus, points closer to each other are more similar than points that are far away from each other.Cosine similarity is given by Cos θ, and cosine distance is 1-Cos θ.
Manhattan Distance metric, also known as city block distance, or taxicab geometry is used if we need to calculate the distance between two data points in a gridlike path.
While determining customer profiles, the RFM method is used.Recency shows the time elapsed since the user's last consumption so far.The closer the consumption time, the higher the customer score.Time is measured in days.
Frequency indicates how many times the user visits in a given time period.
Monetary represents the amount of money a user has spent in a given period of time.One point is based on the transaction amount, the higher the transaction amount, the higher the score.
To present the details of the work done, the rest of this paper is structured as follows.In section II, the developed hotel recommendation system and the three types of recommendation system filtering methods are introduced.
In section III, RFM analysis of the hotel data is reviewed.In section IV, the results of the tests are given.Finally, we conclude in Section V.

Recommendation System Creation
The recommender system is the biggest subfield of data mining and there are two main approaches: 1) Non-customized 2) Customized.Each approach has different techniques in machine learning.The non-customized recommender system gives the same item recommendation to all system users, rather than individual user data.Users' interests are not considered.In contrast, the personalized recommendation system considers the preference or interest of each user, thus recommending certain items to the user more effectively.
There are three basic approaches in the customized recommender system: 1) Content-based filtering approach: Characteristics are derived from information items.
2) Collaborative filtering approach: Characteristics derived from the user's environment.
3) Hybrid filtering approach: Each of all content-based filtering approaches and collaborative filtering approaches has its pros and cons.To cope with these disadvantages, the hybrid approach, a combination of both approaches, is used [1].

Content Based Filtering
Content-based filtering method, in other words, "cognitive filtering", works according to user profiles that were created in the beginning.The creation of such profiles is provided by the user creating an account for himself and logging into the system.
A profile contains information about a user and their tastes.During profile creation, it is necessary to provide initial information about the user, and for this reason, the recommendation systems prepare a survey.
The recommendation engine compares the items that users have rated positively with each other and determines their similarities [3].The more a user interacts with the system, the stronger the user profile is created.In this content-based filter, only the recorded information of the user is sufficient instead to other similar users [1].For instance, if a user liked a website containing the words "car", "engine" and "petrol", the pages proposed by CBF will be relevant to the automotive world [8].
Content-based recommendation systems are formed of three basic parts in terms of high-level architecture.Firstly, the system preprocesses units with a content analyzer.After that, a professional profile learner gets information concerning the users.
Eventually, the filtering component reveals several suitable suggestions [12].The three sections mentioned are detailed below: Content Analyzer: The task of the content analyzer is to prepare the available data for the next process and convert the information from its original state into more abstract and feasible.For instance, this converter has the ability to accept a web page as input and convert it to a keyword vector [9].
Profile Learner: Profile learner is a module specially designed for the user.It obtains preprocessed information from the content analyzer and generalizes them to structure the user's preferences [9].
Filtering Component: The last step of the contentbased filtering method is the filtering component.It recommends the relevant items to the user based on the user profile [9].
Creating a model for the user's preference from user history is a classification learning style.It is divided into binary categories such as "user likes" and "user dislikes".For example, if a user buys a product, it is a sign that the user likes that product.
However, if the user buys the product and returns it, this is a sign that the user does not like the product.Generally, implicit methods can collect large amounts of data with some uncertainty as to whether the user likes the item [10].
In this study, the features of the hotels' customers visited before are taken as the basis to create the profiles of the users.For example, in Table I, user 1 has gone to hotels A and B. The capacities of the hotels that User 1 visits are 300 and 400 people, respectively, their distance to the sea is 200 and 100 meters, and breakfast is served in these two hotels.Therefore, when calculating User 1's profile, this content information is used and the average values of the relevant rows in Table I are taken.In this case, the capacity of the hotels where User 1 goes is 350 people on average, and their distance from the sea is 150 meters.It also seeks breakfast service at the hotels that User 1 visits.On the other hand, User 2 displayed a different profile by choosing hotels with less capacity and hotels farther from the sea.As a result, the profiles of these users will be as in Table II.Using Microsoft Excel, the hotels selected by the customer and the features of the hotels are combined in a single table.
Hotel list data has been cleaned by preprocessing steps.Some hotels selected by customers are not in the hotel feature list, they were detected using excel and removed from the list.
These pre-processing steps consist of correcting missing data (Ex: Hotel names) and removing inconsistent data (Ex: Number of rooms).Currency code and foreign currency sales amounts columns, which will not be used in the setup of the recommendation system, are deleted and a comparison was made in Turkish Lira.
The names of districts and towns were removed, and the cities remained.Branch name, code, and type; sales, entry, and exit dates, and how many people stayed overnight are removed.Then, the most important 8 features (pool, beach, breakfast, etc.) of the hotels are determined by subtracting the features with less than 600 selections of customers.Thus, by reducing the size of the data, it is ensured that the working time was shortened while calculating the similarity between the profiles and the hotels.
The average of the features in the hotels selected by the customers is calculated using the pivot table in excel.Customers with a binary value of 0 for the features of all selected hotels are removed from the study and the data set was simplified by removing them from the data set.

I.
New user issue: It is a problem caused by a lack of information.When a new user enters the system, the system does not have enough data about the user profile and preferences, so a suitable product profile cannot be created accordingly.As a result, the advice may not be good enough [1].

II.
Excessive specialization: It is caused by the recommendation system recommending similar types of products based on available historical data.When the user wants to try a new product, it may cause problems.For example, even if a user will like fiction movies if they are suggested; Since he only liked comedy movies in the past, the system will only suggest new comedy movies to him.Therefore, excessive specialization can narrow the scope of recommendation [1].

Collaborative Filtering
Collaborative filtering is a popular user recommendation approach.The collaborative filtering approach was first described and explained in 1997 by Paul Resnick and Hal Varian [7].
If two users have common items with similar ratings, it is assumed that they have similar tastes.Such users form a group or a so-called neighborhood.
A user receives recommendations for items that they have not previously rated but are already rated positively by users in their neighborhood [15].
In CF systems, a recommendation is made to a user collectively based on past ratings of all users.The Grundy system is the first recommendation system to propose the use of stereotypes as a mechanism to create user models based on limited information about each user.
The system creates the model of the individual user and relevant books are recommended for each user [1].Video Recommender [13] and amazon.com[14] are some examples of collaborative filtering.
Collaborative filtering suggestions improve their accuracy as the amount of data on items increases [11].
Collaborative filtering method approaches can be divided into three subgroups: I. User-based approach: This approach was proposed by University of Minnesota Professor Jonathan L. Herlocker in the late 1990s [15].In this approach, users take the main role.In this filtering, the subset of users is selected based on their similarity to active users.Customers who have the same taste construct a group.The user is given suggestions based on the items evaluated by other users in his group [3].The weighted combination of their ratings is then used to estimate the rating for the user [1].

II. Item-based approach:
This approach was proposed by University of Minnesota researchers in 2001 [16].As a system grows, the number of users increases and so does the complexity of finding similar users.Therefore, a new approach to itemitem collaborative filtering was proposed rather than finding similar users [1].The system creates neighborhoods according to the tastes of the users.The system then generates suggestions with items found in a user's preferred neighborhood [3].

III. Model-based approach:
The system makes recommendations for users by estimating the parameters of statistical models for user ratings.In this approach, a pre-calculated model is a design based on available data.This model-based approach quickly responds to the user's preference when the user query appears.Thanks to this approach, the system can be visualized more accurately and can also reduce errors.The most commonly used methods are MF (Matrix factorization), and SVD (Singular value decomposition) [1].
In this study, while using the CF method, the number of times a customer went to each hotel is accepted as the score the customer gave to the hotel.In order to simplify the number of hotels, hotels with less than a total of 10 selection by the customers are removed from the dataset.

Limitation of collaborative filtering I.
Cold-start problem: This problem refers to insufficient information to give the user a recommendation.Collaborative filtering is entirely dependent on the similar neighbor in the system, but these similar neighbors are not present in the system in the first stage known as the cold start issue and they are not known by the system.This reduces the performance of the recommendation system [7].This problem can be avoided with the hybrid approach [1].

II. First-rater problem:
The system cannot recommend an item that has not been previously rated.As new items are entered into the system, many users did not refer to the items, so there are not enough ratings for these items.The problem can be solved with a hybrid approach.

III. Sparsity
The sparsity problem is an important issue.Collaborative recommendation systems often build users' neighborhoods using their profiles.Sparsity occurs when the user does not rank these items [3].If a user has only rated a few items, it is quite difficult to determine their taste, and may be in the wrong neighborhood [1].Sparsity is a problem of lack of information.

IV. Popularity Bias
The system cannot recommend products to someone with unique tastes.Sometimes the user has a unique taste compared to all other users on the system.This problem is known as the "popularity bias" issue.This problem can be solved with a hybrid approach [1].

Hybrid Approach
For better results, some recommendation systems use a combination of collaborative and content-based approaches to take advantage of each of them.By using the hybrid approach, the limitations of the content-based and collaborative approaches such as cold-started problems can be avoided.The combination of these two approaches can be achieved in different ways: • Applying both methods separately and combining the results.
• Combining some content-based features with a collaborative approach.
• To include some collaborative features in the contentbased approach.Babodilla, et al have classified collaborative filtering and content-based filtering combined into four different groups as in figure 1 to make a hybrid method [9].  Figure 1.b shows the methods using CBF methods to extract the features and send the recommendation to CF.
Figure 1.c shows a combined model using CF and CBF to obtain the outputs of another classifier such as the probability model.
Figure 1.d shows a model for CBF using output from CF.For example, user ratings can help CBF better identify users [9].
Probability methods are used in hybrid filtering.Examples such as genetic algorithms, neural networks, and Bayesian networks can be given [8].SeturTech dataset includes user properties and userhotel preferences.First of all, the user profiles are analyzed with RFM in a single data with an engine.A user-hotel preference matrix is created from the same data.Then, a user profile-based hotel preference matrix is created.While using the Manhattan distance method, a similaritybased recommendation system is created.For example, calculating user-2 that is most similar to user-1, the hotel that user-1 did not go to but user-2 went to is recommended to user-1.

Manhattan Distance
Manhattan distance is a metric in which the distance between two points is calculated as the sum of the absolute differences between their Cartesian coordinates.In a simple way of saying it is the total sum of the difference between the x-coordinates and y-coordinates.
ManhattanDistance [{a, b, c}, {x, y, z}] Using Manhattan Distance given in equation ( 1), it is intended to calculate the similarity between customers.The python program is used for this calculation.Each group of 8 groups is saved as a CSV file and the Manhattan distance is calculated.Since customers whose distance from Manhattan is closer to each other are considered similar to each other, the distance of each customer from Manhattan to the other is obtained.Since customers whose Manhattan distance is "0" will have gone to the same hotels, 0's have been eliminated from the code, and outputs are saved.

RFM Analysis of Customer Features
RFM analysis is based on the following assumptions:

1.
Customers who have made recent purchases are more likely to purchase again than customers who have not purchased recently.
2. Customers who make more frequent purchases are more likely to repurchase the company's products.
3. Customers with a higher total purchase amount are more likely to purchase again.
For RFM analysis in the SeturTech dataset, each customer's Invoice number, Sales Date, and Sales Amount information is used in this study.A customer may have made more than one purchase.Since RFM is about customers, sales amounts are aggregated by customer ID.
The data is grouped with the pivot table feature in the Excel program.With the pivot table, operations such as sorting, summing, and averaging can be performed.
In the pivot table, customers are added to the "rows" column.Thus, each customer appears only once.
For the "Recency" calculation; customers need to know when the last time they visited the site is.
The "maximum" of the data in the "Sales Date" column is selected from the pivot table, and the last time they visited is shown in the column.
For "Frequency", due to the need for information about how often he visits the website, it is calculated how many invoices belong to the same customer to find out how many times the customer has visited the website.
"Monetary" requires information on how much the customer has made a total purchase.The total amount spent by the same customer is calculated with the pivot table.
To calculate RFM scores, from 1 to 5 points are distributed (eg frequency value of a frequent customer should be 5).Recency, frequency, and monetary matrices are created with formulas in Excel.For recency, for example, the customer who visited the website the longest time takes the value 1.
The time elapsed since the last visit is divided into groups according to percentiles.The last visitor was given after the 80% slice and took the value of 5.The customer who spends the least money should get 1 as its monetary value.According to these calculations, the RFM scores of all customers are determined.

Customer Segmentation
To achieve its business goals, a company can use customer segmentation to target its marketing efforts and resources to valuable and loyal customers [18].
To prepare a recommendation system using collaborative filtering, customers are divided into specific groups.Since applying this system to those in the lost customer class would not yield efficient results, it is needed to determine focus groups.For this purpose, primarily customer profiles are determined.These focus groups are selected according to the RFM scores assigned by RFM analysis, using frequency, monetary, and recency scores.Thus, 6 different customer groups emerged.
Customers in SeturTech data are divided into the following six different segmentations: 1) Loyal 2) Potential Loyal 3) Promising 4) Hesitant 5) Need Attention 6) Detractors This customer segmentation is determined according to RFM scores given in Table 3. Loyal, potential loyal, and promising groups consist of active customers who visited the company last recently, shop from the company frequently, and spend much when shopping.
Hotel scores created by collaborative filtering are added to customers in each group.Hotel features and sales data features that are not needed for this study are cleared from these two data sets.
The following features, which will not affect the results of the analyzes in the sales dataset, are not considered in this study: 1) Currency code and foreign currency sales amount columns were deleted, and a comparison was made in TL.
2) The names of districts and towns were removed, and the cities remained.
3) Branch name, code, and type are removed.4) Sales, entry and exit dates, and how many people stay overnight are excluded because they will not be used in classification and RFM methods.
The 19 features, which contain relatively little data in the hotel features dataset, are not considered in this study, and the following 8 features are used: 1) Child-Baby Friendly 2) Pool 3) Pool-Summer 4) Other Beach 5) Spa-Thermal Hotel 6) Sauna-Turkish Bath 7) Fitness 8) Room Breakfast In these two data sets, only the data to be used in this study is obtained, the sales data are sorted on the basis of the customers' code, and the hotel features data are kept to be integrated with the customers' choices.
After determining the hotels selected by the customer codes, the value of 1 or 0 assigned to the hotel selected by the customer from the data set with the characteristics of the hotels positioned with "=VLOOKUP()" in Microsoft Excel Program.
It is determined by the IFERROR formula that some of the hotels selected by the customers are not in the hotel feature list and are deleted from the list.
In order to obtain content-based filtering, the average value of the features selected by the customers in the hotels is taken with the pivot table, and the values given by the customers to each hotel feature are assigned as this average value.The brief section of the result matrix is given in Table 6.

Data Analysis Results by Using Collaborative Filtering Method
For collaborative filtering, due to the need to calculate the points given by each customer to each hotel, the number of visits by the customer to the hotels is calculated as the score given by that customer to the relevant hotel, and the result matrix is obtained.Additionally, it is aimed to increase the reliability by removing the customers who are included in the data set and whose features of the selected hotels are not included in the "hotel features" dataset (whole row is 0).
While applying this collaborative filtering method, the total number of selections of the hotels is calculated with the pivot table and hotels with less than 10 visits by customers are excluded from the matrix in order to obtain clearer and more reliable results.
In the end, a matrix is obtained with customers in the rows and hotels in the columns, and the scores given by the customers to each hotel as seen in Table 7.

Data Analysis Results by Using RFM Method
Since the values we will calculate with RFM analysis are recency, frequency, and monetary, the values we will use in our 2 separate data sets are as follows: 1) Customer code 2) Billing information 3) Sales date 4) Sales Price While analyzing these values, the pivot table is used and the customer code is selected as the line, the date of the last sale, how many invoices are issued, and the total sales price are determined.

Recency Scores:
A recency matrix is created to find customers' recency scores.The recency matrix is divided into percentiles with the PERCENTILE.INC formula.The customer who visited us the longest time is determined and scored 1, then the matrix is divided into the following percentages, in order: 20, 40, 60, and 80.The recency scores are assigned as in Table 8.According to the created recency matrix (Table 8), R-Scores are assigned to customers by the VLOOKUP formula.
Examples of the result of the R-Score processed into the RFM matrix are shown in table 11.
Frequency Scores: While assigning frequency values to customers, the PERCENTILE.INC formula is used.The customer who visits the company at least should get 1 as the frequency value.After calculating the customer with the least visits to the company with the pivot table, the most recent visit dates are divided into the following percentages, in order: 20, 40, 60, 80, 90, and 95.The frequency matrix is given in table 9.As seen in Table 9, the frequency value of the customer is 1 up to the 80% slice.This value means that at least 80% of customers shopped from the company only once.
According to the created frequency matrix (Table 9), F-Scores are assigned to customers by the VLOOKUP formula.
Examples of the result of the F-Score processed into the RFM matrix are shown in table 11.

Monetary Scores:
According to the total sales price amount created with the pivot table in the RFM table, the monetary matrix is created with the PERCENTILE.INC formula and monetary scores are assigned.The results are given in Table 10.According to the created monetary matrix (

Customer Similarities
According to focus groups, Manhattan distance is calculated using python and the closest customers to each other are determined.Here, customers who have made the same choices are removed, and outputs are obtained.Table 12 shows an example of the customers who most closely resemble each customer.

Conclusions
The recommendation system is an area in the industrial system that has multiple content-based, collaborative and hybrid approaches to increase companies' growth and productivity.
Recommendation systems provide access to personalized information on the web and have progressed in the last 10 years.
Recommendation systems created new options for information search and filtering.For instance, online shopping stores have increased their profits and music lovers discovered new songs.
Besides the positive effect of the recommendation system on customers, there are some limitations and deficiencies.This paper has reviewed three different recommendation approaches in detail.Additionally, this paper has reviewed Recency, Frequency, and Monetary analysis in detail.

Figure 1 .
Figure 1. a shows the hybrid method combining CF and CBF with a weighting method.Figure1.b shows the methods using CBF methods to extract the features and send the recommendation to CF.Figure1.c shows a combined model using CF and CBF to obtain the outputs of another classifier such as the probability model.Figure1.d shows a model for CBF using output from CF.For example, user ratings can help CBF better identify users[9].Probability methods are used in hybrid filtering.Examples such as genetic algorithms, neural networks, and Bayesian networks can be given[8].

Figure 2 .
Figure 2. Hybrid approach to SeturTech Data Set

Table 1 .
User-Hotel Features Example

Table 2 .
User Profiles with CB

Table 3 .
RFM Scores of Customer Profiles

Table 5 .
A Brief Example from the Sales Data Set

Table 6 .
A Brief Example of Result Matrix of Content Based Filtering

Table 7 .
A Brief Example of Result Matrix of Collaborative Filtering

Table 10 )
, M-Scores are assigned to customers by the VLOOKUP formula.Examples of the result of the M-Score processed into the RFM matrix are shown in table 11.

Table 11 .
Recency, Frequency, and Monetary Scores of the Customers

Table 12 .
Output of Manhattan Distances