Improvement of Football Match Score Prediction by Selecting Effective Features for Italy Serie A League

Football is one of the most popular sports in terms of number of fans in the world. This situation arises from the unpredictable nature of football. People are becoming more and more connected to this sport as it combines emotions such as excitement and joy that it creates in people. Match result prediction is a very challenging problem, and recently the solution to this problem has become very popular. With the result of this unpredictable game the events that occur during the match that affect this result are tried to be predicted by machine learning methods. This study demonstrates our work on finding the most effective features in match result prediction using match statistics from the Italian Serie A League's 2027 pieces match between the 2014-2015 and 2019-2020 seasons and with 54 features for each match. Feature selection testing was conducted to estimate the results of a football match and determine the most important factors. The selection of features was made using the ANOVA method and it was predicted that 28 of the 54 features would be effective in predicting match results. After this stage, fairly high rates classification success was achieved using the logistic regression method. 88.85% as a result of the prediction made with all features and 89.63% success was achieved as a result of the prediction made with 28 selected features. With these results, it is possible to say that process of feature selection increase success in match result prediction.


Introduction
In recent years, football continues to attract the attention of people from various age groups whose social and cultural status is different. In addition, it remains one of the sports with the largest number of spectators and fans worldwide. In football, outcome prediction is seen as a rather difficult problem because of the large number of factors that cannot be predicted and can affect results. In football, outcome prediction is seen as a rather difficult problem because of the large number of factors that cannot be predicted and can affect results. There are too many football teams at different levels in regional and national leagues in all countries of the world [1]. The success that can be achieved in a single match in football does not mean that it will be successful in other matches in the league. Also, a good team can sometimes be defeated against teams weaker than itself. Possession of the ball in the match, shots thrown to the goal, fouls, corner kick and many other factors that occur during the match affect the outcome of the match [2]. While it is difficult to predict results due to these emerging situations, people in academia and industry have sought to achieve positive results by conducting research on Football match prediction [3]. Various machine learning techniques and statistical methods are used to estimate the results and analyse the factors affecting the outcome of the match [4,5]. Considering the history of the sport of football, thousands of matches have been played in the period up to the present day. A large number of statistical data about these matches can be accessed and used through sports sites on the internet.
A naive Bayesian method was used in a study to predict the results of Tottenham Hotspur football clubs for the period 1995-1997. They had noted that the Bayesian network outperforms other machine learning techniques, such as nearest neighbors, decision trees. They had achieved an error rate of 40.79% by focusing on a specific team and a specific time period [6].
The study, which used Bayesian networks to predict the results of Spanish football team Barcelona's football matches, took into account weather conditions, the psychological state of the players and whether any of the main players had been injured. They had achieved a 92% accuracy rate in their study over 20 matches for a single season and a single team [7]. In another study, which used a regular probit regression model to predict the outcome of a football match, they proved that interesting factors such as the distance the away team travelled to the match had an effect on the outcome of the match. However, the study also had analysed the economic gains and price efficiency of the fixed-win betting market rather than match prediction [8,9].
New approaches are being implemented to solve the secrets in football using artificial neural networks. In some studies, artificial neural networks derived from machine learning methods can even be said to be the best predictive model [10]. However, over the years, this claim has been refuted by developments in artificial neural networks. The outcome prediction was made with the help of images taken during match with convolutional neural networks from deep learning methods [11]. Deep neural networks have been used to predict football match outcomes in another study [12]. By using different match results in different leagues, match results were also estimated with the help of various machine learning methods. Match data of teams in the English Premier League, multiple linear regression, Artificial Neural Network (ANN), discriminant function analysis [13], Bayesian network, expert Bayesian network, decision tree, k-nearest neighbor [6] and ANN [14] methods were used in match result prediction.
Football match result estimation is a multi-class classification problem and in most studies the number of classes was taken as 2 or 3 classes. 2-class classifications home team won-away team won, 3-class classifications draw-home team won-away team won in the form classes were created. In a study with the collection of match data from various leagues, Long Short Term Memory Neural Network (LSTM NN) classification, LSTM NN regression methods were used and different results were obtained using different class numbers. They achieved 70.2% results in tests conducted using 2 classes and 52.5% results in tests conducted with 3 classes [15]. In the classifications for which 2-class match results were estimated, it was observed that they obtained higher results than the methods with 3-classification [16][17]. In the English Premier League, 69.5% success was achieved with a 2-class classification using 4 features with logistic regression method on 2280 pieces match data. When these studies in the literature are examined in detail, the number of match data in the leagues used, the number of features used and the method used are 3 factors that affect prediction success. When the numbers of data from matches in leagues are taken into account, there are also studies that use more than 200 thousand match data [18][19][20], as well as studies that use less than 100 match data [16,6,21]. The most important factor in predicting match results is the events that occur during the match. The features obtained from these unpredictable events are used as features in classification problems. Some of these features may be meaningless in match result predictions. These meaningless features are extracted from all features, meaningful data is evaluated and given as an introduction to classification algorithms. More than 100 features in literature studies [22], [15] and classifications made using only 4 features [23,19] are available. This situation can vary for the data in each league and does not have a standard [1]. Trial and error or data simplification methods can be used to find the most effective features. Similar results were obtained in studies with different methods, feature numbers and 3 classes on the same data. 52.4% by XGBoost regression method using 66 features [18], 51.5% by Hybrid Bayesian Network method using 4 features [19], 51.9% by K nearest neighbour method using 8 features [20] achieved classification success. Based on this information, the feature selection phase of the data from the matches comes across to us as an extremely important factor for match result prediction.
In a study conducted between 1997 and 2003, tests were carried out using the multiple logistic regression method using Match records belonging to the Australian Football League. They achieved 66.7% accuracy in their work and stated that the key variables were the team's offensive strength, home advantage, distance travelled and ground recognition [24].
In another study, they used the Bayesian hierarchical model to predict the results of matches played in the Serie A League between 1991-1992. They showed that the most effective features when making predictions are home advantage, team attack and team defense variables [25]. In a similar study, the most effective features were shown to be attack and defender [26].
In another study, which used ANN and logistic regression, 95% success was achieved in match result prediction with match data from the English Premier League 2014-2015 season. They had shown that the most effective features in the classification were home and away teams, goals, shorts, corner, odds, attack strength, players' performance index, managers' performance index, managers' win, and teams' win streak [27]. In the literature, entropy, probability distributions and feature selection with different algorithms have been made [28].
This study, the Serie A League which is at the top level of the Italian football leagues and consists of 20 teams, uses football match data played between the 2014-2015 and 2019-2020 seasons to answer the question of which are the most effective features for predicting match results. A study was carried out to determine the features affecting the match result using logistic regression method with datasets containing a total of 55 features including 2027 pieces football matches and one match result feature for each match. The dataset was created by scraping method on the web. With this dataset, tests were carried out by selecting the most effective features in the result estimation among 54 features that affect the outcome of the match using ANOVA.
The article is edited as follows. In the second part of the article provides information about the dataset used, the features of the dataset, and the methods we use. In Chapter 3, tests were carried out to determine effective features and estimate results. In section 4, performance analysis is given about the results obtained.

Material and Methods
Information about the data to be used in the study is given in this section. The data in the dataset is undergone various processes to be used in the study. These operations consist of putting together match data from the entire season and checking for missing data. After these operations, the selection of the effective features of each match was made. After the feature selection process, classification process was made for the prediction of the match result. The classification process is done separately for selected features and all features, and the test results was shown in Chapter 3. The introduction of the logistic regression method used in feature selection and classification is made in this section. The operations to be carried out in Figure 1 are shown with a flow diagram.

Figure 1. Match score prediction flow diagram
In this study, machine analysis and classification processes were carried out with the MATLAB program. A computer with Intel i5 10200H CPU, 8 GB Ram and GTX1650Ti graphics card was used.

Dataset
The dataset includes data from matches played in the Italian The data were collected by scraping method on the web. Web Scraping method is a useful method for collecting data for use in researches [29]. Match data scoreboard.com from the website [30] was taken by this method. In the data set includes a total of 2027 pieces football matches, 54 features for each match, and a match result class with together these features. Table 1 contains list of features in the datasets. Besides the 54 features in the dataset match results are also an important factor in the outcome prediction. The ratios of Match Results for 2027 matches included in the dataset are shown in Figure 2. Draw: The end of the match with an equal score, Home Win: The result of the match is that the number of goals scored by the home team is greater, Away Win: The result of the match is that the number of goals scored by the away team is greater.

Feature Selection
Feature selection is a totality of operations to select relevant features for solving a problem, to discard unnecessary ones and to increase the success of classification. A large number of data is being studied in order to increase classification accuracy. This is a big problem and it is quite difficult for algorithms to work with large data sets. Therefore, irrelevant features in the data are discarded and pre-processing steps are applied to reduce the number of features and the number of data. Thanks to the correct selection of features, learning speed can mostly be increased, as well as improvement in classification success according to the amount of data can be provided in a positive way [28].
ANOVA method of variance analysis was used for feature selection in the study. ANOVA is used to analyse how independent variables interact among themselves and the effects of these interactions on the dependent variable [31].
The dependent variables here are the features obtained from match statistics, while the independent variable is the match result.

Logistic Regression
Logistic regression is one of the statistical models used frequently in studies. In logistical regression, the dependent variable is estimated from one or more variables. Logistic regression clarifies the relationship between dependent variables and independent variables. In logistical regression, variables do not need to show normal distribution [32]. The values predicted in logistic regression are limited to 0 and 1 as they are probabilities. This is because logistical regression predicts the probability of outcomes, not the results itself [33].

ANOVA (Analysis of Variance)
ANOVA is a statistical analysis method used to study equalities over the values of more than two groups of features found in the datasets. It is used to compare the average values in small clusters formed by dividing the dataset into clusters by assigning variable labels to the values in the dataset [34].

Confusion Matrix
The measure of classification performance is measured by a confusion matrix, which records true and false recognized instances for each class [35]. The example confusion matrix for a two-class classifier is given in Table 2 [36].  Performance criteria for classification methods are provided with their formulas in Table 4 [ [37][38][39]. These performance criteria give percentile accuracy rates of the classification. The main aim in the experiments is to examine the contribution of these features to classification success by selecting the most effective features derived from match statistics, primarily for result prediction. In the study, logistic regression method was used for classification operations. The classification result obtained using all the features in the dataset was compared with the classification result obtained with the selected features. In this way, it is envisaged that faster classifications can be made in large data sets by looking at these effective features.
First, data pre-processing operations were performed to ensure that the data can be processed smoothly. After this process, the selection of features was realized with ANOVA. However, it was observed that only the features belonging to the home team or only the away team were selected among the selected features. To ensure data integrity, the features coloured in Table 4 are considered as selected features and added to the list. This situation is true for the datasets used and can result in different situations in different datasets. This is why every feature in the table is included for both teams. The features selected with ANOVA are shown in Table 5.   A total of 2027 football matches were classified by logistic regression using data from the data set, which includes 54 features per match and the Match Result class with these features. The results obtained from this classification are shown in the confusion matrix in Table 6. As a result of the calculations made with the data in Table 6, it was found that 77.90% of the draw status, 93.31% of the home team win, 91.51% of the away team win were correctly classified. Overall average classification success was found as 88.85%. The reason why draw state classification success is lower than classification success in the case of the home team or away team being the winner is that the draw state classification is difficult. The reason for this situation in the event of a tie occurs because the statistics during the match are similar to other situations, and that 25.46% of 2027 pieces football match result in a draw.
Data simplification, in some cases can increase the success of classification and in some cases can decrease it. The high success of classification after the selection of important features in the dataset is related to the contribution of effective features and non-effective features to classification. However, it should also be forgotten that non-effective features may never affect the outcome of the match. Because the outcome of a football match and the events during the match is an unpredictable process. Features that are not effective apply to the dataset in our study.
Some of the effective features selected are not common features for either team. To ensure data integrity, these missing features were added within both sets and the number of selected features was increased to 28. These selected features were given as input to the logistic regression method and the outcome of the match, which was the output data, was tried to be estimated. The confusion matrix of the results obtained is shown in Table 7. The success achieved as a result of classification with selected features is 89.63%. Based on this result, it can be said that noneffective features adversely affect classification success.
Overall classification success showed a 0.78% increase in classification with selected features. Classification success was achieved by 79% for the result of the match resulting in a draw, 94% for the win of the home team and 92% for the win of the away team. The biggest increase in classification success was observed when the match ended in a draw. The reason for the increase in general classification success can be said to be the high percentage of draw status classification success. Because the draw situation is undesirable in match result prediction and negatively affects classification results. The success of classification by number of features is shown in Figure 4. In this study, the selection of effective features in the dataset was provided using data simplification methods. Imbalances can occur when certain features that do not contribute to classification are removed. For example, the yellow card feature of the home team is an effective feature in classification, while the yellow card feature of the away team may not be an effective feature. However, 3 features that are not effective in terms of data integrity (yellow card away, recoveries home, and block crosses home) were added and used as input data in the classification. As a result of the tests, the success rate was 88.85% in the classification performed using 54 features and the success rate was 89.63% in the classification performed with 28 features after the selection of features. As can be seen in other studies in the literature, in order to increase the success of classification, the result of the match can be treated as a 2-class. However, the handling of situations where only the home and away team are victorious does not coincide with real life. Of the 54 features we use in the classification, 26 are removed because they are noncontributing to the classification. It can be said that with the new features selected in the study carried out, it has been successful in the feature selection in terms of match result prediction success.