Sales History-based Demand Prediction using Generalized Linear Models

Ticari isletmeler icin mevcut satis verilerini kullanarak talebi net olarak tahmin etmek onemlidir. Şirketlerin karliligi artirmak icin karar destek sistemlerinin bir parcasi olarak tahmin analitigi yapabiliyor olmasi gerekir. Tahmine yonelik veri analitiginde, regresyon modelleri satis miktari gibi sayisal bir bagimli degiskenin tahmin edilmesinde kullanilir. Bu kategoride dogrusal modeller basittir, yorumlanmasi kolaydir ve ayni zamanda genellestirilmis dogrusal modeller (GLM) olarak adlandirilan cok guclu ve esnek model ailelerine genellestirme yapilmasini saglar. Basit dogrusal regresyona gore genellestirme potansiyeli iki katli olarak aciklanabilir: Ilk olarak GLM normal dagilimli hata terimleri varsayimini yumusatir. Ayrica, tahmin degiskenleri kumesi ile bagimli degisken arasindaki baglanti fonksiyonunu ozdeslik fonksiyonu ile sinirlandirmaz. Bu calismada satis miktari tahmin problemi GLM ile modellenmistir. Model uyarlamasini eniyilestirmek icin bir sirkete ait satis verilerinin kesifsel analizi yapilmis ve bagimli degisken olan satis miktarinin dagilimi gama dagilimi olarak bulunmustur. Sonrasinda, gama dagilimli bagimli degisken icin standart baglanti fonksiyonu olan ters baglanti fonksiyonu kullanilmistir. Deneysel sonuclar diger regresyon modelleri ve siniflandirma algoritmalariyla karsilastirilmistir. Model seciminde MSE ve AIC olcutleri kullanilmistir. Sonuclar GLM’nin dogrusal regresyondan daha iyi oldugunu gostermektedir. Siniflandirma algoritmalari acisindan ise, rastgele orman ve GLM en ust performansi gostermistir. Ayrica, tahmin degiskenlerinin kategorizasyonunun model uyumunu iyilestirdigi gorulmustur.


Introduction
Demand prediction is a vital activity for commercial companies. Companies should better manage current resources and plan for future needs in order not to lose their competitive advantage and reduce costs. The uncertainty in future makes the prediction hard. There are various methods for demand prediction yet the research community is still in search of more effective prediction techniques.
Sales demand prediction can vary due to short/intermediate/long range prediction, the characteristic of good such as durable or not, the type of response variable (binary, categorical, or numerical), and the model choice in the form of parametric vs. nonparametric.
Among existing models, linear models are simple and easy to interpret yet permit ready generalization to very powerful and flexible families of models. Generalized linear models (GLM) ( [1]) represent such a powerful and flexible families of models. In this work, we predict demand by making a novel adaptation of GLM for unique company data.
To clarify the idea, it's useful to explain simple linear regression in conjunction with GLM: Simple linear regression is commonly used to predict a numerical response variable like sales amount. It has some assumptions to simplify the theory of analysis. One of the assumptions is regarding error terms. Linear regression models assume that the error terms are normally distributed. The second assumption is such that response variables are independent normal random variables.
In some real world applications, error terms and response variables may not have normal distribution. In that case GLM can be used instead of linear regression models. GLM relax the assumption of normally distributed error terms. Moreover, GLM can be used for predicting the expected value of a response variable which has a distribution from the exponential family. Whenever the response variable is no more normally distributed, a constant change in a predictor variable does not lead to a constant change in the response variable. Thus, the relationship between the set of predictor variables and the response variable could be represented by a set of link functions rather than the identity function.
GLM can provide a solution for different types of response variable distributions. For a binary response variable, the two popular link functions are logit and probit. In demand prediction; besides estimating the amount of demand estimating the presence of demand can be crucial as well. Linear models which use logit link function to predict the probability of demand thus have common usage. [2] compares and contrasts the probit ang logit link functions through the use of an example case. [3] performs an empirical study on the cigarette demand problem. Cigarette demand problem traditionally is modeled as a mixed distribution: a logit specification to predict the decision to smoke and OLS for estimating the intensity of smoke. He tried to model the intensity of smoke in a population. The problem was modeled with both ordinary least squares (OLS) method and GLM. Results were compared to understand the importance of prediction bias due to omitting error terms while data transformation. The results show that OLS method overestimates the effect of price on the cigarette demand when compared to GLM. Because in the case of OLS, a logarithmic transformation is performed on the response variable whereas GLM performs logarithmic transformation on the expected value of the dependent variable. In other words, OLS with logarithmic transformation has a constant variance assumption which does not represent the truth. GLM, which has non-constant variance assumption thus performs better.
[4] conducts a study to predict voting behavior to Obama or Romney in 2012 American National Election. The study is based on logit model to evaluate the dichotomous dependent variable. The method performs well when the data set is sufficiently big.
We have five years' (2010,2011,2012,2013,2014) sales data for a cooling company. The data set includes sales data that consists of the variables of sale amount, the date of sale, item price, and air temperature.
In our sales demand prediction problem; the response variable, sale amount is found as gamma distributed. Thus, GLM with gamma distributed dependent variable is used. The canonical link function in this case is the inverse link function and it is also found to be best performing by the experimental evidence.
In our solution scheme, the combinations of three different predictor variables, "Days", "Temperature" and "Price" are analyzed. Then, the predictor variables are transformed into categorical variables for investigating the effect of categorization. When categorical predictor variable fitting results are compared with that of non-categorical, the former one gives better results than the latter. Thus, categorization provides more accurate prediction mostly due to variance reduction on predictor variables.
When the GLM result is compared with the other predictive data analysis techniques, our findings are as follows: Within the scope of regression techniques, GLM gives better fitting results than the linear model. Within the scope of classification techniques; in single predictor variable cases Random Forest is the best, but when "price" predictor variable is used in conjunction with other variable(s), GLM outperforms the others. The model fitting results are evaluated with respect to MSE and AIC metrics.
Our contribution comes in two different ways: Although GLM is an old technique in modeling, its use in data mining is relatively not widespread almost restricted to the use of logistic regression. In fact, GLM is composed of a set of models that can be configured with respect to inherent characteristics of data. The generalization property is due to this. In our work, we used unique company data for sales demand prediction and adapted the GLM using data distribution (GLM with gamma distributed dependent variable). Moreover, we performed a comparative analysis with the linear model and other data mining algorithms considering the effect of feature selection and categorization.

Material and Method
In this section; first we explain GLM along with its adaptation to unique company sale demand prediction problem. Then, we go through the description of the classification and regression algorithms in our comparison base. Finally, we briefly describe model evaluation with respect to the sampling techniques (cross-validation and hold-out sampling) and evaluation metrics that are MSE and AIC respectively.

GLM
Linear regression models have some assumptions to simplify the theory of analysis. One of the assumptions is regarding error terms. Linear regression models assume that the error terms are normally distributed. The second assumption is such that response variables are independent normal random variables. [5].  In some real world applications, error terms and response variables may not have normal distribution. In that case generalized linear models (GLM) can be used instead of linear regression models. GLM relax the assumption of normally distributed error terms. Moreover, GLM can be used for predicting the expected value of response variable which has a distribution from the exponential family and the individual values of the response variable are independent from each other. Link function is one of the GLM property which connects the parameters of the response variable distribution with the linear model [1]. So, if there exists an appropriate link function for fitting GLM then, the goodness of fit of GLM may produce better result than linear regression models. In other words, the issue is to find out the functional form-link function pair that is in accordance with the left-hand side variable's expected value distribution.
The gamma distribution, which is a member of the exponential family is widely used to model physical quantities that take positive values. Sale amount is such a quantity and can be modeled as a random variable denoted as Y ∼Gamma(α, β ) where α is the shape parameter and β is the scale parameter. Our model fitting results confirm that sale amount distribution is best fit to a gamma distribution. The probability density function of a gamma distribution is as follows: The expected value and variance equations are provided below: As seen from the formulations, when the parameter β (scale parameter) is not varying so much, the expected value of the response variable is just dependent on the shape parameter α. Then, the task becomes to find an appropriate link function connecting α to the right-hand side linear model. In accordance with this, the canonical parameter for the gamma distribution is − 1 µ . If the link function is chosen to be the function expressing the canonical parameter for the distribution being used as the linear sum, the fit becomes better. So the canonical link function is the inverse link [6].
The two common link functions for GLM are inverse link (canonical link) and log link functions where the former takes the inverse of the expected value of the response variable while the latter takes its logarithm. We used inverse link (the canonical) function for GLM fitting, it gave better performance (4).
In the given equation, a + bX 1 + cX 2 is our functional form, X 1 and X 2 are our predictor variables "Day" and "Temperature", we take the inverse of the functional form to connect with the expected value of the response variable Y (Sale amount) distribution.
In [6], GLM with a Gamma-distributed dependent variable is analyzed through different combinations of link functions and functional forms.
• Identity function-Inverse on the functional form • Inverse link function-Inverse on the functional form (log link equivalent) are proposed as alternative ways of GLM fitting. Identity function-Inverse on the functional form can be represented by the following formula, whereas Inverse link function-Inverse on the functional form (log link equivalent) can be stated as follows: We applied all these variations. As stated before, inverse link function along with the standard functional form performed better (Equation 4).

GLMNet
GLMNet is a regularized version of GLM. The regularization overcomes overfitting by adding terms to the cost function of the learning model. The addition of these terms in general push the parameters of the learning model towards a prior value. In the case of GLMNet there are two such terms namely 1 (the lasso) and 2 (ridge regression).
The target regression problems in context are linear, two class logistic and multinomial regression model problems.
To acquire a sparse solution for regression models, 1 (the lasso) penalty term is used. Ridge regression ( 2) shrinks the estimated coefficients with shrinking method which adds a penalty on coefficients. The mixture of 1 and 2 penalties is named as the elastic net regularized regression method which is GLMNet [7].

Gradient boosting method (GBM)
Boosting is a method of machine learning which produces a combined strong classifier out of weak learners. The weak learner algorithm is run on the dataset and according to a loss function, an updated version of the weak learner is introduced. The data distribution is updated so that the misclassified points in the dataset get higher weights. Then, the updated weak learner algorithms are run repetitively in this manner. The result is a combination of weak classifiers weighted with respect to the loss function outcome per iteration. The basic assumption in a boosting scheme is that the selected weak learner algorithm is at least better than a random classifier. There are mainly two varying components namely the cost function and the weak learner in different boosting methods. In the case of Gradient Boosting Method, the weak learner is selected in the direction of the negative gradient of the loss function [8].

Principal component regression (PCR)
Principal component analysis (PCA) was invented by Karl Pearson [9]. With principal component analysis (PCA) method, independent variables are projected onto new variables such that the sample variance is maximized and the resultant linear combination is uncorrelated with the original one. The resultant variables are named as principle components. Principal component regression (PCR) predicts the dependent variable using linear regression on the principal components [10].

Support vector machine (SVM)
[11] introduced support vector machine learning method.
The method is based on support vectors which represent decision boundaries on the training set. One desired characteristic of these decision boundaries is having a large margin as small margin causes model overfitting. Every such decision boundary can be associated with two hyperplanes and the task is to find out the maximal margin classifier that separates those two hyperplanes. The default classifier works with linear decision boundaries on the binary classification problem. Support vector machines generalize this to more complex surfaces by transformations from a linear decision surface into a nonlinear one.

Random forest (RF)
Random forest was introduced by [12]. It is a combination of multiple decision trees. In classifying a new instance, majority voting is applied on component decision trees. In order to construct every individual decision tree, a random training set using sampling with replacement is generated out of the original one. The performance of random forest is mainly dependent on the correlation between component trees and the strength of each individual tree.

Conditional inference trees (Ctree)
Another predictive method which is similar to MARS method is Conditional Inference Tree (CTree) method which also systematically tries all the combinations of the variables to select the right predictor variables. It is a tree structured regression model. CTree creates a decision tree. It generates splits iteratively. These splits are generated for most significantly related variable with the response variable. That response variable is evaluated by p values. Iterations finish when there is no more significant p value available for the remaining variables [13].

Ensemble learning
In ensemble learning, a set of classifiers is combined to make a better prediction. The component classifiers can be identical or diverse. In general, the aggregate opinion of diverse classifiers is better in reducing variance. The aggregate opinion is formed using the weighted average of individual votes [14].

Cross-validation
It's a sub-sampling technique in which the existing data are split into training and test sets. The model is trained using the training part and validated on the test part. In k-fold cross-validation, data are divided into k equal parts of size n. In every iteration, the i t h set of n items are used as the test and the remaining as the training. After k iterations, the average performance from the k sets is recorded as the resultant performance.

Hold-out sampling
In hold-out sampling, a separate validation set is utilized in order to assess the predictive performance of the model on unseen data.

Akaike information criterion (AIC)
Akaike Information Criterion (AIC) is a metric that is used to evaluate the goodness of fit of a model. Different models of different complexity can be compared using this metric. AIC is formulated as follows: In the formulation, l is the log-likelihood term that describes how well the data are described given the model. p represents the model complexity in terms of the number of parameters. Lower AIC values are preferred and AIC favors simpler models that explain the data well.

Mean squared error (MSE)
It is a measure of the deviations of real data points from the model predicted ones. The sum of squares of individual errors is taken and normalized with respect to the number of data points. The sum of squared errors (SSE) is calculated using the following formula; where the first term inside the summation represents a real data point while the second is a model estimated data point. We have five years' (2010,2011,2012,2013,2014) sales data for a cooling company. There are more than twelve types of products that are sold in almost all cities of Turkey. The total number of sales in the product database is 185986. The most popular product is A with 134247 total number of sales. For our demand prediction problem, we referred to product A sales in the city of Istanbul. The total number of sales meeting this criterion is 12788. We queried the product database in order to filter it with respect to this criterion. As a result, our sales records include date, sale amount, product item price, city name, and product code. Using the date and city information, we added air temperature as an additional feature to the data set.

Goodness of fit tests
As part of exploratory data analysis, we calculated some summary statistics. One such statistic is the total product sales amount for every product sorted with respect to city. Then, we based our demand prediction on the most popular product sold in Istanbul, which had the highest number of product sales.
We performed goodness of fit tests on the collected data to determine its distribution ( Table 1). The results are acquired according to the Chi-Squared fit test.
The null hypothesis in the case of 2010 sales amount data can be stated as follows: H 0 =There is no difference between Sales2010 data distribution and the theoretical Log-logistic distribution.
The alternative one is: H 1 =There is difference between Sales2010 data distribution and the theoretical Log-logistic distribution.
According to the p value obtained (0.21991), if we reject the H 0 , we are 21% wrong. Thus, we cannot reject that the Sales2010 data fit to Log-logistic distribution.
In a similar way; as all p values for the other years' sales data are greater than 0.05, we can conclude that Sales2011, Sales2012, Sales2013, Sales2014 data fit to Gamma Distribution, Gamma (3P) Distribution, Lognormal Distribution, and Gamma Distribution respectively.
Please be reminded that in goodness of fit tests, Type 2 error (failing to reject a false null hypothesis) rather than Type 1 (p-value: the probability of incorrectly rejecting the null hypothesis) is common.

Discretization
In order not to disregard the effect of discretization on the performance of GLM model fitting, we prepared the discretized versions of our predictor variables. We calculated the quantiles for the temperature and price predictor variables. Then, we converted the raw values into categories with respect to first, second, third, and fourth quartiles ( Table 2 and 3).

Results
In our modeling, we investigated the response variable with respect to three predictor variables, "Date", "Temperature" and "Price". We considered all combinations of these three variables in our experiments. Additionally, we transformed them into categorical variables to test the effect of categorization. In the following part, we first give the empirical results for non-categorical predictor variables then for their categorical counterparts.

Incremental addition of non-categorical predictor variables
Day, temperature, and price are considered as predictor variables while constructing the fitted model to sales data for the year 2014. Predictor variables are added to the model incrementally for testing their effects on model fitting in a controlled way.
To assess the performance of model fit, p values are calculated with respect to the goodness of fit null hypothesis. The null hypothesis can be stated as follows: H 0 =There is no difference between observed and fitted values.
When we analyze the results given in Table 5, p values and AIC metric verify that the model which is based on "Day" and "Temperature" predictor variables gives the best result. The calculated p value, 0.6185 means that we cannot reject the null hypothesis, that is, the model fitting is valid statistically.

Incremental addition of categorical predictor variables
This time, GLM fitting results are interpreted with categorical sales data for the year 2014. The results are presented in Table 6.
The model which is constructed with "Quarter", "Temperature", and "Price" gives the best result when we consider MSE, AIC and p values. So, "Quarter", "Temperature", and "Price" are added to the model as predictor variables. Table 7 represents the comparison of GLM results with non-categorical and categorical predictors. The model which is constructed with the categorical independent variables gives better results. The difference between two models can be explained by the effect of the "Quarter" predictor variable. It gives seasonal information and is more relevant when the sale of cooling goods is considered.   Table 6. GLM fitting results for categorical sales 2014 data.

Ordinary least squares estimator (OLS) versus GLM
For all the combinations of predictor variables GLM outperforms OLS. In order to give an idea of how they differentiate from each other, the OLS and GLM fitted models which are constructed with "Day" and "Temperature" are visualized in Figure 2 and compared in Table 8. In order to further analyze the difference between OLS and GLM model fitting, we can focus on their variances ( Figure 3 and 4 respectively). The OLS method assumes that the residuals have the same variance which is named as homoscedasticity. Constant variance of the OLS method can be observed in Figure 3. GLM fitted model has non-constant variance across an entire range of values which is called heteroscedasticity (Figure 4). Fitting sales data by GLM provides less variance than the OLS method and GLM assumes different variances for each estimated response variable because of the error term.   fitted models. GLM fitted model gives better AIC and MSE results. In addition to these, the residual to null deviance fraction gives a better result in the case of GLM. That means adding "Day" and "Temperature" predictor variables to the GLM fitted model decreases null deviance more than that of the OLS fitted model. As a result, if we compare the two models based on AIC, MSE and, the residual to null deviance fraction then, the GLM fitted model represents observed data better than the OLS fitted model.
The validation of the selected model is one important step of model building. In this section we do validation on the fitted models which are constructed by non-categorical predictor variables. Hold-out samples which are samples of data that are not used in fitting a model are used to validate the fitted models. Istanbul sales are used as the training data and Izmir sales are used as the validation set. Table 9 gives the comparison of the selected data mining methods and GLM with non-categorical predictor variables through the use of hold-out samples. As seen from Table 9, GLM gives the best result with "Days" and "Temperature" predictor variables and "Days", "Temperature", and "Price" predictor variables. GBM, PCR, SVM, and Random Forest models give better results in some other variable combinations (marked in bold). Table 10 gives the comparison of the selected data mining methods and GLM with categorical predictor variables: As a result, in some variable combinations the RF fitted model is the best while in the remaining ones the GLM fitted model outperforms the others. To further characterize those variable combinations, in single predictor variable cases RF is the best, but when "price" predictor variable is used in conjunction with other variable(s), GLM is the top performer. The best values in every column are marked in bold.
Meanwhile, as an alternative we used cross-validation on the model building based on Istanbul sales data. The obtained results are in accordance with the results from the hold-out sampling.

Discussion and Conclusion
To deal with demand prediction, various techniques of regression analysis and data mining are used under the predictive methods. The purpose of this work is to make history-based demand prediction of sales by using generalized linear models.
The data set which is used for analysis is real company data. In our modeling, the response variable is the sale amount and the date of sale, item price, and air temperature are selected as the predictor variables. The distribution of sale amount which is the response variable for real customer data is discovered as the gamma distribution. Because of the response variable distribution, GLM method is used with the gamma distribution which is a member of the exponential family and inverse link function is used. Investigating the data, choosing the GLM setting in accordance with the response variable distribution along with an appropriate link function are crucial steps for characterizing the data through the use of GLM modeling.
In any modeling case, the target model is fitted to the whole set of data points. Thus, the variance of data points and the bias of them from their real population should be considered. In order to govern the total variance due to the predictor variables, if there are a few variables like our case, all the possible combinations of them should be taken into consideration. If the original data types of variables and their ranges cause a lot of variance, it can be an option to apply categorization to the variables in order to better identify their impact on the response variable.
We analyzed the combinations of three different predictor variables, "Days", "Temperature", and "Price". Then, the predictor variables are transformed into their categorical counterparts for investigating the effect of categorization. When categorical predictor variable fitting results are compared with that of non-categorical, the former one gives better results than the latter. Thus, categorization provides more accurate prediction mostly due to variance reduction on predictor variables.
When the GLM result is compared with the other predictive data analysis techniques, our findings are as follows: Within the scope of regression techniques, GLM is compared with the linear model both for default and inverse response variables. As a result of the comparison, GLM gives better fitting results than the linear model in both cases. Within the scope of classification techniques, RF and GLM are the top performers. When single predictor variables are used RF is the best while in the case of the usage of "price" with any other predictor variable(s) GLM is the best. The model fitting results are evaluated with respect to MSE and AIC metrics.
Classifier performance depends greatly on the characteristics of the data to be classified. Various classification algorithms are compared in order to find out the characteristics of data that explain their comparative performances. However, it's still an open problem.
Attribute error and concept size are good features (characteristics of data) to explain the performance of classification algorithms. Concept size is the proportion of concept space covered by positive instances while attribute error is the random substitution of attribute values [15].
In explaining the performance of RF, it can be said that RF is robust to attribute error as it performs random selection of a subset of features to grow each tree.
As for GLM, its performance can be attributed to our model adaptation in which we took the dependent variable, "sale demand" distribution (Gamma) into consideration.
Further research on this topic could have the following directions: • Different discretization methods for categorizing the predictor variables can be used.
• Hybrid models [16] can be constructed such as the formulation of the GLM model along with an additive time series component.