Market Basket Analysis of Basket Data with Demographics: A Case Study in E-Retailing

Businesses overcome with a high degree of competition that necessitates customer-focused strategies in most industries. In a digitalized business environment, the implementation of such strategies often requires the analysis of customer data. Market basket analysis is a well-known method in marketing that examines basket data to discover useful information about customers’ purchase intentions. The analysis has been a playground for data mining researchers that aim to overcome with its practical challenges. Our study extends the conventional basket analysis by incorporating demographic variables along with purchase transactions. With such modification, we provide an example for the extraction of segment-specific rules that relate productlevel purchase decisions with gender, location, and age group. For this purpose, we present a case study on monthly basket data obtained from an e-retailer in Turkey. Our findings demonstrate association rules that might guide marketing practitioners who need to discover segment-specific purchase patterns to designate personalized promotions.


Introduction
In a competitive environment, businesses aim to satisfy their customers while maintaining profitable relationships in the long run. Retaining existing customers with customer-centric practices is crucial for businesses due to the high cost of acquiring new customers (Bodapati, 2008). Customer Relationship Management (CRM) is often considered as a business strategy to achieve long-term customer satisfaction through a customer-oriented approach.
CRM strategy focuses on relationships established with customers and necessitates analyzing customer activities to understand their needs and behaviors (Winer, 2001). With this perspective, a business needs to keep track of its customers' actions, explore what they like, analyze their purchase patterns, and investigate their customers' needs, preferences, and behaviors (Tsiptsis & Chorianopoulos, 2009). In this regard, organizations have been utilizing information to establish long-lasting profitable relationships with their customers by offering better services (Sota et al., 2018).
Over the last decades, e-commerce has transformed how businesses and consumers interact with each other. As a side effect, marketing efforts for online shopping had become more and more critical for businesses (Aksoy, 2008). IT infrastructure facilitates capturing, storing, and rapidly accessing customer data, and provides a useful framework to implement data-driven marketing strategies. The need to extract useful information from large data stacks has led to the introduction of data mining techniques (Han et al., 2012).
Data mining is a semi-automatic process to discover interesting patterns and statistically significant structures within datasets (Vahaplar & İnceoğlu, 2001). This process involves the use of techniques for deriving meaningful and useful information within large and unstructured datasets (Dalal, 2012;Ching & Pong, 2002) and help to estimate future trends that affect decision-makers (Hudairy, 2004). In a broader context, data mining is often described as an essential step in the Knowledge Discovery Process (Bramer, 2016:2); which involves cleaning, integration, selection, and transformation of data before the application of data mining algorithm, followed by pattern evaluation and presentation (Han et al., 2012). Data mining techniques allow the discovery of previously unknown patterns and relationships within large data sets (Linoff & Berry, 2011;Marakas, 2003) and are widely employed in interdisciplinary studies across various domains (Savaş et al., 2012), including finance, e-commerce, medicine, business, and education. In this manner, data mining field stands at the intersection of computer science, machine learning, database management, applied mathematics, and statistics (Emel & Taşkın, 2005).
Data mining applications help to estimate future trends that affect decision-makers (Hudairy, 2004). The patterns revealed through data mining might help marketers to decide on the marketing mix, to determine new product opportunities, and to predict customer behaviors (Strauss & Frost, 2009). Moreover, segmentation, classification, and forecasting techniques in data mining are applied in various business problems to leverage data to implement customer-oriented strategies. Alphanumeric Journal Volume 9, Issue 1, 2021 Among a variety of industries and sectors, retailing is one of the hotspots for data mining applications. In a study by Anderson et al. (2007), it was emphasized that retailers that aim to respond to the needs of their customers shift towards datadriven decisions. Data mining techniques have been widely adapted for retailing problems, including cross-selling, market basket analysis, risk management, fraud detection, customer acquisition, customer retention (Bala, 2008), shelf placement, and stock management (Hormozi & Giles, 2004). Basket data has been a valuable resource for data mining applications in retailing, and often analyzed for customer segmentation and extraction of purchase patterns (Griva et al., 2018). Notably, market basket analysis has been often addressed as a data mining problem with an objective to discover relationships, or association rules, which represent hypotheses on purchase intentions. Such findings have been used exploited in facility layout design (Halim et al., 2019) and online/mobile recommendations (Osadchiy et al., 2019) with the objective of increasing cross-sales.
As remarked in (Dippold & Hruschka, 2013a), prior research often focuses on crosscategory purchase decisions extracted from basket data, where most exploratory models exclude demographics and marketing mix elements. However, it might be argued that the technique might be further applied to other product attributes in the retailing context, including brand (Kabasakal, 2020). Kooti et al. (2016) conducted a study to analyze the differences in customer purchase decisions and emphasized that an extensive count of attributes, including geo-location and demographics, might help to predict online shopping decisions. Moreover, Zhang and Pennacchiotti (2013) noticed that social media profile data, including gender and age, are useful in predicting purchase decisions in e-commerce. Due to their influence on purchase behaviors, customer profiles and demographics have been further utilized for product recommendations. With this motive, our study aims to analyze basket data and demographics together using the association rule mining technique. Along with a case study, our study presents association rules which help to relate purchase intentions with demographic variables. In the following sections, an introduction of market basket analysis and rule mining technique is provided. Subsequently, our case study is presented. Our findings involve prominent rules that were initially chosen by interestingness measures, then categorized with the inclusion of demographic attributes.

Market Basket Analysis
Market basket analysis is a popular technique for marketers that might be useful to designate customer-focused strategies for businesses (Özçakır & Çamurcu, 2007). The analysis extracts clues on customers' purchase intensions by discovering interrelated categories. The analysis depends on the assumption that the customers' purchase decisions across product categories might not be independent, thus follow similar patterns (Dippold & Hruschka, 2013b).
From a broader perspective, market basket analysis might be categorized into two types of models; exploratory models are designed to discover cross-category purchase patterns while explanatory models explore the effects of marketing mix variables right after the extraction of purchase patterns (Solnet et al., 2016). Alphanumeric Journal Volume 9, Issue 1, 2021 However, the use of the analysis in e-retailing is often aimed at the discovery of crosscategory purchase patterns for improved recommendations.
The findings in a market basket analysis typically imply which products could be sold together. Such results typically involve complementary products (Winer, 2001). Purchase patterns discovered by the analysis might be utilized to designate sales promotions (Chen et al., 2006). Furthermore, e-commerce web sites might provide relevant products for online users instantaneously.
The discovery of frequently purchased items has been commonly revisited as a frequent pattern mining problem in data mining studies (Aggarwal, 2015). Apriori algorithm (Agrawal et al., 1993) is a well-known algorithm proposed by R. Agrawal to discover the association rules that represent purchase patterns in a supermarket dataset. The primary advantage of the algorithm lies in the ability to scale for large input sizes efficiently (Kronberger & Affenzeller, 2011). Alternatively, Eclat (Equivalence Class Transformation) algorithm is widely employed in rule mining due to its high performance in smaller datasets (Şimşek-Gürsoy et al., 2019). As another alternative, the FP-Growth algorithm generates FP-trees to achieve better data compression in item-set discovery (Kotu & Deshpande, 2015:206).
The purchase behaviors are formulated as association rules. An association rule X Y for two sets of items X and Y represents a purchase pattern that indicates the purchase of Y along with X. In such representation, the antecedent (X) and the consequent (Y) denote discrete sets of items.
The significance of association rules and item-sets is assessed with several measures. The support criterion for a set of items indicates the fraction of all records that involve that set of items (Aggarwal, 2015). The confidence measure is used in rule mining to assess the importance of a rule. The confidence for the rule X  Y is calculated by the ratio of transactions that involve X and Y together among all observations that involve X, as in the following (Aggarwal, 2015): The support and confidence are among the criteria that indicate the usefulness of association rules (Bayardo Jr & Agrawal, 1999). Rules measured over a threshold for both measures are often described as 'strong' in most studies. On the other hand, high confidence score might be misleading in some cases; thus, the lift measure that signifies the correlation among the itemsets is useful to choose interesting rules (Han et al., 2012). The lift measure for the rule XY can be formulated as follows:

Methodology
In most studies on market basket analysis, rules are entirely discovered from the purchase history. Particularly, the rule mining technique (Agrawal et al., 1993) handles transactional data with binary attributes where each attribute signifies the purchase of a product. Our approach extends the basket data by integrating additional variables about the customers and the orders. However, our study sticks to the original Apriori algorithm for rule mining, after several steps of data preparation.
Alphanumeric Journal Volume 9, Issue 1, 2021 Market basket analysis typically formulates products within baskets in binary form regardless of the quantity, and such relations are occasionally represented in a table of binary relationships. In Table 1, each basket is represented as a column, and each row corresponds to an item (product or category). The cells at the intersection of columns and rows hold binary values. In this notation, a value of 1 corresponds to the presence of an item within a basket, and 0 corresponds to the opposite.
To examine demographic variables in rule mining, we split categorical values into binary attributes to form a binary All items and transactions represented in the table above were stored in a relational database in SQLite. The transactional data initially involved purchased products for each basket, where each product was assigned with a category attribute. Accordingly, the products were reduced into product-categories to obtain generalized rules. Moreover, the basket-item relations were extended by including the demographic variables for analysis. For this purpose, a query was executed to prepare a combined transaction list that involves pairs of {Basket, Product Category}, {Basket, Gender}, {Basket, Age-Group} and {Basket, Location}. By this means, we assert that our transactional basket data represents the customers, even partially.
The subsequent step of our methodology involves the application of the Apriori algorithm to extract association rules. Rule mining was conducted by the 'RuleGenerator' software, which is a custom implementation of the Apriori algorithm introduced in Kabasakal (2020). The findings were separated into several groups based on the presence of demographics variables.
We suggest that the main advantage of our approach is the opportunity to identify behavioral purchase patterns by specific customer segments. Such rules can be exemplified as "women who purchase X also purchase Y", or "customers of age 25-Alphanumeric Journal Volume 9, Issue 1, 2021 34 who purchase X also purchase Y". The resulting rules, including demographic variables on the left-hand side, might help to develop segment-specific offers. If a specific customer segment prefers a group of products more often than the others, our approach might demonstrate the difference with segment-specific association rules.

Case Study
In this study, we present a case study of a market basket analysis combined with demographic variables. The data was obtained from Adepo Sanal Market (adepo.com) for analysis. Since its foundation in 2001, adepo.com had been a substantial e-retailer in İzmir, Turkey. The company offered products of a variety of categories such as groceries, beverages, cleaning supplies, and household items; and had been in service until the end of December 2015. The dataset consists of 3163 purchase records, all of which were ordered in November 2013 by a total of 1717 online customers. Moreover, our dataset involves demographic variables that consist of gender and year of birth. The demographics had been provided by customers voluntarily in a membership form during their registration. Additionally, the delivery location was available in our dataset for each order.
Online customers were split into five segments according to their age. For such purpose, the years of birth selected in membership pages were used to calculate the customers' age by November 2013. Additionally, the location data of orders were available for each purchase in the dataset. With the additional variables included in the analysis, we argue that our study differentiates from most studies.  Table 3 demonstrates the distribution of customers grouped by their location. We should note that the numbers arise from the limited dataset analyzed in our study, which only involves the orders delivered in November 2013.

Location Customers Location Customers Location Customers 54
294 ( Before analyzing our dataset for rule extraction, a preprocessing phase was required to eliminate products rewarded by the e-retailer. In particular, the company occasionally had a promotion for online customers whose purchase total exceeds a threshold. Accordingly, 2199 orders by 1291 customers had been rewarded with bottled water. The expert opinion by an IT specialist working for the e-retailer was to ignore such products. Moreover, a study by Häubl and Trifts (2000) describes online shopping as a two-step decision making process in which the customers initially screen a set of relevant products, then make a purchase through the examination of products based on their important attributes. From this perspective, it can be argued that free items offered by the e-retailers might be purchased by any customer, without proper consideration of the product attributes. Accordingly, our data preprocessing stage involved the elimination of gifts from the basket data.
As the Apriori algorithm requires, a support threshold should be set for the pruning of the infrequent item-sets. The minimum support threshold was set as 2% for our analysis. Additionally, the confidence threshold required to remove redundant rules was set as 25%. With such parameters, the analysis resulted in a total of 2956 association rules.

Findings
The rule mining technique initially discovers frequent item patterns that correspond to the most popular items in the basket data. As in the conventional market basket analysis, our findings involve cross-category purchase patterns. Due to the inclusion of demographic attributes in our analysis, our frequent patterns in Table 4 lists combinations of demographic attributes and purchased items. According to the table above, top-two itemsets suggest that females often purchase vegetables or milk. Moreover, the third itemset indicates the frequent purchase of vegetables and fresh fruits together, regardless of gender. On the other hand, a drawback from the inclusion of demographics was the discovery of irrelevant itemsets. Such findings ranked as 4, 12, 13, 15 in Table 4 identify the count of orders by particular groups of customers.
The inclusion of demographic variables in the analysis has also enabled the discovery of association rules, which occasionally point to a limited group of customers. Firstly, we present the top-10 rules in terms of the lift measure in Table 5 Among the top-10 rules, two involved demographic variables, while the remaining signify cross-category purchase patterns. A cross-category relation, as in the last row, suggests an observation of "purchase of lentil is 6.27 times more frequent in baskets that involve cracked wheat". Moreover, the confidence for the same rule signifies that 34.90% of the baskets that involve cracked wheat also involve lentil. The support measure in this rule signals that such a pattern relates to 6.07% of baskets that include cracked wheat. On the other hand, the rules ranked 8th, and 9th in Table 5 involves the gender variable, that signifies the presence of a particular purchase pattern within a specific customer group. To explore such findings further, top rules which involve the gender, age, and location variables have been filtered and presented separately in Tables 6-8.

Rule Antecedent (X) Consequent (Y) Support Confidence Lift 1
Coffee Cream  Table 6. Association rules that demonstrate relations among purchase decisions and gender The rules that involve gender above help to discover interesting purchase patterns observed for males and females, separately. For instance, the 6th rule in Table 6 indicates that 65.4% of male customers who purchase sugar also purchase tea. Moreover, the lift for the rule suggests that male customers purchasing sugar are 3.77 times more likely to purchase tea, compared with others. Arguably, such a finding suggests a noticeable difference in tea consumption across females and males. On the other hand, one could argue that purchase decisions often originate from family members; thus, deriving such conclusions might be misleading due to the lack of a variable representing the family size. Nevertheless, such a rule might still be taken into consideration to recommend tea for males who have already added sugar into their basket when shopping.
In addition to the gender, age groups and delivery locations were also evident in the results. It could be argued that inclusion of the age variable might help to demonstrate how purchase behaviors come forward depending on age groups. Accordingly, interesting rules involving age groups are listed in Table 7 The most interesting rule at the top of Table 7 suggests that customers of an age between 35 and 44 who purchase yogurt are likely to purchase milk and eggs, too. Moreover, the customers described in the antecedent had purchased milk & eggs approximately three times more often than others. Furthermore, the third rule signifies a cross-category purchase pattern that is more common in females of age 35-44. Accordingly, 72.45% of females between the age of 35-44 who purchase yogurt and vegetables have purchased fresh fruits; moreover. Moreover, the purchase of fresh fruits was found 2.74 times more frequent in this group than other customers. In demonstrated in this rule, our approach might discover relationships among purchase behaviors and multiple demographic variables.
As the final set in our findings, Table 8 shows the associations between the delivery locations and the product categories. As mentioned before, locations were represented with identifier numbers to prevent revealing an overall location-based sales report. As an example, compared to the rule "Vegetables Fresh Fruits", the fifth rule "[Location 54] and Vegetables Fresh Fruits" provides location-specific metrics about the purchase decisions. Moreover, the latter rule has a lift of 2.10, which is higher than the lift of 2.00 in the former.
An interesting detail in our findings was the dominance of Location-54 compared to other locations. Among the 2956 rules, 158 involved a location variable. Among those Alphanumeric Journal Volume 9, Issue 1, 2021 158 rules, location-54 was present in 75, while the remaining 83 rules involved the other locations. Moreover, the average lift for rules with Location-54 was 1.33, whereas the average lift was found 1.18 in other segment-specific rules. Based on this difference, it might even be argued that the rules regarding location-54 indicate purchase characteristics, which differentiates 54 from other locations.

Conclusion
Customer data is an essential resource for analyses to implement customer-oriented strategies. For this purpose, purchase records have been extensively analyzed in prior research for a variety of problems. This study extends the conventional market basket analysis with the inclusion of demographic variables and presents findings that indicate segment-specific behaviors.
The data examined with market basket analysis consists of purchase records as well as delivery location, gender, and age group. The underlying motive to integrate those attributes was the opportunity to extract patterns that connect purchased products with demographic variables. The justification of this idea lies in the potential differences in purchase intentions across different customer segments. In this regard, our study aims to present a broadened use of conventional market basket analysis with demographics and discover segment-related purchase patterns.
Among the association rules discovered in our study, interesting results were chosen based on the lift and confidence measures and presented in Tables 5-8 separately. It could be argued that such rules might be useful for practitioners, especially when launching segment-specific campaigns. Moreover, the findings might be utilized to develop customized offers in e-retailing. We argue that our approach might result in more specific rules and lead to more-detailed purchase patterns in datasets with more demographic variables.
The consumer-oriented paradigm in the marketing context emphasizes an understanding of consumer behaviors and adopting more customer-focused practices. The model proposed in this study aims to contribute prior research with the inclusion of customer demographics in basket data for market basket analysis. Besides, the assessment of demographic cross-category association rules in recommender systems might be explored in further studies.