Association Rule Mining on the Extended Bakery dataset
Introduction
We used the Extended Bakery Dataset's 75,000 receipt data from apriori.zip which can be found at this website.
We also used the EB-build-goods.sql in order to convert the product ID to their names. The original file can be found here.
Objective
What is the domain and what are the potential benefits to be derived from association rule mining?
The domain of association rule mining is that it is a mining method that specialises in finding frequent patterns, associations, correlations or causal structures in the ExtendedBakery data set that is provided. With associative rule mining we can possible get to improve the inventory management, customer buying prediction and time related sales.
Association rules mining or what is sometimes referred to as 'Market Basket Analysis' is among the preeminent component used in data mining to find useful insights to a particular domain. It is a rule-based machine learning method designed to discover frequent co-occurring associations among a collection of items in transaction and even in relational databases. Normally, data produced in transactions are categorical(non-numeric) data which makes association rules mining a pertinent method because it handles these forms of data well when searching for interesting discoveries.
An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in the data. A consequent is an item that is found in combination with the antecedent. The strength of an association rule can be measured in terms of its support and confidence. Support is an indication of how frequently the items appear in the database. This is of interest due to the fact that if a rule is measured to be very low in support, it is likely to be uninteresting from a business perspective. For example, it may prove unprofitable to promote items that customers seldom buy together. Confidence, on the other hand indicates the number of times the if/then statements have been found to be true. It essentially measures the reliability of the inference made by a rule.
The classic example of the Beer and Diapers association that is often mentioned in data mining books. The association suggests that a strong relationship exists between the sale of diapers and beer because many customers who buy diapers also buy beer. This can help retailers to learn about purchasing behavior of their customers. Such information can also be utilized to support a variety of business-related applications such as marketing promotions, inventory management, and customer relationship management.
Association mining has been broadly used in many application domains besides the business field in the last years the application areas have increased significantly. Some recent applications are the discovery of patterns in biological databases, extraction of knowledge from software engineering metrics and the optimization of user's profiles for web system personalization. An example would be a case for Walmart in 2004 when a series of hurricanes crossed the state of Florida. Walmart mined their massive retail transaction database to see what their customers really wanted to buy prior to the arrival of a hurricane. They found one particular item that increased in sales by a factor of 7 over normal shopping days. That was a huge Lift factor for a real-world case. That one item was not bottled water, or batteries, or beer, or flashlights, or generators, or any of the usual things that we might imagine. The item was strawberry pop tarts. Therefore, Walmart stocked their stores with tons of strawberry pop tarts prior to the next hurricanes, and they sold them out. That is a win-win: Walmart wins by making the sell, and customers win by getting the product that they most want.
Now that it has been established what association rules are and how it is utilised, we are able to continue on how would one would be able to apply this to the current Extended Bakery dataset. The question now comes to, what good may come from the relationships and rules that will be found? Overall, many growth inducing outcomes can occur from the gleaning and utilization of rules. First of all, this would result in the owners of the business being much more knowledgeable on the subject of their own business. After pondering this information, one can also form many courses of action based on solid theories backed by proof rather than on simple guesses. The business may increase its sales of specific items of their shop by displaying items with higher correlation together. This should boost the amount of sales of this set, as it increases awareness of the presence of the other item/s in the set. They can also choose sets of items that sell well together to offer as a much more desirable form of promotion, for example set meals and coupon discounts. This option is viable as it would draw more customers into the shop, and thus increase sales and awareness of the shop. In conclusion, using the association rules may result in higher sales and also increases awareness of the store, both of the potential customers and the owner's.
Dataset description
There are 3 files in the 75000 receipt folder. All the files are in .csv format. First, there is the sparse vectors file. The file name is 75000-out1.csv. Each row starts with a receipt ID followed by the food ID of the pastries and/or drinks sold in that receipt.
The next file is the full binary vector file. The file name is 75000-out2.csv. There are 51 columns. The first column is the receipt ID and the following 50 columns signifies the absence or presence of the item on the given receipt. For example, the second column would be 1 to represent if food ID = 0 was bought on the given receipt. The third column would be 0 to represent if food ID = 1 was not bought on the given receipt and so on.
The last file is the items table file. The file name is 75000i.csv. There are 3 columns. The first column is the receipt ID. The second column is the quantity of item purchased. The third column is the food ID of the item purchased. For example, if receipt ID = 1 bought two of food ID = 2 and food ID = 4 each, there would be two rows. The format will be like:
1 2 2
1 2 4
Preprocessing
Let’s start with the main file, 75000-out1.csv. As stated above, each row starts with a receipt ID followed by the food ID of the pastries and/or drinks sold in that receipt.
We want to eliminate some rows. The rows which are redundant are rows where only has one item because there will be no significant rules that can be generated if the receipt only contains one item. If the row only has one purchase, then the third column (V3) will be NA because the first column (V1) would be the receipt ID and the second column (V2) would be the first item purchase. Therefore, all rows which has their third column (V3) as NA will be rows which only has one item purchased.
Besides that, since we do not want our association rule mining algorithm to consider the receipt ID as part of the transaction we have to remove the first column of the dataset.
From 75000 initial transactions, now we have 71408 transactions left.
It would be quite meaningless to look at rules based on only their food ID. It would be quite hard to decipher rules like 1 -> 20 because looking at the food ID only instead of the food name will tell you nothing. Therefore, all the food ID should be converted to their names. To do so, we have to use the EB-build-goods.sql file to find out which food ID corresponds to which food name. The file can be opened in any text editors. The contents of the file are SQL statements such as:
insert into goods values (0,‘Chocolate’,‘Cake’,8.95,‘Food’);
We want to convert this into a .csv file to load it into R to perform the conversion. We cleaned up all the SQL statements using the Replace function in Sublime Text 3 so that each row is in the following format:
0,Chocolate Cake
We then save that file as food.csv. The .csv file can be found here. After converting all the food ID to food name, it is saved into a new file called foodUpdated.csv
Rule Mining Process
Since the dataset is still quite small, the Apriori algorithm would be enough to generate our rules rather than the FP tree. We want to generate rules with somewhat high confidence and applicable to a small crowd of people. The parameters for our algorithm is a minimum support of 0.02 (1428 receipts) and a minimum confidence of 0.5. We set the minimum length of the rule to be 2 to ensure that there are at least 2 items involved for all the rules generated. The algorithm generated 116 rules and completed all the processing in 0.07s in our local computer.
Seems like there are 116 rules generated. Now, to remove redundant rules (if any).
Looks like there were no redundant rules.
Summary and selection of rules
All the rules (except the last rule) have lifts of more than 5.5. This means that the correlation between all the items in their respective rules are very high.
From all the rules above, the rules which stand out are rules 1, 2, 10, 13, 58. The items in these rules seems to be appearing frequently in many other smaller rules as well. These items are Lemon Lemonade, Raspberry Lemonade, Lemon Cookie, Raspberry Cookie and Green Tea. We should also keep in mind that the antedecent and consequent seem to be switching places (a -> b and b -> a) in order to form new rules. Let us remove the rules based on their itemset while keeping in mind that the consequent and antedecent are interchangeable with only minor variations in support, confidence and lift. This is to make it easier to read. We will also sort the rules by support and lift.
Looks like a lot of rules were removed because they have belong to the same itemset. Take a look at rule number 2.
{Green Tea, Lemon Cookie, Raspberry Cookie, Raspberry Lemonade} => {Lemon Lemonade}
This was the rule we were discussing previously with the 5 items: Lemon Lemonade, Raspberry Lemonade, Lemon Cookie, Raspberry Cookie and Green Tea. Many other smaller rules (rules 1, 3, 4, 5 and many others) recycles these items as well. Thus, this is one rule that we could show our customer. Let us take a look when the rules are sorted by support.
Let us take rule 1, rule 2 and rule 3 because their support level is the highest and rule 1 and 2 have an acceptable level of high confidence (>0.55). We chose the 3 rules with the highest support levels because we want the items that many people are buying.
{Apricot Danish} => {Cherry Tart}
{Marzipan Cookie} => {Tuile Cookie}
{Chocolate Coffee} => {Chocolate Cake}
Recommendation Part 1
First, let us explain this rule: {Green Tea, Lemon Cookie, Raspberry Cookie, Raspberry Lemonade} => {Lemon Lemonade}
Please keep in mind that we removed some rules based on the itemset and the antecedent and consequent are interchangeable. There are 2 choices that we would like to advice to the bakery manager:
The manager should bundle the items together. It could be be any combination that he likes because there are smaller sized rules which is a subset of the 5 items. For example, he could bundle Lemon Cookies + Raspberry Cookies + Green Tea or Lemon Cookies + Raspberry Cookies + Raspberry Lemonade or Lemon Cookies + Raspberry Cookies + Lemon Lemonade, and so on. He could even bundle them all in a package if he wanted to because the rule holds true even for 5 items. But how much of each item should he include per bundle if he were to bundle all 5 items together? We shall try and answer that question later.
The manager could bundle the cookies together but exclude the drinks. Perhaps, he could place the drinks nearby the cookies so that they could just grab a drink after purchasing the cookies. He could experiment by slightly increasing the price of Green Tea if he wanted to because since Green Tea is already on the LHS of the rule, it means that people are already buying it together with the two types of cookies. We chose Green Tea because from the EB-build-goods.sql file, we saw that the original price of Green Tea is $1.85 while both the Lemonades are at $3.25. Since Green Tea is that much cheaper than both the Lemonades, maybe a slight increase of price will not be a deterrant for buying Green Tea.
Next, we will explain the three 2 items rules:
{Apricot Danish} => {Cherry Tart}
{Marzipan Cookie} => {Tuile Cookie}
{Chocolate Coffee} => {Chocolate Cake}
We would recommend the manager to produce coupons for the two items rules. For example, the coupons could be:
Buy 5 Apricot Danishes for 1 free Cherry Tart
Buy Chocolate Cake and Chocolate Coffee for $ X.XX
Buy 10 Marzipan Cookies and 10 Tuile Cookies for $ X.XX
Recommendation Part 2 (How much to bundle?)
Let us explore on how many items should the manager bundle together for the rule if he were to bundle all 5 items together:
{Green Tea, Lemon Cookie, Raspberry Cookie, Raspberry Lemonade} => {Lemon Lemonade}
An average of 3 items each for all 5 items were purchased whenever all of them were bought in a single receipt. This would mean that we should bundle 3 Lemon Cookies, 3 Raspberry Cookies, 3 Green Tea, 3 Raspberry Lemonade and 3 Lemon Lemonade in a single package. This is a very nice coincidence because there are 3 kinds of flavours (Lemon, Raspberry and Green Tea) in the package.
The R Script for the exploratory exercise can be found here.