1 数据
我们这里将采用BASKETS.txt数据文件,根据客户的购买记录来发现商品之间的关联规则。
> shop=read.table("D:/Desktop/BASKETS.txt",header = TRUE,sep = ",")
2 Apriori算法
> install.packages("arules")
> install.packages("arulesViz")
> library(arules)
> library(arulesViz)
> shopData = as(shop, "transactions")
Warning message:
Column(s) 1, 2, 3, 4, 5, 6, 7 not logical or factor. Applying default discretization (see '? discretizeDF').
> fit_apriori=apriori(shopData,parameter = list(support = 0.1,confidence = 0.8,minlen = 1))
> fit_apriori
set of 53 rules
> inspectDT(sort(fit_apriori,by = "lift"))
说明:规则后项有sex=M这种形式,显然不符合我们的期望。除了可以使用inspectDT()函数输出规则外,也可以直接使用inspect()函数输出规则!
3 筛选规则
> itemLabels(shopData)
[1] "cardid=[1.02e+04,4.22e+04)" "cardid=[4.22e+04,7.86e+04)"
[3] "cardid=[7.86e+04,1.1e+05]" "value=[10,22.6)"
[5] "value=[22.6,36.1)" "value=[36.1,49.9]"
[7] "pmethod=CARD" "pmethod=CASH"
[9] "pmethod=CHEQUE" "sex=F"
[11] "sex=M" "homeown=NO"
[13] "homeown=YES" "income=[1.02e+04,1.68e+04)"
[15] "income=[1.68e+04,2.36e+04)" "income=[2.36e+04,3e+04]"
[17] "age=[16,26)" "age=[26,39)"
[19] "age=[39,50]" "fruitveg"
[21] "freshmeat" "dairy"
[23] "cannedveg" "cannedmeat"
[25] "frozenmeal" "beer"
[27] "wine" "softdrink"
[29] "fish" "confectionery"
说明:1到9变量不应该进行关联规则;10到19变量只能出现在关联规则的左侧。
> rules_sub <- subset(fit_apriori, subset = rhs %in% itemLabels(shopData)[20:30] & lhs %in% itemLabels(shopData)[10:30] & lift>3)
> inspect(rules_sub)
lhs rhs support confidence coverage lift count
[1] {homeown=NO,
age=[16,26),
fish} => {fruitveg} 0.111 0.9098361 0.122 3.042930 111
[2] {homeown=NO,
age=[16,26),
fruitveg} => {fish} 0.111 0.9568966 0.116 3.277043 111
[3] {income=[1.02e+04,1.68e+04),
frozenmeal,
beer} => {cannedveg} 0.138 0.9718310 0.142 3.207363 138
[4] {income=[1.02e+04,1.68e+04),
cannedveg,
beer} => {frozenmeal} 0.138 0.9787234 0.141 3.240806 138
[5] {income=[1.02e+04,1.68e+04),
cannedveg,
frozenmeal} => {beer} 0.138 0.9387755 0.147 3.204012 138
[6] {sex=M,
frozenmeal,
beer} => {cannedveg} 0.141 0.9527027 0.148 3.144233 141
[7] {sex=M,
cannedveg,
beer} => {frozenmeal} 0.141 0.9400000 0.150 3.112583 141
[8] {sex=M,
cannedveg,
frozenmeal} => {beer} 0.141 0.9276316 0.152 3.165978 141
[9] {sex=M,
income=[1.02e+04,1.68e+04),
beer} => {frozenmeal} 0.136 0.9714286 0.140 3.216651 136
[10] {sex=M,
income=[1.02e+04,1.68e+04),
frozenmeal} => {beer} 0.136 0.9510490 0.143 3.245901 136
[11] {sex=M,
income=[1.02e+04,1.68e+04),
beer} => {cannedveg} 0.136 0.9714286 0.140 3.206035 136
[12] {sex=M,
income=[1.02e+04,1.68e+04),
cannedveg} => {beer} 0.136 0.9714286 0.140 3.315456 136
[13] {sex=M,
income=[1.02e+04,1.68e+04),
frozenmeal} => {cannedveg} 0.137 0.9580420 0.143 3.161855 137
[14] {sex=M,
income=[1.02e+04,1.68e+04),
cannedveg} => {frozenmeal} 0.137 0.9785714 0.140 3.240303 137
[15] {sex=M,
income=[1.02e+04,1.68e+04),
frozenmeal,
beer} => {cannedveg} 0.136 1.0000000 0.136 3.300330 136
[16] {sex=M,
income=[1.02e+04,1.68e+04),
cannedveg,
beer} => {frozenmeal} 0.136 1.0000000 0.136 3.311258 136
[17] {sex=M,
income=[1.02e+04,1.68e+04),
cannedveg,
frozenmeal} => {beer} 0.136 0.9927007 0.137 3.388057 136
说明:通过subset可以提取出符合需求的规则。rhs表示规则右项,lhs表示规则左项!
> rules_sub1 <- subset(fit_apriori, subset = rhs %in% itemLabels(shopData)[20:30] & lhs %in% "age=[16,26)" & lift>3)
> inspect(rules_sub1)
lhs rhs support confidence coverage
[1] {homeown=NO, age=[16,26), fish} => {fruitveg} 0.111 0.9098361 0.122
[2] {homeown=NO, age=[16,26), fruitveg} => {fish} 0.111 0.9568966 0.116
lift count
[1] 3.042930 111
[2] 3.277043 111
说明:筛选左项包含age=[16,26)条件的规则
> rules_sub2 <- subset(fit_apriori, subset = rhs %in% itemLabels(shopData)[20:30] & !(lhs %in% itemLabels(shopData)[1:19]) )
> inspect(rules_sub2)
lhs rhs support confidence coverage lift count
[1] {frozenmeal, beer} => {cannedveg} 0.146 0.8588235 0.170 2.834401 146
[2] {cannedveg, beer} => {frozenmeal} 0.146 0.8742515 0.167 2.894873 146
[3] {cannedveg, frozenmeal} => {beer} 0.146 0.8439306 0.173 2.880309 146
说明:只筛选商品之间的关联规则