While I was thinking about basket analysis last week I started wondering what kind of visualisations would be useful for this problem, and I remembered NodeXL (which I blogged about earlier this year). After all, isn’t basket analysis pretty similar to network analysis? It’s all just connections between things…
Anyway, taking the PowerPivot model and calculated measures from my last post, I started to look at how to import that data into the NodeXL Excel template. The first problem was how to present the data in a tabular form and the answer was the new Excel 2010 flattened pivot table that Kasper and Rob have blogged about before, with subtotals turned off. This gave an output that looked like this:
The second problem was slightly more tricky. NodeXL wants a set of edges to draw its graph, and in our case an edge will represent an instance where two products were bought together. However in the results above we have rows where the two products are identical, eg showing the number of customers who bought Apples with Apples, and we also have rows that show the same data in different ways, eg one row showing the number of customers who bought Apples and Cake and another row showing the number of customers who bought Cake with Apples. We need to get rid of these unwanted rows and we can do that with Excel 2010’s ability to use custom MDX to generate a named set. Here’s the set expression I used:
, [Product Bought With].[And Product]).NEXTMEMBER : NULL})
What I’m doing here is to taking the set of all Products, then using the Generate function to crossjoin each Product with the set of all Products from the And Product dimension, starting from the And Product that’s immediately after the And Product that has the same name as the current Product, to the end of the level.
This gives us the following set of rows:
We can now paste this into the Edges worksheet of the NodeXL template, so that NodeXL can work its magic. I’m not going to pretend to be an expert on NodeXL and indeed this data isn’t the most exciting in the world to visualise, but once I’d got the graph drawn a quick look through the NodeXL tutorial (and especially the section on analysing voting patterns in the Senate) showed me how to use the value of the measure ‘Customers buying both Products’ to control the opacity of the lines in the graph. And here it is:
From this we can see clearly that Bread and Cake were never bought together and that Apples and Bread were bought together more often that any other combination. Job done! With real data, I think NodeXL would prove very useful indeed for this kind of analysis and it would be great if NodeXL could work direct with data in PowerPivot (hint, hint) in the future. If anyone out there does try using NodeXL with their data for basket analysis, I’d be very interested to hear from them…