Analysing SQLBits 7 Data, Part 1: Session Selections

As I’ve said before, I’m involved with the organisation of the SQLBits conferences here in the UK and at the moment the SQLBits committee is busy preparing for SQLBits 8 in April (make sure you come – it’s going to be great!). This eats up a lot of my spare time – spare time that I usually spend blogging – so I thought I’d kill two birds with one stone and blog about some of the BI-related stuff I’m doing for SQLBits (I’ve done this before but there’s plenty more mileage in this subject). It turns out a lot of the things SQLBits needs to do requires classic ‘self-service BI’: solve a business problem as best you can with whatever data and tools are to hand. It’s good to see things from the end user’s point of view for a change!

First of all, let’s take a look at scheduling: how can we make sure that we don’t run two sessions in the same time slot that are interesting to the same type of attendee? If attendees are put in a situation where they are forced to choose between two sessions they want to see they won’t be happy – we want to be able to create a schedule where there are as few difficult choices as possible. Unfortunately we don’t collect data about which sessions attendees actually go to, and even if we did it would be no use because of course by the time the session runs it’s too late to fix the agenda. However, well before the conference we allow people to vote for the ten sessions out of all those that have been submitted that they’d like to see (voting has just opened for SQLBits 8, incidentally), and we use this data to help us decide which ones make it onto the agenda; we can therefore use this data to help avoid overlaps.

This data can be visualised very effectively using NodeXL. To do this, I ran a SQL query on the SQLBits database that gave me every combination of two sessions that had been picked by the same user, so for example if a user had selected sessions A, B and C my query returned the pairs A-B, A-C and B-C. This gave me my list of edges for the graph and for the size of the edges I used the number of times the combination of sessions occurred, so I could see the most popular combinations. Unfortunately with 107 sessions on the list and thousands of edges, I got something that looked like one of my four-year-old daughter’s scribbles rather than a useful visualisation, so I decided to filter the data and look at one session at a time. Here’s what I got for my session ‘Implementing Common Business Calculations in DAX’:


Still not great, but at least with the thicker lines you can see where the strongest relationships are and when you select these relationships it highlights them and the nodes on either end, so you can read the names of the sessions. I then realised you could use the ‘dynamic filters’ functionality to filter out the weaker relationships, making it even easier to pick out the strongest ones:


So we can now see that the strongest relationships were with the sessions “You can create UK maps with SSRS 2008 R2” and “Data Mining with SQL Server 2008”. I’m still getting to grips with NodeXL which, I have to say, I like more and more and which deserves more visibility in the MS BI world.

Anyway, since this is a basket analysis problem I also thought of using the Data Mining Addin for Excel, but since I have Office 2010 64-bit I couldn’t. Luckily though the nice people at Predixion do have a version of their addin that works on 64-bit, and they gave me another eval license to use on my data. Getting useful results out of Predixion turned out to be ridiculously easy: I just copied the raw data into Excel, clicked the ‘Shopping Basket Analysis’ button on the ribbon and it spat out a pair of nicely-formatted reports. The first shows ‘Shopping Basket Recommendations’, ie if you select one session it recommends another one you might like:


And the second shows the most commonly-occurring ‘bundles’ of sessions that were picked together:


It almost feels too easy… but I think you can see that the results look correct and to be honest it’s much easier to do something useful with this than the NodeXL graph. When we close the voting for SQLBits 8 I’ll repeat the exercise and hand the results over to Allan, who’s in charge of speakers, and he’ll be able to use them to put together our agenda for Saturday April 9th.

Basket Analysis, PowerPivot and NodeXL

While I was thinking about basket analysis last week I started wondering what kind of visualisations would be useful for this problem, and I remembered NodeXL (which I blogged about earlier this year). After all, isn’t basket analysis pretty similar to network analysis? It’s all just connections between things…

Anyway, taking the PowerPivot model and calculated measures from my last post, I started to look at how to import that data into the NodeXL Excel template. The first problem was how to present the data in a tabular form and the answer was the new Excel 2010 flattened pivot table that Kasper and Rob have blogged about before, with subtotals turned off. This gave an output that looked like this:

The second problem was slightly more tricky. NodeXL wants a set of edges to draw its graph, and in our case an edge will represent an instance where two products were bought together. However in the results above we have rows where the two products are identical, eg showing the number of customers who bought Apples with Apples, and we also have rows that show the same data in different ways, eg one row showing the number of customers who bought Apples and Cake and another row showing the number of customers who bought Cake with Apples. We need to get rid of these unwanted rows and we can do that with Excel 2010’s ability to use custom MDX to generate a named set. Here’s the set expression I used:

, {[Product].[Product].CURRENTMEMBER}
, [Product Bought With].[And Product]).NEXTMEMBER : NULL})

What I’m doing here is to taking the set of all Products, then using the Generate function to crossjoin each Product with the set of all Products from the And Product dimension, starting from the And Product that’s immediately after the And Product that has the same name as the current Product, to the end of the level.

This gives us the following set of rows:

We can now paste this into the Edges worksheet of the NodeXL template, so that NodeXL can work its magic. I’m not going to pretend to be an expert on NodeXL and indeed this data isn’t the most exciting in the world to visualise, but once I’d got the graph drawn a quick look through the NodeXL tutorial (and especially the section on analysing voting patterns in the Senate) showed me how to use the value of the measure ‘Customers buying both Products’ to control the opacity of the lines in the graph. And here it is:

From this we can see clearly that Bread and Cake were never bought together and that Apples and Bread were bought together more often that any other combination. Job done! With real data, I think NodeXL would prove very useful indeed for this kind of analysis and it would be great if NodeXL could work direct with data in PowerPivot (hint, hint) in the future. If anyone out there does try using NodeXL with their data for basket analysis, I’d be very interested to hear from them…

%d bloggers like this: