Analysing SQLBits 7 Data, Part 1: Session Selections

As I’ve said before, I’m involved with the organisation of the SQLBits conferences here in the UK and at the moment the SQLBits committee is busy preparing for SQLBits 8 in April (make sure you come – it’s going to be great!). This eats up a lot of my spare time – spare time that I usually spend blogging – so I thought I’d kill two birds with one stone and blog about some of the BI-related stuff I’m doing for SQLBits (I’ve done this before but there’s plenty more mileage in this subject). It turns out a lot of the things SQLBits needs to do requires classic ‘self-service BI’: solve a business problem as best you can with whatever data and tools are to hand. It’s good to see things from the end user’s point of view for a change!

First of all, let’s take a look at scheduling: how can we make sure that we don’t run two sessions in the same time slot that are interesting to the same type of attendee? If attendees are put in a situation where they are forced to choose between two sessions they want to see they won’t be happy – we want to be able to create a schedule where there are as few difficult choices as possible. Unfortunately we don’t collect data about which sessions attendees actually go to, and even if we did it would be no use because of course by the time the session runs it’s too late to fix the agenda. However, well before the conference we allow people to vote for the ten sessions out of all those that have been submitted that they’d like to see (voting has just opened for SQLBits 8, incidentally), and we use this data to help us decide which ones make it onto the agenda; we can therefore use this data to help avoid overlaps.

This data can be visualised very effectively using NodeXL. To do this, I ran a SQL query on the SQLBits database that gave me every combination of two sessions that had been picked by the same user, so for example if a user had selected sessions A, B and C my query returned the pairs A-B, A-C and B-C. This gave me my list of edges for the graph and for the size of the edges I used the number of times the combination of sessions occurred, so I could see the most popular combinations. Unfortunately with 107 sessions on the list and thousands of edges, I got something that looked like one of my four-year-old daughter’s scribbles rather than a useful visualisation, so I decided to filter the data and look at one session at a time. Here’s what I got for my session ‘Implementing Common Business Calculations in DAX’:


Still not great, but at least with the thicker lines you can see where the strongest relationships are and when you select these relationships it highlights them and the nodes on either end, so you can read the names of the sessions. I then realised you could use the ‘dynamic filters’ functionality to filter out the weaker relationships, making it even easier to pick out the strongest ones:


So we can now see that the strongest relationships were with the sessions “You can create UK maps with SSRS 2008 R2” and “Data Mining with SQL Server 2008”. I’m still getting to grips with NodeXL which, I have to say, I like more and more and which deserves more visibility in the MS BI world.

Anyway, since this is a basket analysis problem I also thought of using the Data Mining Addin for Excel, but since I have Office 2010 64-bit I couldn’t. Luckily though the nice people at Predixion do have a version of their addin that works on 64-bit, and they gave me another eval license to use on my data. Getting useful results out of Predixion turned out to be ridiculously easy: I just copied the raw data into Excel, clicked the ‘Shopping Basket Analysis’ button on the ribbon and it spat out a pair of nicely-formatted reports. The first shows ‘Shopping Basket Recommendations’, ie if you select one session it recommends another one you might like:


And the second shows the most commonly-occurring ‘bundles’ of sessions that were picked together:


It almost feels too easy… but I think you can see that the results look correct and to be honest it’s much easier to do something useful with this than the NodeXL graph. When we close the voting for SQLBits 8 I’ll repeat the exercise and hand the results over to Allan, who’s in charge of speakers, and he’ll be able to use them to put together our agenda for Saturday April 9th.