Market basket analysis is a fairly well understood problem for SSAS – it’s best handled using many-to-many relationships, as shown in the excellent The Many-to-Many Revolution white paper. However the other week I was asked a question that was an interesting variation on this problem, and which shows up one of the limitations of the many-to-many approach.
Let me show you an example. The following PivotTable on the Adventure Works cube shows the number of distinct customers who bought products in the product category Bikes:
The Customer Count measure is a distinct count measure, the Country hierarchy from the Customer dimension is shown on rows, and the slicer here shows the Category hierarchy from the Product dimension. If you select Bikes and Clothing in the slicer, like so:
… the measure values are larger because what you’re seeing now is the number of distinct customers who bought products in either the category Bikes or the category Clothing.
But what if you want to see the number of customers who bought Bikes AND Clothing? Well, setting up a many-to-many relationship will give you this (and a lot more useful stuff too) but there are some limitations with this approach:
- Adding the m2m relationship and the extra measure groups it needs increases processing time
- You have to select your two product categories from two different (but identical) hierarchies
- You can only select two product categories – if you want to add a third to your AND filter you need to add a third m2m relationship, and so on. It’s not very flexible.
It is possible to get around these issues with a pure MDX approach, and apply an AND filter that works with however many product categories that the user selects.
The key to understanding how this works is to see how you would write an MDX calculated member that does an AND filter on two product categories:
Here I’ve used two nested NonEmpty() functions to filter the set of all customers, first to get the ones that bought Clothing, then to filter these customers again to get the ones that bought Bikes; there’s also an EXISTING there to get only the ones in the current country.
Once again, to add more product categories to the AND filter you need to add more nested NonEmpty() functions… which means you need a way to dynamically generate the code, which of course you can do using StrToSet(). Now normally I avoid using StrToSet() inside MDX calculations because it can cause serious performance problems but in this case it’s the best choice.
The following named set and calculated measure show how to solve the problem for Adventure Works:
The dynamic named set is there to help work out what is selected in the slicer in my Excel worksheet, as shown in Mosha’s old post here. The calculated measure is where all the interesting stuff happens: it uses the ability of the Generate() function to iterate over a set (in this case the dynamic named set), evaluate a string expression and concatenate these strings. The output of this is a series of nested NonEmpty()s, which then goes to StrToSet() to be evaluated, and then the contents of that set are counted.
This technique does not replace using a m2m relationship, because it will not allow you to show a query with product category on rows and columns and the number of customers who bought each combination shown. However, if all you want to do is show the number of customers who have bought one specific combination, this has a lot of advantages.
For all of your MDX fans out there, here’s another approach I came up with which doesn’t perform quite as well but is so much fun I had to share it:
I’m using Generate() here again, but this time I’m using it with inline named sets (which are even worse for performance than StrToSet) in a way similar to this classic post. The theory is the same though: it’s a way of calling NonEmpty() multiple times over a set with an arbitrary number of items in.
Here’s what the calculation does:
- Inside the first Intersect() I’m declaring an inline named set called MYCUSTOMERS, then doing the Intersect() with an empty set to return… and empty set. But I’ve managed to declare my inline named set, which is the point here.
- I’m then iterating over the same dynamic named set shown in my earlier example using Generate(), and:
- For all but the last product category in that set, I’m doing the NonEmpty() on the contents of MYCUSTOMERS on the current product category and then overwriting the contents of MYCUSTOMERS with the output. However, for these iterations of Generate() I’m returning an empty set.
- For the last product category in the set I’m actually returning the output of NonEmpty() over MYCUSTOMERS for the current product category. The reason I’m only returning something other than an empty set on the last iteration of Generate() is that Generate() returns the union of all the sets returned by each iteration, and that would give me the wrong results! I only want to output the set from the final iteration.