One of the more popular blog posts from my archives (86 comments so far) is the one I wrote on “Counting New and Returning Customers in MDX”. The trouble with all of the calculations in there is that they execute in cell-by-cell mode, and therefore perform quite badly.
For example, take the following query on Adventure Works to find the number of returning customers (customers who have bought from us today and have also bought something before in the past):
On a cold cache this takes 47 seconds on my laptop and a quick look in Profiler shows this executes in cell-by-cell mode. In the comments on the original post Deepak Puri suggested an alternative approach using the Customer Count distinct count measure:
Interestingly, this performs even worse than the previous query (although I would have expected it to be better). So how can we write a query that returns in a reasonable amount of time?
I haven’t found a way to do this for a calculated measure defined on the server, to be used in a true ad hoc query environment like Excel (any suggestions welcome – please leave a comment if you can do it), but I have got a way of optimising this calculation for scenarios where you have control over the MDX being used, such as in SSRS.
Here’s the query:
On my laptop, this query executes in around 5 seconds on a cold cache. The reason it’s so much faster is also the reason it can’t be used in ad hoc queries – it uses named sets to find all the combinations of customer date needed by the query in one operation. Here’s a step-by-step explanation of how it works:
- First of all, the customerdates set gets a set of tuples containing every single combination of day and customer where a purchase was made, using a simple Nonempty().
- Next, the nondistinctcustomers set takes the customerdates set and removes the dates, so what we are left with is a list of customers. It’s not a list of distinct customers, however – a given customer may appear more than once. This still represents a list of the customers that bought something each day, it’s just that we no longer have any information about which day we’re looking at.
- The customercountsum measure allows us to take the list of customers in the nondistinctcustomers set and find out which customers bought something in any given day. It’s a running sum of the Customer Count measure. This is a distinct count measure, and usually you wouldn’t use the Sum() function on a distinct count, but it’s important we do here. How is it used? For example, let’s imagine we had just three days of data: on the first day we had three customers, on the second four customers and on the third five customers. That would result in the nondistinctcustomers set containing twelve (not necessarily distinct) customers. We can then use the running sum of a distinct count of customers to find out the index of the item in nondistinctcustomers that is the last customer in the list for each day. So on day two we would have a running sum of seven, and therefore the seventh item in nondistinctcustomers gives us the last customer in the list for that day.
- Finally, the Returning Customers V3 measure gives us the number of returning customers each day. It uses the customercountsum measure to find the subsets of the nondistinctcustomers set that represent the customers that bought on the current day and the customers that bought on all days up to yesterday, then uses the Intersect() function to find the returning customers.