# Avoiding an expensive TOPCOUNT

An interesting MDX problem came up the other day, the solution to which I thought would be worth sharing. It involved a query which for a set of months showed the top n products plus a member which showed the sum of all other products outside the top n – the Foodmart 2000 equivalent is something like this:

WITH
SET MYDATES AS ‘{[Time].[1997].[Q1].[1]:[Time].[1997].[Q2].[5]}’
MEMBER PRODUCT.OTHERS AS ‘([Product].[All Products], TIME.CURRENTMEMBER) – SUM(TOPCOUNT([Product].[PRODUCT NAME].MEMBERS, 3, (MEASURES.[UNIT SALES], TIME.CURRENTMEMBER)), (MEASURES.[UNIT SALES], TIME.CURRENTMEMBER))’
SELECT MEASURES.MEMBERS ON 0,
GENERATE(
MYDATES,
CROSSJOIN(
{TIME.CURRENTMEMBER},
{[Product].[All Products], TOPCOUNT([Product].[PRODUCT NAME].MEMBERS, 3, (MEASURES.[UNIT SALES], TIME.CURRENTMEMBER))
, PRODUCT.OTHERS}
) )
ON 1
FROM SALES

The question was, could the query be optimised? What struck me was that in the query the top n was worked out in two places: first in the rows axis, then in the calculated member. And doing a TOPCOUNT is usually a very expensive operation. So how could you make sure you only did it once? At first I thought you might be able to declare a set containing the top n inside the GENERATE statement and then reference it from the calculated member, but I couldn’t make that work. String manipulation turned out to be the key and the Foodmart 2000 version of the optimised query is as follows:

WITH
SET MYDATES AS ‘{[Time].[1997].[Q1].[1]:[Time].[1997].[Q2].[5]}’
MEMBER MEASURES.MYTOPSTR AS ‘SETTOSTR(TOPCOUNT([Product].[PRODUCT NAME].MEMBERS, 3, (MEASURES.[UNIT SALES], TIME.CURRENTMEMBER)))’
MEMBER PRODUCT.OTHERS AS ‘[Product].[All Products] – SUM(STRTOSET(CSTR(MEASURES.MYTOPSTR)), (MEASURES.[UNIT SALES], TIME.CURRENTMEMBER))’
SELECT MEASURES.MEMBERS ON 0,
GENERATE(
MYDATES,
CROSSJOIN(
{TIME.CURRENTMEMBER},
STRTOSET(CSTR(MEASURES.MYTOPSTR))
, PRODUCT.OTHERS}
) )
ON 1
FROM SALES

What I did was create a calculated measure which returned the top n set in string form; using STRTOSET you could then reference it from both the rows axis and the OTHERS calculated member. The reason this second query is faster than the first – and the guy who came to me with the problem says that the second version runs in almost a quarter of the time – is that the values returned by calculated members are cached so the second time that Measures.MyTopStr is called it doesn’t need to do the topcount again.

The one drawback to this approach is something that is common to all MDX that relies on STRTOSET and other similar functions – if you are trying to turn a very large set into a string, you may exceed the maximum string length and get an error. However if, as in this example, you’re only dealing with sets containing a few members you’ll never run into this problem.

# Handling missing members

This is something of an FAQ on the newsgroups, but since someone asked me about it again recently I thought I’d write up the answer up for posterity: in an MDX query, how do you deal with the problem of a query breaking because members which were once present in a cube are no longer present? For example, you have a query that references the member [Time].[2002].[Q1].[M1] but in a freak dimension-processing accident you suddenly find that this member no longer exists in your cube. In this scenario, trying to run your query will raise an error.

Now in any well-run data warehouse this shouldn’t really happen but in the real world it does. In AS2K MDX there isn’t an elegant solution to this problem – you have to use the VBA ISERROR function as follows:

WITH
MEMBER MEASURES.TEST AS
‘IIF(VBA!ISERROR(STRTOVALUE("MEASURES.[UNIT SALES]")), 1,0)’
MEMBER MEASURES.TEST2 AS
‘IIF(VBA!ISERROR(STRTOVALUE("MEASURES.[THIS WONT WORK]")), 1,0)’
SELECT {MEASURES.TEST, MEASURES.TEST2} ON 0 FROM SALES

In Yukon, though, it’s handled much better: dimensions have a property called MDXMissingMemberMode, which you can set to either ‘Error’ which reproduces the AS2K behaviour, or ‘Ignore’ which means that the member is ignored if the dimension part of the unique name is valid but the name or key part is not. The default behaviour is ‘Ignore’.

I have to say that this caused some head-scratching here when we first came across it: when you’re writing MDX manually, sometimes you don’t spot the fact that a member isn’t appearing in your results because you’ve misspelled its name. Overall though, I think the benefits of this new functionality outweigh any potential confusion.

# Rant: Reporting Services and Analysis Services

I’m just finishing off my presentation for PASS, in the course of which I’ve come up with a whole bunch of topics to blog about in the future, but now I’ve got some more spare time I’ve started to look at Reporting Services 2005 in more detail. Unfortunately this has also reminded me of something that has irritated me for a long time and which I need to get off my chest…

Now, before I get all steamed up I should say that I have done quite a bit of work with Reporting Services and like it a lot – its extensibility and programmability mean you can do a lot of cool stuff with it. However in my opinion it has one big design flaw, a flaw that Yukon doesn’t really seem to be doing much to correct, and which makes me roll my eyes and tear my hair whenever I think about it: Reporting Services just doesn’t support Analysis Services properly.

Before the comments start, yes, I have seen all the new features in RS2005 to do with AS support, but in my mind the problem is a fundamental one. And it’s that RS2005 still expects to receive a flattened rowset when you’re querying AS. Why is this a problem? Well, MDX gives you all this rich functionality to decide how you want to lay out your resultset, crossjoining as many dimensions you want on rows and columns, slicing by what you want, and RS then expects you to ignore all this and make MDX behave like SQL. You have to put measures on columns (what happens if you want to put measures in the WHERE clause then?) and all the rest of your dimensions on rows or in the WHERE clause, and then use the matrix control to recreate what you used to be able to do with a simple crossjoin. In RS2000 it was only recommended that you write your queries this way, and you could ignore this advice if you liked; in RS2005 this rule is actually enforced! What about backwards compatibility? Argh! It’s clumsy, it’s restrictive, and it’s a waste of time.

All I want is to be able to enter any MDX statement I want, assign it to a matrix-like control that understands multidimensional resultsets, and be able to display the results in a report. Not difficult, surely? I know it might break the purity of the design, but why should every data provider be forced to dumb itself down to the level of SQL? After all, isn’t one of the unique selling points of AS that MDX allows you to get round exactly this kind of inherent limitation in SQL? To me it’s unbelievable that the two Microsoft flagship BI products don’t understand each other on this basic level.

OK, rant over. Is anyone listening?

# Improving performance of VPC

Great tip from Roman Rehak here, if like me you’re running Yukon on Virtual PC. NTFS compression certainly hasn’t harmed my performance and it’s saved me 4Gb of disk space!

# Finding the currentmember on all dimensions, take#3

Following on from Mosha’s comment revealing the new .CurrentOrdinal function, I thought I’d mention how this solves the problem even though, for me, it’s still not quite resolved…

After a bit of grovelling, Mosha sent me the following example of how .CurrentOrdinal works:

with member measures.x as

generate(strtoset("{"+rept("measures.defaultmember,", dimensions.count-1)+"measures.defaultmember}") as a,

dimensions(a.currentordinal-1).name + ",")

select x on 0 from [Sales]

The reason .CurrentOrdinal is the key is that it allows you to create a set containing n instances of the same member and still know what iteration you were on inside a GENERATE statement. This frees you from the restriction present in my first attempt at this problem, which relied on a level being present in your cube which had at least as many members on it as there were dimensions. Remember that without this new function, you could only use the RANK function to try to find the iteration number and this only worked if each of the members in the set were unique.

The problem I had was that the vba REPT function doesn’t work on my Yukon install, and I couldn’t find another way of generating a set containing n instances of the same member. However it seems to work for Mosha so I can only assume it’s a bug in the February CTP of some sort, and in later builds we’ll all be able to use this technique.

# More on querying dimensions

Apart from Mosha’s putting a dampener on my excitement (should I really be getting excited by MDX anyway? Is it healthy?) about being able to query a dimension directly, by saying that in the future only database administrators will be able to do it after the next CTP, there was another interesting comment on yesterday’s entry – why is being able to do this any more efficient than doing the following:

SELECT MYDIMENSION.MEMBERS ON 0, {} ON 1 FROM MYCUBE

Hmm, well, good question. To be honest I’d not thought of doing this – I’d always done

SELECT MYDIMENSION.MEMBERS ON 0 FROM MYCUBE

which of course does bring back unwanted data. You learn something new every day! In the future I suppose the first of these queries is going to be the best option available to non-database-admins, but I was curious to see if querying a dimension was any more efficient than querying a cube. Now I’m sure a certain someone out there with intimate knowledge of how the AS engine handles each of these queries will be able to give a definitive answer; in the meantime I fired up Profiler to see if I could see any difference between the two approaches. Unfortunately I don’t have much experience about the data Profiler returns and I have to say that what it did return didn’t seem that relevant, but looking at the durations of queries it seemed that querying a dimension directly was slightly more efficient although the differences were too tiny to be noticeable.

# Querying a dimension not a cube

A small but significant bit of new functionality in AS2005, pointed out by Ohjoo Kwon on the beta newsgroup: you can now run MDX queries against dimensions as well as cubes. So, for example, you can do the following:

SELECT [MYDIMENSION].MEMBERS ON 0 FROM [\$MYDIMENSION]

Why is this useful? Anyone who has written any kind of client interface for AS2K and used the ADOMD catalog object will know how slow it is, and that it’s easier to write queries instead; and similarly in Reporting Services very often you have to write queries to populate lists of valid values for parameters. The problem is that when you run a query on a cube you always have to bring back at least one measure’s worth of data, which is a bit of a waste and potentially time- and resource-consuming. Now you don’t have to bring back any measure data because you can query the dimension instead!