ASSP 1.2

Darren Gosbell has just done a new release of the Analysis Services Stored Procedure Project; he’s blogged about it here:
And of course, you can download it from Codeplex here:
It includes the partition healthcheck code that I blogged about here, as well as various other cool stuff.

M2M Dimension Optimisation Techniques Paper

Another new paper from the SQLCat team, this time on techniques that can be used to optimise cubes that use many-to-many relationships:
I’ve only skimmed through it so far, but I thought I’d mention two other tricks to try with m2m dimensions. One is breaking up large dimensions if you don’t ever need to see them, which I blogged about in detail here:!7B84B0F2C239489A!777.entry, and which can be very powerful. The other is introducing other dimensions to the intermediate measure group: for example, imagine you have a measure group showing the relationship between Products and the Components that make them up, you could try adding the Time dimension to that measure group; even though it will increase the size of the measure group because you are repeating the Product/Component combinations for each Time period, if all of your queries are sliced by a Time period and you know that the majority of your Products were only sold for a short space of time then it can be beneficial, although if neither of these conditions is true then doing this can have a negative impact on performance too.

SQLBits II – Last Chance to Submit a Session

It’s getting close to the deadline to submit a session for SQLits II next year. I know what you’re thinking, you’d like to attend but don’t have the time to prepare a presentation. Aren’t you tired of hearing the same old people (eg me) present though? Wouldn’t you like to share your hard-won knowledge with the community? Get your 45 minutes of fame and adoration? Of course you would. So why not go to

…and submit something right now? I’d especially like to see some more BI sessions.

BTW, if you’re on Facebook you can join the SQLBits group to stay up-to-date with all the latest news and meet up with other people who’ll be attending.

Amazon SimpleDB… for BI?

I’ve just seen that Amazon has released a new web service database called SimpleDB:
Note that it’s not a relational database, but has a very flexible attribute-based model; it also claims to need no tuning, data modelling or any of the boring stuff that us database people do for a living. While claims like that automatically make me somewhat sceptical, I also think about Google Bigtable and the new generation of COP databases and think, well maybe… Slap an MDX-like interface over this, make it accessible from Excel, and you’ve got something interesting and disruptive from a BI point of view.

Comparing MDX Query Results With Tablediff

Recently I was working on a project where the customer wanted to upgrade from SP1 to SP2 but found that when they did so the results of some of the calculations on their cube changed. Unfortunately these changes weren’t easy to spot: their MDX Script was >1500 lines long and the bugs they were running into were dependent on the query being run and whether the cache was cold or warm (in fact one of them was the bug I blogged about here). When testing various workarounds and hotfix builds, the problem I ran into was a problem I’ve run into before on numerous occasions: when I changed something on the MDX Script, how did I know it didn’t break something else?

In the past what I’ve done was take a few MDX queries, run them, copy and paste the results into Excel, make my changes and then rerun the same queries and paste the new results into Excel again and use Excel formulas to compare the old and new resultsets. However in this case this approach just wasn’t feasible and I started looking for a better way of comparing large numbers of MDX resultsets. Now there is definitely a tool to be written here, probably one which runs batches of queries, serialises the cellsets and saves them to disk, and then compares the cellsets and highlights any cells with differences, but I didn’t have time to write it. I was talking the problem over with Didier Simon of PSS, though, and he made a great suggestion – use tablediff.exe, the command line utility you get with SQL Server to compare relational tables. You can read more about it here:

This turned out to be the key. I cannibalised bits of my Analysis Services cache warmer and created two SSIS packages, the first of which ran a batch of MDX queries retrieved from a relational table (originally saved from a Profiler trace) through SQL Server using OpenRowset and saving the results in a table using a SELECT INTO. Here’s what the control flow looked like:


And here’s an example of the kind of SQL query I was generating:

select identity(int, 1,1) as idcolumn, * into Query1
from openrowset(‘MSOLAP’, ‘Datasource=localhost; Initial Catalog=Adventure Works DW’,
‘ with cell calculation test for ”(*)” as ”cstr(round(measures.currentmember,5))”
select Measures.[Internet Sales Amount] on 0,
[Date].[Calendar Year].Members on 1
from [Adventure Works]’)

A couple of things to note – Tablediff needs a column which can uniquely identify each row in the tables its comparing, hence the identity column. I also found that I was running into the precision problems described here and as a result, tablediff was flagging up cells as having different values when in fact the values were differing only by a miniscule amount, so I created a calculated cell assignment in the WITH clause which rounded up every value to five decimal places. I could do this because I knew none of my queries had WITH clauses, but another solution would have been to create a session calculated cell assignment which did the same thing.

After running this package I had a SQL Server database with as many tables containing results as I had queries. I could then make my changes to the cube and rerun the same package to dump a new set of results to a different database, and then run my second package to compare the results. This was, again, a fairly simple package to put together:


All I did was use a ForEach loop to loop over the tables in the database containing the first set of results, and then use an Execute Process task to call tablediff to compare each table with the equivalent table in the second database; if it found any, I used the -et argument to save the output to another table. I also used the -c argument to get column-level differences and the -b argument to specify the number of bytes to compare for large objects, necessary because the SELECT INTO creates an ntext column for all the member names on the Rows axis.

Overall, it’s not an ideal solution (coding a proper app is definitely the way to go) but it did the job, and hopefully this will be useful to someone else…

Partitions and Aggregations

Something that’s easy to forget (at least I’m always forgetting it) when creating an aggregation design is the impact that partitioning can have on it. Aggregations are only created on a per-partition basis – you can’t aggregate across partitions – so that means, for example, when you’re partitioning by Month then there’s little point building aggregations at granularities higher than Month; after all, if you only have one Month of data in your partition and you aggregate that data to the Month, Quarter or Year levels then the figures for each aggregation will be identical! In fact building aggregations above the granularity of Month can be a bit of a waste: an aggregation at Month granularity can be used by queries at Quarter and Year, but an aggregation for the same partition at Year granularity that is in all other respects identical can only be used by queries at the Year granularity.

The best thing to do to ensure this doesn’t happen is to set the AggregationUsage property to None for all attributes above the granularity that you’re partitioning by, and also to make sure your partition row counts are set correctly, before you run either of the Aggregation Design wizards. This should ensure that your aggregation design properly reflects your partitioning strategy.

One of the questions I’ve always meant to research further on this topic is whether partitioning could hurt query performance, given that it effectively prevents certain aggregations being built. Taking the example we’ve been using so far, what happens if you’re partitioning by Month and you’ve got a slow query at Year granularity that you’d like to build an aggregation for – would partitioning by Year rather than Month, so you could build a true Year-level aggregation, make sense? Well, I’ve just had a chance to test this out on a fairly large cube (with approx 100 million rows in the main fact table, and Product and Customer dimensions with 1 million+ members on their key attributes) and interestingly partitioning seems to make no difference at all. I created two measure groups, one partitioned by Month and one partitioned by Year, and then created one aggregation on the former at Month/Product and one aggregation on the latter at Year/Product. I then ran a query that returned the top 10 Products at Year on each measure group and the performance was identical. Clearly this is not something to be worried about, thank goodness…

[Thanks are due to Eric Jacobsen of the SQLCat team for talking through some of these issues after the BI Conference earlier this year. He accurately predicted that partitioning wouldn’t hurt performance because the AS query engine is very efficient at merging the data caches of partitions.]

%d bloggers like this: