Analysis Services Load Testing Tool

It’s all happening on Codeplex at the moment: via Russell Christopher and Patrice Truong I’ve just seen that the dev team have released their Analysis Services load testing tool to the community:
 
I like this new-found enthusiasm on the part of Redmond for sharing code like this with the rest of us….

MDX Script Performance Analyser

I do a lot of performance tuning as part of my consultancy work, and quite often when I start looking at a customer’s cube I find that for any given query that needs to be tuned there are several (sometimes hundreds) of calculations which affect the cells in the query and which could be the cause of performance problems. To help me work out which calculations are the ones that need to be looked at I put together a tool – the MDX Script Performance Analyser – which I’ve just got round to putting up on Codeplex so it can be shared:
 
Basically what it does is this:
  • First of all, you connect to the cube that your query runs against
  • Then you enter your query in the text box at the top of the screen and hit ‘Run Query’
  • This then starts the following process:
    • The tool reads the cube’s MDX Script and splits it up into its constituent statements, storing them in an array
    • It executes a Clear Cache command to ensure that all queries are run on a cold cache
    • It executes a Clear Calculations command so that for the current session the cube appears as though its MDX Script contains no commands
    • For each statement in the array of statements from the MDX Script, it then:
      1. Executes the first statement in the MDX Script within the session, so that the cube now acts as though its MDX Script contains only this statement and all previously executed statements
      2. Runs the query you entered in the textbox
      3. Stores how long the query took to run, plus other interesting metrics
    • Once the query has run on the equivalent of the entire MDX Script in the cube, a report is generated which contains graphs and charts illustrating the data captured earlier

As a result you get to see how much each calculation has contributed to the overall time taken by the query; when you see a big jump upwards in the query time graph (that isn’t followed by a big fall subsequently – which could happen with scoped assignments) then at that point in the MDX Script there’s an expensive calculation.

As you’ll see if you download the source code it is in a very basic state at the moment, but it works most of the time for me and has come in very handy on a number of occasions. There are a lot of changes and improvements that I’d like to make (such as recording the total number of cells returned by the query, the total number of non-empty cells and the number of Query Subcube events at each step) and if anybody out there has other suggestions or would like to sign up as a developer I’d be only too happy to hear from you.

I also need to thank Greg Galloway for making some invaluable contributions to the code and for fixing at least one critical bug for me. This is probably also a good point to mention another project on Codeplex that Greg and Darren Gosbell have been working on over the last few months, BIDS Helper:
http://www.codeplex.com/bidshelper

… which contains some really useful extra functionality for people working with AS in BI Development Studio.

Office User Group

Just found out about an event run by the UK Office User Group on ‘Integration, Collaboration and Visualisation’ next Friday at Microsoft in Reading, which unfortunately I won’t be able to attend but which looks pretty interesting:
 
 

Silverlight

Reading all the fuss about the Silverlight announcement (see http://www.microsoft.com/presspass/press/2007/apr07/04-15WPFEPR.mspx) makes me wonder how long it’s going to be before we have a Microsoft BI client tool that makes use of it. I’m seeing more and more of Crystal Xcelsius in the marketplace (which is based on Flash, which Silverlight is competing with) and business people seem to be very impressed with some of the wizzy animations you can do with it – see, for example, Janne Pyykkö’s demo here http://jpbi.blogspot.com/2007/04/crystal-xcelcius-playful-3d-demo.html. Surely it can only be a matter of time…?
 
UPDATE: James Wakefield has just sent me this link to an article he wrote on hooking Xcelsius up to any XMLA compliant OLAP database:

Microsoft BI Conference Session List Published

UPDATE: How does meeting up at 6pm by the registration booth at the conference centre sound to everyone? If you can’t make it then, I’m sure we can catch up later at the party. We’ve got quite a few people joining us (apart from those below) so it should be a good evening… assuming I’m not suffering from jet lag too badly!
 
At last, they’ve published the session list for the BI Conference:
I’ve already registered but I was a bit worried having done so without knowing what the content was going to be like. Luckily it looks pretty good (even if my session got turned down :-(, but at least it means I won’t be stressing about it for the next few weeks), with the majority of speakers being from Microsoft – including Mosha speaking on… Integration Services!
 
Anyway, if you see me walking around stop me and say hello; I’ll also be hanging around on the Solid Quality stand at some point, probably Thursday afternoon. Does anyone fancy going out on the Tuesday night for a few drinks after the welcome reception? Leave a comment or drop me an email and maybe we can sort something out.

White Paper on Designing Cubes for Excel 2007

There’s a new white paper available on designing cubes for use with Excel 2007:
 

OR Queries in MDX

I’m always happy to hear from readers of this blog (you can find my contact details at http://www.crossjoin.co.uk/contact.html) and happy to take requests for subjects to blog about. For example I got an email from Joseph Boschert the other week asking if I could post something about OR queries in MDX and since this is a subject that comes up on a fairly regular basis and can be quite confusing I agreed.

The simplest kind of OR query is where you want to display or slice by two members on the same hierarchy. For example in Adventure Works if I wanted to see Internet Sales where the year was 2003 or 2004 then I could write a query something like this:

SELECT [Measures].[Internet Sales Amount] ON 0,
{[Date].[Calendar Year].&[2003], [Date].[Calendar Year].&[2004]}
ON 1
FROM [Adventure Works]

and if I wanted to see an aggregated value I could either put the set containing 2003 and 2004 in the WHERE clause, as follows:

SELECT [Measures].[Internet Sales Amount] ON 0
FROM [Adventure Works]
WHERE({[Date].[Calendar Year].&[2003], [Date].[Calendar Year].&[2004]})

or in a subselect:

SELECT [Measures].[Internet Sales Amount] ON 0
FROM (SELECT
{[Date].[Calendar Year].&[2003], [Date].[Calendar Year].&[2004]} ON 0
FROM [Adventure Works]
)

Both queries should return the same result, although of course the WHERE clause and subselects are not equivalent (see Mosha’s posting here for an explanation of how they differ). For the sake of simplicity I’ll stick to using WHERE clauses for the rest of this post.

Things become a little more complex when you want to do an OR over two different hierarchies on the same dimension because the default behaviour here in MDX is to AND. For example, what if I was interested in seeing the value of Sales made in 2003 or on a Friday? The following query gives me the value of Sales made in 2003 and on a Friday:

SELECT [Measures].[Internet Sales Amount] ON 0
FROM [Adventure Works]
WHERE([Date].[Calendar Year].&[2003],[Date].[Day Name].&[6])

How do I get the OR? Well, the answer is this:

SELECT [Measures].[Internet Sales Amount] ON 0
FROM [Adventure Works]
WHERE(
{([Date].[Calendar Year].&[2003],[Date].[Day Name].[All Periods])
,
([Date].[Calendar Year].[All Periods],[Date].[Day Name].&[6])})

…but let’s take some time to understand why this works. The set in the WHERE clause contains two tuples, the first representing all Day Names in 2003 and the second representing Fridays in all Calendar Years. This is clearly what we want to pass into the OR, but isn’t there a danger of double counting here? Actually no because the WHERE clause is pretty clever. If you were to write the following query you’d see a different, incorrect, higher value:

WITH MEMBER MEASURES.DOUBLECOUNT AS
([Date].[Calendar Year].&[2003],[Date].[Day Name].[All Periods],[Measures].[Internet Sales Amount])
+
([Date].[Calendar Year].[All Periods],[Date].[Day Name].&[6],[Measures].[Internet Sales Amount])
SELECT MEASURES.DOUBLECOUNT ON 0
FROM [Adventure Works]

…because by summing the tuples together we’re adding the value of Sales on Fridays in 2003 twice, ie double-counting it. Luckily the WHERE clause doesn’t do a simple sum though, it looks at the area described in the set and doesn’t count any overlapping cells twice. You can visualise this area by running this query:

SELECT {} ON 0,
UNION(
CROSSJOIN({[Date].[Calendar Year].&[2003]},[Date].[Day Name].[Day Name].MEMBERS)
,
CROSSJOIN([Date].[Calendar Year].[Calendar Year].MEMBERS,{[Date].[Day Name].&[6]})
)
ON 1
FROM [Adventure Works]

It’s basically a set of tuples that contains all days in 2003 and Fridays in all other years. Incidentally, if we needed to create a calculated member that didn’t double-count we’d need to sum up this same set as follows:

WITH MEMBER MEASURES.NOTDOUBLECOUNTED AS
AGGREGATE(
UNION(
CROSSJOIN({[Date].[Calendar Year].&[2003]},[Date].[Day Name].[Day Name].MEMBERS)
,
CROSSJOIN([Date].[Calendar Year].[Calendar Year].MEMBERS,{[Date].[Day Name].&[6]})
)
, [Measures].[Internet Sales Amount])
SELECT {MEASURES.NOTDOUBLECOUNTED} ON 0
FROM [Adventure Works]

You could also use a query something like this:

WITH MEMBER MEASURES.NOTDOUBLECOUNTED AS
([Date].[Calendar Year].&[2003],[Date].[Day Name].[All Periods],[Measures].[Internet Sales Amount])
+
([Date].[Calendar Year].[All Periods],[Date].[Day Name].&[6],[Measures].[Internet Sales Amount])

([Date].[Calendar Year].&[2003], [Date].[Day Name].&[6],[Measures].[Internet Sales Amount])
SELECT MEASURES.NOTDOUBLECOUNTED ON 0
FROM [Adventure Works]

…which explicitly subtracts the tuple containing the double-counted value and may be slightly faster, but makes the potentially dangerous assumption that everything is additive.

The last scenario we need to mention is ORing across hierarchies from dimensions, for example if I wanted to see Sales where the transaction was either in 2003 or to a Customer in the US. Again, thankfully, the WHERE clause ‘just works’ (the example of two hierarchies on the same dimension doesn’t do what it does because of auto-exist) so that the following query doesn’t double-count:

SELECT [Measures].[Internet Sales Amount] ON 0
FROM [Adventure Works]
WHERE(
{([Date].[Calendar Year].&[2003],[Customer].[Country].[All Customers])
,
([Date].[Calendar Year].[All Periods],[Customer].[Country].&[United States])})

Before I finish, I should also mention that the Analysis Services Stored Procedure Project has a very useful sproc that makes it very easy to generate the kind of asymmetric sets of tuples we’ve been dealing with here, written by Darren Gosbell:
http://www.codeplex.com/ASStoredProcedures/Wiki/View.aspx?title=AsymmetricSet&referringTitle=Home

It’s BI Evening time again…

Yes folks, in conjunction with the UK SQL Server Community I’m proud to announce another BI Evening on the 26th of April and – by popular demand – it’s going to be in central London. We’ve got a star-studded line-up of speakers too: David Francis  talking about PerformancePoint Monitoring and Analytics; the legendary Allan Mitchell on dimension and fact table ETL methods with SSIS; and all the way from the US of A (though not just for this event, he was coming here anyway) Reed Jacobson on PerformancePoint Planning and Budgeting.  You’ll detect a bit of  a PerformancePoint theme going on here…
 
Anyway, here’s the link to register:
Spaces are limited so you’d better hurry. There’s also a non-BI SQL Server event going on at the same time, and since both are being sponsored by those nice people at Red-Gate software I’m told there will be all sorts of goodies to give away.

Optimising Many-to-Many Relationships

Great post here from Erik Veerman on a way of optimising some many-to-many relationships:
I blogged about something similar, but applicable in slightly different scenarios, here:
In general the idea is that if you can reduce the size of any dimensions and measure groups involved in a m2m relationship then you’ll get better performance. I believe partitioning your intermediate measure group appropriately can also help.
 

AS2005 Design Best Practices White Paper

Via Darren Gosbell, I see that the best practices included in the forthcoming SQL2005 Best Practices Analyzer have been turned into a white paper which you can download here:
 
Most of the recommendations are fairly obvious, some were new to me, some provoked a ‘they need to fix that’ response, and some I wasn’t sure I agreed with 100%. For example:
 
Avoid creating diamond-shaped attribute relationships
I quite often find myself designing diamond-shaped relationships in my attributes; the example they give in the text isn’t particularly good, but Day->Month->Year and Day->Week->Year is very common. OK most of the time I design user hierarchies along these paths anyway, but this need to design user hierarchies just for the sake of performance rather than because users actually want to see these hierarchies (and this isn’t the first time I’ve heard this recommended) is something that I don’t feel particularly happy about. It would be nice if we could make the choice between using attribute hierarchies and user hierarchies based on ease-of-use alone.
 
Avoid including unrelated measure groups in the same cube
This is the multiple measure groups and one cube vs multiple cubes with one measure group question that has been kicking around for at least a year. We were promised it was going to be addressed in the AS2005 Performance Guide but it wasn’t; when are we going to get some details published? In my experience I have seen a slight improvement in performance when you split cubes up in this way but nothing major, yet clearly there must be some reason for MS to keep recommending this so perhaps I’ve not come across the right scenario yet.
 
Avoid having very big intermediate measure groups or dimensions of many-to-many dimensions
Obviously having large intermediate measure groups and/or dimensions in m2m relationships is going to slow things down, but in my experience I’ve always been pleasantly surprised with the performance of m2m dimensions. I’ve seen intermediate measure groups with much more than 1 million members in perform really well…
 
Avoid having partitions with more than 20 million rows
As an aside, I recently noticed that in Eric Jacobsen’s recently blog entry on partitioning here he says you can also think about 2Gb (rather than a number of rows) as a rough guideline for the maximum size of a partition, and more interestingly that you should not have more than about 2000 partitions in your measure group. I asked him about this and it turns out that there are some internal limitations in AS at the moment, which hopefully will be fixed soon, that mean that slow down performance when you have large numbers of partitions. This does suggest that at the moment roughly speaking there’s a maximum size for a MOLAP cube to perform well of around 4Tb or 30 billion rows… not that I’ve ever seen a AS2005 MOLAP cube that big, but interesting to note.
 
Do set the Slice property on partitions that are ROLAP or partitions that use proactive caching
Based on Eric’s article mentioned in the last point, I think it’s also a good idea to set the Slice property on MOLAP partitions where possible too given that the automatic detection of which members appear in which partitions gets confused by overlapping slices. Greg Galloway recently did some experiments on this which I saw where he showed that if you set the slice on a MOLAP partition the auto-slice information isn’t used.
 
Avoid creating aggregations that are larger than one-third the size of the fact data
I quite frequently find myself designing aggregations which fail the one-third rule, although not by much, and for certain queries and calculations it can be a good idea. But I agree you should only do so as a last resort.