AS2KSP4 unwelcome changes

As I’m sure you all now by now, AS2KSP4 has now been released and is available here:
http://www.microsoft.com/sql/downloads/2000/sp4.asp

One very unwelcome change is to the DSN Requested and DSN Used Perfmon counters, which are now practically useless. The values are now in scientific form (eg 1.3112e+012) which means that you can only see what levels your queries are hitting for the first five dimensions in your cube – which makes building aggregations manually (one of my favourite tricks for performance optimisation, detailed here) impossible. Grr! I noticed this a while ago and logged a bug in betaplace, but got no feedback. Double Grr!

Other SQL2005 BI books

First of all, thanks to everyone who replied to my post from last week about ‘MDX Solutions’ 2E – I’ll try to respond to everyone individually (in many cases there are some follow-up questions I’d like to ask) but it might take a while.

As a first time author I’ve been lucky in that just at the time I’ve needed guidance about the process of writing an IT book my publisher Joseph Wikert (=the guy that signed the contract on behalf of Wiley) started an excellent blog on exactly this subject. One of his recent postings suggests that authors research the competition to their book, and so I did some of my own to see what other books were on SQL2005 BI were in the pipeline…

First of all, soon after my post about MDX Solutions 2E went up on the newsgroup Mosha came out with his own announcement (imitation is the sincerest form of flattery, eh?)regarding the second edition of ‘Fast Track to MDX’ and a new book on how to model various business problems correctly in AS2005. The former is the obvious competitor to ‘MDX Solutions’, but even then I’m sure he’d agree (see here, for example) that our books have very different aims and that many people will end up buying both. The latter book I am very much looking forward to seeing because I think AS has been crying out for something like it for a long time; indeed, some of the feedback I received for MDX Solutions echoed this.

Moving onto what’s currently listed on Amazon, I’ll start with the book which MDX Solutions 2E shares a co-author with: ‘Professional SQL Server Analysis Services 2005 with MDX’, by Siva Harinath and Stephen Quinn. Also listed is something called ‘Data Warehousing with SQL Server 2005‘ which, after a quick Google, I reckon is the same as the book called ‘The Microsoft Data Warehouse Toolkit’ by Joy Mundy and Warren Thornthwaite, mentioned in passing here. I would guess it another variation on the ‘Data Warehouse Toolkit’ theme, but this time with a SQL2005 bent; I don’t know how much AS-related content it will have, but being a big fan of this particular series I’ll probably end up buying it. Thirdly there’s Teo Lachev’s book ‘Applied Analysis Services 2005‘, more details of which are available on his blog. And finally there’s ‘Data Mining with SQL Server 2005’ by Jamie MacLennan and ZhaoHui Tang – another one I’m looking forward to seeing, if only because the subject is one I’ve always been interested in and never found a good, practical book for (although I did think this white paper was pretty good) at least as far as SQL 2000 went.

Going back to the most recent post on Joseph Wikert’s blog, I can’t help but wonder if the days of the traditional IT book are numbered. I’m not one of those techno-utopians who think that the book will replace the printed word simply because it’s the web – I think a book is much more likely to provide focused, useful information than a blog or an article on a website – but with more and more information becoming available on the web you do wonder whether a) publishers’ margins are going to be trimmed to a degree that the publishing IT books becomes unprofitable, and b) the people who traditionally might have written a book decide that running a website or blogging is a better way of spending their time. 

 

Building cubes from SAP data

Courtest of Mat Stephen, whose blog I’ve just come across and which contains a lot of interesting MS BI-related content, I’ve just read the following paper on building AS2K cubes from SAP BW. Now I don’t know much about SAP BW or R/3, but I do know that this is the data that people most want to analyse and that SAP’s own BI tools have a history of being a bit rubbish (I’ve heard this from a lot of people, but the best summaries can be found in the OLAP Survey and the OLAP Report – both fine reads if you have the cash). However, I also know that one of the main reasons why SAP’s own attempt to use AS2K to speed up BW’s query performance was never widely used was because AS2K simply couldn’t handle the number of dimensions and levels that are present in a typical BW Infocube: the only time I heard anyone complaining about AS2K only being able to support 128 dimensions and 256 levels in a cube was in precisely this scenario, even if in my opinion anyone expecting a user to understand a cube this complex is nuts…

Anyway, to get onto my main point, it strikes me that the new attribute-based dimension model of AS2005 is going to make it much more suitable for analysing SAP data. If I didn’t suspect that Microsoft were going to beat everyone to it with Maestro I’d say that using AS2005 and RS2005 to do exactly this job could be the basis for a killer third-party product, although I’m sure there’ll still be a lot of expensive consulting needed even for Maestro implementations. There must be plenty of disaffected BW users out there who are looking for a replacement BI solution.

Also on this subject, the paper above stresses several times that extracting data from R/3 and BW directly from the relational tables in which it’s stored is risky and unsupported, because tables are subject to change in future versions etc etc. That said, I understand that doing this is much quicker than going through any of the official interfaces. So I thought I’d mention a tool called Saphir produced by a company I got to know when I worked in Switzerland, which makes exploring and extracting from SAP’s (and other ERP tools’) relational tables much easier. I know of at least one project where it was used to extract data to populate Analysis Services cubes.

Update: no sooner do I write this entry than I read this article – if you follow the link to the SAP site, you’ll see a mention of ‘Business Analytics delivered through Microsoft Excel’ – hmm, I wonder if this has anything to do with Maestro?

XMLA Compression

Some interesting comments from Mark Hill on his blog concerning the problems he’s had with querying over a poor network. I know lots of people have had similar problems, but I think he’s the only person I’ve heard of who has persevered and got a workable solution.

PASS presentation

I’m going to be giving a presentation at the 2005 PASS European Conference in May this year on new cube design features in AS2005. This will cover stuff like the new attribute-based dimension model, the new dimension types (for example many-to-many, reference, fact and role-playing) and measure expressions. Although I’ve had some experience with these features through the course of my beta testing I would be very interested in hearing from anyone else who has used them too, especially people who have tested them in real-world scenarios, so we can compare notes. I’d like to be able to get as many useful tips and tricks into my presentation as possible and avoid just rehashing what’s already in BOL! So please either leave a comment or mail me on cwebb_olap.removethisbit@hotmail.com if you can help me…

Raising your own errors in Yukon

Another thing I’ve noticed in Books Online – the new ERROR function, which allows you to raise your own errors. For example in AdventureWorks, try the following query:

WITH MEMBER MEASURES.TEST AS
IIF(MEASURES.[INTERNET ORDER COUNT]>2000
, MEASURES.[INTERNET ORDER COUNT]
, ERROR("TOO LOW!")
)
SELECT {MEASURES.[INTERNET ORDER COUNT], MEASURES.TEST} ON 0,
[DATE].[CALENDAR TIME].[ALL PERIODS].CHILDREN ON 1
FROM
[DIRECT SALES]

There’s a section in BOL on error handling which isn’t filled in yet so I guess this is probably only the tip of the iceberg as far as this area of functionality is concerned, but it’s interesting nonetheless.

BI Best Practices Blog

Although this has been around for a while, I thought I’d write a quick post about the BI Best Practices Blog in case some of you haven’t seen it. Unfortunately it doesn’t seem to have been updated since last summer but there are plenty of interesting articles on there (including a few that I wrote when I was at Microsoft – look in the MDX category).

AS2K SP4

SQL Server SP4 is in beta (you can sign up here), and I’ve been looking at the long list of fixes that have gone into Analysis Services here. One thing that really leaps out at me are the number of fixes for dynamic security, one of my pet subjects – if you’ve currently implemented it, you really should take a look. The other thing of interest is the first official documentation of two connection string properties that I’ve found very useful for query performance tuning: Cache Ratio2 and Cache Policy. The writeup of Cache Ratio2 is fairly detailed and gives a good insight into the workings of the engine (when is someone going to write an ‘Inside Analysis Services’ book?); the writeup for Cache Policy doesn’t really explain what it does, unfortunately, and just discusses a fixed bug. I have a draft of a so-far unreleased whitepaper which goes into much more detail about Cache Policy, and which in my opinion ought to get an official release.

Creating local cubes in AS2005

I spotted the following thread on the AS2005 beta newsgroup, concerning the new way of creating local cubes in AS2005:

http://communities.microsoft.com/newsgroups/previewFrame.asp?ICP=sqlserver2005&sLCID=us&sgroupURL=microsoft.private.sqlserver2005.analysisservices.olap&sMessageID=%253CO18gH%2523NBFHA.2368@cpmsftngsa05.privatenews.microsoft.com%253E

(Apologies for the link, but the web-based UI didn’t give me a better way of linking to the whole thread – hopefully you can read it backwards)

Anyway, in summary, it means you create local cubes with the same DDL as you use to create server-based cubes – a very welcome improvement in my opinion.

Building Aggregations Manually

I’ll start with two disclaimers:

  • This entry is based on my own understanding of how the AS engine works, based on my own experience and the occasional question I’ve asked the dev team. It might not be totally accurate on the details but hopefully the general principles are sound – I’ve used them successfully on a number of occasions.
  • 99% of the time the Storage Design Wizard and Usage Based Optimisation Wizard will design the right aggregations for you, and you won’t need to build aggregations manually.

 

So what about the 1% of times when you will need these techniques? They usually occur when you have several dimensions which have large numbers of members at their lower levels, and queries which retrieve data at these levels. Cubes with large parent-child dimensions and dimensions containing entities like customers are the most common real-life manifestations of this scenario. The 1/3 rule, which dictates which aggregations can be built by the Storage Design Wizard and the Usage Based Optimisation Wizard (see the Analysis Services Performance Guide for more details), may have stopped any aggregations which would be useful for your queries being built and so query performance suffers as a result. Now the 1/3 rule is there for a good reason – to stop large aggregations being built which increase processing time and the size of the cube, but don’t have much of an impact on query performance – but it is not infallible; equally, when faced with query performance problems very often you are glad of any improvement in response times, however small!

 

The first step to take when designing aggregations manually is to understand what data the queries you want to run are actually requesting. The best way to do this is to take a representative sample of these queries, clear the Query Log, set it to log every query and then run the queries one by one. You’ll notice that one MDX query doesn’t necessarily equate to one query in the log – in fact, one MDX query can generate multiple subqueries, even thousands, depending on its complexity; you’ll also notice that what is stored in the query log is somewhat cryptic. Rather than explain it here, I suggest you read the following white paper which explains the contents in some detail. The contents of the DataSet column represent the slice of the cube that each query in the log asked for data from; the same values can be viewed in Performance Monitor using the DSN Requested counter. The other useful value to monitor is the slice of the cube that the AS engine queried to get data to service each of these requests, and unfortunately this is not present in the Query Log – you can only see it in PerfMon using the DSN Used counter.

 

This is probably a good place to stop and give some examples. Imagine you have a cube with 4 dimensions apart from Measures, and these dimensions have respectively 5, 6, 7 and 8 levels each. If you ran a query requesting values at the top level of each of these dimensions, DSN Requested would show it as 1111 (top, top, top, top); similarly, if you ran a query requesting data from the leaf levels of each dimension, DSN Requested would show it as 5678 (leaf, leaf, leaf, leaf). Now if you had no aggregations in your cube and you ran the first of these queries, 1111, in order to get the value returned AS would have to aggregate values from the leaf member of every single dimension and DSN Used would show 5678; the fact that all this aggregation had to happen at runtime would mean that the query might not run very quickly. If, however, you had an aggregation built at the third level of each dimension and you ran the same query, DSN Used would instead show 3333, and because AS only had to aggregate the members on the third level of every dimension the query would run much faster. Following on from this, imagine that you wanted to run a query that took values from the top levels of the first two dimensions and the leaf levels of the last two dimensions, so that DSN Requested would be 1178. Since it is unlikely that any aggregations would have been built at the bottom level of two such deep dimensions (especially if they had large numbers of members on, for example if one was a Customer dimension), then DSN Used would have to be 5678 and AS would still have to aggregate lots of values at runtime.

 

Back looking at your Query Log and PerfMon, the next step is to decide whether you need to build aggregations manually. If you’ve been through the Storage Design Wizard and the Usage Based Optimisation Wizard and you’ve set the Aggregation Usage property on all your dimension appropriately (again, see the AS Performance Guide for more information on this), and you still see that your queries are not hitting aggregations (so that there are big differences between the DSN Requested and the DSN Used values) then you probably need to. On the other hand, if you see DSN Requested and DSN Used values which are the same or almost the same, building more aggregations might not do any good and you might need to look at other ways of improving performance, such as partitioning.

 

The tool you’ll need in order to build aggregations manually is ‘Partition Manager’, which is available in the SQL 2K Resource Kit and also in an updated form in the BI Accelerator, a free download from the Microsoft website. The UI can be a bit fiddly when you have more than a few dimensions but it beats writing the code in DSO.

 

The final problem is knowing which aggregations you need to build. Going back to the example above, if you saw the pattern 1178 in DSN Requested you could build an aggregation which matched this exactly, ie at the top levels of the first two dimensions and the leaf levels of the last two. This would do the trick, but if your users wanted to drill down on any of the first two dimensions then AS could no longer use it. Therefore it’s a much better idea to build aggregations at the lowest level your users are likely to drill down to, perhaps 3378, and sacrifice a little performance on some higher level queries for much better performance on queries which go slightly deeper. Overall, though, it’s only through trial and error that you’ll work out which aggregations you need.