Reporting Services-generated MDX, Subselects and Formula Caching

Analysis Services can, of course, cache the results of MDX calculations and share that cache with other users most of the time… but when it can’t the result is much unnecessary load on your server as calculations are recomputed again and again, and much longer query times. Unfortunately the details on what caching AS can do when are not properly documented anywhere and I’ve only learned the bits and pieces that I know about this subject from my own experience and from talking to the likes of Mosha; I’ve avoided blogging about it because my knowledge is incomplete and this is a complex topic. Happily the upcoming AS2008 Performance Guide should rectify this lack of proper documentation, and I believe Mosha is also going to address this subject in his pre-conf seminar at PASS.

However, there is one specific scenario that I thought I’d write about because I come across it regularly and it affects anyone using a client that generates MDX with subselects in, such as the Reporting Services query builder. Let’s say you’ve got an expensive calculated member on your cube – for example, add the following (deliberately slow) calculated measure to the Adventure Works cube:

{[Scenario].[Scenario].&[1], [Scenario].[Scenario].&[2]}
[Department].[Departments].[Department Level 02].members
, 10,[Measures].[Amount])

In Reporting Services you might then use the query builder to create a query that looked like this to select just Calendar Year 2004 on rows and this new calculated measure on columns:

NON EMPTY { ([Date].[Calendar Year].[Calendar Year].ALLMEMBERS ) }
{ [Date].[Calendar Year].&[2004] } )
FROM [Adventure Works])

On a cold cache, on my laptop, this query takes 33 seconds; on a warm cache it still takes 33 seconds. This is not good – practically all of the query time is taken up evaluating the calculation, and this calculation is clearly not being cached. The reason why is because the query contains a subselect, and when a query contains a subselect it means that the results of any calculations can only be cached for the lifetime of the query. You can easily rewrite the query as follows to avoid the subselect:

NON EMPTY { [Date].[Calendar Year].&[2004] }
FROM [Adventure Works]

On a cold cache this query still takes 33 seconds, but on a warm cache it comes back in 1 second because the calculation can now be cached and this cache can be reused for subsequent queries. Clearly this is much better.

I wonder how many SSRS implementations are suffering from needlessly poor performance just because of this? Subselects are a useful addition to MDX and make calculating ‘visual totals’ type values very easy, but they should be used with care and not just as a means of selecting individual members.

Azure and Analysis Services in the cloud

Hmm, I’m a bit late on this but it seems at PDC the other day buried in Bob Muglia’s keynote was the announcement that we’ll be getting a cloud-based version of Analysis Services as part of SQL Data Services. Here’s Jamie Thomson’s post on this:

Here’s the official Azure site and white paper:

UPDATE: see Mosha’s comment below – there was no announcement of cloud-based AS. Oh well… but I’ve just read about the Office-in-the-cloud announcement. I wonder if we’ll get any AS connectivity in the cloud-based Excel?

Pentaho 2.0 and Pentaho Aggregate Designer

Via Julian Hyde, I see Pentaho have released version 2.0 of their BI suite. As Julian points out in this blog entry here, one of the major new features in this release is the Pentaho Aggregate Designer, which makes it much easier to design aggregate tables for Mondrian. A quick look at the screenshot on Julian’s blog suggests that Mondrian is continuing to flatter Analysis Services by imitation (you can even see that the aggregations are being built on the Foodmart database!) – not that this is a bad thing, in fact it’s what makes Mondrian so easy to use if you’re used to Analysis Services.

So Mondrian is definitely catching up in terms of functionality… and in terms of performance, well, as Julian says "Mondrian is beholden to the RDBMS for performance" and those RDBMSes are getting ever faster. I suspect very soon more and more people are going to find that Mondrian plus an open source RDBMS (perhaps the column-store-based LucidDB, which I see already already has support for the new aggregate table designer) is going to be ‘good enough’ for many BI projects – after all, most AS implementations I see don’t involve massive amounts of data, and if Mondrian can give good query response times on a fact table of 20-50 million rows then the decision on which platform to use will be more heavily influenced by price. And what with the current financial crisis, price is going to be an ever-more important factor for many customers. I wonder how long it will be before I see MS BI consultancies starting to offer open source BI, given the relative ease with which you can transfer your cube design and MDX skills between the two platforms?

Book review: Applied Microsoft SQL Server 2008 Reporting Services, by Teo Lachev

Here’s the deal: if you liked Teo Lachev’s book "Applied Microsoft Analysis Services 2005", which I did, then you’ll like "Applied Microsoft SQL Server 2008 Reporting Services". Both books share the same format and approach and are exhaustive guides to their subjects.

In case you’ve not seen one of Teo’s books before, though, what should you expect? Well, "Applied Reporting Services 2008" is 750 pages long and covers just about every aspect of Reporting Services 2008 that’s worth covering, from installation to report design to management to security to extensibility in great detail. This probably isn’t a book you’re going to sit down and read from cover to cover, but it’s a great reference guide and Teo’s prose is very clear so reading individual chapters as and when you need to is no chore. The best thing about this book, though, is the fact that on every page you can see Teo’s vast real-world experience showing through. For some reason many books on Reporting Services fall into the trap of being Books Online rewritten, never telling you the stuff you really need to know; this book on the other hand is a goldmine of information on how things really work. Let me take the chapter I can really speak from authority on – the chapter on using Reporting Services with Analysis Services. Every other SSRS book I’ve seen has been written by someone with clearly no practical experience of using SSAS and SSRS together, and parrots the usual line about ‘great integration’ and ‘easy-to-use MDX query designer’ etc. Teo on the other hand has, for the first time anywhere, put together all the tips and tricks I’ve ever seen (plus a few I haven’t) on this topic in one place – he lists the pros and cons of the built-in Analysis Services datasource and the OLEDB datasource and how to work around them, how to handle parent/child hierarchies, using extended properties, the lot.

All in all, then, a very highly recommended book; it’s ideal for both beginners and experienced developers and probably the only Reporting Services 2008 book you’ll ever need. You can read some sample chapters and see video demos on the book website here:

OLAP PivotTable Extensions new release

For some reason I’ve not blogged about this before, but anyway the ever-industrious Greg Galloway has just released a new version of his OLAP PivotTable Extensions:

It’s an Excel addin that gives you useful new functionality in Excel 2007 pivot tables connected to Analysis Services, such as the ability to add private calculations, view the MDX behind the pivot table, and (in the new release) search for members and other things. Definitely worth a look, and useful too if you’ve ever wondered how to work with the Excel pivot table in code.

Calculated members are better than assignments to real members (at least sometimes)

Earlier this year I blogged about how, in my experience, using real members and overwriting them with MDX Script assignments sometimes seemed to perform worse than using calculated members. See here for the full posting:!7B84B0F2C239489A!1598.entry

Recently I came across this problem again but was able to get together a proper repro and open a support case with Microsoft. What was happening was that I had a time utility dimension with three members – let’s call them member A, member B and member C. A was the default member and had no calculation, it just showed the real values from the rest of the cube; B was a straightforward YTD calculation; C was a really nasty, untuned, same period previous year calculation with lots of special logic. I had a query that included A and B but not C and which was performing really badly, and I found that when I commented out the calculation for C the query returned instantly. So I guessed that for some reason the calculation for C was being evaluated when the query ran, even though C wasn’t actually requested in the query and this was confirmed by Microsoft.

Unfortunately I was also told this behaviour was ‘by design’, something to do with prefetching and sonar. Thankfully it doesn’t happen for all queries or all cubes (I tried and failed to repro it on Adventure Works) but in my opinion there should never be a situation where the formula engine evaluates calculations that aren’t needed – precisely because in some cases it will mean the query runs much longer than it ever should. Probably the worst thing about all this is that you have no idea when exactly it’s happening with the tools we’ve got available at the moment. I suspect that many people out there have poorly-performing cubes because of this issue…

One of the workarounds was to use calculated members instead of real members, as I recommended in my original post, but as I also said in that post there are some cases where calculated members are not a good alternative to real members. In my particular case I was able to tune the calculation associated with member C so either the problem didn’t happen or C evaluated so quickly it didn’t matter – I’m not sure which. Anyway, I stand by what I said in my original posting: use calculated members rather than real members overwritten by Script assignments wherever you can!

Metadata, semantic web technologies and, yes, Gemini again

I saw a very interesting article the other day in Intelligent Enterprise by Seth Grimes, about a newly-published report on the semantic web by David Provost. Grimes is rightly sceptical about how close we are to these ideas reaching fruition and notes that many of the companies mentioned in the report are concentrating on ‘semantic data integration’. Frankly, to me the idea of being able to integrate data sourced from different parts of the web is still far-fetched, given the trials you have to go through to integrate data from different parts of the same company. But it did get me thinking: when users get Gemini, where will they get their data from? Yes, they’ll be downloading data on ‘industry trends’ etc from the web in the way we saw in the Gemini demo, but it won’t be that often. In well-run companies most of the time the data will come from four sources:

  • The data warehouse, either from the relational source or Analysis Services.
  • OLTP systems
  • Data that lives in someone’s small, well-maintained, official, IT-department tolerated Excel spreadsheet. The kind of spreadsheet that couldn’t and shouldn’t be promoted to database form because it’s too small, or short-lived, or needs to be maintained by non-technical people, or needs the complete flexibility that Excel gives you. In fact, exactly the kind of thing that Excel is meant to be used for.
  • Data that originally came from the data warehouse but was downloaded into someone’s local Access database, or exported to Excel, or sent to the user in something like a SSRS report; so data that has come to the user second hand.

How, then, can Gemini know (beyond its cleverness with column names and analysis of the data within those columns) what data can be integrated with what? Ideally it would have access to some form of metadata embedded in the source data that would help it make the correct decision all in all cases. Where this metadata comes from is a problem much larger than Gemini of course, and goes down to the fundamental question of how an enterprise can keep track of all of its data assets wherever they’re stored and understand what the data in each data store actually means; Microsoft is regularly criticised for its lack of a metadata tool and I guess MS is working on something in this area. If, in the first and second scenarios above, Gemini could connect to SQL Server or Analysis Services and see a common layer of metadata that would help it out, allowing it to join data from SQL Server with data from Analysis Services for instance. In a way, Analysis Services is already a metadata layer on top of the data warehouse, containing information on how tables need to be joined and how measures should be aggregated; perhaps this side of it will become more important as the MOLAP engine is superceded?

I also think some of the technologies of the semantic web, when applied to the enterprise, could be very useful here; I’ve only really just come across this stuff myself, but as an introduction I found the one-page explanation of RDF on Twine was very good, and the Microformats site was also full of interesting information. One of the companies mentioned in the semantic web report is Cambridge Semantics, whose product Anzo for Excel is aimed at imposing structure on all of those Excel spreadsheets I referred to in the third bullet point above. The demo on the web concentrates on using the tool for sharing data and collaboration, but as far as I can see the key to getting that to work is selecting data in the spreadsheet and linking it back to a common metadata layer. Of course the obvious criticism of a product like this is that you’re again relying on your users to make the effort to mark up their data and do so properly, but if there’s an incentive in the form of easier collaboration and – to put it bluntly – less of that boring cutting and pasting then maybe there’s a chance they could be persuaded to do so. I could imagine MS offering something very similar to this as part of Excel, with all the central management being done through Sharepoint, and extended throughout the Office suite to Access, Word etc. Gemini would then be able to do a better job of understanding what and how data in an Excel spreadsheet, say, could be integrated with data in the data warehouse.

So the bottom-up approach could be combined with the more traditional top-down metadata management approach, and crucially I’d want to see metadata automatically embedded in various the output formats of the server products. For example when you built a SSRS report or created an Analysis Services pivot table in Excel, in both cases I’d want the data by default to retain some record of what it was and where it had come from – the metadata would embedded in the document in the same way as if the user had marked up the document themselves. And this in turn would make it a lot more easily reusable and comprehensible by Gemini and other applications such as Enterprise Search.  Metadata would accompany data as it travelled from document to document, data store to data store, format to format. This would then cover the scenario in the third bullet point, allowing you to integrate data from an SSRS report with data from the data warehouse it originally came from, or data from a user-generated Excel spreadsheet that had been marked up manually.

So much for structured data; ideally we’d want to be able to include unstructured data too. Some of the applications of natural language processing mentioned in the semantic web report looked very interesting, especially OpenCalais (which already has integration with MOSS 2007 I see). If you were looking at sales figures for a particular customer in Gemini, wouldn’t you also want to be able to look for documents and web pages that discussed sales for that customer too? And I’ve often thought that while the super-simple type-a-search-term approach for Search has worked brilliantly over the last few years, there’s a niche for a power-user interface for search too, kind of like a Proclarity Desktop for search where you could drag and drop combinations of terms from the central metadata repository and see what you found; could that be Gemini too? A tool for searching, integrating, aggregating and manipulating all enterprise data, not just numeric data?

Over the last few days I’ve seen the Gemini team (notably Amir) engaging with bloggers like me about our concerns through comments on our postings. I appreciate that, but my position hasn’t changed: I think the technology is cool but there’s too great a risk that it will be misused as it stands at the moment. Relying on oversight from the IT department isn’t enough; if MS had a convincing metadata story extending to all data types and data sources, metadata that Gemini could use, it would go a long way to addressing my concerns.

%d bloggers like this: