Reporting Services-generated MDX, Subselects and Formula Caching

Analysis Services can, of course, cache the results of MDX calculations and share that cache with other users most of the time… but when it can’t the result is much unnecessary load on your server as calculations are recomputed again and again, and much longer query times. Unfortunately the details on what caching AS can do when are not properly documented anywhere and I’ve only learned the bits and pieces that I know about this subject from my own experience and from talking to the likes of Mosha; I’ve avoided blogging about it because my knowledge is incomplete and this is a complex topic. Happily the upcoming AS2008 Performance Guide should rectify this lack of proper documentation, and I believe Mosha is also going to address this subject in his pre-conf seminar at PASS.

However, there is one specific scenario that I thought I’d write about because I come across it regularly and it affects anyone using a client that generates MDX with subselects in, such as the Reporting Services query builder. Let’s say you’ve got an expensive calculated member on your cube – for example, add the following (deliberately slow) calculated measure to the Adventure Works cube:

{[Scenario].[Scenario].&[1], [Scenario].[Scenario].&[2]}
[Department].[Departments].[Department Level 02].members
, 10,[Measures].[Amount])

In Reporting Services you might then use the query builder to create a query that looked like this to select just Calendar Year 2004 on rows and this new calculated measure on columns:

NON EMPTY { ([Date].[Calendar Year].[Calendar Year].ALLMEMBERS ) }
{ [Date].[Calendar Year].&[2004] } )
FROM [Adventure Works])

On a cold cache, on my laptop, this query takes 33 seconds; on a warm cache it still takes 33 seconds. This is not good – practically all of the query time is taken up evaluating the calculation, and this calculation is clearly not being cached. The reason why is because the query contains a subselect, and when a query contains a subselect it means that the results of any calculations can only be cached for the lifetime of the query. You can easily rewrite the query as follows to avoid the subselect:

NON EMPTY { [Date].[Calendar Year].&[2004] }
FROM [Adventure Works]

On a cold cache this query still takes 33 seconds, but on a warm cache it comes back in 1 second because the calculation can now be cached and this cache can be reused for subsequent queries. Clearly this is much better.

I wonder how many SSRS implementations are suffering from needlessly poor performance just because of this? Subselects are a useful addition to MDX and make calculating ‘visual totals’ type values very easy, but they should be used with care and not just as a means of selecting individual members.

Azure and Analysis Services in the cloud

Hmm, I’m a bit late on this but it seems at PDC the other day buried in Bob Muglia’s keynote was the announcement that we’ll be getting a cloud-based version of Analysis Services as part of SQL Data Services. Here’s Jamie Thomson’s post on this:

Here’s the official Azure site and white paper:

UPDATE: see Mosha’s comment below – there was no announcement of cloud-based AS. Oh well… but I’ve just read about the Office-in-the-cloud announcement. I wonder if we’ll get any AS connectivity in the cloud-based Excel?

Pentaho 2.0 and Pentaho Aggregate Designer

Via Julian Hyde, I see Pentaho have released version 2.0 of their BI suite. As Julian points out in this blog entry here, one of the major new features in this release is the Pentaho Aggregate Designer, which makes it much easier to design aggregate tables for Mondrian. A quick look at the screenshot on Julian’s blog suggests that Mondrian is continuing to flatter Analysis Services by imitation (you can even see that the aggregations are being built on the Foodmart database!) – not that this is a bad thing, in fact it’s what makes Mondrian so easy to use if you’re used to Analysis Services.

So Mondrian is definitely catching up in terms of functionality… and in terms of performance, well, as Julian says "Mondrian is beholden to the RDBMS for performance" and those RDBMSes are getting ever faster. I suspect very soon more and more people are going to find that Mondrian plus an open source RDBMS (perhaps the column-store-based LucidDB, which I see already already has support for the new aggregate table designer) is going to be ‘good enough’ for many BI projects – after all, most AS implementations I see don’t involve massive amounts of data, and if Mondrian can give good query response times on a fact table of 20-50 million rows then the decision on which platform to use will be more heavily influenced by price. And what with the current financial crisis, price is going to be an ever-more important factor for many customers. I wonder how long it will be before I see MS BI consultancies starting to offer open source BI, given the relative ease with which you can transfer your cube design and MDX skills between the two platforms?

Book review: Applied Microsoft SQL Server 2008 Reporting Services, by Teo Lachev

Here’s the deal: if you liked Teo Lachev’s book "Applied Microsoft Analysis Services 2005", which I did, then you’ll like "Applied Microsoft SQL Server 2008 Reporting Services". Both books share the same format and approach and are exhaustive guides to their subjects.

In case you’ve not seen one of Teo’s books before, though, what should you expect? Well, "Applied Reporting Services 2008" is 750 pages long and covers just about every aspect of Reporting Services 2008 that’s worth covering, from installation to report design to management to security to extensibility in great detail. This probably isn’t a book you’re going to sit down and read from cover to cover, but it’s a great reference guide and Teo’s prose is very clear so reading individual chapters as and when you need to is no chore. The best thing about this book, though, is the fact that on every page you can see Teo’s vast real-world experience showing through. For some reason many books on Reporting Services fall into the trap of being Books Online rewritten, never telling you the stuff you really need to know; this book on the other hand is a goldmine of information on how things really work. Let me take the chapter I can really speak from authority on – the chapter on using Reporting Services with Analysis Services. Every other SSRS book I’ve seen has been written by someone with clearly no practical experience of using SSAS and SSRS together, and parrots the usual line about ‘great integration’ and ‘easy-to-use MDX query designer’ etc. Teo on the other hand has, for the first time anywhere, put together all the tips and tricks I’ve ever seen (plus a few I haven’t) on this topic in one place – he lists the pros and cons of the built-in Analysis Services datasource and the OLEDB datasource and how to work around them, how to handle parent/child hierarchies, using extended properties, the lot.

All in all, then, a very highly recommended book; it’s ideal for both beginners and experienced developers and probably the only Reporting Services 2008 book you’ll ever need. You can read some sample chapters and see video demos on the book website here:

OLAP PivotTable Extensions new release

For some reason I’ve not blogged about this before, but anyway the ever-industrious Greg Galloway has just released a new version of his OLAP PivotTable Extensions:

It’s an Excel addin that gives you useful new functionality in Excel 2007 pivot tables connected to Analysis Services, such as the ability to add private calculations, view the MDX behind the pivot table, and (in the new release) search for members and other things. Definitely worth a look, and useful too if you’ve ever wondered how to work with the Excel pivot table in code.

Calculated members are better than assignments to real members (at least sometimes)

Earlier this year I blogged about how, in my experience, using real members and overwriting them with MDX Script assignments sometimes seemed to perform worse than using calculated members. See here for the full posting:!7B84B0F2C239489A!1598.entry

Recently I came across this problem again but was able to get together a proper repro and open a support case with Microsoft. What was happening was that I had a time utility dimension with three members – let’s call them member A, member B and member C. A was the default member and had no calculation, it just showed the real values from the rest of the cube; B was a straightforward YTD calculation; C was a really nasty, untuned, same period previous year calculation with lots of special logic. I had a query that included A and B but not C and which was performing really badly, and I found that when I commented out the calculation for C the query returned instantly. So I guessed that for some reason the calculation for C was being evaluated when the query ran, even though C wasn’t actually requested in the query and this was confirmed by Microsoft.

Unfortunately I was also told this behaviour was ‘by design’, something to do with prefetching and sonar. Thankfully it doesn’t happen for all queries or all cubes (I tried and failed to repro it on Adventure Works) but in my opinion there should never be a situation where the formula engine evaluates calculations that aren’t needed – precisely because in some cases it will mean the query runs much longer than it ever should. Probably the worst thing about all this is that you have no idea when exactly it’s happening with the tools we’ve got available at the moment. I suspect that many people out there have poorly-performing cubes because of this issue…

One of the workarounds was to use calculated members instead of real members, as I recommended in my original post, but as I also said in that post there are some cases where calculated members are not a good alternative to real members. In my particular case I was able to tune the calculation associated with member C so either the problem didn’t happen or C evaluated so quickly it didn’t matter – I’m not sure which. Anyway, I stand by what I said in my original posting: use calculated members rather than real members overwritten by Script assignments wherever you can!

Metadata, semantic web technologies and, yes, Gemini again

I saw a very interesting article the other day in Intelligent Enterprise by Seth Grimes, about a newly-published report on the semantic web by David Provost. Grimes is rightly sceptical about how close we are to these ideas reaching fruition and notes that many of the companies mentioned in the report are concentrating on ‘semantic data integration’. Frankly, to me the idea of being able to integrate data sourced from different parts of the web is still far-fetched, given the trials you have to go through to integrate data from different parts of the same company. But it did get me thinking: when users get Gemini, where will they get their data from? Yes, they’ll be downloading data on ‘industry trends’ etc from the web in the way we saw in the Gemini demo, but it won’t be that often. In well-run companies most of the time the data will come from four sources:

  • The data warehouse, either from the relational source or Analysis Services.
  • OLTP systems
  • Data that lives in someone’s small, well-maintained, official, IT-department tolerated Excel spreadsheet. The kind of spreadsheet that couldn’t and shouldn’t be promoted to database form because it’s too small, or short-lived, or needs to be maintained by non-technical people, or needs the complete flexibility that Excel gives you. In fact, exactly the kind of thing that Excel is meant to be used for.
  • Data that originally came from the data warehouse but was downloaded into someone’s local Access database, or exported to Excel, or sent to the user in something like a SSRS report; so data that has come to the user second hand.

How, then, can Gemini know (beyond its cleverness with column names and analysis of the data within those columns) what data can be integrated with what? Ideally it would have access to some form of metadata embedded in the source data that would help it make the correct decision all in all cases. Where this metadata comes from is a problem much larger than Gemini of course, and goes down to the fundamental question of how an enterprise can keep track of all of its data assets wherever they’re stored and understand what the data in each data store actually means; Microsoft is regularly criticised for its lack of a metadata tool and I guess MS is working on something in this area. If, in the first and second scenarios above, Gemini could connect to SQL Server or Analysis Services and see a common layer of metadata that would help it out, allowing it to join data from SQL Server with data from Analysis Services for instance. In a way, Analysis Services is already a metadata layer on top of the data warehouse, containing information on how tables need to be joined and how measures should be aggregated; perhaps this side of it will become more important as the MOLAP engine is superceded?

I also think some of the technologies of the semantic web, when applied to the enterprise, could be very useful here; I’ve only really just come across this stuff myself, but as an introduction I found the one-page explanation of RDF on Twine was very good, and the Microformats site was also full of interesting information. One of the companies mentioned in the semantic web report is Cambridge Semantics, whose product Anzo for Excel is aimed at imposing structure on all of those Excel spreadsheets I referred to in the third bullet point above. The demo on the web concentrates on using the tool for sharing data and collaboration, but as far as I can see the key to getting that to work is selecting data in the spreadsheet and linking it back to a common metadata layer. Of course the obvious criticism of a product like this is that you’re again relying on your users to make the effort to mark up their data and do so properly, but if there’s an incentive in the form of easier collaboration and – to put it bluntly – less of that boring cutting and pasting then maybe there’s a chance they could be persuaded to do so. I could imagine MS offering something very similar to this as part of Excel, with all the central management being done through Sharepoint, and extended throughout the Office suite to Access, Word etc. Gemini would then be able to do a better job of understanding what and how data in an Excel spreadsheet, say, could be integrated with data in the data warehouse.

So the bottom-up approach could be combined with the more traditional top-down metadata management approach, and crucially I’d want to see metadata automatically embedded in various the output formats of the server products. For example when you built a SSRS report or created an Analysis Services pivot table in Excel, in both cases I’d want the data by default to retain some record of what it was and where it had come from – the metadata would embedded in the document in the same way as if the user had marked up the document themselves. And this in turn would make it a lot more easily reusable and comprehensible by Gemini and other applications such as Enterprise Search.  Metadata would accompany data as it travelled from document to document, data store to data store, format to format. This would then cover the scenario in the third bullet point, allowing you to integrate data from an SSRS report with data from the data warehouse it originally came from, or data from a user-generated Excel spreadsheet that had been marked up manually.

So much for structured data; ideally we’d want to be able to include unstructured data too. Some of the applications of natural language processing mentioned in the semantic web report looked very interesting, especially OpenCalais (which already has integration with MOSS 2007 I see). If you were looking at sales figures for a particular customer in Gemini, wouldn’t you also want to be able to look for documents and web pages that discussed sales for that customer too? And I’ve often thought that while the super-simple type-a-search-term approach for Search has worked brilliantly over the last few years, there’s a niche for a power-user interface for search too, kind of like a Proclarity Desktop for search where you could drag and drop combinations of terms from the central metadata repository and see what you found; could that be Gemini too? A tool for searching, integrating, aggregating and manipulating all enterprise data, not just numeric data?

Over the last few days I’ve seen the Gemini team (notably Amir) engaging with bloggers like me about our concerns through comments on our postings. I appreciate that, but my position hasn’t changed: I think the technology is cool but there’s too great a risk that it will be misused as it stands at the moment. Relying on oversight from the IT department isn’t enough; if MS had a convincing metadata story extending to all data types and data sources, metadata that Gemini could use, it would go a long way to addressing my concerns.

Last thoughts on Gemini for the moment

Despite the worldwide financial meltdown, all talk on the MS BI blogosphere has been about Project Gemini this week. Even though no-one knows much about it – and certainly no-one knows everything about it either, as we’ve been told there are more announcements to come – the reaction has been pretty clear: the back-end technology looks cool, but the strategy of putting Gemini in Excel on the desktop is an invitation to bypass the data warehouse and create spreadmart hell whatever MS says about Sharepoint-based management of Gemini models. I’ve already linked to just about everyone who’s expressed this opinion apart from Mick Horne, whose three posts here, here and here are well worth reading as they put the case against Gemini in its currently proposed form very clearly. Choice quote: "Where is the single version of the truth in this architecture? I’ve just spent 4 years of my life trying to convince users to stop using Excel as a data store and here are Microsoft positively encouraging it. Hell will freeze over before this capability is used responsibly in most organisations".

So what would I like to see MS actually do with Gemini? From the comments on Marco’s blog, at the moment it sounds like the priority is to get the Gemini storage mode working for local cubes and Excel rather than put everything in place for the full server version of Analysis Services. I would prefer the emphasis to be reversed and have Gemini storage mode ready as soon as possible on the server side. It’s clearly going to provide a massive performance boost when it does arrive (and let’s not forget it’s at least two years away), and with other COP databases and data warehouse appliances improving every day there’s going to be significantly more competitive pressure on Analysis Services and SQL Server-based data warehouse projects in the future. I’d hate to see current MS BI shops start abandoning their Analysis Services implementations because they can get better query performance and scalability elsewhere.

But whatever us bloggers say I’m sure we’ll get Gemini in the form we’ve already been shown. The reason is that whatever the rights and wrongs of it from a BI consultant’s point of view, the people who use Excel will definitely want this and so there will be an overwhelming commercial case for it. This kind of desktop, DIY BI is in a way similar to illegal drugs: there are always some people that want it, a certain number of them are always going to do it even though they know they shouldn’t, so you’ve got two choices – either legalise it and then hope to control it, as with Gemini, or throw all your efforts into outlawing it. With the first option you run the risk of encouraging what you wanted to discourage and ending up with a worse problem, with the second option you run the risk of people resenting your rules so much they end up being widely ignored. What’s the best option?

UPDATE: one last link for you – Nigel Pendse of the OLAP Report gives an unreservedly positive review here:

More thoughts on Project Gemini

I’ve now had a chance to watch the demos and read all the first-hand accounts of what was announced yesterday (see Marco, Mosha, Tim Kent, Jeremy Kashel, Richard Tkachuk). Here are some unstructured thoughts and questions:

  • As per my comment yesterday about Qlikview, self-service BI is undoubtedly what many end users want – but as BI professionals we know only too well that it can be dangerous; in fact just about every blog entry I’ve read on Gemini has made this point. The question of whether it’s a good idea to let your power users do what Gemini lets them do is likely to cause all kinds of heated religious dispute: I was involved in an argument about this at SQLBits recently, and over on Brent Ozar’s blog (see here and here) you can see the same discussion being had. Although I completely understand the scenario that MS describes in its demos of users needing to work with data that isn’t and will never be in a data warehouse, I lean slightly to the side of those who see self-service BI vendors as selling "snake oil". But being a BI consultant I would, wouldn’t I? All this talk of Gemini representing the ‘constellation of twins’, power users and the IT department working together happily, is something of a fairy tale…
  • In a comment on my blog yesterday, Mosha stated that there was no cube wizard needed for Gemini. But looking at the demo there’s certainly a step needed where you connect to data sources and choose the tables and fields you want to work with, so whether you call it a cube wizard in the strictest sense you need to have some understanding of your data before you can do anything with it. And whatever the demo says, the application you’re using can only take you part of the way, there’s no way a model can be 100% inferred. What happens if fields that mean the same thing have two different names in two different data sources, or if there are two fields which mean different things which have the same name? And, even for many power users, the question of what a table or a join or even a database actually will still need some explanation.
  • While we’re at it – and I know this is a bit of a tangent – expecting power users to understand basic technical concepts is one thing but in many cases (as this excellent blog entry points out) "people have no way of knowing which questions are meaningful ones to ask, and which are meaningless". Not that I’m saying your average BI consultant/IT guy has a better idea either, far from it.
  • I was pleased to see mention of data cleaning functionality in the Gemini addin. Is this coming from Zoomix?
  • Certainly the Gemini pivot table demo was very impressive. Is this what pivot tables will look like in Office.Next? If so, are we going to see Excel finally grow up to being a full-featured AS client tool for power users in the same way Proclarity Desktop was?
  • Moving on, on one hand we’ve got Project Madison, which gives us in SQL Server the ability to query vast amounts of data very quickly. Since this is in SQL Server, I would expect to be able to use AS in ROLAP mode on top. On the other hand we have Project Gemini which will give us a super-fast in-memory storage mode for AS but for slightly smaller data volumes. Where do the two meet? Will we be able to create a HOLAP like solution where your raw data stays in SQL Server/Madison and you can create Gemini-mode aggregations? And can you persist the data in Gemini to disk easily, in case of hardware failure? How long does it take to load data into Gemini?
  • Apart from Qlikview, the other product being mentioned in the same breathe as Gemini is TM1, which is of course primarily used for financial apps. So what will the benefits of Gemini be for PerformancePoint and home-grown AS financial cubes? Not only faster storage engine queries, but also faster calculations (although I know only too well that sometimes you can have poor query performance due to calculations even on a warm storage engine cache, even in AS2008). And will you be able to do writeback on a Gemini partition? Now that would be a major performance benefit.
  • Having said that the need to be able to write MDX will keep people like me in a job, it’s worth noting that it should indeed be possible to make it easy to write many MDX calculations in Excel. Indeed, one of the cool features of the Intelligencia Query MDX generator is precisely this: the ability to turn spreadsheet style formulas into MDX calculations. And yes, Andrew is in the process of getting this functionality patented.
  • I love the idea of Gemini being AS, but I can imagine that some more relationally orientated people would want the ability to query this new data store with SQL. Of course AS actually can be queried with SQL but it’s a very limited subset; it would be great to see tighter integration between AS and the relational engine (along the lines of Oracle’s new cube-based materialised views) so the performance gains that AS gives you can be made available to the relational engine.
  • Which thought in turn leads onto whether Madison style MPP can be applied to the Analysis Services engine itself (as I wondered here), either directly or if AS was more tightly integrated with the relational engine. So many permutations of these technologies are possible…
  • As with PerformancePoint and Excel Services, there seems to be yet another dependency on Sharepoint here for the management of Gemini models. Of course some central repository is necessary and it makes sense to use Sharepoint rather than reinvent the wheel, but as Microsoft Watch points out this cross-dependency helps MS sell more licenses. And as anyone who has tried to sell a MS BI solution will tell you, selling more server products can be a problem – it’s not necessarily the licence cost but the perfectly valid "we don’t use Sharepoint here, we use X and we don’t want to have to support Sharepoint just for this" response that has to be overcome. I think this issue part-explains why I’ve seen so little use of Excel Services with Analysis Services in my work when it seems such a compelling proposition for almost all companies.
  • Lastly, given the current financial crisis, something tells me that when the first CTPs of all this appear next year consultants like me will have plenty of free time to test it out. I know pundits out there are saying that the BI industry will weather any recession because companies will want to compete on information, but I’m sceptical – in my experience most companies don’t make rational decisions in circumstances like these (is that heresy coming from a BI consultant?), they just cut budgets and fire staff without thinking much. And IT consultants, perceived as a cost and of lesser importance to the health of the business than things like, say, the CEO’s bonus, always feels the pain first. Hohum.
%d bloggers like this: