One Cube vs Multiple Cubes

One of the questions discussed in the book that Marco, Alberto and I wrote last year, “Expert Cube Development with SSAS 2008” (available in all good bookshops, folks!) was whether, if you have multiple fact tables, you should create one big cube with multiple measure groups or multiple cubes each with a single measure group. While I still stand by what we wrote then, I recently took part in an interesting debate on this subject in the MSDN Forum with Akshai Mirchandani from the dev team about the pros and cons of each approach where some interesting new details came to light:

http://social.technet.microsoft.com/Forums/en-US/sqlanalysisservices/thread/681e59bd-93ca-4a91-9f26-8ed96e825553

Here are the main points you need to consider when deciding whether to use the single cube approach or the multiple cube approach. In favour of the multiple cube approach:

  • Having multiple, smaller cubes may result in faster query performance than one large cube in some cases, especially if your fact tables have very different dimensionality. This was true in SSAS 2005, and while it’s less obvious in 2008 it’s apparently still there. This was what I’d previously not been sure about: I’d heard rumours about this, and seen it happen in some cases myself with 2005 – although in other cases when I’d tested this out I’d seen no difference in performance – and I wasn’t sure what the situation was with 2008. You’d need to test the two approaches yourself on your cubes and queries to be sure. Here’s what Akshai says on the matter:

    if you keep adding lots of dimensions to a cube, then the virtual space of the cube grows — it does not add to the storage cost, but it does hurt formula engine performance in some scenarios because the cell coordinates are based on the number of attributes in the cube space. Increasing the number of attributes in the cube space will start costing performance in lots of small ways and result in performance regressions. Adding lots of unrelated measure groups would result in you adding lots unrelated dimensions to the cube space and cause a performance slowdown — if you had 10 measure groups and they all shared lots of common dimensions, then one cube makes the most sense…

    …As I already explained… it affects the sizes of data structures inside the formula engine that are based on the number of attributes in the cube space. When those data structures get larger, there is an incremental cost that can add up (depending on your calculations and query patterns).

    For example, you see the Query Subcube Verbose events in Profiler — they show you the subcubes that are used for accessing measure groups. There are similar subcubes that are used for calculating formulas and cell coordinates — all those subcubes get wider and wider as you start adding more attributes into the cube. The cost of accessing and indexing those data structures is what we’re talking about here. If adding new measure groups doesn’t require adding new attributes/dimensions, then there is no problem…

    We had measured the difference before 2005 shipped for some real customer cubes and found there there was a noticeable performance improvement to split up into multiple cubes…

  • While it is possible to apply dimension security to the Measures dimension, it is much easier to allow or deny access to a cube with the multiple cube approach than it is to apply security to all the measures in a measure group using the single cube approach.
  • Having multiple, simpler cubes can be much more user friendly than one monster cube with loads of dimensions and measure groups. If you have Enterprise Edition you can of course use Perspectives to counter this, but if you are using Standard Edition then Perspectives aren’t available.
  • Maintenance can be easier and less disruptive with multiple cubes: if you need to make changes to a cube while users are querying it, you might end up invalidating users’ connections and dropping caches. With one cube the chances of this disruption affecting more users increases.
  • It’s easier to scale out with multiple cubes: if you find your server is maxing out, you can simply buy another server and distribute your cubes equally between the two. With a single cube approach you end up having to look at (admittedly not that much) more complex scale-out scenarios like network load balancing.

On the other side, here are the arguments in favour of the single cube approach:

  • If you ever need to work with data from two fact tables in the same query or calculation, or if you think you might ever need to in the future, you should go with the single cube approach. The two options for cross-cube querying, linked measure groups and the LookUpCube MDX function, should be avoided. Linked measure groups are a pain to manage, carry a slight query performance overhead, and can result in the same MDX calculations being duplicated across the original cube and the cube containing the linked measure group (which means maintenance becomes more difficult). The LookUpCube function is probably the worst MDX function to use in a calculation from a performance point of view and should be avoided at all costs. So a single cube is the only feasible option.
  • Even if your users tell you they will not ever need to analyse data from two fact tables in the same query, be prepared for them to change their minds. In my experience, SSAS projects have a tendency to grow in complexity over time, and cubes that start out simple in a first release often grow lots of new functionality as time goes on – and the more successful the project, the quicker things get complicated. As soon as your users see what’s possible with SSAS they will start to have new, more ambitious ideas about the kind of analysis they want to do with their data, and it’s very likely that they will realise they do need to do cross-measure-group queries and calculations. If you started out on the multiple cube approach and then this happens you will have no choice but to use linked measure groups, and as I said this can make maintenance difficult; using the single-cube approach from the start means you won’t have this problem.

My personal preference is to use the single cube approach by default, and then move to multiple cubes if there are pressing reasons to do so, for example if query performance is a problem. This might seem a bit strange given the number of reasons I’ve given for the multiple cube approach, but frankly the need to support cross-measure-group querying and calculations trumps them all. As I said, if you need to do it (and 99% of the time you will), or you even half suspect you might need to do it sometime in the future, you have to go with the single cube approach. That said, I know other people are more inclined to the multiple cube approach than I am and to a certain extent it’s a matter of taste.

The Deployment Wizard, and when “Retain Partitions” doesn’t retain partitions

I learned something the hard way this week about how the Deployment Wizard actually works (thank goodness for daily backups) that I thought I’d share.

I’ve used the Deployment Wizard lots of times to deploy changes to a SSAS database in production. If you have a system where new partitions are generated dynamically, for example in an SSIS package that creates new partitions automatically when new data is loaded into the cube, then these partitions are not going to be present in your project in BIDS and you don’t want them to be overwritten when you deploy the new version of your database. The Deployment Wizard allows you to stop this happening, and also to not overwrite other setting that might be different in dev and prod, such as connection strings and security roles.

However, what I didn’t realise was that in order for the Deployment Wizard to work properly in all cases it needs to be able to see the server you want to deploy to, and in my current customer’s environment you can’t see prod from dev. Working on dev, what I did was to edit the .deploymenttargets input file manually to enter the name of the database in prod I wanted to deploy to, then ran the wizard with the /a argument (see here for more information on these settings) to set other options in the input file such as Retain Partitions, then ran the wizard again with the /o and /d argument to generate the xmla deployment script without needing to connect to the target server (which I couldn’t, of course, see). And then, when I ran the resulting script in prod, I saw all the partitions that weren’t in dev disappear even though I’d selected the Retain Partitions option.

What I had assumed was that by selecting the Retain Partitions option the wizard would somehow generate the XMLA Alter command it outputs so that it would just ignore the partitions in each measure group. However this is not correct: you can’t have an Alter command for a database that doesn’t list all the partitions in all the measure groups in the database. This is why the wizard needs to connect to the target server: it scripts out all the partitions that currently exist on the target and inserts them into the script it generates, instead of just the partitions that are in the .asdatabase file. If, though, you use the /d argument you’re telling the wizard to use only the information that’s in the input files and not to connect to the target server to see what’s there, so it can’t know what partitions are present on the target server and it has to use the partitions from the .asdatabase file instead. So, if you use the /d argument, even if you specify the Retain Partitions option you may see partitions deleted on the target when you run the wizard’s script.

Building a Better Cache-Warmer, Part 2: The Formula Engine Cache

Sorry for the very long delay – at long last, here’s the companion piece to the post I wrote last September about cache-warming. That post dealt with the relatively straightforward topic of warming the Storage Engine cache; this time we’ll be looking at how to warm the Formula Engine cache. As I promised in the first post, the idea is that I’m going to take the ideas discussed here and use them to create a better cache warmer than the ones I’ve blogged about in the past.

The first thing to note when talking about caching the result of MDX calculations is that it does happen – a surprising number of people think that Analysis Services performs every MDX calculation needed by a query from scratch every time. This does indeed happen in worst-case scenarios – and it’s all too easy to do things which accidentally get you into these worst-case scenarios – but when SSAS can cache the result of calculations it can have a massive impact on query performance.

The second thing to note is that when I say that SSAS can cache the result of calculations, what I mean is that it can cache the values returned by cells in the cube which have been subject to some form of MDX calculation. Analysis Services can’t cache the result of MDX expressions as such, it can only do so when the result of those expressions is surfaced as a value returned by a cell somewhere in the cube. This means that if we create a calculated measure on the cube’s MDX Script, and then include this calculated measure in a query, we can expect that SSAS should be able to cache the values returned by this calculated measure. On the other hand if we have a complex MDX set expression on the Rows axis of our query, the set that expression returns cannot be cached – it will be re-evaluated every time the query is run.

The third, and possibly most important thing to remember is that SSAS can only cache the results of a calculation for as long as the calculation actually exists. As you may already know, there are three places you can create a calculated member in SSAS, and these equate to three different ‘lifetimes’. If you create a calculated measure on the MDX Script of the cube it will live until the next time the cube’s MDX Script is dropped and recreated, and this will happen if you explicitly clear the cache or you do any kind of processing. This is called ‘global’ scope. If, however, you create a calculated member in the WITH clause of a query then that calculated member will only exist for the lifetime of that query – so while you will benefit from caching while the query is executing, after the query has returned the calculated member will disappear and so will any related cache. This is called ‘query’ scope. The third scope, or lifetime, for creating a calculated member is ‘session’ scope: you can create calculated members that live for the lifetime of a session using the CREATE MEMBER statement, so values can only be cached for one user and one session; this is used only very rarely though.

The fourth to understand is that SSAS can’t cache the results of some calculations, or if it can cache them for one user it cannot share the contents of this cache with other users. For example this will happen if the MDX calculation might return a different result every time it runs (eg if it used the Now() function to return the system date and time – which is obviously going to be different each time it’s called) or if it might return different results for different users (eg when two different users run the same query through different security roles, because they might be able to see different parts of the cube the values returned might be different). The way in which a query is constructed can also force ‘query’ scoping for the formula engine cache too: the presence of a calculated member in the WITH clause can do this, as can the use of subselects. This also means that FE cache warming is pointless for certain popular client tools like SSRS or Excel because the MDX they generate prevents the use of the FE cache.

Finally, the Formula Engine has two different structures it uses to hold cached values. For calculations that execute in bulk mode then it can use the same structure that the Storage Engine cache uses, namely the data cache registry. For calculations that execute in cell-by-cell mode, however, it uses a different structure called the flat cache. Calculations that execute in bulk mode in most cases cannot make use of values stored in the flat cache, and calculations that execute in cell-by-cell mode cannot make use of values stored in the data cache registry. Furthermore, the size of the flat cache is restricted to 10% of the TotalMemoryLimit server property; if it grows bigger than that it will be completely emptied.

(Incidentally, if you’re looking for more detail on any of the above points, I suggest you watch the video of my session on “Cache Warming Strategies for SSAS 2008” here).

So clearly there are many potential pitfalls to watch out for when warming the FE cache, and indeed in many cases I’d say that it’s just easier to concentrate on warming the SE cache and tuning your calculations so they execute as fast as possible even on a cold FE cache!

If you do need to warm the FE cache though, you need to be very careful. Unfortunately the CREATE CACHE statement mentioned in the previous post only works for the SE cache, so the only way to warm the FE cache is to use MDX queries. These queries should be hand-written specifically for the purpose – if you simply record queries run in production using Profiler, it’s likely you’ll end up with queries that don’t warm the FE cache because they contain a WITH clause or a subselect – but it’s nonetheless a good idea to use production queries as a starting point and then modify them so they become FE-friendly. A smaller number of larger queries is going to be better than a large number of queries that return small amounts of data, to avoid cache fragmentation in the data cache registry, and it’s a good idea to keep them as simple (and as non-arbitrary-shaped) as possible.

Since it’s likely that the calculations you’re most interested in caching are going to be the ones that, despite your best efforts at tuning them, execute in cell-by-cell mode, then it’s a good idea to keep an eye on the overall size of the various flat caches in existence. The following DMV, I think, shows the size of all the flat caches across all the databases and for every session on an SSAS instance:

select * from
$system.discover_object_memory_usage
where object_id=’FormulaCache’

…but I need to do a bit more research to make sure it shows what I think it shows and to make sure I can observe this 10% limit resulting in the flat cache being emptied.

Lastly, as with the SE cache, you don’t want to overfill it and leave no memory for the natural growth in cache that will occur as a result of normal querying; and you certainly don’t want to go over the LowMemoryLimit or TotalMemoryLimit while you’re warming the cache and end up with cache evictions taking place.

IBAX SSAS Value Pack

Before I start, I need to declare that since Michael Brönnimann (one of the guys at IBAX behind this product) is a friend of mine I can’t really be considered an unbiased reviewer of this product; however, since I think it’s interesting and deserves wider exposure, I did think it was worth blogging about. In fact I’ve more or less given up reviewing products and books on this blog because I know so many people in the SSAS world that it makes impartiality on my part very difficult; it’s just easier to flag up interesting stuff and not try to make any kind of critical assessment.

Anyway, back to the matter in hand: a look at the IBAX SSAS Value Pack. The product’s web page is here:
http://www.ibax.ch/-IBX-/plain.simpleimagetitletextlinklist.en.solutions.products.valuepack/default.aspx
Basically, it’s a template Analysis Services solution that contains solutions to many cube design and MDX calculation problems that you can adapt for your own project. The idea is that, rather than building everything yourself from scratch, if you bring your own data into an existing solution you’ll be able to build something that’s very feature-rich very quickly without having to spend too much time worrying about the design and implementation of the difficult stuff. And the template itself certainly is very feature rich: it has a full set of time series calculations, many other financial and analytical calculations, different types of currency conversion, showing measures in different scales, formatting and colour highlighting, actions and drillthrough.

The next question is, of course, will you actually get any value from a product like this (assuming you don’t buy it as a service offering)? I mean, if you don’t know how to implement this stuff in the first place are you going to be able to adapt someone else’s code to your own requirements? Well, there’s a lot of documentation and some introductory training included in the package to help get you up to speed; and I think if you’re going to learn how to learn SSAS it always helps to have some worked examples to try to understand; and even if you throw away 80% of what you get, the remaining 20% will have still proved useful. If the alternative is to spend months and months developing something, and then finding you’ve made some fundamental mistake in the design early on that you can’t now change, then it’s got to be better.

Arguably Microsoft should be providing something like this to its customers and partners; there have been various service-offerings and solution accelerators developed for SSAS over the years but nothing much has ever come of them. The Add Business Intelligence Wizard in BIDS does something similar too, but again that never seemed to be a development priority and it suffered in the earliest releases of SSAS 2005 from generating MDX that didn’t follow best-practices or, worse, just didn’t work. Adventure Works is probably the closest comparison, and although it does a good job as a sample database it’s not really a template for an enterprise solution. So it’s up to third parties like IBAX to fill the gap in the market…

2008 R2 Best Practices Analyzer released

The 2008 R2 version of Best Practices Analyzer has just been released – you can download it here:

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=0fd439d7-4bff-4df7-a52f-9a1be8725591

I’ve not looked at it yet (it was only released a few hours ago) but I’m told it features an updated set of rules for SSAS. Running the BPA over your existing cube is a good way of spotting bad things you might have done by accident or because you didn’t know they were bad.

New Stuff in the TechEd BI Keynote

I wasn’t able to go to the MS BI Conference/TechEd in New Orleans this year, unfortunately, but I’ve just watched Ted Kummert’s keynote from today and it’s got lots of interesting new BI-related stuff in it. You can view it here:
http://www.msteched.com/2010/NorthAmerica/Keynote02

If you don’t have any patience with the normal keynote backslapping, I suggest you forward to around 01:20:00 when Amir Netz comes on stage to do some demos. Here’s what he shows:

  • 01:25:20 Pivotviewer Extensions for Reporting Services. As far as I can see, this is a new bit of SSRS functionality that does something like the following: execute a data-driven subscription to generate multiple SSRS reports, then load them into the new Silverlight control-based version of Live Labs Pivot, so the output of these reports can be analysed. This will be available in the next thirty days, so I guess early July. It makes for a very visually appealing demo but if I’m right about what’s happening here then it seems a bit of a hack. Live Labs Pivot needs to be properly integrated into the BI stack!
  • 01:32:30 the ability to define KPIs in PowerPivot. The funny thing is, ever since KPIs were introduced in SSAS 2005 I have worked with hundreds of companies using SSAS and I have never, ever seen anyone using KPIs in production! But this is slightly different and it’s got a nice visual designer, so I can see how KPIs in PowerPivot would be very useful.
  • 01:34:45 new record view for viewing data in the PowerPivot designer, a better way of working with wide tables and editing the calculations on them. Handy.
  • 01:36:30 importing a PowerPivot solution, from Excel, into BI Development Studio. You get the same PowerPivot designer UI in BIDS as you get in Excel, but now you are able to use source control, develop offline and so on.
  • 01:37:55 lineage and impact analysis for DAX calculations – a diagram showing the dependencies between DAX calculations. Good, but lineage and impact analysis is needed for much more than just DAX calculations – it needs to cover everything from the structure of the relational data warehouse to SSIS, SSAS and SSRS reports.
  • 01:39:30 2 billion rows of data loaded into a server-based instance of SSAS running in in-memory mode, from a project developed in BIDS. And of course, it’s fast!

So only a few hints at how Vertipaq will be used in corporate BI scenarios, but what’s here is encouraging…

Thoughts on Analysis Services in the Cloud

So far this year I’ve indulged myself a few times in a bit of futurology (here and here, for example) regarding directions the Microsoft BI stack might take. The one area I haven’t touched on recently, though, is what Analysis Services in the cloud might look like; I did speculate a bit here but that was a while ago now and before several relevant technologies had been announced. It’s certainly coming, and presumably somebody somewhere is working on it right now in some top-secret bunker in Redmond, so maybe a few public comments on what we the user community would want from it would be helpful…? Anyway, welcome or not, here are some thoughts…

So why would you want or need Analysis Services in the cloud rather than regular Analysis Services? I can think of a few things it should be able to do to justify its existence:

  • It should be cheaper than hosting all the infrastructure in-house and it should be scalable to the Nth degree. OK, so these are the standard reasons dragged out for cloud-based anything, but with Analysis Services there are two obvious times when resource usage peaks – processing and when a big/complex query runs – and equally there are times when the server can be completely quiet; so the idea of being able to make use of near infinite resources when you need them to make the processing/querying super-quick, but only pay for what you use, is very appealing. From this point it follows that when you need to use extra resources, you need a platform that can scale to be able to make use of those resources.
  • As well as being able to work with ‘traditional’ data sources such as your corporate data warehouse, it should be able to work with cloud-based data sources be they relational or non-relational (like Amazon SimpleDB, Google’s recently-announced BigQuery, Azure Table Storage and all the rest), feeds (like OData or GData), linked data (RDF), web-based spreadsheets like Google Docs or the Excel web app, or completely unstructured data from anywhere on the web (maybe something like how Google Squared works). Supporting the integration of data modelled for all of these different types of database would be a challenge but I think it should be possible.
  • It should be available as a data source for anywhere else on the web – your own apps, reports, web-based spreadsheets and so on – as well as desktop apps like Excel. The really important thing, for me at least, would be for it to expose an XMLA interface to allow ad-hoc querying (note that Excel can’t talk direct to an XMLA provider, it only does OLEDB for OLAP, but it’s possible to bridge the two and Simba already sell an OLEDB provider that does this); grudgingly, I’ll admit a SQL query interface would be useful too. The ability to expose data via an OData feed would be a must as well. 
  • I’d also like to see it support some basic ETL functionality too, because if it is as scalable and fast as I’d like to be then it would have an obvious secondary use for large-scale number crunching – aggregating data, doing lookups, sorting, many of the things that you might do today in the SSIS data flow or which you might look at Hadoop to do. Derived columns and lookups could all be done with DAX or MDX calculations; pivoting, sorting and filtering could all be done (and configured very easily with a good client tool) through the right MDX query. I can imagine it acting as a datasource for itself: you’d load data into a cube or a table or whatever, create a query on top of it which is made available as a feed, then take the data from that feed and load it into another cube/table, and so on.
  • Following on from the last two points, it’s not enough to be able to act as a data source, Microsoft would need to come up with a decent web-based client tool specifically for use with it. And no, vanilla pivot tables on the web wouldn’t cut it, nor would SSRS in the cloud (not that that I wouldn’t want that) – you’d need to have the wow factor that something like Live Labs Pivot has as well as serious, power-user functionality like the Proclarity desktop client; it would probably need to be built using Silverlight or HTML5. I still think there’s an opportunity to rethink what a client tool could be here, blur the line between BI client tool, spreadsheet and database and come up with something really new.

I wouldn’t want, and don’t expect to get, a recognisable version of Analysis Services 2008 in the cloud in the way that SQL Azure is recognisable as server-based SQL Server. While I still see an important role for Analysis Services as we have it today as in corporate BI scenarios I don’t think there’s any point transferring it to the cloud with exactly the same functionality. Some things, like dimension security, would still be needed, but some things, like cell security, we could probably live without. PowerPivot in the cloud would make more sense as a starting point, so long as it was not just a straight copy of PowerPivot on the desktop: the ability to scale to really, really large data volumes, as in the first bullet above, would be the key feature and the only real reason why customers would want BI in the cloud. And it’s not just the scalability of simple queries either – queries that use complex calculations would need to scale too. You know what, though? When I look at DAX I can’t help but think it was designed with this requirement in mind; I can see how the evaluation of DAX expressions could be easily parallelised.

So all this sounds pretty ambitious, even a bit pie-in-the-sky. The way I see it, though, as far as BI-in-the-cloud goes there’s everything still to play for and if Microsoft doesn’t deliver then someone else will. Could it be Google, or Amazon, or some startup we’ve never heard of? Now that Google BigQuery has a SQL interface, it’s only going to be a matter of time before someone builds a BI app on top of it (I wonder if Mondrian can be made to work with it?) – and it certainly seems to be fast. Microsoft needs to think beyond using BI to defend the Excel/Sharepoint franchise and start thinking about the future! With ten years of BI experience behind it, Microsoft should be in a strong position to move forward into the cloud but it’s only going to succeed if it’s innovative. I understand there will be some new PowerPivot-related product announcements at the BI Conference this week; I’m keeping my fingers crossed.

As always, your comments and ideas are welcome…

What happens when you do a Process Update on a dimension?

Over the last few days I’ve been involved in an interesting thread on the SSAS forum regarding what happens when you do a Process Update on a dimension. It’s a topic that is not widely understood, and indeed I’ve not known all the details until today, but it’s nonetheless very important: one of the commonest performance-related problems I see in my consultancy work is partitions that have aggregations designed for them, but where those aggregations aren’t in a processed state because a Process Update has been run on one or more dimensions. Anyway, just now Akshai Mirchandani from the dev team posted a really good overview of what actually happens when you run a Process Update on that thread, so I thought I’d copy here to ensure it gets a wider audience:

Here is a quick summary of what happens when you do ProcessUpdate:

1. After the dimension has been updated, the server analyzes the changes that occurred to the dimension. In 2005, this analysis was pretty simple and would often incorrectly detect that major changes had occurred that required clearing of indexes and aggregations. In 2008, this code was improved such that it more often would realize that nothing significant has occurred. It’s a fairly small (but useful) optimization — I guess nobody thought it was worth documenting!

2. Based on this analysis, the server will decide whether or not indexes and aggregations need to be cleared. If no (e.g. because records were only added and not deleted/updated), then the partitions won’t be affected.

3. If indexes/aggregations need to be cleared, then the server will check if ProcessAffectedObjects was enabled — if yes, then instead of clearing the indexes/aggregations it will rebuild the indexes/aggregations.

4. The act of clearing the indexes/aggregations also shows up as "partition processing operations" in Profiler — that’s one of the things that has been confusing some of you.

5. When aggregations are cleared, only the flexible aggregations need to be cleared because we’re guaranteed by the rigid relationships that the members cannot have moved and therefore the rollups cannot have changed. However, indexes can still have changed and therefore you may still see the partition processing jobs kick off to clear the indexes.

6. ProcessIndexes and ProcessClearIndexes take care of building both bitmap indexes (aka map) and aggregations — the context is that both aggregations and bitmap indexes are generically considered "indexes".

Really the main takeaway here is that if you ProcessUpdate a dimension, you should strongly consider either doing ProcessAffectedObjects or an explicit ProcessIndexes on the affected partitions so that bitmap indexes and flexible aggregations get rebuilt. The advantage of explicitly doing ProcessIndexes is that you can bring your cube online earlier and have the indexes/aggregations get processed more lazily in the background — a number of customers prefer to do that because their processing windows are too small to wait for the indexes to get processed.

Also related to this topic, I thought I’d also highlight a great post by Darren Gosbell where he shows how to find out if your aggregations are processed or not:

http://geekswithblogs.net/darrengosbell/archive/2008/12/02/ssas-are-my-aggregations-processed.aspx

SQL Server Metadata Toolkit 2008

One of the few sessions I was able to see last week at the PASS European Conference (I had to leave very early, unfortunately) was Markus Raatz’s excellent “New Treasures for Microsoft BI Found on Codeplex”. Among other things, he showed off a new project on Codeplex that I hadn’t seen before: SQL Server Metadata Toolkit 2008. It’s an updated version of the SQL Server Metadata Toolkit that was released for 2005 a few years ago, and allows you to trace metadata dependencies between the relational database, SSIS packages and SSAS objects. So if, for example, you change a column name in a dimension table you should be able to work out which SSIS packages and which SSAS dimensions will be broken. You can download it here:
http://sqlmetadata.codeplex.com/

I’ve not had a chance to download it yet, but it looks like it will be very useful.

Analysis Services and Solid State Disks

There’s a lot of discussion going on in the SQL Server relational world about solid state disks and their benefits, but for some reason very little has been said about how useful they could be for Analysis Services. And yet, with their impressive performance for random reads, SSAS and SSDs would seem to be the perfect combination. I think the reason for the silence is that there’s less general understanding of SSAS performance issues out there in the blogosphere, and probably also since there are fewer really large installations of SSAS not many people have had the chance to test SSDs with SSAS properly.

Anyway, the other week I was lucky enough to spend some time testing SSDs with SSAS for a customer of mine. For reasons of privacy I can’t share any significant details about what I found, and before I go on I need to point out that what I was doing was comparing the performance of a particular SSAS cube on a particular existing hard disk configuration with the same SSAS cube on a particular SSD configuration. So while it’s certainly possible to say that, in lab tests, SSDs can perform extremely well for certain tasks, in the real world you have to ask the following question: what benefit will an SSD give me for my cube, over my existing disk configuration? So in fact even if I could share the details of my tests they wouldn’t be all that useful.

That said, here are a few high-level findings:

  • Backup on the SSD was approximately 50% faster than on their existing, very respectable, hard disk configuration; Restore was 65% faster. These were the clearest examples of the SSD showing massive performance improvements.
  • For various reasons I couldn’t do full tests on processing performance, but Process Index operations were about 10-15% faster, as was Synchronisation.
  • Performance of individual queries was not significantly faster, maybe 1-2%; there were some cases where, with large numbers of concurrent users, I saw a bigger improvement of around maybe 10%. This didn’t surprise me at all because I knew that any query performance issues the customer has are related to the formula engine and not the storage engine, so IO wasn’t much of a problem in the first place.

These tests also led to some interesting behind-the-scenes discussions about how you can actually measure the impact of IO on SSAS storage engine operations, which Teo has already blogged about here and which are well worth reading:
http://prologika.com/CS/blogs/blog/archive/2010/04/13/analysis-services-i-o-time.aspx
http://prologika.com/CS/blogs/blog/archive/2010/04/16/using-xperf-to-test-analysis-services.aspx

For this particular customer, my recommendation was that it wasn’t worth the time, money and effort to move to SSDs, at least from a query performance point of view. In fact, based on several years experience of tuning SSAS implementations I find that in most cases IO is not one of the most important factors in determining query performance – it’s more often the case that the problem lies in the formula engine, and even when it doesn’t, an appropriate partitioning strategy and aggregation design can work wonders.

However, let me be clear: I am not making some general statement that SSDs are not useful for SSAS. What I am doing is giving the usual ‘your mileage may vary’ and ‘it depends’ answer. There are definitely going to be SSAS implementations where SSDs will make a massive difference to query performance, and where no other tuning technique can have the same impact. For example, I’ve heard that with distinct count measures (especially when there are a large number of concurrent users) SSDs can lead to very impressive improvements in query performance; I would assume that where there are large many-to-many relationships you would also benefit accordingly. What you need to do before you spend money on SSDs is to understand the causes of any performance issues you currently have and make sure you tune your cube to the best of you abilities.

Has anyone else had any experience with SSAS and SSDs they’d like to share?

UPDATE: the SQLCat team have published this technical note on the subject: http://sqlcat.com/technicalnotes/archive/2010/09/20/analysis-services-distinct-count-optimization-using-solid-state-devices.aspx