Chris Webb's BI Blog: Usage-Based Partitioning

I was reading Dave Wickert’s excellent white paper "Project REAL: Analysis Services Technical Drilldown" the other day (you can get it here), specifically the section on P39 about partitioning. In it he discusses the new functionality in AS2005 which automatically determines which members from your dimensions have data in a given partition, and goes on to talk about the new possibilities this opens up in terms of partitioning strategy. Here’s an excerpt:

The partitions in the Project REAL database seem to violate one of the basic best practices of SQL Server 2000. There is no data slice set for the partitions. In SQL Server 2000, partitions must have the data slice set so that the run-time engine knows which partition to access. This is similar to specifying a hint to a relational query optimizer. In SQL Server 2005, this is no longer necessary. Processing the partition now automatically builds a histogram-like structure in the MOLAP storage. This structure identifies which members from all dimensions are included in the partition. Thus, so long as the storage method is MOLAP, the data slice is an optional (and unused) property. However, the data slice is used with ROLAP storage or when proactive caching involves a ROLAP access phase. In both of these circumstances, the actual fact data is never moved so the system does not have a chance to identify a member. In this case, setting the data slice for the partition remains a necessary and critical step if you expect the system to perform well.

Because the MOLAP structures dynamically determine the data slice, a new type of partitioning technique is possible with SQL Server 2005. The best way to describe this technique is via a simple example.

Suppose a system that you are designing has a product dimension of 1,000 products. Of these, the top 5 products account for 80% of the sales (roughly evenly distributed). The remaining 995 products account for the other 20% of the sales. An analysis of the end-user query patterns show that analysis based on product is a common and effective partitioning scheme. For example, most of the reports include a breakdown by product. Based on this analysis, you create six partitions. You create one partition each for the top 5 products and then one “catchall” partition for the remainder. It is easy to create a catchall partition. In the query binding, add a WHERE clause to the SQL statement as in the following code.

In the top five partitions (1 through 5) use the following code.

SELECT * FROM <fact table>
WHERE SK_Product_ID = <SK_TopNthProduct#>

In the catchall partition use the following code.

      SELECT * FROM <fact table>
      WHERE SK_Product_ID NOT IN (<SK_TopProduct#>,
                                  <SK_2ndTopProduct#>

<SK_3rdTopProduct#>

<SK_4thTopProduct#>

<SK_5thTopProduct#>)

This technique requires a lot of administrative overhead in SQL Server 2000 Analysis Services. In SQL Server 2000, the data slice must identify each and every member in the partition—even if there are thousands and thousands of members. To implement the example, you would need to create the catchall partition data slice with 995 members in it. This is in addition to the administrative challenge of updating that list as new members are added to the dimension. In SQL Server 2005 Analysis Services, the automatic building of the data slice in the partition eliminates the administrative overhead.

This got me thinking… if we’ve got a Usage-Based Optimisation wizard for helping design the right aggregations for a cube, surely it’s possible to do something similar so that we can design partitions on the basis of the queries that users actually run? Here’s an idea on how it might work (nb this would be a strategy to use in addition to partitioning by Time, Store or other ‘obvious’ slices rather than a replacement):

First, get a log of all the queries that users are actually running. Unfortunately the Query Log in AS2005, like AS2000, doesn’t actually record the actual MDX of all the queries run. The only way to do this is to use Profiler; I was a bit worried about whether doing this would have an adverse impact on query performance but when I put the question to MS they indicated it shouldn’t be much (Richard Tkachuk also mentioned, as an aside, that turning off Flight Recorder should result in an increase of a few % in terms of query performance – a tip to remember for production boxes, I think). Once you’ve run your trace you can then export all of the MDX statements from it to a text file very easily.
You’d then need a bit of code to extract the unique names of all the members mentioned explicitly in these queries. It should be a fairly simple task if you get the right regular expression, I think. Note that this ignores queries which use any kind of set expression – my thinking was that individually named members are going to be the most interesting because they’re going to be the ones which slice the queries the most: if users are querying on all the countries in the cube that’s not going to be any use for partitioning, but if they have a particular product in the WHERE clause that is much more useful to know about.
Then you could do some data mining to cluster these members by their propensity to appear in a query together. The idea is that each of the resulting clusters would translate into a partition; those members which didn’t fall nicely into a cluster and those members that didn’t get extracted in step #2 would have their data fall into one of Dave’s ‘catch-all’ partitions. Imagine this scenario: the UK branch of the Adventure Works corporation suddenly finds there is massive demand for bikes after petrol (‘gas’, for you Americans) prices rise massively. As a result, analysts in the UK run lots of queries which are sliced by the Product Category [Bikes] and the Country [UK]. You’d hope that this pattern would emerge in the clustering and result in a single partition containing all the data for ([Bikes], [UK]), so in the future similar queries run much faster.

What does everyone think? There seems to be a lot of activity these days in the comments section of my blog, so I thought I’d invite feedback. Can anyone see a fatal flaw in this approach?

Usage-Based Partitioning

Like this:

Published by Chris Webb

Leave a ReplyCancel reply

Share this:

Like this:

Published by Chris Webb

Leave a ReplyCancel reply

Discover more from Chris Webb's BI Blog