PowerPivot Top N Reports Using Excel Cube Formulas

Top N reports are an extremely common requirement: my customers are always trying to find their top 10 products or sales people or geographies by some measure or other. Luckily this type of report is fairly easy to build in PowerPivot if you’re using a PivotTable; in fact, Rob Collie wrote a good blog post on this subject only last week which is well worth a read. The problem with PivotTables is, however, that they are a pain to format and many people prefer to use Excel cube formulas for their dashboards – and unfortunately dynamic Top N reports are surprisingly difficult to implement with cube formulas. As the discussions here and here show, even when you’re using CubeSet and CubeRankedMember you need to know MDX pretty well and even then it’s a bit messy. Here, instead, is a pure DAX solution to the problem which, while not as simple as I’d like, involves no MDX, no clever use of Excel cube functions, and works when you select more than one item in a slicer.

The first thing you need to do is to create a table with as many rows in as you need items in your Top N report. In my example I’m going to return the top 10 products in a model built from the Adventure Works database, so here’s my table (called TopNRank):

image

Here’s my model in Diagram View:

image

And here’s a screenshot of my main worksheet, for reference, with two Slicers on CalendarYear and EnglishProductCategoryName; a PivotTable with a Top 10 filter applied on EnglishProductName (to check the output and for debugging); and below it my Excel formulas, with the ten values from the TopNRank table on rows and two measures called [TopN Product Name] and [TopN Product Sales] on columns, showing the same top 10 values:

image

Step 1 is to create a measure called [Sales] that simply sums up the values in the [Sales Amount] column:

Sales:=SUM([SalesAmount])

You can then create a measure, called [Product Rank] here (and shown in the PivotTable above), that returns the rank of each product by [Sales] for the current year and category:

Product Rank:=
IF(
  ISBLANK([Sales])
    , BLANK()
    , RANKX(ALL(DimProduct[EnglishProductName]), [Sales], [Sales], 0, Dense)
)

The basic idea for this approach is that with the Excel cube formulas, you’re going to use the values from the TopNRank table on rows and then use a measure to return the name of the top Nth Product for each row. This measure needs to return the name of the product that has the same rank value as whichever value from the TopNRank table is on rows. For example, in the screenshot above, in cell D21 there is a CubeMember function that returns the value 1 from TopNRank table; in cell D22 there is a CubeValue function that references the new measure, and this filters the list of all Products to return the name of the Product where [Product Rank] is 1, which is Road-150 Red, 48 (as you can see from the PivotTable).

There’s a problem with this approach, however, and that is that the RankX function always returns tied ranks when two products have the same value for [Sales]. So, in the PivotTable in the screenshot above, there are two products with the rank 2 because they have the same value for the [Sales] measure – and this causes big problems for the approach described in the previous paragraph. Despite what BOL says you can’t calculate a rank by more than one column, so the only way to get around this is to ensure that tied ranks can never occur, and the way I’ve done this is to rank by [Sales] and the name of the product by using the following measures:

Product Name:=
FIRSTNONBLANK(VALUES(DimProduct[EnglishProductName]), DimProduct[EnglishProductName])

Product Name Rank:=
IF(
  ISBLANK([Sales])
    , BLANK()
    , RANKX(ALL(DimProduct[EnglishProductName]),[Product Name])
)

Combined Rank:=
[Product Rank] + (1/[Product Name Rank])

Untied Product Rank:=
RANKX(ALL(DimProduct[EnglishProductName]), [Combined Rank],,1)

With this done, at long last it’s possible to create the measure that returns the name of the Top Nth product as follows:

TopN Product Name:=
IF(
  ISFILTERED(‘TopNRank'[TopNRank]) && ISBLANK([Sales])=FALSE()
    , FIRSTNONBLANK(
        FILTER(VALUES(DimProduct[EnglishProductName])
        , [Untied Product Rank]=VALUES(‘TopNRank'[TopNRank]))
    , DimProduct[EnglishProductName])
  , BLANK()
)

And here’s the measure that returns the value of [Sales] for each product:

TopN Product Sales:=
IF(
  ISFILTERED(‘TopNRank'[TopNRank]) && ISBLANK([Sales])=FALSE()
  , CALCULATE(
    SUM(FactInternetSales[SalesAmount])
    , FILTER(
      VALUES(DimProduct[EnglishProductName])
      , DimProduct[EnglishProductName]=[TopN Product Name]))
    , BLANK()
)

I’ve been told by the customer that implemented this approach that performance on larger models, while acceptable, is a bit slow and that it gets worse the more items you display in your top n list. This doesn’t surprise me and to be honest I’ll need to do some experiments to see if I can improve performance.

You can download my sample workbook (Excel 2010 64 bit, PowerPivot V2.0) from here.

My PowerPivot Post on the Microsoft BI Blog

Just a quick note to mention that a guest post I wrote on PowerPivot and how it can be used with some other new, obscure and/or experimental BI tools in Excel is now live here on the Microsoft BI Blog:

http://blogs.msdn.com/b/microsoft_business_intelligence1/archive/2012/07/05/the-microsoft-self-service-bi-stack-it-s-more-than-just-powerpivot.aspx

This post follows the outline of a session I’ll be presenting at the PASS Summit this year; I’ve presented it a few times already to whip it into shape (including last week at the Leeds UG) and I can promise you it’s even more fun in person!

Storage Engine Cache Aggregation and its Implications for Dimension Design

SSAS Multidimensional does caching in a number of places, but the most important type of caching for most cubes happens in the Storage Engine (SE) – the part of SSAS that reads data from disk and aggregates it up to a given granularity. After the SE has answered a single request for data it stores the resulting subcube in cache so that if the same request is made in the future then it can be answered from there very quickly. In addition, in some cases the SE is also able to answer requests at a higher granularity from data that is currently in its cache, and increasing the likelihood of this happening can have a significant positive impact on the performance of large cubes.

Let’s see an example of how SE caching works using a very simple cube built on the Adventure Works database, with one Sum measure and just one dimension, a Date dimension with the following attribute relationships:

image

Running the following query on a cold cache:

select {[Measures].[Sales Amount]} on 0,
[Date].[Year].[Year].members
on 1
from [SECacheDemo]

…and then looking in Profiler shows that the SE has to go to disk to get the data it needs (as shown by the Progress Report Begin/End events):

image

Running the same query immediately afterwards shows the SE can get the data it needs from cache:

image

Running the following query, where I’m getting the All Member from the Year hierarchy, shows that the SE is also able to answer this request from cache:

select {[Measures].[Sales Amount]} on 0,
{[Date].[Year].[All]}
on 1
from [SECacheDemo]

image

This is as you’d expect because, of course, the All Member on the Years hierarchy represents the aggregated total of all the years returned in the first query.

There are several limits on the ability of the SE to derive aggregated totals from data it already has in cache. For a start, the SE cannot aggregate multiple cache entries to derive a single figure. So, for example, if I run the following three queries:

select {[Measures].[Sales Amount]} on 0,
{[Date].[Year].&[2001],[Date].[Year].&[2002]}
on 1
from [SECacheDemo]

select {[Measures].[Sales Amount]} on 0,
{[Date].[Year].&[2003],[Date].[Year].&[2004]}
on 1
from [SECacheDemo]

select {[Measures].[Sales Amount]} on 0,
{[Date].[Year].&[2006]}
on 1
from [SECacheDemo]

…which together return all the years, when I run the query to get the All Member that will not be answered from the SE cache. Each of the three queries above create separate entries in the SE cache; this is one of the reasons why, when cache warming, it’s better to use a few very large queries rather than lots of small, filtered queries.

Furthermore (and this is something that surprised me a little when I found out about it recently), despite the presence of attribute relationships, the SE cannot always work out how to derive higher-level values from lower-level cached data. Running the following query on a cold cache:

select {[Measures].[Sales Amount]} on 0,
[Date].[Date].[Date].members
on 1
from [SECacheDemo]

image

…populates the SE cache with data at the Date granularity, but the following query to get the values for all years:

select {[Measures].[Sales Amount]} on 0,
[Date].[Year].[Year].members
on 1
from [SECacheDemo]

…is not answered from cache, even though the year values could be derived from the date values already in cache.

image

Why is this happening? Well, the SE does not (at least at the time of writing) decode the attribute relationships when working out which granularities can be aggregated from cache. If you look at the granularities as represented in the Query Subcube Verbose events shown in the previous two screenshots, the granularity for the request at Date is
* 0 0 0
…and the granularity for the request at Year is
0 0 0 *
The four characters in this representation of the granularity stands for the four attributes on the dimension: Date, Month, Quarter and Year. The 0 character shows that a request is not at the granularity of that attribute, any other value shows that it is, and the asterisk character shows the request returns all the values at the specified granularity (this white paper gives more detail on how to interpret these values). So, without knowing anything about attribute relationships, the SE can say that the granularity
0 0 0 0
can be aggregated from
* 0 0 0
but it cannot say that
0 0 0 *
can be aggregated from
* 0 0 0

Luckily these limitations on what can be aggregated do not apply to aggregations: if I was to build an aggregation at the Date granularity, my query at the Year granularity would be able to make use of that aggregation.

Also, the use of natural user hierarchies can work around this limitation. Consider the following user hierarchy built on the dimension:

image

Querying at the Date level of this user hierarchy, like so:

select {[Measures].[Sales Amount]} on 0,
[Date].[Calendar].[Date].members
on 1
from [SECacheDemo]

image

…results in a request at the granularity
* * * *
which can then be aggregated up to many more granularities – querying at a level in a natural user hierarchy automatically includes the granularities of all the attributes used for the levels above in the user hierarchy.

Therefore, both the following queries:

select {[Measures].[Sales Amount]} on 0,
[Date].[Calendar].[Year].members
on 1
from [SECacheDemo]

select {[Measures].[Sales Amount]} on 0,
[Date].[Year].[Year].members
on 1
from [SECacheDemo]

…can be answered from the cache created by the query at the [Date].[Calendar].[Date] level.

The conclusion must be, then, that using natural user hierarchies will allow for much better SE cache reuse than using the attribute hierarchies on their own. Also, if you have a natural user hierarchy, it might be a good idea to hide the underlying attribute hierarchies so users and other developers do not reference them in their queries and calculations. You may not notice the performance difference that better SE cache reuse gives you on most cubes, but on very large cubes or cubes that are very SE-intensive (for example, because they are reprocessed frequently) this could make a noticeable difference to your overall query performance.

Thanks to Akshai Mirchandani and Hrvoje Piasevoli for their help in understanding this.

%d bloggers like this: