‘Events in Progress’ Part 3 – Average Age of Unpaid Invoices

Sorry for the relative quiet on here recently, I’ve been caught up with helping to organise SQLBits and other stuff… But anyway, this week I’m working with a company who asked me to tackle what must be a very common business problem: they have a fact table where one row represents an invoice they have generated, and they need to know at any point in time what the average age of all their unpaid invoices is, and monitor whether this value goes up or down. The point of this is, of course, is that if the average age of outstanding invoices starts to go up over time then it means that their customers are getting slower to pay them and they may potentially have a cashflow problem.

The solution is a variation on the ‘events in progress’ model that I blogged about recently but with an interesting twist that I thought was worth describing. First of all, we need to prepare some data so we can reproduce the scenario in Adventure Works – let’s pretend that the FactInternetSales table actually contains invoice data. We’ll need to massage the data a bit for two reasons:

  • In Adventure Works if you look at the OrderDate, ShipDate and DueDate columns there’s always exactly the same number of days between the dates, which doesn’t make for a great demo
  • In the real world some invoices will be unpaid when the fact table is loaded, others will be paid. We need to be able to handle both scenarios properly.

Once again, we need two fact tables. The first of these will contain all of our invoices, and for this I used the following named query in the Adventure Works DSV:

SELECT        CustomerKey, OrderDateKey
FROM            dbo.FactInternetSales

The second will be a fact table that contains information on invoices that have been paid:

SELECT       
CustomerKey,
OrderDateKey,
CASE WHEN SalesTerritoryKey > 5 THEN ShipDateKey ELSE DueDateKey END AS PaidDateKey,
CASE WHEN SalesTerritoryKey > 5 THEN 7 ELSE 12 END AS DaysToPayment
FROM            dbo.FactInternetSales
WHERE        (PromotionKey < 3)

A few things to notice in this second query:

  • I’ve filtered the fact table to remove a very small number of rows where PromotionKey is less than 3 – this will give us a few unpaid invoices to make the numbers more interesting.
  • I’ve created a new date key called PaidDateKey to make the invoices have different payment lengths.
  • Because these are paid invoices, we know how long they took to be paid and we can precalculate this and put it in the fact table in the DaysToPayment column.

We can now build the cube. It needs two measure groups, one based on each fact table, and we can add the Customer dimension too. Three measures need to be built: one count measure on each measure group and a sum measure on the DaysToPayment column. The Date dimension joins to the OrderDateKey on the All Invoices measure group, and to the PaidDateKey on the Paid Invoices measure group.

Now for the MDX. The first thing we need to do is find the total number of invoices generated to date and the total number of invoices that have been paid to date (and also to make sure that these totals aggregate up properly through the time dimension); we can then subtract the second from the first to find the total number of unpaid invoices:

CREATE MEMBER CURRENTCUBE.MEASURES.[Total Invoices To Date] AS NULL;

SCOPE(MEASURES.[Total Invoices To Date]);

    SCOPE([Date].[Date].[Date].MEMBERS);
        THIS = SUM(NULL:[Date].[Date].CURRENTMEMBER, [Measures].[All Invoices Count]);
    END SCOPE;

    SCOPE([Date].[Date].[All Periods]);
        THIS = (TAIL(EXISTING [Date].[Date].[Date].MEMBERS, 1).ITEM(0).ITEM(0)
                , MEASURES.[Total Invoices To Date]);
    END SCOPE;

END SCOPE;


CREATE MEMBER CURRENTCUBE.MEASURES.[Total Paid Invoices To Date] AS NULL;

SCOPE(MEASURES.[Total Paid Invoices To Date]);

    SCOPE([Date].[Date].[Date].MEMBERS);
        THIS = SUM(NULL:[Date].[Date].CURRENTMEMBER, [Measures].[Paid Invoices Count]);
    END SCOPE;

    SCOPE([Date].[Date].[All Periods]);
        THIS = (TAIL(EXISTING [Date].[Date].[Date].MEMBERS, 1).ITEM(0).ITEM(0)
                , MEASURES.[Total Paid Invoices To Date]);
    END SCOPE;

END SCOPE;

CREATE MEMBER CURRENTCUBE.MEASURES.[Unpaid Invoices Count] AS
MEASURES.[Total Invoices To Date] - MEASURES.[Total Paid Invoices To Date];

 

This is the easy bit. In order to find out what the average of the number of days the unpaid invoices have actually been unpaid, we need to find the number of days that each unpaid invoice has been open for, sum that up, and divide it by the total number of unpaid invoices. And to find this sum of the number of days that all unpaid invoices are open, we need to do the following:

  • Find the total up to yesterday of the [Unpaid Invoices Count] measure. This gives the total number of days that every invoice has ever been open.
  • Then sum the total to date of the Days to Payment measure we created on the Paid Invoices fact table, to find the total number of days that all paid invoices were ever open
  • Then subtract this second figure from the first, which will give us the sum of the number of days all currently unpaid invoices have been open

Here’s the code:

CREATE MEMBER CURRENTCUBE.MEASURES.[Sum of Ages of Invoices To Date] AS NULL;

SCOPE(MEASURES.[Sum of Ages of Invoices To Date]);

    SCOPE([Date].[Date].[Date].MEMBERS);
        THIS = SUM(NULL:[Date].[Date].CURRENTMEMBER.PREVMEMBER
                , MEASURES.[Unpaid Invoices Count]);
    END SCOPE;

    SCOPE([Date].[Date].[All Periods]);
        THIS = (TAIL(EXISTING [Date].[Date].[Date].MEMBERS, 1).ITEM(0).ITEM(0)
                , MEASURES.[Sum of Ages of Invoices To Date]);
    END SCOPE;

END SCOPE;

CREATE MEMBER CURRENTCUBE.MEASURES.[Sum of Ages of Paid Invoices To Date] AS NULL;

SCOPE(MEASURES.[Sum of Ages of Paid Invoices To Date]);

    SCOPE([Date].[Date].[Date].MEMBERS);
        THIS = SUM(NULL:[Date].[Date].CURRENTMEMBER, [Measures].[Days To Payment]);
    END SCOPE;

    SCOPE([Date].[Date].[All Periods]);
        THIS = (TAIL(EXISTING [Date].[Date].[Date].MEMBERS, 1).ITEM(0).ITEM(0)
                , MEASURES.[Sum of Ages of Paid Invoices To Date]);
    END SCOPE;

END SCOPE;

CREATE MEMBER CURRENTCUBE.MEASURES.[Difference of Sum of Ages] AS
MEASURES.[Sum of Ages of Invoices To Date] - MEASURES.[Sum of Ages of Paid Invoices To Date];

 

Finally, we can calculate the average like so:

CREATE MEMBER CURRENTCUBE.MEASURES.[Average Age of Unpaid Invoices] AS
IIF(MEASURES.[Unpaid Invoices Count]=0, NULL,
MEASURES.[Difference of Sum of Ages]/MEASURES.[Unpaid Invoices Count])
, FORMAT_STRING='#,#.00';

 

It’s pretty complex I’ll admit, but it does perform well. Here’s what the results look like in Excel:

image

I only wish I could come up with a way of getting some of my customers to pay this quickly…!

Solving the ‘Events in Progress’ problem in MDX, Part 2–Role Playing Measure Groups

In my last post I described a simple solution to the ‘events in progress’ problem in SSAS and MDX, which nonetheless had one drawback: it involved loading all the data from the same fact table into two different measure groups in the same cube, which of course potentially doubles the amount of time taken to process the cube and its size on disk. I used the same technique in my recent post on improving the performance of currency conversion, and indeed it’s a technique that I have used in the past with several customers successfully; but it seems rather inelegant, so is there a way we can avoid doing it and only have one measure group? It doesn’t seem to be possible at first glance and I actually opened a Connect a while ago asking for this functionality (‘role playing measure groups’) to be implemented. I was having a good week last week, though, and at the same time as solving the ‘events in progress’ problem I also came up with a solution to this problem too…

Here’s what I did, using the ‘events in progress’ problem as an illustration:

First of all, I created and processed a simple cube called ‘EventsInProgressSource’ with a single measure group, a single Count measure, and two dimensions, Order Date and Ship Date having regular relationships joining on the OrderDateKey and ShipDateKey columns on the fact table:

image

image

I then created a second cube, ‘EventsInProgress2’, using the ‘Create Empty Cube’ option in the Cube Wizard.

image

I then started the New Linked Object wizard and copied everything from the ‘EventsInProgressSource’ cube to the EventsInProgress2 cube:

image

This resulted in a cube with one linked measure group and two linked dimensions:

image

Now here’s the fun bit. I then deployed and processed the cube, closed BIDS and went to SQL Management Studio. There I scripted the Internet Sales Facts linked measure group in ‘EventsInProgress2’ to an XMLA Create statement, then manually updated the XMLA by adding a 1 to the end of the name and ID of the object itself and the Line Item Count measure, then executed the script against the ‘EventsInProgess2’ cube. This created a second, identical linked measure group – something that again BIDS doesn’t let you do. I then reopened BIDS and connected direct to the cube in online mode (I could also have reimported the project back into BIDS) and went to the Dimension Usage tab, then deleted the relationship between Ship Date and the first linked measure group and Order Date and the second linked measure group, leaving the relationships like this:

image

I then added another Date dimension and set up referenced relationships (which had to be non-materialised) with each measure group via the Date attributes of the Ship Date and Order Date dimensions:

image

With this done we have achieved out goal: we have the same fact table appearing twice in the same cube as two different measure groups with different dimensionality, but we are only processing the data once. The last step to solve the ‘events in progress’ problem is to add what is essentially the same MDX as last time to the cube:

CREATE MEMBER CURRENTCUBE.MEASURES.OrderedToDate AS
    SUM(NULL: [Date].[Calendar].CURRENTMEMBER,
        ([Measures].[Line Item Count]));

CREATE MEMBER CURRENTCUBE.MEASURES.ShippedToPrevDate AS
    SUM(NULL: [Date].[Calendar].CURRENTMEMBER.PREVMEMBER,
        ([Measures].[Line Item Count1]));

CREATE MEMBER CURRENTCUBE.MEASURES.OrderedNotShipped AS
MEASURES.OrderedToDate – MEASURES.ShippedToPrevDate;

And we get the results we want out of the cube:

image

Now I haven’t tested this with anything other than the Adventure Works data, and there are some areas where I’d like to do more testing:

  • Non-materialised reference relationships don’t always perform that well. Materialised reference relationships aren’t allowed with linked measure groups though, so possibly using a m2m relationship to join the Date and Order Date/Ship Date dimensions might be an alternative
  • There are various ways of calculating a total-to-date in MDX and I’ve not spent any time working out if the version I’ve shown above is the most efficient.
  • There is going to be an overhead from querying a linked measure group rather than a regular measure group, and also probably an overhead from trying to query the same source measure group twice rather than two different measure groups, but I don’t know how significant it is.

If anyone out there does decide to try this at scale with their own data, please leave a comment and tell me about your experiences.

Solving the ‘Events in Progress’ problem in MDX, Part 1

I wouldn’t admit to this in public (at least to an audience of normal people, rather than the Microsoft BI geeks reading here), but over the last few years I’ve been obsessed with solving the ‘events in progress’ problem in SSAS and MDX. I’ve tackled it successfully in PowerPivot and DAX (see here and here) but I always thought there was no good solution in SSAS, and certainly other people such as Richard Tkachuk who have written about the same issue have come to a similar conclusion. But… last week I found one! So in this blog post I’ll outline the basic approach, and in my next post I’ll detail an even better, if more complex to implement, solution.

The two blog posts linked to above describe the problem in detail, but here’s a quick overview. In the Internet Sales fact table in Adventure Works each row represents a line item on an invoice, and each line item has an Order Date (ie the date the order was received) and a Ship Date (ie the date the order was shipped to the customer). We would like to know, on any given date, how many invoice line items are part of orders that have been received but not yet shipped and so have an Order Date before that date and a Ship Date after that date.

The key to being able to solve this problem in SSAS and MDX is how you model the data – my previous difficulties were down to the fact that I had the obvious way of modelling it, with one measure group having regular relationships with an Order Date and a Ship Date dimension, fixed in my head. However if you take a completely different approach the problem becomes easy, and here’s my worked solution using the Adventure Works data:

First of all I created my cube with one Date dimension and two measure groups, both based on the Internet Sales fact table:

image

image

Since BIDS won’t let you build two measure groups from the same table in the DSV, I created a named query that duplicated the Internet Sales fact table and used that as the basis of the Ship Dates measure group. Both the Order Dates Count and Ship Dates Count have AggregateFunction set to Count, and the Date dimension joins to Order Dates on OrderDateKey and Ship Dates on ShipDateKey.

I then created the following calculated measures:

CREATE MEMBER CURRENTCUBE.MEASURES.OrderedToDate AS
    SUM(NULL: [Date].[Calendar].CURRENTMEMBER,
        ([Measures].[Order Dates Count]));

CREATE MEMBER CURRENTCUBE.MEASURES.ShippedToPrevDate AS
    SUM(NULL: [Date].[Calendar].CURRENTMEMBER.PREVMEMBER,
        ([Measures].[Ship Dates Count]));

CREATE MEMBER CURRENTCUBE.MEASURES.OrderedNotShipped AS
MEASURES.OrderedToDate – MEASURES.ShippedToPrevDate;

The first, OrderedToDate, gives the total number of line items ordered from the beginning of time up to the current date. The second, ShippedToPrevDate, gives the number of line items shipped from the beginning of time up to the date before the current date. If we subtract the second value from the first, as we do in the OrderedNotShipped measure, we can therefore find the number of line items have been ordered but not yet shipped on any given date:

image

image

And it’s pretty fast, too – certainly much faster than any other solution I’ve tried before in SSAS. There’s one obvious catch here though, and that is that you have to load the data from your fact table into your cube twice. What if you have a really large fact table, and doubling your processing time and storage in this way is not an option? Well, stay tuned for part two when I’ll show you a trick to avoid doing this and only use one measure group.

Solving the ‘Events in Progress’ Problem in DAX, Part Two: Aggregating Dates and Multiselect

You may recall that by the end of last week’s post on solving the ‘events in progress’ problem in DAX, I’d managed to get the right values out of PowerPivot at the Date level with good performance. BUT when, for example, I selected a Calendar Year I got no values out, and I hadn’t even thought of how to handle situations such as when Date was on filter and the user had selected three different, non-consecutive dates. Well, that was last week when I was a DAX newbie – this week, well, I’m slightly less of a newbie and I’ve got the solution.

First, let’s restate the problem a bit more clearly. What I want is this: when I have a date on columns or rows, I want my calculated measure to display the number of rows in the FactInternetSales table that have an OrderDate less than or equal to that date, and have a ShipDate that is greater than or equal to that date. This way I see the number of rows in the table where something has been ordered but not yet shipped. When I select a year, I want to see the number of rows where that OrderDate-ShipDate range intersects with the dates in that year. So if I selected the year 2002, I want all the rows where either the OrderDate is in 2002, or the ShipDate is in 2002, or when the OrderDate is before 2002 and the ShipDate is after 2002. Finally if I make a multiselect when Date is on a filter, for example if I multiselect the 1st January 2002, the 5th of January 2002 and the 10th January 2002, I want to count the number of rows where the OrderDate-ShipDate range includes any of those three dates.

I spent several hours thinking how I could do this in DAX using just the columns I had and I came up with the following formula, which gave me the results I was expecting… when it ever returned, because it was horrendously slow:

=CALCULATE(
COUNTROWS(‘FactInternetSales’)
, FILTER(‘FactInternetSales’
, COUNTROWS(
FILTER(
VALUES(‘DimDate'[DateKey]),
(‘FactInternetSales'[OrderDateKey]<=’DimDate'[DateKey])
&&
(‘FactInternetSales'[ShipDateKey]>=’DimDate'[DateKey])
))>0))

It’s using the Calculate function to evaluate the expression CountRows(FactInternetSales); what it’s trying to do in the outermost filter is to filter all of FactInternetSales, and find all the rows where there is at least one Date in the current selection that is between the OrderDateKey and the ShipDateKey. But, as we learned last week, doing a filter on all the columns in a table is very inefficient, and what we actually want to do is to filter on just the columns we’re interested in: OrderDateKey and ShipDateKey. However what I found was that I really needed all of the distinct combinations of OrderDateKey and ShipDateKey, but I could only get either all of the columns in a table using the table name, or the distinct set of values in a single column using the Values() function. Therefore I couldn’t actually filter on the distinct combinations of OrderDateKey and ShipDateKey where OrderDateKey>=DateKey>=ShipDateKey. This seems like a limitation of DAX from what I can see, though it might turn out that there is a way of doing it – if there is I will of course blog about it.

Anyway, the solution was to create a new column that did contain the distinct combinations of OrderDateKey and ShipDateKey. I went into the PowerPivot designer and created a new column called OrderShip on FactInternetSales with the following expression:

=CONCATENATE(‘FactInternetSales'[OrderDateKey],’FactInternetSales'[ShipDateKey])

As you can see, it simply concatenates the OrderDateKey and ShipDateKey values into a 16 character string:

image

I was then able to use that column in a filter as follows, extracting the OrderDateKey and the ShipDateKey from the string using the Left and Right functions, and turning them back into numbers using the Value function:

=Calculate(
CountRows(FactInternetSales)
, FILTER(
VALUES(‘FactInternetSales'[OrderShip])
, COUNTROWS(
FILTER(
VALUES(‘DimDate'[DateKey]),
(‘DimDate'[DateKey]>=VALUE(LEFT(‘FactInternetSales'[OrderShip], 8)))
&&
(‘DimDate'[DateKey]<=VALUE(RIGHT(‘FactInternetSales'[OrderShip],8)))
))>0))

The logic is the same as with the previous expression, with the all-important exception that the outer Filter only works on the distinct values of OrderShip. It works and it’s fast, so I’m relatively happy although I would like to find a more elegant solution.

It not only gives the correct values for individual dates:

image

but also for Years (and other aggregations of Dates – also notice how the sum of the years is greater than the same as the Grand Total, because of course an order can be in progress in two different years and so needs to be counted in each):

image

and also for multiselect (where again the value is not the sum of the values of the three individual dates):

image

image

I think this illustrates one area where DAX is superior to MDX. MDX is notoriously bad at handling multiselect, but as Marius explained to me in DAX multiselect support is built-in: the Values() function is rather like a CurrentSet function, returning the set of all selected (‘existing’ in MDX terms) values in a column in a particular context.

So, another problem solved… time to think up the next challenge!

Solving the ‘Events in Progress’ Problem in DAX

One problem I’ve spent a lot of time thinking about over the last couple of years is what I call the ‘events in progress’ problem. The problem is this: if you have a fact table where each row represents an event, and there are two date columns representing the start and end dates of the event, on any given date how many events were actually in progress (ie how many rows are there where your given date is between the start and the end date)? It’s trivial to solve in SQL, but in SSAS and MDX there’s no truly satisfying solution. Richard Tkatchuk wrote up the best solution for SSAS here, but since it still involves a fair amount of hacking of the data I’ve never felt it was particularly elegant; I’ve always felt like there should be a pure MDX solution, but all of the approaches I’ve tried have been unusably slow. As a result, I was curious to see whether there would be an elegant way of solving this problem in PowerPivot and DAX…

It’s quite easy to use the AdventureWorks DW database to illustrate the problem. The following SQL query returns the number of orders that had been placed but had not yet shipped on any given date:

select DimDate.DateKey
, COUNT(*) as OrdersInProgress
from DimDate,
FactInternetSales
where
DimDate.DateKey>=factinternetsales.OrderDateKey
and
DimDate.DateKey  <= FactInternetSales.ShipDateKey
group by DimDate.DateKey
order by Dimdate.DateKey asc

image

The first thing to do in PowerPivot is, obviously, to import the DimDate and FactInternetSales tables. However the next step is less than obvious: you must not create a relationship between the two tables. Even though PowerPivot suggests you should, you won’t get this solution to work if you do – and as a result, in my opinion, it would be great if this message could be turned off (vote here if you agree).

image

Next we need to create a new calculated measure in the FactInternetSales table, and we need to come up with a DAX expression for it. Even though the logic is fairly simple, I struggled at first because the Calculate statement doesn’t allow you to use the Min or Max functions directly in one of its filter parameters. Then, though, I realised that the same limitation doesn’t apply to the Filter function and I came up with the following, which worked:

=COUNTROWS(
Filter(
Filter(‘FactInternetSales’, ‘FactInternetSales'[OrderDateKey]<=MIN(‘DimDate'[DateKey])), ‘FactInternetSales'[ShipDateKey]>=MAX(‘DimDate'[DateKey])))

Unfortunately, although it was faster than any of my MDX approaches in SSAS, it was still not really fast enough. So I went back to my friends at Microsoft, and Marius Dumitru (again) came up with the following which is really fast:

=Calculate(
CountRows(FactInternetSales),
Filter(Values(FactInternetSales[OrderDateKey]), Min(DimDate[DateKey])>=[OrderDateKey]),
Filter(Values(FactInternetSales[ShipDateKey]), Max(DimDate[DateKey])<=[ShipDateKey]))

image

The next problem is that this only works when something is explicitly selected from DateKey. When you put Calendar Year on rows or columns though, for example, it returns nothing. I’m currently working on getting this fixed, but that will be another blog post!

Anyway couple of questions need to be asked at this point. First of all, how/why does this work? Well Marco has just posted a great explanation of the Calculate statement here that I recommend you to read and reread; it covers the theory behind this expression and a lot more.

Secondly why is this second expression so much faster than the first? Here’s what Marius told me: in the first expression the filtering is done on each individual row in the fact table. In the second expression the filters act on the distinct values from both OrderDateKey and ShipDateKey, and since the number of distinct values is smaller than the overall number of rows this is evaluated very quickly; the results of these two filters can be then used in a further, very efficient request to the PowePivot/Vertipaq engine. In Marius’ own words, the takeaway is that “it’s always a best practice to avoid filtering an entire table if the filter condition involves only some of the table columns (i.e. when no key/unique columns are used in the condition expression). The way to go is to only filter the columns involved in the condition (using Filter over All, Values or Distinct applied to the desired columns), and use those filter results as parameters to Calculate to carry out the rest of the computation”. Note that this isn’t a matter of block computation vs cell-by-cell because all calculations in DAX are evaluated in block mode.

So there you go, the first DAX optimisation tip on this blog!

Join Conditions in Power Query, Part 2: Events-In-Progress, Performance and Query Folding

In my last post you saw how to join two tables together using conditions other than the built-in inner, outer and anti join conditions. However, as I mentioned, you need to be aware of some of the performance implications of using this technique – and that is the subject I’ll be looking at in this post.

Let’s take the events-in-progress problem, one that I have blogged about many times in the past (see here, here, here and here for example) and see how we can solve it in Power Query. It’s a very common problem that you encounter when you have a fact table where each row represents an event, there are columns containing the start and end dates of each event, and you want to count the number of events that were in progress on any given date.

Using the Adventure Works DW database, my normal way of illustrating the problem is this: taking the DimDate and FactInternetSales tables, for each date show the number of purchases that had been ordered but not shipped on that date. One way of solving this problem (but not the only way, as my previous blog posts show) is to do a cross join between the DimDate and FactInternetSales tables, then filter the result so that you get the rows where the value in the DateKey column from the DimDate table is between the values in the OrderDateKey and the ShipDateKey columns from the FactInternetSales table.

As a first attempt you might come up with a query like this one, which uses a custom column containing a table value that is subsequently expanded to do a cross join (the technique shown in my previous post):

let

    //Connect to SQL Server

    Source = Sql.Database("localhost", "Adventure Works DW"),

    //Get data from the DimDate table

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

    //Remove all columns except the DateKey column

    RemovedOtherColumns = Table.SelectColumns(dbo_DimDate,{"DateKey"}),

    //Insert a custom column that contains the whole of FactInternetSales as a table in each row

    InsertedCustom = Table.AddColumn(RemovedOtherColumns, "FactInternetSalesTable", 

                        each Source{[Schema="dbo",Item="FactInternetSales"]}[Data]),

    //Expand this new column to show the OrderDateKey and ShipDateKey columns

    #"Expand FactInternetSalesTable" = Table.ExpandTableColumn(InsertedCustom, 

                        "FactInternetSalesTable", 

                        {"OrderDateKey", "ShipDateKey"}, 

                        {"FactInternetSalesTable.OrderDateKey", 

                        "FactInternetSalesTable.ShipDateKey"}),

    //Filter where DateKey is greater than or equal to OrderDateKey and 

    //DateKey is less than or equal to ShipDateKey

    FilteredRows = Table.SelectRows(#"Expand FactInternetSalesTable", 

                        each [DateKey] >= [FactInternetSalesTable.OrderDateKey] and 

                        [DateKey] <= [FactInternetSalesTable.ShipDateKey]),

    //Find the count of the number of rows grouped by DateKey

    GroupedRows = Table.Group(FilteredRows, {"DateKey"}, 

                        {{"Count", each Table.RowCount(_), type number}})

in

    GroupedRows

There is, however, a big problem with this query: on my laptop it runs and runs forever – well, maybe not forever but I cancelled it after several minutes. Some tuning is necessary.

I don’t think anyone outside the Power Query dev team has much experience of performance tuning Power Query yet. However there is one golden rule that I do know: where possible, allow Power Query to push as much of the work back to the data source. This behaviour is known as “query folding” and it’s something that I’ve blogged about, as have Matt Masson and Darren Gosbell. Looking in SQL Server Profiler for the query above it is clear that no query folding is taking place: the only activity visible is Power Query reading the data from the DimDate and FactInternetSales tables separately.

After a bit of trial and error I came up with the following alternative:

let

    //Connect to SQL Server

    Source = Sql.Database("localhost", "adventure works dw"),

    //Get data from the DimDate table

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

    //Remove all columns except DateKey

    RemovedOtherColumns = Table.SelectColumns(dbo_DimDate,{"DateKey"}),

    //Add a custom column to DimDate containing the value 1

    InsertedCustom = Table.AddColumn(RemovedOtherColumns, "Dummy", each 1),

    //Get data from the FactInternetSales table

    dbo_FactInternetSales = Source{[Schema="dbo",Item="FactInternetSales"]}[Data],

    //Remove all columns except OrderDateKey and ShipDateKey

    RemovedOtherColumns1 = Table.SelectColumns(dbo_FactInternetSales,

                        {"OrderDateKey", "ShipDateKey"}),

    //Add a custom column to FactInternetSales containing the value 1

    InsertedCustom1 = Table.AddColumn(RemovedOtherColumns1, "Dummy", each 1),

    //Join DimDate and FactInternetSales on the two columns that contain 1

    Custom1 = Table.Join(InsertedCustom1, "Dummy", InsertedCustom, "Dummy"),

    //Filter rows where DateKey is between OrderDateKey and ShipDateKey

    FilteredRows = Table.SelectRows(Custom1, 

                    each [DateKey] >= [OrderDateKey] and 

                    [DateKey] <= [ShipDateKey]),

    //Group by DateKey and find the number of rows for each date

    GroupedRows = Table.Group(FilteredRows, {"DateKey"}, 

                    {{"Count", each Table.RowCount(_), type number}}),

    //Sort dates in ascending order

    SortedRows = Table.Sort(GroupedRows,{{"DateKey", Order.Ascending}})

in

    SortedRows

This returns the correct result more or less instantly:

image

The main difference between this query and the previous one is how I’m getting a cross join between the two tables. This time I’m creating custom columns on DimDate and FactInternetSales that both contain the value 1, and then doing an inner join between this two tables on the new columns – which of course results in the equivalent of a cross join.

In Profiler I can see the following SQL query being generated by Power Query:

select [_].[DateKey],

    [_].[Count]

from 

(

    select [rows].[DateKey] as [DateKey],

        count(1) as [Count]

    from 

    (

        select [_].[OrderDateKey],

            [_].[ShipDateKey],

            [_].[Dummy],

            [_].[DateKey]

        from 

        (

            select [$Outer].[OrderDateKey],

                [$Outer].[ShipDateKey],

                [$Inner].[Dummy],

                [$Inner].[DateKey]

            from 

            (

                select [_].[OrderDateKey] as [OrderDateKey],

                    [_].[ShipDateKey] as [ShipDateKey],

                    1 as [Dummy]

                from 

                (

                    select [OrderDateKey],

                        [ShipDateKey]

                    from [dbo].[FactInternetSales] as [$Table]

                ) as [_]

            ) as [$Outer]

            inner join 

            (

                select [_].[DateKey] as [DateKey],

                    1 as [Dummy]

                from 

                (

                    select [DateKey]

                    from [dbo].[DimDate] as [$Table]

                ) as [_]

            ) as [$Inner] on ([$Outer].[Dummy] = [$Inner].[Dummy])

        ) as [_]

        where [_].[DateKey] >= [_].[OrderDateKey] and [_].[DateKey] <= [_].[ShipDateKey]

    ) as [rows]

    group by [DateKey]

) as [_]

order by [_].[DateKey]

 

Query folding is definitely taking place now!

The last question to ask here is whether the first query was slow because query folding was not taking place, or slow because of the way the query was written. You can test this quite easily by rewriting the second query to prevent query folding taking place using the Table.Buffer() function. For example, in the second query the step to get the data from the DimDate table is:

dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

To prevent query folding it needs to be altered to:

dbo_DimDate = Table.Buffer(Source{[Schema="dbo",Item="DimDate"]}[Data]),

(The step to get data from the FactInternetSales table needs to be altered in the same way.)

With this change made the query now executes in around a minute. So clearly the new query is more efficient when it is executed inside the Power Query engine itself, without query folding, but the Power Query engine is still nowhere near as fast as SQL Server and query folding gives the best possible performance.

You can download the sample workbook here.

A New Events-In-Progress DAX Pattern

I’ve been working on a very complex SSAS Tabular implementation recently, and as a result I’ve learned a few new DAX tricks. The one that I’m going to blog about today takes me back to my old favourite, the events-in-progress problem. I’ve blogged about it a lot of times, looking at solutions for MDX and DAX (see here and here), and for this project I had to do some performance tuning on a measure that uses a filter very much like this.

Using the Adventure Works Tabular model, the obvious way of finding the number of Orders on the Internet Sales table that are open on any given date (ie where the Date is between the dates given in the Order Date and the Ship Date column) is to write a query something like this:

EVALUATE

ADDCOLUMNS (

    VALUES ( 'Date'[Date] ),

    "OpenOrders",

    CALCULATE (

        COUNTROWS ( 'Internet Sales' ),

        FILTER( 'Internet Sales', 'Internet Sales'[Ship Date] > 'Date'[Date] ),

        FILTER( 'Internet Sales', 'Internet Sales'[Order Date] <= 'Date'[Date] )

    )

)

ORDER BY 'Date'[Date]

On my laptop this executes in around 1.9 seconds on a cold cache. However, after a bit of experimentation, I found the following query was substantially faster:

EVALUATE

ADDCOLUMNS (

    VALUES ( 'Date'[Date] ),

    "OpenOrders",

    COUNTROWS(

        FILTER(

            'Internet Sales',

            CONTAINS(

                DATESBETWEEN('Date'[Date]

                    , 'Internet Sales'[Order Date]

                    , DATEADD('Internet Sales'[Ship Date],-1, DAY))

                , [Date]

                , 'Date'[Date]

            )

        )

    )

)

ORDER BY 'Date'[Date]

On a cold cache this version executes in just 0.2 seconds on my laptop. What’s different? In the first version of the calculation the FILTER() function is used to find the rows in Internet Sales where the Order Date is less than or equal to the Date on rows, and where the Ship Date is greater than the Date. This is the obvious way of solving the problem. In the new calculation the DATESBETWEEN() function is used to create a table of dates from the Order Date to the day before the Ship Date for each row on Internet Sales, and the CONTAINS() function is used to see if the Date we’re interested in appears in that table.

I’ll be honest and admit that I’m not sure why this version is so much faster, but if (as it seems) this is a generally applicable pattern then I think this is a very interesting discovery.

Thanks to Marco, Alberto and Marius for the discussion around this issue…

UPDATE: Scott Reachard has some some further testing on this technique, and found that the performance is linked to the size of the date ranges. So, the shorter your date ranges, the faster the performance; if you have large date ranges, this may not be the best performing solution. See https://twitter.com/swreachard/status/349881355900952576

UPDATE: Alberto has done a lot more research into this problem, and come up with an even faster solution. See: http://www.sqlbi.com/articles/understanding-dax-query-plans/

Solving the Events-In-Progress Problem in DAX V2.0

This is probably the 5th or 6th post I’ve written on this problem (most deal with MDX, but I did blog about solving it in DAX early last year) but what can I say – it’s an interesting problem! I came across it at work today while working with the 2012 CTP3 version of PowerPivot and found yet another solution to the problem that used some of the new DAX functionality, so I thought I’d crank out one more blog post.

The basic approach is similar to the one I describe here. Using the same Adventure Works data, I can load the DimDate and FactInternetSales tables into PowerPivot V2.0 and I’ll get the following model:

image

Note that we have three relationships between the two tables: one active one, which is the relationship from OrderDateKey to DateKey, and two inactive ones from DueDateKey and ShipDateKey. If we want to find the number of orders up to the current date using the Order Date we can simply use the following DAX in a measure definition:

SalesToDate:=CALCULATE(
COUNTROWS(FactInternetSales)
, DATESBETWEEN(DimDate[FullDateAlternateKey], Blank(), LASTDATE(DimDate[FullDateAlternateKey])))

Now, if we want to find the number of orders that have shipped up until yesterday we don’t need any special modelling, we can use the new UseRelationship function to force a calculation to follow the relationship going from ShipDateKey to DateKey. Therefore, if we want to find the number of orders that have been placed but not shipped, we just need to take the measure above and subtract the vale returned by the same measure when use this different relationship and change the filter context to be the day before the current day:

SalesInProgress:=[SalesToDate]-
IF(ISBLANK(DATEADD(LASTDATE(DimDate[FullDateAlternateKey]), -1, DAY))
, BLANK()
,CALCULATE(
[SalesToDate]
, USERELATIONSHIP(FactInternetSales[ShipDateKey], DimDate[DateKey])
, DATEADD(LASTDATE(DimDate[FullDateAlternateKey]), -1, DAY)))

image

Quite an elegant solution, I think.

Analysis Services Connect digest

It’s a fairly popular thing for SQL Server bloggers and MVPs to put together lists of items on Connect (bugs that need fixing, new bits of functionality that should be in a future version, etc) and ask readers to vote on them so they get more visibility. Jamie Thomson did a post for SSIS the other day, for instance:
http://blogs.conchango.com/jamiethomson/archive/2009/07/05/ssis-connect-digest-2009-07-05.aspx

While I’ve done something similar in the past, I thought it would be a good idea to put together a more comprehensive list of Connect items for SSAS in the hope that some of the more important issues will be addressed in the next major release. So, please look down the list below and vote on anything that you think is important!

  1. MDX needs a special division operator that returns null instead of infinity when dividing by zero or null. Why do we always have to trap this in code ourselves? I have never, ever wanted to return infinity from a calculation!
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=448127
  2. BI Development Studio is ridiculously slow doing Deployment or Saving (when in Online mode) sometimes. This isn’t a BIDS issue, of course, more of a reflection on how long the XMLA it’s generating takes to run, but still it wastes so much time…
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=447405
  3. It would be cool if, after every time the server crashed, SSAS could make a copy of the current Flight Recorder trace file so we can see what was happening on the server. I know PSS can get this information from the mdmp files that get generated, but sometimes I want to do some detective work of my own before opening a support call:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=442606
  4. Attribute relationships serve a second purpose as member properties, but sometimes you want to display an attribute as a member property of another attribute when there isn’t a direct relationship between them. This leads people to creating either redundant attribute relationships or duplicate attributes in the dimension, both of which are equally bad:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=436023
  5. We have role-playing dimensions, so why not role-playing measure groups? There are a few times when I’ve wanted to reuse the same measure group with dimensions joining onto different columns (eg when working with many-to-many relationships), and not have to create and process multiple different physical measure groups:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=436081
  6. Apart from the Slowly-Changing Dimension support that we already have, it would be useful to be able to create different versions of the same dimension and allow users to choose which version they want to see. This would allow us to expose snapshots of what a dimension looked like at any given point in time:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=436064
  7. Richard Tkachuk’s approach for handling ‘events in progress’ highlights the fact that SSAS isn’t very good at dealing with what is a reasonably common BI problem. Let’s have a new ‘Range’ dimension relationship type to do this out of the box:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=428465
  8. Let calculated members in MDX return sets. This would allow us to optimise calculations where expensive set operations currently have to be duplicated:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=426719
  9. At the moment, the IgnoreUnrelatedDimensions property controls what happens for measures when you query them with dimensions that have no relationship with the measure group, and when you query them with attributes below the granularity of the measure group. Very often, though, I do not want the same behaviour for these two scenarios, so can we have two different properties please? Or even be able to set this on a per-dimension basis?
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=350232
  10. At the moment, when using connection string properties in SQLMS, after you’ve set them the properties get persisted for all future connections without being visible in the connection dialog. Either don’t persist the properties or show us which ones are being set!
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=358956
  11. No more arbitrary-shaped set errors! When writing scoped assignments, I want to be able to scope on whatever area in the cube I want; at the moment, I often find I have to repeat the same assignment several times to get around the requirement that I can’t assign to an arbitrary-shaped set.
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=339861
  12. We can’t currently deny access to an entire dimension, rather than individual members on it, yet this is another common requirement:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=233410
  13. It would be cool to be able to parameterise server-side MDX calculated members and named sets:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=188842
  14. Here’s an old one: we really, really need to be able to rename hierarchies on each instance of a role-playing dimension. Not being able to do so seriously limits the usability of role-playing dimensions.
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=144500
  15. Similarly, I’d like to be able to set individual All Member captions for all of the attributes on a dimension:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=144234
  16. One mistake I make all too often is change something in the cube, redeploy it, process and then at the very end of processing have it fail because of an MDX Script error. I know it’s possible to use the ScriptErrorHandling mode to do something similar, but that’s a bit dangerous to set in production. What I have is an option for processing in a dev environment that is set by default and allows you to ignore MDX Script errors if any occur.
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=144248
  17. Support for cross-dimensional user hierarchies would be useful for two reasons: first it would help us create (and possibly optimise) complex drillpaths; second, it would help when you have a ROLAP attribute that you want to expose only as the lowest level of a user hierarchy, to encourage people to cut down the scope of their query before they used it. Another approach to solving this second problem would be the ability to have attributes on the same dimension with different storage modes (which Greg Galloway and Harsh suggested on my older post).
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=473684
  18. Dimensions are getting bigger and bigger, and it would be useful to be able to partition them for manageability and performance reasons. For example, doing this would allow you to delete some members from the dimension quickly by deleting the partition; you could also speed up Process Updates by only processing the partitions where dimension members have changed.
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=473687
  19. I find it frustrating that dimension calculations (ie custom member formulas, unary operators etc) can’t be edited from the MDX Script. I want all my MDX calculations in one place!
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=473688
  20. It would be really cool if we could declare parameterised MDX Select statements on the server, and then allow client tools to expose them as pre-canned queries, similar to what can be done with stored procedures in the relational world. This would mean that as developers we could write complex queries that most end users would never be able to create themselves (even if their client tools were capable of building them). Similarly, it would be useful if we could declare custom MDX functions in the MDX Script: for example, you could declare a complex set operation as a new function, then use it in your own calculations and also expose it to users so that they could use it in their own client tools.
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=473694
  21. Fix parent/child hierarchies! They’re so commonly used, but there are so many ‘known issues’… I want to:
    1. Them to perform better, perhaps by building aggregations within them
    2. To be able to scope calculations on them properly
    3. Have more flexibility building them, by creating more than one per dimension and not having to build them from the key attribute
      https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=473698
  22. I’d like calculated members to do everything, and have all of the same properties, as real members. This means I want to be able to
    1. Control the order they appear on a level, and mix them in with real members
    2. Allow them to have other calculated members as children
    3. Allow them to have member properties
    4. Allow all properties (member properties as well as properties such as Name) to be dynamically generated using an MDX expression
    5. Be able to dynamically generate whole sets of calculated members with a single expression; this would, for example, be useful for doing pareto analysis or dynamically generating bandings.
    6. Be able to expose the MDX definition as a property visible in MDX
    7. Be able to secure calculated members with dimension security
      https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=473693
  23. From Jamie Thomson (from a long time ago) – automating partition generation when new data appears in the cube is a pain – this is something SSAS should be able to do for us:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=127378
  24. One from Thomas Ivarsson here – it would be good to have a flag for the BottomCount function to get it to automatically remove empty tuples from the set we’re passing in:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=472560
  25. From Darren Gosbell – we really need an official, built-in way of making calculations work properly when the current context is a set (for example when doing a multiselect results in a set in the Where clause):
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=267570
  26. Also from Darren – at present, if you rename a database in SQLMS you then can’t deploy to this database from BIDS:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=281595
  27. From Marco Russo – at the moment, the DSV still has problems with handling TinyInt columns, casting them to different types in different scenarios:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=126798
  28. From Marco Russo – fix drillthrough! It’s almost unusably slow at the moment on large cubes and it doesn’t work with semi-additive measures, let alone calculated measures. This is such an important bit of functionality it needs to work properly:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=403083
  29. Following on from this, we need better control over the SQL that Analysis Services generates full stop. I’d like more influence over how SQL is generated for supported relational data sources and also official support (and help) for creating cartridges for new relational data sources.
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=473689
  30. A tool that displayed MDX query plans would make performance tuning queries much easier:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=321161
  31. From David Clem – you can’t currently control the order in which calculated measures are displayed. It would be useful if there was a ‘display order’ property that allowed you to do this:
    https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=294907

Of course, if you’ve got your own issues open on Connect that you’d like to publicise, please leave a comment below…

SSAS Multidimensional Caching-Related Performance Problems With Power BI Tables

Last week I was doing some performance tuning for a customer using Power BI on top of a large SSAS Multidimensional cube via a Live connection. Some of their reports were performing particularly badly and I uncovered a problem with the DAX generated by Power BI for tables with a large number of rows, fields from two or more large hierarchies and totals turned on.

The problem is very easy to reproduce; I’m going to do it using a simplified version of the Adventure Works cube that contains only the Date and Product dimensions. Take a look at the following table from a Power BI report:

image

It has the Date attribute from the Date dimension, and the Product attribute from the Product dimension, on rows and two measures on columns. The table has a large number of rows in it (both Date and Product are fairly large hierarchies) and if you look at the DAX generated by Power BI you can see that it only requests the first 501 rows. That’s fine – the real problem is that Power BI also generates a second DAX query to get the two values displayed in the Total line at the bottom of the table. The DAX looks something like this, and is easily identifiable because it uses the Row() function:

EVALUATE
  CALCULATETABLE(
    ROW(
      "Sales_Amount", 'Internet Sales Facts'[Sales Amount],
      "Tax_Amt", 'Internet Sales Facts'[Tax Amt]
    ),
    KEEPFILTERS(
      GENERATE(
        KEEPFILTERS(VALUES('Product'[Product.Key0])),
        CALCULATETABLE(
          FILTER(
            KEEPFILTERS(VALUES('Order Date'[Date.Key0])),
            OR(
              NOT(ISBLANK('Internet Sales Facts'[Sales Amount])),
              NOT(ISBLANK('Internet Sales Facts'[Tax Amt]))
            )
          )
        )
      )
    )
  )

This query has something in it – I don’t know what – that means that it cannot make use of the Analysis Services Storage Engine cache. Every time you run it SSAS will go to disk, read the data that it needs and then aggregate it, which means you’ll get cold-cache performance all the time. On a big cube this can be a big problem. This is very similar to problems I’ve seen with MDX queries on Multidimensional and which I blogged about here; it’s the first time I’ve seen this happen with a DAX query though. I suspect a lot of people using Power BI on SSAS Multidimensional will have this problem without realising it.

This problem does not occur for all tables – as far as I can see it only happens with tables that have a large number of rows and two or more hierarchies in. The easy way to check whether you have this problem is to refresh your report, run a Profiler trace that includes the Progress Report Begin/End and Query Subcube Verbose events (and any others you find useful) and then refresh the report again by pressing the Refresh button in Power BI Desktop without changing it at all. In your trace, if you see any of the Progress Report events appear when that second refresh happens, as well as Query Subcube Verbose events with an Event Subclass of Non-cache data, then you know that the Storage Engine cache is not being used.

image

Also look at the Duration column in the trace for these events which shows the time in milliseconds that they took to execute. This will tell you how much of an impact this problem is having on your report refresh times.

The easy way to stop this happening is to turn off the totals row in the table:

image

Displaying the totals in a separate table also seems to avoid the problem, although of course it doesn’t look as good. Only putting one hierarchy in the table apart from your measures, also seems to solve the problem. You could also try all the usual methods to improve SSAS performance such as building aggregations on the cube.

The dev team is aware of this problem but it’s unlikely to be fixed in the short term.