Join Conditions in Power Query, Part 2: Events-In-Progress, Performance and Query Folding

In my last post you saw how to join two tables together using conditions other than the built-in inner, outer and anti join conditions. However, as I mentioned, you need to be aware of some of the performance implications of using this technique – and that is the subject I’ll be looking at in this post.

Let’s take the events-in-progress problem, one that I have blogged about many times in the past (see here, here, here and here for example) and see how we can solve it in Power Query. It’s a very common problem that you encounter when you have a fact table where each row represents an event, there are columns containing the start and end dates of each event, and you want to count the number of events that were in progress on any given date.

Using the Adventure Works DW database, my normal way of illustrating the problem is this: taking the DimDate and FactInternetSales tables, for each date show the number of purchases that had been ordered but not shipped on that date. One way of solving this problem (but not the only way, as my previous blog posts show) is to do a cross join between the DimDate and FactInternetSales tables, then filter the result so that you get the rows where the value in the DateKey column from the DimDate table is between the values in the OrderDateKey and the ShipDateKey columns from the FactInternetSales table.

As a first attempt you might come up with a query like this one, which uses a custom column containing a table value that is subsequently expanded to do a cross join (the technique shown in my previous post):

let

    //Connect to SQL Server

    Source = Sql.Database("localhost", "Adventure Works DW"),

    //Get data from the DimDate table

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

    //Remove all columns except the DateKey column

    RemovedOtherColumns = Table.SelectColumns(dbo_DimDate,{"DateKey"}),

    //Insert a custom column that contains the whole of FactInternetSales as a table in each row

    InsertedCustom = Table.AddColumn(RemovedOtherColumns, "FactInternetSalesTable", 

                        each Source{[Schema="dbo",Item="FactInternetSales"]}[Data]),

    //Expand this new column to show the OrderDateKey and ShipDateKey columns

    #"Expand FactInternetSalesTable" = Table.ExpandTableColumn(InsertedCustom, 

                        "FactInternetSalesTable", 

                        {"OrderDateKey", "ShipDateKey"}, 

                        {"FactInternetSalesTable.OrderDateKey", 

                        "FactInternetSalesTable.ShipDateKey"}),

    //Filter where DateKey is greater than or equal to OrderDateKey and 

    //DateKey is less than or equal to ShipDateKey

    FilteredRows = Table.SelectRows(#"Expand FactInternetSalesTable", 

                        each [DateKey] >= [FactInternetSalesTable.OrderDateKey] and 

                        [DateKey] <= [FactInternetSalesTable.ShipDateKey]),

    //Find the count of the number of rows grouped by DateKey

    GroupedRows = Table.Group(FilteredRows, {"DateKey"}, 

                        {{"Count", each Table.RowCount(_), type number}})

in

    GroupedRows

There is, however, a big problem with this query: on my laptop it runs and runs forever – well, maybe not forever but I cancelled it after several minutes. Some tuning is necessary.

I don’t think anyone outside the Power Query dev team has much experience of performance tuning Power Query yet. However there is one golden rule that I do know: where possible, allow Power Query to push as much of the work back to the data source. This behaviour is known as “query folding” and it’s something that I’ve blogged about, as have Matt Masson and Darren Gosbell. Looking in SQL Server Profiler for the query above it is clear that no query folding is taking place: the only activity visible is Power Query reading the data from the DimDate and FactInternetSales tables separately.

After a bit of trial and error I came up with the following alternative:

let

    //Connect to SQL Server

    Source = Sql.Database("localhost", "adventure works dw"),

    //Get data from the DimDate table

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

    //Remove all columns except DateKey

    RemovedOtherColumns = Table.SelectColumns(dbo_DimDate,{"DateKey"}),

    //Add a custom column to DimDate containing the value 1

    InsertedCustom = Table.AddColumn(RemovedOtherColumns, "Dummy", each 1),

    //Get data from the FactInternetSales table

    dbo_FactInternetSales = Source{[Schema="dbo",Item="FactInternetSales"]}[Data],

    //Remove all columns except OrderDateKey and ShipDateKey

    RemovedOtherColumns1 = Table.SelectColumns(dbo_FactInternetSales,

                        {"OrderDateKey", "ShipDateKey"}),

    //Add a custom column to FactInternetSales containing the value 1

    InsertedCustom1 = Table.AddColumn(RemovedOtherColumns1, "Dummy", each 1),

    //Join DimDate and FactInternetSales on the two columns that contain 1

    Custom1 = Table.Join(InsertedCustom1, "Dummy", InsertedCustom, "Dummy"),

    //Filter rows where DateKey is between OrderDateKey and ShipDateKey

    FilteredRows = Table.SelectRows(Custom1, 

                    each [DateKey] >= [OrderDateKey] and 

                    [DateKey] <= [ShipDateKey]),

    //Group by DateKey and find the number of rows for each date

    GroupedRows = Table.Group(FilteredRows, {"DateKey"}, 

                    {{"Count", each Table.RowCount(_), type number}}),

    //Sort dates in ascending order

    SortedRows = Table.Sort(GroupedRows,{{"DateKey", Order.Ascending}})

in

    SortedRows

This returns the correct result more or less instantly:

image

The main difference between this query and the previous one is how I’m getting a cross join between the two tables. This time I’m creating custom columns on DimDate and FactInternetSales that both contain the value 1, and then doing an inner join between this two tables on the new columns – which of course results in the equivalent of a cross join.

In Profiler I can see the following SQL query being generated by Power Query:

select [_].[DateKey],

    [_].[Count]

from 

(

    select [rows].[DateKey] as [DateKey],

        count(1) as [Count]

    from 

    (

        select [_].[OrderDateKey],

            [_].[ShipDateKey],

            [_].[Dummy],

            [_].[DateKey]

        from 

        (

            select [$Outer].[OrderDateKey],

                [$Outer].[ShipDateKey],

                [$Inner].[Dummy],

                [$Inner].[DateKey]

            from 

            (

                select [_].[OrderDateKey] as [OrderDateKey],

                    [_].[ShipDateKey] as [ShipDateKey],

                    1 as [Dummy]

                from 

                (

                    select [OrderDateKey],

                        [ShipDateKey]

                    from [dbo].[FactInternetSales] as [$Table]

                ) as [_]

            ) as [$Outer]

            inner join 

            (

                select [_].[DateKey] as [DateKey],

                    1 as [Dummy]

                from 

                (

                    select [DateKey]

                    from [dbo].[DimDate] as [$Table]

                ) as [_]

            ) as [$Inner] on ([$Outer].[Dummy] = [$Inner].[Dummy])

        ) as [_]

        where [_].[DateKey] >= [_].[OrderDateKey] and [_].[DateKey] <= [_].[ShipDateKey]

    ) as [rows]

    group by [DateKey]

) as [_]

order by [_].[DateKey]

 

Query folding is definitely taking place now!

The last question to ask here is whether the first query was slow because query folding was not taking place, or slow because of the way the query was written. You can test this quite easily by rewriting the second query to prevent query folding taking place using the Table.Buffer() function. For example, in the second query the step to get the data from the DimDate table is:

dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

To prevent query folding it needs to be altered to:

dbo_DimDate = Table.Buffer(Source{[Schema="dbo",Item="DimDate"]}[Data]),

(The step to get data from the FactInternetSales table needs to be altered in the same way.)

With this change made the query now executes in around a minute. So clearly the new query is more efficient when it is executed inside the Power Query engine itself, without query folding, but the Power Query engine is still nowhere near as fast as SQL Server and query folding gives the best possible performance.

You can download the sample workbook here.

12 thoughts on “Join Conditions in Power Query, Part 2: Events-In-Progress, Performance and Query Folding

  1. Hi Chris,

    This is very good work. I’ve tried various techniques, including repeating rows in each table (Table.Repeat). However, the inner join is the cleanest and fastest method. On the TechNet forum, you mentioned that your solution is a bit of a hack. However, I’ll disagree on the following grounds:

    1) Your query is quite fast. I’ve pasted a native SQL query in Power Query (PQ) to return the same results as your inner join (I used Cross Join in the SQL query), and the performance is no better than your PQ query. This test suggests that even if cross join was supported natively in PQ, it’ll be no better in performance than the “hack” you’ve done.
    2) Packaged as a custom function, cross join can be used as if it were a native function, minimizing the need for built-in support. For example:
    Table_CrossJoin =
    (table1 as table,table2 as table, optional table2Name as nullable text) as table=>
    let
    Table1InsertedCustom = Table.AddColumn(table1,”JoinIndex”, each 1),
    Table2InsertedCustom = Table.AddColumn(table2,”JoinIndex”, each 1),
    Table2Name = if table2Name = null then “_” else table2Name,
    JoinedTables = Table.Join(Table1InsertedCustom,”JoinIndex”,Table.PrefixColumns(Table2InsertedCustom,Table2Name),Table2Name&”.JoinIndex”),
    RemovedColumns = Table.RemoveColumns(JoinedTables,{“JoinIndex”,Table2Name&”.JoinIndex”})
    in
    RemovedColumns

    And your query would look like:
    let
    Source = Sql.Database(“localhost”, “adventure works dw”),,
    dbo_DimDate = Source{[Schema=”dbo”,Item=”DimDate”]}[Data],
    DimDateRetainedColumns = Table.SelectColumns(dbo_DimDate,{“DateKey”}),
    dbo_FactInternetSales = Source{[Schema=”dbo”,Item=”FactInternetSales”]}[Data],
    FactSalesRetainedColumns = Table.SelectColumns(dbo_FactInternetSales,
    {“OrderDateKey”, “ShipDateKey”}),
    JoinedTables = Table_CrossJoin(DimDateRetainedColumns,FactSalesRetainedColumns,”FactInternetSales”),
    FilteredRows = Table.SelectRows(JoinedTables,
    each [DateKey] >= [FactInternetSales.OrderDateKey] and
    [DateKey] <= [FactInternetSales.ShipDateKey]),
    GroupedRows = Table.Group(FilteredRows, {"DateKey"},
    {{"Number of Events", each Table.RowCount(_), type number}}),
    SortedRows = Table.Sort(GroupedRows,{{"DateKey", Order.Ascending}})
    in
    SortedRows

    There are probably good reasons that a crossjoin option was omitted from the PQ join functions. I don't believe that the target audience for PQ would know how to use this join properly. For example, if you have large tables, and you don't filter the table after the join, you run the risk of very long wait times and out of memory errors. I'm not sure if there is any other built-in function that could land you in this kind of trouble. In the AdventureWorks sample, the cross join produces 132,332,018 records if not filtered. AdventureWorks represents only a small dataset compared to real-world data!

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      I’m glad you found it useful!

  2. Hi Chris, how does joining tables in Power Query and MS Access compare performance wise? I ramped up to 15gb of ram usage today doing some of the former so am considering switching to the latter despite the brain ache it will give me – I’m not that savvy when it comes to joins, transformations etc so power query was my go to tool

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      It’s going to depend on whether Power Query can ‘fold’ (ie translate to SQL) what’s happening in the query. Was it Power Query or Access using the memory?

      1. Hi, it was PowerQuery just doing a connection only to data files. This comes very close to my scenario:
        https://javierguillen.wordpress.com/2014/01/05/denormalizing-tables-in-power-query/
        Looks like having everything in one power query is better than referencing sub queries where I had broken out my components in a functional decomposition style
        The most expensive actions were definitely joining tables and pivoting, that’s where the ram went off the scale making me reconsider my approach
        I’m using PowerQuery because it is easier for me than thinking through transformations and joins from first principles in Access
        I’m going to try the one query approach as outlined in above to see if helps

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      A few thousand in each case

  3. I have not understood, if it is implicitly established somewhere in the blog, but in any case I would like to share an observation that derives from some small experiments that I made with the function.
    the point concerns the internal functioning of the hash function.
    The conjecture, that derives from the tests I have done, is that the number that the hash function returns is calculated as the remainder of the division of the input value with n, where n is the number of partitions established.

Leave a Reply to Uvarov MaximCancel reply