The Rank Column Transform In Power Query Online

[This post was originally published on the official Power Query blog, which has now been taken down. I’m republishing all my posts there to this blog to ensure the content remains available.]

Yet another new transformation has been added to Power Query Online: you can now add columns containing ranks to a table. To see how to use it, consider the following table containing sales and profit values for various products:

Let’s say you want to add a new column to this table containing the rank of each product, calculated according to its sales. To do this select the Sales column so that it is highlighted (as shown in the screenshot above) and then go to the Add Column tab on the ribbon in Power Query Online and click on the “Rank column” button:

When you do this the following dialog will appear:

Click OK and the dialog will close, and you will see that a new column has been added to your table containing the ranks for each product by sales and that the table itself has been sorted by the values in the Sales column too:

In most cases this will give you the values you want, but by clicking on the Advanced radio button in the Rank dialog you’ll find more options for controlling how the rank is calculated.

First of all, notice that because Oranges and Pears have the same sales value they both have a rank of 2 and that the next product, Apples, has a rank of 4. This is because the default rank method, Standard competition, has been used. There are there are two other methods of calculating ranks though. Choosing the Dense method in the Advanced tab of the dialog from the “Rank method” dropdown:

…changes the rank value for Apples to 3:

Choosing the Ordinal rank method ignores any tied values like so:

A different way of handling tied values for sales is to consider profit values, and this is also possible on the Advanced tab. Clicking the “Add ranking” button allows you to add another column to sort by, so that when sales is tied then the product with the highest profit will get the highest rank:

Here’s the output:

Notice now that Pears has a rank of 2 and Oranges has a rank of 3 because, even though they have the same sales, Pears has a higher value for Profit.

Behind the scenes this new functionality is powered by a new M function called Table.AddRankColumn which Reza Rad has already blogged about in detail here.

Gateways And Measuring Power Query CPU Usage During Power BI Dataset Refresh

After last week’s post on measuring Power Query CPU usage during dataset refresh, someone asked an obvious question that I should have addressed: does using a gateway change anything? After all, if you’re using a gateway to connect to an on-premises data source then all the Power Query queries transforming the data from that source will be executed on the gateway machine and not in the Power BI Service.

Let’s do a quick test to find out. I couldn’t use the same Power Query query I used in last week’s post (it turns out you can’t force the use of a gateway when there isn’t an external data source) so instead I used another dataset that connects to a large CSV stored in ADLSgen2 storage and does a group by operation – something which is guaranteed to be very expensive in terms of CPU for Power Query.

Here’s what Profiler shows for the refresh operation when no gateway is used:

The refresh took around 30 seconds and used around 44 seconds of CPU time.

Here’s what Profiler shows when the refresh does use a gateway:

The refresh takes a lot longer, around 103 seconds (as you would expect – instead of loading the data from ADLSgen2 storage in the cloud to the Power BI Service, it has to take a round trip via the gateway on my PC) but the important thing is that the CPU time is now very low – 141 milliseconds.

So, as you might expect, the CPU time for refreshes that use an on-premises data gateway is not shown in Profiler traces because, as I said, all the work done by the Power Query engine is done on the gateway machine and not in the Power BI Service. Making refreshes use a gateway, even when you don’t need to, can be a way of taking load off a Power BI Premium capacity if it’s overloaded.

This in turn raises the question of how you measure Power Query CPU usage on a gateway? As far as I know it isn’t possible for individual Power Query queries (I could be wrong though), although the gateway logs do allow you to capture CPU usage for the whole machine. Better gateway monitoring tools are on the way but this seems like a good time to mention my colleague Rui Romano’s open source gateway monitoring solution (article | repo) which makes understanding the gateway logs a lot easier.

Measuring Power Query CPU Usage During Power BI Dataset Refresh

Some time ago I wrote a post about how optimising for CPU Time is almost as important as optimising for Duration in Power BI, especially if you’re working with Power BI Premium Gen2. This is fairly straightforward if you’re optimising DAX queries or optimising Analysis Services engine-related activity for refreshes. But what about Power Query-related activity? You may have a small dataset but if you’re doing a lot of complex transformations in Power Query that could end up using a lot of CPU, even once the CPU smoothing for background activity that happens with Premium Gen2 has happened. How can you measure how expensive your Power Query queries are in terms of CPU? In this post I’ll show you how.

Let’s consider two Power Query queries that return a similar result and which are connected to two different tables in the same Power BI dataset. The first query returns a table with one column and one row, where the only value is a random number returned by the Number.Random M function:

#table(type table [A=number],{{Number.Random()}})

The second query also returns a table with a single value in it:

let
  InitialList = {1 .. 1000000},
  RandomNumbers = List.Transform(
    InitialList,
    each Number.Random()
  ),
  FindMin = List.Min(RandomNumbers),
  Output = #table(
    type table [A = number],
    {{FindMin}}
  )
in
  Output

This second query, however, generates one million random numbers, finds the minimum and returns that value – which of course is a lot slower and more expensive in terms of CPU.

If you run a SQL Server Profiler trace connected to Power BI Desktop and refresh each of the two tables in the dataset separately, the Command End event for the refresh will tell you the duration of the refresh and also the amount of CPU Time used by the Analysis Services engine for the refresh (there will be several Command End events visible in Profiler but only one with any significant activity, so it will be easy to spot the right one). In Desktop, however, the Command End event does not include any CPU used by the Power Query Engine. Here’s what the Command End event for the first Power Query query above looks like in Desktop:

As you would expect the values in both the Duration and CPU Time columns are low. Here is what the Command End event looks like for the second query above:

This time the refresh is much slower (the Duration value is much larger than before) but the CPU Time value is still low, because the Analysis Services engine is still only receiving a table with a single value in it. All the time taken by the refresh is taken in the Power Query engine.

If you publish a dataset containing these queries to a Premium workspace in the Power BI Service, connect Profiler to the XMLA Endpoint for the workspace, and then refresh the two tables again then for the first, fast query you won’t notice much difference:

[Note that in this screenshot I’ve chosen a comparable Command End event to the one I used in Desktop, although for some reason it doesn’t show the duration. The overall refresh duration, which includes some extra work to do a backup, is around 2 seconds]

However, for the second, slower query you can see that the CPU Time for the Command End event is much higher. This is because in the Power BI Service the event’s CPU Time includes all the Power Query-related activity as well as all Analysis Services engine activity:

This is a simple example where there is very little work being done in the Analysis Services engine, which means that pretty much all the CPU Time can be attributed to the Power Query engine. In the real world, when you’re working with large amount of data, it will be harder to understand how much work is being done in the Analysis Services engine and how much is being done in the Power Query engine. This is where Power BI Desktop comes in, I think. In Desktop you know you are only seeing the CPU used by the Analysis Services engine, so I’ll bet that if there is a big difference in the ratio of CPU Time to Duration for your refresh in Power BI Desktop compared to the Power BI Service, it’s highly likely that that difference is due to Power Query engine activity and that’s where you should concentrate your optimisation efforts.

Of course the next question is how can you optimise Power Query queries so they use less CPU? I don’t know, I haven’t done it yet – but when I have something useful to share I’ll blog about it…

Monitoring Power Query Online Memory And CPU Usage

Power Query Online is, as the name suggests, the online version of Power Query – it’s what you use when you’re developing Power BI Dataflows for example. Sometimes when you’re building a complex, slow query in the Query Editor you’ll notice a message in the status bar at the bottom of the page telling you how long the query has been running for and how much memory and CPU it’s using:

The duration and CPU values are straightforward, but what does the memory value actually represent? It turns out it’s the “Commit (Bytes)” value documented here for Query Diagnostics, that’s to say the amount of virtual memory being used by the query. That’s different to the “Working Set (Bytes)” value which is the amount of physical memory used by the query, and which is not visible anywhere. For a more detailed discussion of these values in Power Query in Power BI Desktop see this post. The maximum commit or working set for a query evalation in Power Query Online isn’t officially documented anywhere (and may change) but I can say three things:

The maximum commit is larger than the maximimum working set.
If Power Query Online uses more than the maximum working set then query evaluation will get slow, so if your query uses a lot of memory (say, over 1GB – I suspect you’ll only see this message if it is using a lot of memory…) then you need to do some tuning to reduce it. Probably the best way to do this is to look at the query plan for your dataflow and try to avoid any operations marked as “Full Scan”, as documented here.
If your query uses more than the maximum commit then it may get cancelled and you’ll see an error (note that the maximum time a query evaluation can run for in Power Query Online anyway is 10 minutes, which is documented here).

[Thanks to Jorge Gomez Basanta for this information]

Add ‘Export To Excel’ With Power Query To Your Application

There’s an old joke about “Export to Excel” being the most important feature of any BI tool. In fact, I’d say export to Excel is one of the most important features of any enterprise application of any type. Of course the reason we joke about it is that we know it’s a Bad Thing and the starting point for all kinds of manual, error-prone and time-consuming business processes – but even though we know there are much better ways of achieving whatever it is the user wants to do, they still want to export to Excel.

So wouldn’t be good if you could export to Excel and instead of getting a static copy of the data, you could get a table connected to a Power Query query which in turn connected back to the original data source, so it could be refreshed whenever the user wanted? After all, pretty much everyone nowadays has a version of Excel with Power Query in it (even, with some limitations, Mac users). It’s always been possible to build yourself but technically difficult. Recently, though, I became aware of a JavaScript library developed by the Excel Power Query team called “Connected Workbooks” that makes it extremely easy to do this. You can find out more about it here:

https://www.npmjs.com/package/@microsoft/connected-workbooks

https://github.com/microsoft/connected-workbooks#readme

So if you’re adding export to Excel to your application, or know someone who is, check it out!

I’m Posting On The Power Query Blog Too!

In the future you’re going see me writing blog posts on the official Power Query blog as well as here on my own personal blog, and indeed the first of these posts went live a few hours ago. It’s on a new M function called Table.StopFolding which, as the name suggests, stops query folding taking place:

https://powerquery.microsoft.com/en-us/blog/stop-query-folding-with-table-stopfolding/

I’m doing this a) because I was asked very nicely by the Power Query team if I could help out, and b) because it doesn’t make sense for announcements about new Power BI or Power Query functionality, however obscure, to be made on my own personal blog rather than on an official product blog. This isn’t going to affect the number of posts here though.

[Update May 2024 – the Power Query blog has now been taken down along with all of my posts there. I will republish the five posts I wrote here]

Calculating Percentiles In Power Query Online

[This post was originally published on the Power Query blog which no longer exists. I have republished my posts from there here to make sure the content remains available.]

There’s a new aggregation option available for the Group By transformation in Power Query Online: percentiles. To see how it works, let’s see a simple example.

Consider the following Excel table containing information about invoices from different countries:

If you connect to this table from a Power BI or Power Platform dataflow, you can select the Country column and click the Group by button on the Transform tab in the ribbon:

When you do this the Group By dialog will open with the Country column selected in the Group by dropdown box. This is where you’ll see the new Percentile option in the Operation dropdown box; if you select it another box will appear underneath the Operation dropdown box where you can enter a number between 0 and 100 to indicate which percentile should be returned when the data is aggregated. You will also need to enter a new column name to contain the aggregated data and select a column containing the numeric data to be aggregated. Here’s what the Group By dialog would look like if you wanted to find the value for the 50^th percentile from the Amount column for each country:

Click OK and here’s the output of the transformation:

This functionality uses the List.Percentile M function to do its work.

One last thing to mention is that if you’re using SQL Server or Azure Data Explorer as your data source query folding will take place for this transformation.

Building A Reporting Solution Using Excel Power Query – Where Are We Now?

Seven years ago I gave a presentation at SQLBits called “Building a reporting solution using Power Query”. You can watch the recording here:

https://sqlbits.com/Sessions/Event14/Building_A_Reporting_Solution_Using_Power_Query

In it I showed how you could build a simple reporting solution using just Excel and Power Query, loading data into tables, handling parameterisation, making sure you get the best performance and so on. I think the session holds up pretty well: the functionality I showed hasn’t changed at all, and while in the meantime Power BI has reinvented itself and taken over the world I still think there’s a strong argument for using Excel plus Power Query instead of Power BI for some reporting scenarios (although it may be heresy to say so…).

If you follow the Excel blog you’ll know there have been a number of exciting announcements in the last few months, so I thought it would be interesting to take a look at some of them and consider the impact they have for BI and reporting use cases.

Power Query in Excel for the Mac

One of the priorities for the Excel Power Query team has been to get Power Query working in Excel on the Mac, and in the latest update we now have the Power Query Editor available. Data sources are still limited to files (CSV, Excel, XML, JSON), Excel tables/ranges, SharePoint, OData and SQL Server but they are some of the most popular sources. I’m not a Mac person so this doesn’t excite me much, but this does open up Power Query to a new demographic that has traditionally ignored Microsoft BI; for example, I was leafing through John Foreman’s excellent introductory data science book “Data Smart” recently and all the examples in it are in Excel to reach a mass audience, but… Excel for the Mac.

Power Query in Excel Online

This, on the other hand, is something I do care about: who cares what OS you’re running if you can do everything you need in the browser? Well now you can refresh Power Query in Excel Online, although again only a few data sources are supported at the moment: data in tables/ranges in the current workbook, or anonymous OData feeds. More data sources will be supported in the future and there will also be better integration with Office Scripts, so you’ll be able to refresh queries from Power Automate or via a button without needing VBA; you’ll also be able use the Power Query Editor in the browser too.

Before you get too excited about Power Query in Excel Online, though, remember one important difference between it and a Power BI report or a paginated report. In a Power BI report or a paginated report, when a user views a report, nothing they do – slicing, dicing, filtering etc – affects or is visible to any other users. With Power Query and Excel Online however you’re always working with a single copy of a document, so when one user refreshes a Power Query query and loads data into a workbook that change affects everyone. As a result, the kind of parameterised reports I show in my SQLBits presentation that work well in desktop Excel (because everyone can have their own copy of a workbook) could never work well in the browser, although I suppose Excel Online’s Sheet View feature offers a partial solution. Of course not all reports need this kind of interactivity and this does make collaboration and commenting on a report much easier; and when you’re collaborating on a report the Show Changes feature makes it easy to see who changed what.

More flexibility with Power Query data types

Being the kind of person who stores their data in Power BI I didn’t do much with Power Query data types when they were released; after all, you can create Organisation data types to access Power BI data from Excel and I prefer using Excel cube functions anyway. However if you’re not using Power BI then I can see how Power Query data types could be really useful for building reports that go beyond big, boring tables, making it much easier to create more complex report layouts.

Power Query connector for Power BI dataflows and Dataverse

Lastly, the feature I’m most excited about: the ability to load data from Power BI dataflows and Dataverse into Excel via Power Query. It’s not available yet although I promise it’s coming very soon! The ability to share cleaned and conformed data via dataflows direct to those Excel users who just want a data dump (rather than using Analyze in Excel on a Power BI dataset) will prove to be extremely popular, I think. There are a lot of improvements to dataflows coming soon too (you do remember to check the release notes regularly, don’t you?).

Conclusion

Overall it’s clear that Excel Power Query is getting better and better. It may never be able to keep pace with Power BI (what can?) but all these new features show that, for people who prefer to do everything in Excel, it’s making Excel a much better place to build reports. I feel like I need to update my SQLBits presentation now!

Understanding The “We Couldn’t Fold The Expression To The Data Source” Error In Power BI

If you’re using DirectQuery mode in Power BI you may occasionally run into the following error message:

Couldn’t load the data for this visual

OLE DB or ODBC error: [Expression.Error] We couldn’t fold the expression to the data source. Please try a simpler expression..

What does it mean and how can you fix it?

To understand what’s going on here you must first understand what query folding is. There’s some great documentation here that I strongly recommend you read, but in a nutshell query folding refers to how the Power Query engine inside Power BI can push calculation and transformation logic back to whatever data source you’re using in the form of a query – for example a SQL query if your data source is a relational database. Most of the time when people talk about query folding they are using Import mode but it’s even more important in DirectQuery mode: in DirectQuery mode not only does every transformation you create in the Power Query Editor have to fold, but every DAX query (including all your DAX calculations) generated by the visuals on your report has to be folded into one or more queries against your data source too.

You can do some pretty complex things in the Power Query Editor and in DAX and the error message above is the error you get when Power BI admits defeat and says it can’t translate a DAX query generated by a visual on a report into a query against your data source. The cause is likely to be a combination of several of the of the following:

A complex data model
Complex DAX used in measures or calculated columns
The use of dynamic M parameters
Complex transformations created in the Power Query Editor

Unfortunately it’s hard to be more specific because Power BI can fold different transformations to different data sources and this error almost never occurs in simple scenarios.

How can you avoid it? Again, I can only offer general advice:

Don’t do any transformations in the Power Query Editor if you’re using DirectQuery mode. If you want to use DirectQuery you should always make sure your data is modelled appropriately in whatever data source you’re using before you start designing your dataset in Power BI.
Keep your data model as simple as possible. For example, avoiding bi-directional relationships is a good idea.
Try to implement as much of the logic for your calculations in your data source and reduce the amount of DAX you need to write.
Try to write your DAX in a different way in the hope that Power BI will be able to fold it.

The Power Query XML Functions, Missing Or Null Values And Inconsistent Schema Inference

A few months ago one of my colleagues at Microsoft, David Browne, showed me an interesting Power Query problem with how the Xml.Tables and Xml.Document M functions handle null or missing values. I’m posting the details here because the problem seems fairly common, it causes a lot of confusion and it’s not easy to deal with.

In XML there are two ways to represent a null or missing value:<a></a> or omitting the element completely. Unfortunately the Xml.Tables and Xml.Document M functions handle these inconsistently: they treat the <a></a> form as a table but the other as a scalar.

For example consider the following M query that takes some XML (with no missing values), passes it to Xml.Tables and then expands the columns:

let
    Source = "<?xml version=""1.0"" encoding=""UTF-8""?>
    <Record>
     <Location>
        <id>123</id>
        <code>abc</code>
     </Location>
     <Location>
        <id>123</id>
        <code>hello</code>     
     </Location>
    <Location>
        <id>123</id>
        <code>124</code>     
     </Location>
</Record>",
   t0 = Xml.Tables(Source),
    #"Expanded Table" = Table.ExpandTableColumn(t0, "Table", {"id", "code"}, {"id", "code"})
   
in
   #"Expanded Table"

Here’s the output:

Notice how the code column contains scalar values.

Now compare this with a query where the XML contains a missing value using the <a></a> form but where the query is otherwise identical:

let
    Source = "<?xml version=""1.0"" encoding=""UTF-8""?>
    <Record>
     <Location>
        <id>123</id>
        <code>abc</code>
     </Location>
     <Location>
        <id>123</id>
        <code></code>     
     </Location>
    <Location>
        <id>123</id>
        <code>124</code>     
     </Location>
</Record>",
   t0 = Xml.Tables(Source),
    #"Expanded Table" = Table.ExpandTableColumn(t0, "Table", {"id", "code"}, {"id", "code"})
   
in
   #"Expanded Table"

Here’s the output:

Notice how the code column now contains values of type table because of the presence of a missing value.

Finally, here’s one last example where the missing value is handled by omitting the element:

let
    Source = "<?xml version=""1.0"" encoding=""UTF-8""?>
    <Record>
     <Location>
        <id>123</id>
        <code>abc</code>
     </Location>
     <Location>
        <id>123</id> 
     </Location>
    <Location>
        <id>123</id>
        <code>124</code>     
     </Location>
</Record>",
   t0 = Xml.Tables(Source),
    #"Expanded Table" = Table.ExpandTableColumn(t0, "Table", {"id", "code"}, {"id", "code"})
   
in
   #"Expanded Table"

Here’s the output:

Notice how code now contains scalar values but the value in that column on the second row is null.

You can probably guess the kind of problems this can cause: if your source XML sometimes contains missing values and sometimes doesn’t you can end up with errors if you try to expand a column that sometimes contains table values and sometimes doesn’t. It’s not a bug but sadly this is just one of those things which is now very difficult for the Power Query team to change.

There’s no easy way to work around this unless you can change the way your data source generates its XML, which is unlikely. Probably the best thing you can do is use the Xml.Document function; it has the same problem but it’s slightly easier to deal with given how it returns values in a single column, which means you can use a custom column to trap table values something like this:

 let
  Source
    = "<?xml version=""1.0"" encoding=""UTF-8""?>
    <Record>
     <Location>
        <id>123</id>
        <code>abc</code>
     </Location>
     <Location>
        <id>123</id> 
        <code></code>
     </Location>
    <Location>
        <id>123</id>
        <code>124</code>     
     </Location>
</Record>",
  t0 = Xml.Document(Source),
  Value = t0{0}[Value],
  #"Expanded Value"
    = Table.ExpandTableColumn(
    Value,
    "Value",
    {
      "Name",
      "Namespace",
      "Value",
      "Attributes"
    },
    {
      "Name.1",
      "Namespace.1",
      "Value.1",
      "Attributes.1"
    }
  ),
  #"Removed Other Columns"
    = Table.SelectColumns(
    #"Expanded Value",
    {"Name.1", "Value.1"}
  ),
  #"Added Custom" = Table.AddColumn(
    #"Removed Other Columns",
    "Custom",
    each
      if Value.Is([Value.1], type table) then
        null
      else
        [Value.1]
  )
in
  #"Added Custom"

[Thanks to David Browne for the examples and solutions, and to Curt Hagenlocher for confirming this is the way the Xml.Tables and Xml.Document functions are meant to behave]