Excel · M · Power BI · Power Query

Anti Joins And Query Folding In Power Query In Excel And Power BI

Power Query allows you to merge (“join” in database terms) two tables together in a variety of different ways, including left and right anti joins. Unfortunately, as I found recently, anti joins don’t fold on SQL Server-related data sources, which can result in performance problems. Luckily there is a different way of doing anti joins that does fold.

Say you have two Power Query queries called Fruit1 and Fruit2 that get data from SQL Server tables containing the names of different varieties of fruit:

Now, let’s say you want to get a list of all the fruit varieties that are in Fruit1 and not in Fruit2. The obvious thing to do is to do a Merge and use the Left Anti option like so:

Here’s the M code, including an extra step to remove the join column that this creates:

let
    Source = Table.NestedJoin(Fruit1, {"Fruit"}, Fruit2, {"Fruit"}, "Fruit2", JoinKind.LeftAnti),
    #"Removed Other Columns" = Table.SelectColumns(Source,{"Fruit"})
in
    #"Removed Other Columns"

This gives you the correct result:

…but it doesn’t fold, so performance may be bad.

However, if you do a Merge and use the Left Outer option instead:

Then expand the join column (called Fruit2.Fruit here):

And then filter on that column so you only keep the rows where it contains the value null, and then remove that column, you get the same result:

Here’s the M:

let
    Source = Table.NestedJoin(Fruit1, {"Fruit"}, Fruit2, {"Fruit"}, "Fruit2", JoinKind.LeftOuter),
    #"Expanded Fruit2" = Table.ExpandTableColumn(Source, "Fruit2", {"Fruit"}, {"Fruit2.Fruit"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Fruit2", each ([Fruit2.Fruit] = null)),
    #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Fruit"})
in
    #"Removed Other Columns"

This now does fold (meaning performance should be better) and gives you the following SQL:

select [_].[Fruit]
from 
(
    select [$Outer].[Fruit],
        [$Inner].[Fruit2]
    from [dbo].[Fruit1] as [$Outer]
    left outer join 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner] on ([$Outer].[Fruit] = [$Inner].[Fruit2] or [$Outer].[Fruit] is null and [$Inner].[Fruit2] is null)
) as [_]
where [_].[Fruit2] is null
Excel · Performance Tuning · Power BI · Power Query

Excel Workbook Layout And The Performance Of Reading Data With Power Query In Power BI

Excel workbooks are one of the slowest data sources you can use with Power Query in Excel or Power BI. Reading small amounts of data from small workbooks is usually fast; reading large amounts of data from large workbooks can be very slow. But what about reading small amounts of data from large Excel workbooks? I did some tests and it turns out that performance can vary a lot depending on where your data is in the workbook and how that workbook is structured.

[Note: in this post I’ll be looking at .xlsx files, rather than other Excel formats like .xls and .xlsb; Excel files stored on a local disk and accessed via the File.Contents M function rather than stored in SharePoint or any other location; data read from Excel tables rather than direct from the worksheet; and Power Query in Power BI. Other scenarios may perform differently.]

Let’s see a simple example to illustrate what I found out. I created a new Excel workbook with one worksheet in and put a small table of data on it:

At this point the workbook’s size was 11KB. I then opened Power BI Desktop and created a Power Query query that read this table of data from the Excel workbook:

let
    Source = Excel.Workbook(File.Contents("C:\MyWorkbook.xlsx"), null, true),
    Sales_Table = Source{[Item="Sales",Kind="Table"]}[Data],
    #"Changed Type" = Table.TransformColumnTypes(Sales_Table,{{"Product", type text}, {"Sales", Int64.Type}})
in
    #"Changed Type"

Then I used this technique to measure how long it took to load the data from Excel into Power BI. Unsurprisingly, it was extremely fast: 63ms.

Then I added a new worksheet to the workbook, copied the same table onto it, added a large amount of random numbers underneath using the following Excel formula, and then copied and pasted the values returned by the formula over the output of the formula:

=RANDARRAY(9999,300)

Doing this meant the size of the workbook grew to 43MB. I then created a new Power Query query in Power BI Desktop, identical to the one above except that it connected to the new table. This time the query took 4918ms – almost 5 seconds.

Interestingly, even with the second worksheet with all the data on was added, the first query above (on the worksheet with no other data on) was still fast. I also tested refreshing a Power BI dataset that connected to two identical small tables on different worksheets in the same workbook, both with large amounts of other data on as in the second scenario above, and the performance of both queries was only slightly slower: it was clear two Power Query queries can read data from the same Excel workbook in parallel.

So: reading a small amount of data from a table on a worksheet with a large amount of other data on it is very slow.

What can we learn from this? Well, if you can influence the structure and layout of the Excel workbooks you are using as a data source – and that’s a big if, because in most cases you can’t – and you only need to read some of the data from them, you should put the tables of data you are using as a source on separate worksheets and not on the same worksheet as any other large ranges or tables of data.

It turns out that when the Power Query Excel connector reads data from an .xlsx file it can deserialise just some of the data in it rather than the whole thing, but what it can and can’t avoid deserialising depends a lot on the structure of the workbook and how the data is stored within the workbook .xlsx file. If you’re quick you can even see how much data is being read in Power BI Desktop in the refresh dialog:

You can also use Process Monitor, as I describe here, to see how much data is being read from any file used by Power Query.

Performance also depends on which application generated the .xlsx file (it’s not just Excel that creates .xlsx files, because other applications export data to .xlsx format without using Excel) or even which version of Excel saved the .xlsx file. This is because the same data can be stored in an .xlsx file in different ways, some of which may be more efficient to read than others. I found this blog post by Brendan Long on the .xlsx file format was very clear and it helped me understand how Power Query might go about reading data from an .xlsx file.

[Thanks to Curt Hagenlocher of the Power Query team for answering some of my questions relating to this post]

Errors · Excel · M · Power BI · Power Query

Understanding The “External table is not in the expected format” Error In Power Query In Power BI And Excel

Sometimes when you’re importing data from files using Power Query in either Power BI or Excel you may encounter the following error:

DataFormat.Error: External table is not in the expected format

What causes it? TL;DR it’s because you’re trying to load data from one type of file, probably Excel (I don’t think you can get this error with any other source but I’m not sure), and actually connecting to a different type of file.

Let’s see a simple example. Say you have a folder with two files: one is an Excel file called Date.xlsx and one is a CSV file called Date.csv.

Here’s the M code for a Power Query query that connects to the Excel file and reads the data from a table in it:

let
  Source     = Excel.Workbook(File.Contents("C:\MyFolder\Date.xlsx"), null, true), 
  Date_Table = Source{[Item = "Date", Kind = "Table"]}[Data]
in
  Date_Table

Now, if you change the file path in this query – and only the file path – to point at the CSV file instead like so:

let
  Source     = Excel.Workbook(File.Contents("C:\MyFolder\Date.csv"), null, true), 
  Date_Table = Source{[Item = "Date", Kind = "Table"]}[Data]
in
  Date_Table

…you will get the “external table is not in the expected format” error shown above. This is because your code is using the Excel.Workbook M function, which is used to import data from Excel workbooks, to connect to a file that is a CSV file and not an Excel workbook. The way to fix it is to use the appropriate function, in this case Csv.Document, to access the file like so:

let
  Source = Csv.Document(
    File.Contents("C:\MyFolder\Date.csv"), 
    [Delimiter = ",", Columns = 4, Encoding = 65001, QuoteStyle = QuoteStyle.None]
  ), 
  #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true])
in
  #"Promoted Headers"

To be honest, if making this change is beyond your Power Query skills and you’re sure you’re trying to connect to the right file, you’re better off creating a completely new query rather than editing the query you already have.

Another common scenario where you might encounter this error is when you’re importing data from all the files in a folder and one of the files isn’t in the correct format. For example, let’s say you have a folder with three Excel files in and you use the Folder data source to import all the data from all three files:

Since all three files are Excel files the Folder option will work:

However, if you take a CSV file and drop it into the folder like so:

Then you’ll get the same error in Power Query:

Apart from deleting the CSV file you have another option to solve this problem in this case: filtering the folder so you only try to get data from the .xlsx files and no other file type. To do this, click on the step that is called “Source”. When you do this you’ll see that the step returns a table containing all the files in the folder you’re pointing at:

You’ll see that the table in this step contains a column called Extension which contains the file extension for each file. If you filter this table – which will insert a new step at this point in the query, which is ok – by clicking on the down arrow in the Extension column, delselecting the (Select All) option and selecting “.xlsx” so the table only contains .xlsx files then you can avoid this problem:

If, as in this example, the rogue file happens to be the first file in the folder and you’ve selected that first file to be your “sample” file when setting up the import, then you’ll also need to go to the query called Sample File in the Queries pane and make exactly the same change there (ie click on the Source step and filter to remove any non .xlsx files).

Excel · Power BI · Power Query · SQL Server

Cross Database Query Folding For SQL Server Sources In Power Query In Power BI And Excel

A few months ago a new option was added to the Sql.Database and Sql.Databases functions in Power Query in Power BI and Excel which allows Power Query queries that combine data from different SQL Server databases to fold. Here’s a simple example showing how to use it.

On my local PC I have SQL Server installed and the Adventure Works DW 2017 and Contoso Retail DW sample databases:

Both of these databases have date dimension tables called DimDate. Let’s say you want to create a Power Query query that merges these two tables.

Here’s the M code for a Power Query query called DimDate AW to get just the DateKey and CalendarYear columns from the DimDate table of the Adventure Works DW 2017 database:

let
    Source = Sql.Database("localhost", "AdventureWorksDW2017"),
    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
    #"Removed Other Columns" = Table.SelectColumns(dbo_DimDate,{"DateKey", "CalendarYear"})
in
    #"Removed Other Columns"

Here’s the M code for a Power Query query called DimDate Contoso to get just the Datekey and CalendarYear columns from the DimDate table in the ContosoRetailDW database:

let
    Source = Sql.Database("localhost", "ContosoRetailDW"),
    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
    #"Removed Other Columns" = Table.SelectColumns(dbo_DimDate,{"Datekey", "CalendarYear"})
in
    #"Removed Other Columns"

Both of these Power Query queries fold. However if you create a third query to merge these two queries (ie do the equivalent of a SQL join between them) on the CalendarYear columns like so:

let
  Source = Table.NestedJoin(
    #"DimDate Contoso",
    {"CalendarYear"},
    #"DimDate AW",
    {"CalendarYear"},
    "DimDate AW",
    JoinKind.LeftOuter
  ),
  #"Expanded DimDate AW" = Table.ExpandTableColumn(
    Source,
    "DimDate AW",
    {"DateKey", "CalendarYear"},
    {"DimDate AW.DateKey", "DimDate AW.CalendarYear"}
  )
in
  #"Expanded DimDate AW"

…this query does not fold, because it combines data from two different SQL Server databases.

However if you edit the Sql.Database function in the Source step of both of the first two queries above to set the new EnableCrossDatabaseFolding option to true, like so:

    Source = 
    Sql.Database(
        "localhost", 
        "ContosoRetailDW", 
        [EnableCrossDatabaseFolding=true]
        ),

…then the query that merges these two queries does fold:

Dataflows · Excel · Fabric · Power Query

Power Query Templates In Excel And Fabric

Earlier this year a new feature was added to Power Query in Excel and Dataflows that very few people noticed: the abilty to export all the queries in the Power Query Editor to a template file. The reason few people noticed it was that, when the feature released, the only place you could import a template file was in a Power Platform dataflow. The documentation for exporting from Excel and importing to a Power Platform dataflow is very detailed and can be found here:

https://learn.microsoft.com/en-us/power-query/new-dataflow-from-template

With the release of Fabric you can now import templates into Gen2 Dataflows. This means you can export a template from a Gen1 dataflow and import it into a Gen2 dataflow, which is documented here:

https://learn.microsoft.com/en-us/fabric/data-factory/move-dataflow-gen1-to-dataflow-gen2

Joining the dots, this also means you can now export a template from Power Query in Excel and import it to a Gen2 dataflow. This makes it easy to develop on your PC before pushing your work up to Fabric. Here’s a quick walkthrough of how to do this.

In Excel Power Query let’s say you have one or more queries in the Power Query Editor:

If you then go to the File menu in the top left corner of the screen you’ll see the option to export to a template:

If you click this menu item, you’ll see a dialog where you can enter a name and description for the template:

Click Export and you can save the template to a .pqt file.

If you then create a new Gen2 Dataflow in Fabric then, in the middle of the screen, you’ll see the option to import from a template:

[Notice also the Export Template option in the top right corner]

If you import the .pqt file created in Excel the queries in it will get created in the dataflow. You may need to re-enter credentials:

Once you’ve done that, the queries run:

Template files are the way the cool kids share their Power Query queries these days and no kid is cooler than my colleague Alex Powers who, I saw on Twitter recently, has created a template file that scrapes all the ideas from the Fabric Ideas site:

[If you do import this template close to the time of writing this post you may run into this bug from the list of known issues for Fabric Data Factory and Dataflows]

Hopefully we’ll start to see more and more useful Power Query queries and functions shared as template files!

Log Analytics · Performance Tuning · Power BI · Power Query · Refresh

Measuring Memory And CPU Usage In Power BI During Dataset Refresh

This post is a follow-up to my recent post on identifying CPU and memory-intensive Power Query queries in Power BI. In that post I pointed out that Profiler and Log Analytics now gives you information on the CPU and memory used by an individual Power Query query when importing data into Power BI. What I didn’t notice when I wrote that post is that there is also now information available in Profiler and Log Analytics that tells you about peak memory and CPU usage across all Power Query queries for a single refresh in the Power BI Service, as well as memory usage for the refresh as a whole.

Using the same dataset from my previous post, I ran a Profiler trace on the workspace and captured the Command Begin and Command End events while I refreshed the dataset. Here’s what Profiler shows for the Command End event:

In the upper pane, the Duration column tells you how long the refresh took in milliseconds and the CPUTime column tells you how much CPU was used by both the Analysis Services engine and the Power Query engine during the refresh. This is not new, and I wrote about the CPUTime column last year here.

In the lower pane where the TMSL for the refresh operation is shown – this is the text from the TextData column – underneath the TMSL there are three new values shown:

  • PeakMemory shows the maximum amount of memory used during the refresh operation. This includes memory used by the Analysis Services engine and the Power Query engine.
  • MashupCPUTime shows the total amount of CPU used by the Power Query engine for all Power Query queries used to import data into the dataset. This value will always be less than the value shown in the CPUTime column in the upper pane.
  • MashupPeakMemory shows the maximum amount of memory used during the refresh operation by just the Power Query engine across all query evaluations. It’s important to note that this value may not be entirely accurate since the memory usage values, and therefore the peaks, are captured asynchronously so there could be higher peaks that are not detected.

This new information should be extremely useful for troubleshooting refresh failures that are caused by excessive memory consumption.

[Thanks to Akshai Mirchandani for this information]

Log Analytics · M · Performance Tuning · Power BI · Power Query · Refresh

Identifying CPU And Memory-Intensive Power Query Queries During Refresh In The Power BI Service

Last year I wrote a post about a change in the Power BI Service that meant the CPU Time associated with Power Query transformations was added to the total shown for a dataset refresh operation in Profiler traces and Log Analytics:

https://blog.crossjoin.co.uk/2022/07/03/measuring-power-query-cpu-usage-during-power-bi-dataset-refresh/

This was useful, but it didn’t tell you directly how much CPU Time was used by Power Query and it didn’t tell you which tables or partitions in a refresh were using the most CPU. It also didn’t tell you anything about Power Query memory usage. The good news that recently there has been another change that solves these problems.

Let’s say you have a Power BI dataset that consists of a single table whose source is the following Power Query query:

let
  Source = #table(type table [MyNumber = number], List.Transform({1 .. 1000000}, each {_})), 
  #"Added Custom" = Table.AddColumn(
    Source, 
    "ARandomNumber", 
    each Number.RandomBetween(0, 10000), 
    type number
  ), 
  #"Sorted Rows" = Table.Sort(#"Added Custom", {{"ARandomNumber", Order.Ascending}})
in
  #"Sorted Rows"

This query creates a table with a million rows, adds a column with random numbers in and then sorts on that column – which is, as you’d expect, a very CPU and memory-hungry operation.

If you refresh this dataset in the Power BI Service and run a Profiler trace on it, looking at the Command Begin/End and Progress Report Begin/End events, this is what you’ll see:

The final Command End event shows the toal duration of the refresh as well as the amount of CPU used by both the Analysis Services engine and Power Query – in this case 24094ms.

If you look at the Progress Report End event associated with the finish of the refresh for the only partition of the only table in the dataset (highlighted in the screenshot above), there’s some extra information:

It shows the amount of CPU Time and the maximum amount of memory used by the Power Query engine while refreshing this partition. In this case the Power Query engine used 19468ms of CPU and reached a peak of 581848KB of memory. I can tell this is going to be really useful for troubleshooting refresh performance issues and out-of-memory errors.

[Thanks to Akshai Mirchandani, Xiaodong Zhang, Ting-Wei Chang and Jast Lu for this information]

Power Query

Why Doesn’t Power Query Know If My Step Will Fold?

Recently, when preparing my session on query folding in Power Query for SQLBits, I wrote the following query to demonstrate the “Might Fold” indicator in Power Query Online:

let
  Source = OData.Feed(
    "https://services.odata.org/TripPinRESTierService", 
    null, 
    [Implementation = "2.0"]
  ), 
  Navigation = Source{[Name = "People", Signature = "table"]}[Data], 
  #"Filtered rows" = Table.SelectRows(Navigation, each [FirstName] = "Angel")
in
  #"Filtered rows"

This query connects to the TripPin OData feed (which is public, so you’ll be able to try this yourself) and filters the People table so it only gets the rows where the FirstName column equals “Angel”. If you paste this query into the Advanced Editor in Power Query Online to create a dataflow, you’ll see that the filter shows the “Might Fold” step indicator:

This tells you that Power Query doesn’t know if the filter on the FirstName column folds or not. The Query Plan view shows two alternate plans for if the query folds and if it doesn’t:

The question is, though, why can’t Power Query tell if the query will fold? I didn’t know so I asked Curt Hagenlocher of the Power Query development team, who very kindly explained.

It turns out that OData sources (and also ODBC sources) sometimes misreport their support for query folding. At the time that the query plan above is generated Power Query has already asked the source for its list of supported OData capabilities, but it won’t be until the query is actually run that it will know for sure if the filter can be folded. Similarly, some OData sources have an undeclared maximum URL length limit which means that if Power Query exceeds that it may get an error or even incorrect data back. As a result of this the runtime behaviour of Power Query has several fallbacks: it will try to fold fully, then fold some transforms, then fold nothing. All of which explains why the “Might Fold” step indicator exists.

Data Wrangler · Power Query · Python

Data Wrangler: A Power Query-Like Experience For Python People

If, like me (and Ruth), you spend your life in Power BI but have a lingering feeling that you should get round to learning Python for data analysis sometime then here’s something you should check out: the new Data Wrangler extension for Visual Studio Code. All the details are in the announcement blog post:

https://devblogs.microsoft.com/python/data-wrangler-release/

…and this video is a great introduction to what it does:

Why is it interesting for someone like me? Because it works in a very, very similar way to Power Query – except that instead of generating M code in the background, it generates Python. It doesn’t have the same amount of functionality that Power Query does and the UI is a bit more basic but anyone with Power Query experience will feel immediately at home. I got it up and running very easily and I can see that it will be great for anyone learning Python or who needs a productivity boost.

Excel · M · Power BI · Power Query

Removing Diacritics From Text In Power Query

I faced an interesting challenge recently: I had to use Power Query to remove all non-ASCII characters from some text but maintain the readability of that text. Let’s take the French phrase “un garçon très âgé à Noël” (which I know doesn’t make much sense but is great for testing) as an example. If you remove everything apart from the basic latin alphabet characters and spaces using Text.Select as I blogged here:

Text.Select(
"un garçon très âgé à Noël", 
List.Combine(
{{"A".."Z"},{"a".."z"},{" "}}
)
)

[Rick de Groot has a very detailed post on Text.Select here that’s worth reading]

…then the output is “un garon trs g Nol”. As you can see, removing all the characters leads to unreadable text. Instead, what you have to do is find all the letters with diacritics (accents and other glyphs that can be added to characters) and remove the diacritics. Doing this may be ungrammatical and make it harder to understand the meaning of the text but in most cases the text will still be readable.

The bad news is that there is no straightforward way to do this in Power Query, and indeed there is no straightforward way to do this at all because there are no hard-and-fast rules about what to replace a letter with a diacritic with: should “ö” become “o” or “oe” for example? My first thought was to create a big lookup table with all the rules of what to replace each character with in, similar to the approach taken here for solving this problem in Google Sheets. Building a comprehensive lookup table would be gigantic task though.

Luckily there is another solution (thanks to Curt Hagenlocher for finding this one for me) that is described in this thread on StackExchange: converting the text to binary using the Greek (ISO) code page and converting it back to text again using the Text.FromBinary and Text.ToBinary M functions. I’m not sure I understand why it works but it seems to work well enough for my purpose. Here’s the M code to do this:

Text.FromBinary(
Text.ToBinary(
"un garçon très âgé à Noël", 
28597
)
)

The output of this is “un garcon tres age a Noel”, which is exactly what we want. Of course it’s not perfect and there are cases where characters can’t be converted. If you take the text “Malus × zumi” (it’s a type of crabapple apparently), then:

Text.FromBinary(
Text.ToBinary(
"Malus × zumi", 
28597
)
)

returns the text “Malus ? zumi” – the “×” has been replaced by “?”. As a result you’ll probably also want to replace any question marks with a space and then remove any excess spaces; I’m going to ignore the fact that doing this might also remove any question marks that were in the original text.

Here’s an M custom function that does all this and also solves my original problem of only returning ASCII characters:

(inputText as text) as text =>
  let
    ReplaceDiacritics = Text.FromBinary(Text.ToBinary(inputText, 28597), TextEncoding.Ascii), 
    RemoveQuestionMarks = Text.Replace(ReplaceDiacritics, "?", " "), 
    RemoveExcessSpaces = Text.Combine(
      List.RemoveItems(Text.Split(Text.Trim(RemoveQuestionMarks), " "), {""}), 
      " "
    )
  in
    RemoveExcessSpaces