Power BI/Data Books Roundup

It’s time for another short post on the free books that various authors have been kind enough to send me over the last few months. Full disclosure: these aren’t reviews as such, they’re more like free publicity in return for the free books, and I don’t pretend to be unbiased; also the Amazon UK links have a affiliate code in that gives me a kickback if you buy any of these books.

Deciphering Data Architectures, James Serra

I’ll be honest, I’ve had this book hanging around in my inbox since February and I wasn’t sure what to expect of it, but when I finally got round to reading it I enjoyed it a lot and found it very useful. If you’re looking for clear, concise explanations of all of the jargon and methodologies that are in use in the data industry today then this is the book for you. Do you want to understand the difference between Kimball and Inmon? Get an honest overview of data mesh? Choose between a data lake and a relational data warehouse? It’s all here and more. It’s an opinionated book (which I appreciate) and quite funny in places too. Definitely a book for every junior BI consultant to read and for more senior people to have handy to fill in gaps in their knowledge.

Extending Power BI with Python and R (second edition), Luca Zavarella

I posted about the first edition of this book back in 2021; this new edition has several new chapters about optimising R and Python settings, using Intel’s Math Kernel library for performance and addressing integration challenges. As before this is all fascinating stuff that no-one else in the Power BI world is talking about. I feel like a future third edition covering what will be possible with Power BI and Python in Fabric in 2-3 years will be really cool.

Data Cleaning with Power BI, Gus Frazer

It’s always nice to see authors focusing on a business problem – in this case data cleaning – rather than a technology. If you’re looking for an introductory book on Power Query this certainly does the job but the real value here is the way it looks at how to clean data for Power BI using all of the functionality in Power BI, not just Power Query, as well as tools like Power Automate. It’s also good at telling you what you should be doing with these tools and why. Extra credit is awarded for including a chapter that covers Azure OpenAI and Copilot in Dataflows Gen2.

New Semi Join, Anti Join And Query Folding Functionality In Power Query

There are a couple of nice new features to do with table joins (or merges as they are known in M) and query folding in Power Query in the April release of Power BI Desktop that I want to highlight.

Anti Joins now fold

First of all, a few months ago I wrote a post about how the built-in anti join functionality didn’t fold in Power Query. The good news is that it now does on SQL Server-related sources, so no more workarounds are needed. For example, if you have two tables in a SQL Server database called Fruit1 and Fruit2 and two Power Query queries that get data from those tables:

…then the following M code:

let
  Source = Table.Join(
    Fruit1,
    {"Fruit1"},
    Fruit2,
    {"Fruit2"},
    JoinKind.LeftAnti
  )
in
  Source

…returns the following table of fruits that are in the Fruit1 table and not in the Fruit2 table:

Of course that’s what the code above returned in previous versions of Power Query too. The difference now is that query folding occurs and the following SQL code is generated:

select [$Outer].[Fruit1],
    cast(null as nvarchar(50)) as [Fruit2]
from 
(
    select [_].[Fruit] as [Fruit1]
    from [dbo].[Fruit1] as [_]
) as [$Outer]
where not exists 
(
    select 1
    from 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner]
    where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null

New join kind: semi joins

There are also two brand new join kind you can use in the Table.Join and Table.NestedJoin functions: JoinKind.LeftSemi and JoinKind.RightSemi. Semi joins allow you to select the rows in one table that have matching values in another table. Using the Fruit1 and Fruit2 tables above, the following M code:

let
  Source = Table.Join(
    Fruit1, 
    {"Fruit1"}, 
    Fruit2, 
    {"Fruit2"}, 
    JoinKind.LeftSemi
  )
in
  Source

Returns all the rows in Fruit1 where there is a matching value in Fruit2:

Here’s the SQL that is generated:

select [$Outer].[Fruit1],
    cast(null as nvarchar(50)) as [Fruit2]
from 
(
    select [_].[Fruit] as [Fruit1]
    from [dbo].[Fruit1] as [_]
) as [$Outer]
where exists 
(
    select 1
    from 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner]
    where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null

The ?? operator now folds

The M language’s ?? coalesce operator is used for replacing null values and this now folds on SQL Server-related sources too now. For example, the M query in the previous section that did a semi join on Fruit1 and Fruit2 returns a table where all the rows in the Fruit2 colum contain null values. The following M query adds a new custom column that returns the text value “Nothing” when the Fruit2 column contains a null:

let
  Source = Table.Join(
    Fruit1, 
    {"Fruit1"}, 
    Fruit2, 
    {"Fruit2"}, 
    JoinKind.LeftSemi
  ), 
  ReplaceNulls = Table.AddColumn(
    Source, 
    "NullReplacement", 
    each [Fruit2] ?? "Nothing"
  )
in
  ReplaceNulls

Here’s the SQL generated for this, where the ?? operator is folded to a CASE statement:

select [_].[Fruit1] as [Fruit1],
    [_].[Fruit2] as [Fruit2],
    case
        when [_].[Fruit2] is null
        then 'Nothing'
        else [_].[Fruit2]
    end as [NullReplacement]
from 
(
    select [$Outer].[Fruit1],
        cast(null as nvarchar(50)) as [Fruit2]
    from 
    (
        select [_].[Fruit] as [Fruit1]
        from [dbo].[Fruit1] as [_]
    ) as [$Outer]
    where exists 
    (
        select 1
        from 
        (
            select [_].[Fruit] as [Fruit2]
            from [dbo].[Fruit2] as [_]
        ) as [$Inner]
        where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null
    )
) as [_]

[Thanks to Curt Hagenlocher for the information in this post]

Displaying Azure Maps In A Power BI Paginated Report

The built-in mapping functionality in Power BI paginated reports is fairly basic. However the integration of Power Query into Power BI paginated reports gives you an interesting new way of creating maps in paginated reports: you can call the Azure Maps API using Power Query and display the image returned in an Image report item. In this blog post I’ll show you how.

Here’s a quick summary of what I’m going to do:

Call the API from https://data.police.uk/ (specifically the Crimes At Location endpoint) using Power Query to get all the recorded crimes within a one mile radius of a given latitude and longitude in a given month for any location in England, Wales or Northern Ireland
Take this list of crimes and pass them to the Azure Maps API Get Map Static Image endpoint to return an image of a map with the crime locations on it
Display this image in an Image report part in a paginated report

And here’s an example of what the final paginated report will look like:

Step 1: Sign up for the Azure Maps API

In order to call the Azure Maps API you’ll need to go to the Azure Portal and create a resource. The pricing is very reasonable: the first 1000 calls to the endpoint used here are free and after that it’s $4.50 per month for up to 500,000 calls, which should be more than enough for BI purposes.

Step 2: Create Shareable Cloud Connections

To connect to data sources in Power Query in paginated reports you need to create Shareable Cloud Connections in the Power BI portal. You’ll need two connections for this report: one for the Azure Maps API with the URL https://atlas.microsoft.com/map/static/png and one for the Crime API with the URL https://data.police.uk/api/crimes-street/all-crime. Both SCCs should have the authentication method Anonymous and the privacy level Public and have the Skip Test Connection option checked:

Step 3: Create a paginated report and Power Query query to call APIs

After creating a new paginated report in Power BI Report Builder you need to create a dataset (called AzureMap here) to get data from the APIs. This dataset uses Power Query as a source and has one main query (also called AzureMap) and four parameters:

lon and lat, to hold the latitude and longitude of the location to get crime data for, which will also be the centre point of the map
zoom, which is the zoom level of the map
yearmonth, which is the year and month in YYYY-MM format to get crime data for:

Here’s the M code for the query:

let
  CallCrimeAPI = Json.Document(
    Web.Contents(
      "https://data.police.uk/api/crimes-street/all-crime",
      [
        Query = [
          lat  = Text.From(lat),
          lng  = Text.From(lon),
          date = yearmonth
        ]
      ]
    )
  ),
  ToTable = Table.FromList(
    CallCrimeAPI,
    Splitter.SplitByNothing(),
    null,
    null,
    ExtraValues.Error
  ),
  First50 = Table.FirstN(ToTable, 50),
  ExpandColumn1 = Table.ExpandRecordColumn(
    First50,
    "Column1",
    {"location"},
    {"location"}
  ),
  Expandlocation = Table.ExpandRecordColumn(
    ExpandColumn1,
    "location",
    {"latitude", "street", "longitude"},
    {
      "location.latitude",
      "location.street",
      "location.longitude"
    }
  ),
  JustLatLon = Table.SelectColumns(
    Expandlocation,
    {"location.longitude", "location.latitude"}
  ),
  TypeToText = Table.TransformColumnTypes(
    JustLatLon,
    {
      {"location.longitude", type text},
      {"location.latitude", type text}
    }
  ),
  MergedColumns = Table.CombineColumns(
    TypeToText,
    {"location.longitude", "location.latitude"},
    Combiner.CombineTextByDelimiter(
      " ",
      QuoteStyle.None
    ),
    "LongLat"
  ),
  PrefixPipe = Table.TransformColumns(
    MergedColumns,
    {{"LongLat", each "|" & _, type text}}
  ),
  GetString = "|"
    & Text.Combine(PrefixPipe[LongLat]),
  QueryRecord = [
    #"subscription-key"
      = "InsertYourSubscriptionKeyHere",
    #"api-version" = "2022-08-01",
    layer = "basic",
    style = "main",
    #"zoom" = Text.From(zoom),
    center = Text.From(lon) & ", " & Text.From(lat),
    width = "768",
    height = "768"
  ],
  AddPins = try
    Record.AddField(
      QueryRecord,
      "pins",
      "default|sc0.5" & GetString
    )
  otherwise
    QueryRecord,
  CallAzureMapsAPI = Web.Contents(
    "https://atlas.microsoft.com/map/static/png",
    [Query = AddPins]
  ),
  ToText = Binary.ToText(
    CallAzureMapsAPI,
    BinaryEncoding.Base64
  ),
  OutputTable = #table(
    type table [image = text],
    {{ToText}}
  )
in
  OutputTable

You need to put all this code in a single M query to avoid the Formula.Firewall: Query ‘Query1’ (step ‘xyz’) references other queries or steps, so it may not directly access a data source. Please rebuild this data combination error. You can find out more about this error by watching my data privacy video here.

A few things to note:

The CallCrimeAPI step calls the Get Crimes At Location API endpoint to get all the reported crimes within a one mile radius of the given latitude and longitude in the given year and month.
Because of the way I’m sending the crime location data to the Azure Maps API I limited the number of locations to 50, in the First50 step, to avoid hitting errors relating to the maximum length of a URL.
The GetString step returns a pipe delimited list of longitudes and latitudes of crime locations for the Azure Maps API to display as pins on the map. However, some error handling is needed in case there were no reported crimes in the given location or month and that happens in the AddPins step.
The QueryRecord step contains all the parameters to send to the Azure Maps Get Map Static Image endpoint. This docs page has more information on what’s possible with this API – I’m barely scratching the surface of what’s possible in this example.
Authentication to the Azure Maps API is via a subscription key which you’ll need to pass to the subscription-key parameter. You can get the key from the resource created in step 1 in the Azure Portal.
The API returns an image binary which is converted to text and returned in a table with one column and one row in the ToText and OutputTable steps. The code is similar to what I showed in this blog post but luckily I didn’t seem to need to break it up into multiple rows.

Step 4: Create Power Query query to return values for Zoom parameter

The Zoom parameter of the Get Map Static Image API endpoint accepts a value between 0 and 20, which represents the zoom level of the displayed map. You need to create a separate dataset and M query to return a table containing those values with the following code:

let
  Source = {0 .. 20}, 
  #"Converted to table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ), 
  #"Changed column type"
    = Table.TransformColumnTypes(
    #"Converted to table", 
    {{"Column1", Int64.Type}}
  ), 
  #"Renamed columns" = Table.RenameColumns(
    #"Changed column type", 
    {{"Column1", "Zoom"}}
  )
in
  #"Renamed columns"

Step 5: Create paginated report parameters

Next you need to create four parameters in the paginated report for the longitude, latitude, zoom level and year month:

To make it easy for end users to select a zoom level, you need to bind the available values for the zoom parameter to the table returned by the dataset from the previous step:

Step 6: Display the map in an Image report part

In the paginated report itself the only interesting thing is the configuration of the Image report part in the centre of the report:

You need to set the image source to “Database”, bind it to the following expression

=First(Fields!image.Value, "AzureMap")

…which gets the text value from the sole row and column in the table returned by the AzureMap dataset created in step 3, and set the MIME type to be “image/png”.

And that’s it! After publishing you can enter any latitude and longitude in England, Wales or Northern Ireland, a year and month, and a zoom level, and get all the reported crimes on a map:

You can download the .rdl file with the paginated report in here (remember to edit the AzureMaps query to insert your Azure Map API key).

Power BI Paginated Reports That Connect To Web Services And Excel

By far the most exciting announcement for me this week was the new release of Power BI Report Builder that has Power Query built in, allowing you to connect to far more data sources in paginated reports than you ever could before. There’s a very detailed blog post and video showing you how this new functionality works here:

https://powerbi.microsoft.com/en-us/blog/get-data-with-power-query-available-in-power-bi-report-builder-preview

The main justification for building this feature was to allow customer to build paginated reports on sources like Snowflake or BigQuery, something which had only been possible before if you used an ODBC connection via a gateway or built a semantic model in between – neither of which are an ideal solution. However it also opens up a lot of other possibilities too.

For example, you can now build paginated reports on web services (with some limitations). I frequently get asked about building regular Power BI reports that get data from web services on demand – something which isn’t possible, as I explained here. To test using paginated reports on a web service I registered for Transport for London’s APIs and built a simple report on top of their Journey Planner API (Transport for London are the organisation that manages public transport in London). This report allows you to enter a journey starting point and ending point anywhere in or around London, calls the API and returns a table with different routes from the start to the destination, along with timings and instructions for each route. Here’s the report showing different routes for a journey from 10 Downing Street in London to Buckingham Palace:

You can also build paginated reports that connect to Excel workbooks that are stored in OneDrive or OneLake, meaning that changes made in the Excel workbook show up in the report as soon as the workbook is saved and closed:

So. Much. Fun. I’ll probably develop a presentation for user groups explaining how I built these reports soon.

And yes, if you need to export data to Excel on a schedule, paginated reports are now an even better choice. You know your users want this.

Overhead Of Getting Relationship Columns In Power BI DirectQuery Mode

Many Power BI connectors for relational databases, such as the SQL Server connector, have an advanced option to control whether relationship columns are returned or not. By default this option is on. Returning these relationship columns adds a small overhead to the time taken to open a connection to a data source and so, for Power BI DirectQuery semantic models, turning this option off can improve report performance slightly.

What are relationship columns? If you connect to the DimDate table in the Adventure Works DW 2017 sample database using Power Query, you’ll see then on the right-hand side of the table. The following M code:

let
    Source = Sql.Database("localhost", "AdventureWorksDW2017"),
    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]
in
    dbo_DimDate

…shows the relationship columns:

Whereas if you explicitly turn off the relationships by deselecting the “Including relationship columns” checkbox:

…you get the following M code with the CreateNavigationProperties property set to false:

let
    Source = Sql.Database("localhost", "AdventureWorksDW2017", [CreateNavigationProperties=false]),
    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]
in
    dbo_DimDate

…and you don’t see those extra columns.

How much overhead does fetching relationship columns add? It depends on the type of source you’re using, how many relationships are defined and how many tables there are in your model (because the calls to get this information are not made in parallel). It’s also, as far as I know, impossible to measure the overhead from any public telemetry such as a Profiler trace or to deduce it by looking at the calls made on the database side. The overhead only happens when Power BI opens a connection to a data source and the result is cached afterwards, so it will only be encountered occasionally and not for every query that is run against your data source. I can say that the overhead can be quite significant in some cases though and can be made worse by other factors such as a lack of available connections or network/gateway issues. Since I have never seen anyone actually use these relationship columns in a DirectQuery model – they are quite handy in Power Query in general though – you should always turn them off when using DirectQuery mode.

[Thanks to Curt Hagenlocher for the information in this post]

Reading Parquet Metadata In Power Query In Power BI

There’s a new M function in Power Query in Power BI that allows you to read the data from a Parquet file: Parquet.Metadata. It’s not documented yet and it’s currently marked as “intended for internal use only” but I’ve been told I can blog about it. Here’s an example of how to use it:

let
  Source = Parquet.Metadata(File.Contents("C:\myfile.snappy.parquet"))
in
  Source

…and here’s an example of the output:

This query shows how to expand the record returned by this function into a table:

let
    m = Parquet.Metadata(File.Contents("C:\myfile.snappy.parquet")),
    schema = List.Accumulate(Table.ToRecords(m[Schema]), [], (x, y) => if y[NumChildren] = null then Record.AddField(x, y[Name], y[LogicalType] ?? y[ConvertedType]) else x),
    expanded1 = Table.ExpandTableColumn(m[RowGroups], "Columns", {"MetaData"}),
    renamed1 = Table.RenameColumns(expanded1, {{"Ordinal", "RowGroup"}, {"TotalCompressedSize", "RowGroupCompressedSize"}, {"TotalByteSize", "RowGroupSize"}}),
    expanded2 = Table.ExpandRecordColumn(renamed1, "MetaData", {"Type", "Encodings", "PathInSchema", "Codec", "NumValues", "TotalUncompressedSize", "TotalCompressedSize", "KeyValueMetadata", "DataPageOffset", "IndexPageOffset", "DictionaryPageOffset", "Statistics", "EncodingStats"}),
    renamed2 = Table.RenameColumns(expanded2, {{"Type", "PhysicalType"}}),
    added1 = Table.AddColumn(renamed2, "Column", each Text.Combine([PathInSchema])),
    added2 = Table.AddColumn(added1, "Cardinality", each [Statistics][DistinctCount]),
    added3 = Table.AddColumn(added2, "NullCount", each [Statistics][NullCount]),
    added4 = Table.AddColumn(added3, "DictionarySize", each [DataPageOffset] - [DictionaryPageOffset]),
    added5 = Table.AddColumn(added4, "LogicalType", each Record.FieldOrDefault(schema, [Column], null)),
    selected = Table.SelectColumns(added5, {"RowGroup", "Column", "Codec", "NumValues", "Cardinality", "NullCount", "TotalCompressedSize", "TotalUncompressedSize", "DictionarySize", "PhysicalType", "LogicalType"})
in
    selected

As you can see this gives you all kinds of useful information about a Parquet file such as the schema, the compression type used, column cardinality and so on.

[Thanks to Curt Hagenlocher for the tip-off and the query above]

Understanding The “Evaluation resulted in a stack overflow” Error In Power Query In Excel And Power BI

If you’re writing your own M code in Power Query in Power BI or Excel you may run into the following error:

Expression.Error: Evaluation resulted in a stack overflow and cannot continue.

If you’re a programmer you probably know what a stack overflow is; if you’re not you might search for the term, find this Wikipedia article and still have no clue what has happened. Either way it may still be difficult to understand why you’re running into it in Power Query. To explain let’s see some examples.

First, a really simple one. You can write recursive functions – functions that call themselves – in Power Query, although as you might suspect that is generally a bad idea because they are difficult to write, difficult to understand, slow and may result in the error above. Consider the following function called MyRecursiveFunction which references a parameter of type number called MaxDepth:

(counter as number) as number=>
if counter>MaxDepth then counter else @MyRecursiveFunction(counter+1)

The function takes a number and calls itself, passing in a value one greater than the number that was passed to it, until the number passed is greater than MaxDepth. So, if the value of MaxDepth is 3 and you call the function and pass it the value of 1, like so:

MyRecursiveFunction(1)

…then you’ll get the value 4 back:

So far so good. But how long can a function in M go on calling itself? As you can imagine, it’s not forever. So, when you hit the point where the function can’t go on calling itself then you get the error above. For example if you try setting MaxDepth to 100000 then you’ll get the stack overflow error above instead of 100001:

As a result it’s almost always a good idea to avoid recursion in Power Query and use functions like List.Transform, List.Generate or List.Accumulate to achieve the same result. A great example of this is shown in the Power Query custom connector documentation in the section on handling APIs that return results broken up into pages with the Table.GenerateByPage code sample.

You may still get this error even when you’re not explicitly using recursion though, as a result of lazy evaluation. Consider the following query which uses List.Accumulate to generate a table with a given number of rows:

let
  //define a table with one row and one column called x
  MyTable = #table(type table [x = number], {{1}}), 
  //specify how many rows we want in our output table
  NumberOfTimes = 3, 
  //Use List.Accumulate to create a table with this number of rows
  //By calling Table.Combine
  CombineAllTables = List.Accumulate(
    {1 .. NumberOfTimes}, 
    null, 
    (state, current) => if current = 1 then MyTable else Table.Combine({state, MyTable})
  )
in
  CombineAllTables

Here’s the output, a table with three rows:

But how does it get this result? With NumberOfTimes=3 you can think of this query lazily building up an M expression something like this:

Table.Combine({Table.Combine({MyTable, MyTable}), MyTable})

…which, once List.Accumulate has finished, suddenly all has to be evaluated and turned into a single table. Imagine how much nesting of Table.Combine there would be if NumberOfTimes was a much larger number though! And indeed, it turns out that you can’t make lots and lots of calls to Table.Combine without running into a stack overflow. So if NumberOfTimes=100000 like so:

let

  //define a table with one row and one column called x
  MyTable = #table(type table [x = number], {{1}}), 
  //specify how many rows we want in our output table
  NumberOfTimes = 100000, 
  //Use List.Accumulate to create a table with this number of rows
  //by calling Table.Combine
  CombineAllTables = List.Accumulate(
    {1 .. NumberOfTimes}, 
    null, 
    (state, current) => if current = 1 then MyTable else Table.Combine({state, MyTable})
  )
in
  CombineAllTables

…then, after a minute or so, you get the “Evaluation resulted in a stack overflow and cannot continue” error again.

Rewriting the query so you build up the list of tables first and only call Table.Combine once at the end avoids the problem and is much faster:

let
  //define a table with one row and one column called x
  MyTable = #table(type table [x = number], {{1}}), 
  //specify how many rows we want in our output table
  NumberOfTimes = 100000, 
  //create a table with NumberOfTimes rows
  CombineAllTables = Table.Combine(List.Repeat({MyTable}, NumberOfTimes))
in
  CombineAllTables

It’s also possible to solve the problem by forcing eager evaluation inside List.Accumulate but this is extremely tricky: there’s an example of this on Gil Raviv’s blog here.

Power Query Nested Data Types In Excel

A year ago support for nested data types in Excel was announced on the Excel blog, but the announcement didn’t have much detail about what nested data types are and the docs are quite vague too. I was recently asked how to create a nested data type and while it turns out to be quite easy, I thought it would be good to write a short post showing how to do it.

Let’s say you have a Power Query query that returns details of the different types of fruit that you sell in your shop:

Let’s also say that the last three columns in this table (Grower, Address and Specialty) all relate to the farmer that supplies you with this fruit. Now you could create one Excel data type with all these columns in, but nested data types allow you to create data types inside data types, so in this case you can create a data type specifically for these three columns relating to the farmer and then nest it inside the main data type.

To do this, select these three columns in the Power Query Editor and click the Create Data Type button on the Transform tab in the ribbon:

Give the data type a name, in this case Farmer, in the Create Data Type dialog:

At this point you’ll have a query that returns a table where one of the columns, Farmer, contains a data type:

Finally, you then select all the columns in this table, including the Farmer column, and click the Create Data Type button again to create another new data type, this time called Product:

Here’s what you’ll see in the Power Query Editor once you’ve done this:

And here’s the result in the Power Query Editor:

Once you’ve loaded this query to the worksheet you can explore the nested data type via the card popup:

Or you can access the data in the nested data type using a formula. For example, in the animated gif above the cell A2 contains the data type for Apples, so the formula

=A2.Farmer.Address

…returns the address of the farmer who grows apples.

Alternatively, you can use the Excel FieldValue function to get the same result:

=FIELDVALUE(FIELDVALUE(A2, "Farmer"), "Address")

Incremental Refresh On Delta Tables In Power BI

One of the coolest features in Fabric is Direct Lake mode, which allows you to build Power BI reports directly on top of Delta tables in your data lake without having to wait for a semantic model to refresh. However not everyone is ready for Fabric yet so there’s also a lot of interest in the new DeltaLake.Table M function which allows Power Query (in semantic models or dataflows) to read data from Delta tables. If you currently have a serving layer – for example Synapse Serverless or Databricks SQL Warehouse – in between your existing lake house and your import mode Power BI semantic models then this new function could allow you to remove it, to reduce complexity and cut costs. This will only be a good idea, though, if refresh performance isn’t impacted and incremental refresh can be made to work well.

So is it possible to get good performance from DeltaLake.Table with incremental refresh? Query folding isn’t possible using this connector because there’s no database to query: a Delta table is just a folder with some files in. But query folding isn’t necessary for incremental refresh to work well: what’s important is that when Power Query filters a table by the datetime column required for incremental refresh, that query is significantly faster than reading all the data from that table. And, as far as I can see from the testing I’ve done, because of certain performance optimisations within DeltaLake.Table it should be possible to use incremental refresh on a Delta table successfully.

There are three factors that influence the performance of Power Query when querying a Delta table:

The internal structure of the Delta table, in particular whether it is partitioned or not
The implementation of the connector, ie the DeltaLake.Table function
The M code you write in the queries used to populate the tables in your semantic model

There’s not much you can do about #2 – performance is, I think, good enough right now although there are a lot of optimisations that will hopefully come in the future – but #1 and #3 are definitely within your control as a developer and making the right choices makes all the difference.

Here’s what I did to test incremental refresh performance. First, I used a Fabric pipeline to load the NYC Taxi sample data into a table in a Lakehouse (for the purposes of this exercise a Fabric Lakehouse will behave the same as ADLSgen2 storage – I used a Lakehouse because it was easier). Then, in Power BI Desktop, I created an import mode semantic model pointing to the NYC taxi data table in the Lakehouse and configured incremental refresh. Here’s the M code for that table:

let
  Source = AzureStorage.DataLake(
    "https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/unpartitionednyc/", 
    [HierarchicalNavigation = true]
  ), 
  ToDelta = DeltaLake.Table(Source), 
  #"Filtered Rows" = Table.SelectRows(
    ToDelta, 
    each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
  )
in
  #"Filtered Rows"

Here’s the incremental refresh dialog:

I then published the semantic model and refreshed it via the Enhanced Refresh API from a notebook (Semantic Link makes this so much easier) using an effective date of 8th December 2013 to get a good spread of data. I used Phil Seamark’s new, notebook-based version of his refresh visualisation tool to see how long each partition took during an initial refresh:

The refresh took just over 30 minutes.

Next, using Spark SQL, I created a copy of the NYC taxi data table in my Lakehouse with a new datetime column added which removed everything apart from the date and I then partitioned the table by that new datetime column (called PickupDate here):

CREATE TABLE PartitionedByDateNYC

USING delta

PARTITIONED BY (PickupDate)

AS

SELECT  *, date_trunc("Day", lpepPickupDateTime) as PickupDate

FROM NYCIncrementalRefreshTest.nyctaxi_raw

I created a copy of my semantic model, pointed it to the new table and reconfigured the incremental refresh to filter on the newly-created PickupDate column:

let
  Source = AzureStorage.DataLake(
    "https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionedbydatenyc/", 
    [HierarchicalNavigation = true]
  ), 
  ToDelta = DeltaLake.Table(Source), 
  #"Filtered Rows" = Table.SelectRows(
    ToDelta, 
    each [PickupDate] >= RangeStart and [PickupDate] < RangeEnd
  )
in
  #"Filtered Rows"

…and refreshed again. This time the refresh took about 26 seconds.

Half an hour to 26 seconds is a big improvement and it’s because the DeltaLake.Table function is able to perform partition elimination: the partitions in the semantic model align to one or more partitions in the Delta table, so when each partition in the semantic model is refreshed Power Query only needs to read data from the partitions in the Delta table that contain the relevant data. This only happens because the filter in the Power Query query using the RangeStart and RangeEnd parameters is on the same column that is used to partition the Delta table.

In my final test I partitioned my Delta table by month, like so:

CREATE TABLE PartitionedNYC

USING delta

PARTITIONED BY (PickupYearMonth)

AS

SELECT  *, (100*date_part('YEAR', lpepPickupDateTime)) + date_part('Months', lpepPickupDateTime) as PickupYearMonth

FROM NYCIncrementalRefreshTest.nyctaxi_raw

The challenge here is that:

The new PickupYearMonth column is an integer column, not a datetime column, so it can’t be used for an incremental refresh filter in Power Query
Power BI incremental refresh creates partitions at the year, quarter, month and date granularities, so filtering by month can’t be used for date partitions

I solved this problem in my Power Query query by calculating the month from the RangeStart and RangeEnd parameters, filtering the table by the PickupYearMonth column (to get partition elimination), stopping any further folding using the Table.StopFolding function and then finally filtering on the same datetime column I used in my first test:

let
  Source = AzureStorage.DataLake(
    "https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionednyc/",
    [HierarchicalNavigation = true]
  ),
  ToDelta = DeltaLake.Table(Source),
  YearMonthRangeStart = (Date.Year(RangeStart) * 100) + Date.Month(RangeStart),
  YearMonthRangeEnd = (Date.Year(RangeEnd) * 100) + Date.Month(RangeEnd),
  FilterByPartition = Table.StopFolding(
    Table.SelectRows(
      ToDelta,
      each [PickupYearMonth] >= YearMonthRangeStart and [PickupYearMonth] <= YearMonthRangeEnd
    )
  ),
  #"Filtered Rows" = Table.SelectRows(
    FilterByPartition,
    each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
  )
in
  #"Filtered Rows"

Interestingly this table refreshed even faster: it took only 18 seconds.

This might just be luck, or it could be because the larger partitions resulted in fewer calls back to the storage layer. The AzureStorage.DataLake M function requests data 4MB at a time by default and this could result in more efficient data retrieval for the data volumes used in this test. I didn’t get round to testing if using non-default options on AzureStorage.DataLake improved performance even more (see here for more details on earlier testing I did with them).

To sum up, based on these tests it looks like incremental refresh can be used effectively in import mode semantic models with Delta tables and the DeltaLake.Table function so long as you partition your Delta table and configure your Power Query queries to filter on the partition column. I would love to hear what results you get if you test this in the real world so please let me know by leaving a comment.

Getting Different Versions Of Data With Value.Versions In Power Query

Something I mentioned in my recent post about the new DeltaLake.Tables M function on the Fabric blog recently was the fact that you can get different versions of the data held in a Delta table using the Value.Versions M function. In fact, the Value.Versions is the way to access different versions of data in any source that has this concept – so long as Power Query has added support for doing so. The bad news is that, at least at the the time of writing, apart from the DeltaLake connector there’s only one other source where Value.Versions can be used in this way: the connector for Fabric Lakehouses.

Here’s how you can access the data in a table using the Lakehouse.Contents M function:

let
  Source          = Lakehouse.Contents(), 
  SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data], 
  SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data], 
  SelectTable     = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data]
in
  SelectTable

As with DeltaLake.Table, you can get a table with all the different versions available using Value.Versions:

let
  Source          = Lakehouse.Contents(),
  SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
  SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
  SelectTable     = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
  ShowVersions    = Value.Versions(SelectTable)
in
  ShowVersions

Version 0 is the earliest version; the latest version of the data is the version with the highest number and this version can also be accessed from the row with the version number null. The nested values in the Data column are tables which give you the data for that particular version number. So, for example, if I wanted to get the data for version 2 I could click through on the nested value in the Data column in the row where the Version column contained the value 2. Here’s the M code for this:

let
  Source          = Lakehouse.Contents(),
  SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
  SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
  SelectTable     = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
  ShowVersions    = Value.Versions(SelectTable),
  Data            = ShowVersions{2}[Data]
in
  Data

The Lakehouse connector uses the TDS Endpoint of the Lakehouse to get data by default, as in the first code snippet above, but if you use Value.Versions to get specific versions then this isn’t (as yet) possible so it will use a slower method to get data and performance may suffer.

Last of all, you can get the version number of the data you’re looking at using the Value.VersionIdentity function. If you’re looking at the latest version of the data then Value.VersionIdentity will return null:

let
  Source          = Lakehouse.Contents(),
  SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
  SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
  SelectTable     = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
  GetVersion      = Value.VersionIdentity(SelectTable)
in
  GetVersion

If you are looking at version 2 of the data then Value.VersionIdentity will return 2:

let
  Source           = Lakehouse.Contents(),
  SelectWorkspace  = Source{[workspaceId = "insertworkspaceid"]}[Data],
  SelectLakehouse  = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
  SelectTable      = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
  ShowVersions     = Value.Versions(SelectTable),
  GetVersion2      = ShowVersions{2}[Data],
  GetVersionNumber = Value.VersionIdentity(GetVersion2)
in
  GetVersionNumber

Category: Power Query

Power BI/Data Books Roundup

Like this:

New Semi Join, Anti Join And Query Folding Functionality In Power Query

Like this:

Displaying Azure Maps In A Power BI Paginated Report

Like this:

Power BI Paginated Reports That Connect To Web Services And Excel

Like this:

Overhead Of Getting Relationship Columns In Power BI DirectQuery Mode

Like this:

Reading Parquet Metadata In Power Query In Power BI

Like this:

Understanding The “Evaluation resulted in a stack overflow” Error In Power Query In Excel And Power BI

Like this:

Power Query Nested Data Types In Excel

Like this:

Incremental Refresh On Delta Tables In Power BI

Like this:

Getting Different Versions Of Data With Value.Versions In Power Query

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: