Increasing Refresh Parallelism -And Performance – In Power BI Premium

One of the factors that affects dataset refresh performance in Power BI is the number of objects that are refreshed in parallel. At the time of writing there is a default maximum of six objects that can be refreshed in parallel in Power BI Premium but this can be increased by using custom TMSL scripts to run your refresh.

A few months ago I blogged about how partitioning a table in Power BI Premium can speed up refresh performance. The dataset I created for that post contains a single table with nine partitions, each of which is connected to a CSV file stored in ADLSgen2 storage. Using the technique described by Phil Seamark here I was able to visualise the amount of parallelism when the dataset is refreshed in a Premium Per User workspace:

In this case I started the refresh from the Power BI portal so the default parallelism settings were used. The y axis on this graph shows there were six processing slots available, which means that six objects could be refreshed in parallel – and because there are nine partitions in the only table in the dataset, this in turn meant that some slots had to refresh two partitions. Overall the dataset took 33 seconds to refresh.

However, if you connect from SQL Server Management Studio to the dataset via the workspace’s XMLA Endpoint (it’s very similar to how you connect Profiler, something I blogged about here) you can construct a TMSL script to refresh these partitions with more parallelism. You can generate a TMSL script by right-clicking on your table in the Object Explorer pane and selecting Partitions:

…then, in the Partitions dialog, selecting all the partitions and clicking the Process button (in this case ‘process’ means the same thing as ‘refresh’):

…then, on the Process Partition(s) dialog, making sure all the partitions are selected, selecting Process Full from the Mode dropdown:

…and then clicking the Script button and selecting Script Action to New Query Window:

This generates a new TMSL script with a Refresh command that refreshes all the partitions:

This needs one more change to enable more parallelism though: it needs to be wrapped in a TMSL Sequence command that contains the maxParallelism property. Here’s the snippet that goes before the refresh (you also need to close the braces after the Refresh command too):

{
"sequence":
{
"maxParallelism": 9,

Executing this command refreshed all nine partitions in parallel in nine slots:

This refresh took 25 seconds – eight seconds faster than the original refresh with six slots.

As you can see, increasing the number of refresh slots in this way can have a big impact on refresh performance – although, of course, you need to have enough tables or partitions to take advantage of any parallelism and you also need to be sure that your data source can handle increased parallelism. You can try setting MaxParallelism to any value up to 30 although no guarantees can be made about how many slots are available at any given time. It’s also worth pointing out that there are scenarios where you may want to set maxParallelism to a value that is lower than the default of six, for example to reduce to load on data sources that can’t handle many parallel queries.

[Thanks to Akshai Mirchandani for the information in this post]

Power BI/Power Query And Nullable Columns

Recently I’ve been asked by colleagues with various different types of performance problems why Power BI is generating SQL in a particular way, and the answer has been the presence of nullable columns in the underlying database – whether it’s SQL Server, Snowflake or Databricks. Now I’m not a DBA or any kind of database tuning expert so I can’t comment on why a SQL query performs the way it does on any given platform, but what I can do is show you two examples of how the presence of nullable columns changes the way Power BI and Power Query generate SQL.

Consider the following table in a SQL Server table with a single, integer column that does not allow null values:

If you connect to this table in DirectQuery mode, drag the MyNumber field into a card in a Power BI report and select the Distinct Count aggregation type:

…here’s the TSQL that is generated:

SELECT 
COUNT_BIG(DISTINCT [t0].[MyNumber])
 AS [a0]
FROM 
(
(
select [$Table].[MyNumber] as [MyNumber]
from [dbo].[NotNullableColumn] as [$Table]
)
)
 AS [t0] 

Now if you do the same thing with a table that is identical in all respects but where the MyNumber column does allow null values:

…here’s the TSQL that Power BI generates:

SELECT 
(COUNT_BIG(DISTINCT [t1].[MyNumber]) 
+ MAX(CASE WHEN [t1].[MyNumber] IS NULL THEN 1 ELSE 0 END))
 AS [a0]
FROM 
(
(
select [$Table].[MyNumber] as [MyNumber]
from [dbo].[NullableColumn] as [$Table]
)
)
 AS [t1] 

Notice the extra code in the third line of this second query that has been added to handle the possible presence of null values.

It’s not just when you’re using DirectQuery mode that you can see a difference. Let’s say you’re using Import mode and you take each of these tables and join them to themselves in the Power Query Editor like so:

Here’s the M code for this query:

let
  Source = Sql.Databases("localhost"),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NotNullableColumn = FoldingTest
    {
      [
        Schema = "dbo",
        Item   = "NotNullableColumn"
      ]
    }
    [Data],
  #"Merged Queries" = Table.NestedJoin(
    dbo_NotNullableColumn,
    {"MyNumber"},
    dbo_NotNullableColumn,
    {"MyNumber"},
    "dbo_NotNullableColumn",
    JoinKind.Inner
  ),
  #"Expanded dbo_NotNullableColumn"
    = Table.ExpandTableColumn(
    #"Merged Queries",
    "dbo_NotNullableColumn",
    {"MyNumber"},
    {"dbo_NotNullableColumn.MyNumber"}
  )
in
  #"Expanded dbo_NotNullableColumn"

Joining the table with the not nullable column to itself folds and results in the following TSQL query being generated:

select [$Outer].[MyNumber] as [MyNumber],
    [$Inner].[MyNumber2] as [dbo_NotNullableColumn.MyNumber]
from [dbo].[NotNullableColumn] as [$Outer]
inner join 
(
    select [_].[MyNumber] as [MyNumber2]
    from [dbo].[NotNullableColumn] as [_]
) as [$Inner] on ([$Outer].[MyNumber] = [$Inner].[MyNumber2])

If you do the same thing with the table with the nullable column, here’s the TSQL that is generated:

select [$Outer].[MyNumber] as [MyNumber],
    [$Inner].[MyNumber2] as [dbo_NullableColumn.MyNumber]
from [dbo].[NullableColumn] as [$Outer]
inner join 
(
    select [_].[MyNumber] as [MyNumber2]
    from [dbo].[NullableColumn] as [_]
) as [$Inner] on ([$Outer].[MyNumber] = [$Inner].[MyNumber2] 
or [$Outer].[MyNumber] is null and [$Inner].[MyNumber2] is null)

Once again you can see how the SQL generated for an operation on a nullable column is different to the SQL generated for an operation on a non-nullable column. Whether one SQL query performs significantly better or worse than the other is something you need to test.

The last thing to say is that there is no supported way in Power BI or Power Query to treat a nullable column as if it was not nullable. If you have a nullable column and the extra SQL to handle those nulls results in a performance problem then your only option is to alter the design of your table and make the column not nullable.

What The New Visio Web App And Licensing Announcement Means For Power BI

There was an interesting announcement today regarding Visio:

https://www.microsoft.com/en-us/microsoft-365/blog/2021/06/09/bringing-visio-to-microsoft-365-diagramming-for-everyone/

In summary there will soon be a lightweight, web-based version of Visio available to anyone with a Microsoft 365 Business, Office 365 E1/E3/E5, F3, A1, A3 or A5 subscription. Previously Visio was not part of the main M365 plans and was only available as a separate purchase.

So what? As a Power BI user, why should I care? Well the Visio custom visual for Power BI has been around a long time now and it’s really powerful. Unfortunately it’s very rarely used because Power BI developers don’t usually have Visio licences – but this is exactly what is about to change. With these licensing changes pretty much everyone who uses Power BI will have access to the new lightweight Visio web app. It’s not as sophisticated as desktop Visio but I’ll be honest, I’m no Visio expert and it’s good enough for me and really easy to use. As a result this is going to unlock the power of the Power BI Visio visual for a much, much larger number of people!

To get an idea of what you can do with the Power BI Visio visual, this video is a good place to start:

A Look At Lobe – A Free, Easy-To-Use Tool For Training Machine Learning Models

A few months ago I heard about a new tool from Microsoft called Lobe which makes it easy to train machine learning models. It’s nothing to do with Power BI but I find anything to do with self-service data analytics interesting, and when I finally got round to playing with it today I thought it was so much fun that it deserved a blog post.

You can download it and learn more at https://www.lobe.ai/ and there’s a great ten minute video describing how to use it here:

The most impressive thing about it is not what it does but how it does it: a lot of tools claim to make machine learning easy for non-technical users but Lobe really is easy to use. My AI/ML knowledge is very basic but I got up and running with it extremely quickly.

To test it out I downloaded lots of pictures of English churches and trained a model to detect whether the church had a tower or a spire. After I labelled the pictures appropriately:

…Lobe was able to train the model:

I could test it inside the tool. The model was able to tell whether a church had a tower:

…or a spire:

…very reliably!

If I have one criticism it’s that when you want to use your model things get a lot more technical, at least compared to something like AI Builder for Power Apps and Power Automate, but I guess that’s because it is just a tool for training models. There have been some recent improvements here though (see this blog post) and Lobe does provide a local API for testing purposes that can be consumed in Power BI with some custom M code.

Here’s an example of how to call the local API in Power Query:

let
  Source = Folder.Files("C:\Churches"),
  #"Removed Other Columns"
    = Table.SelectColumns(
    Source,
    {"Content", "Name"}
  ),
  #"Added Custom" = Table.AddColumn(
    #"Removed Other Columns",
    "CallAPI",
    each Text.FromBinary(
      Web.Contents(

        //Insert Lobe Connect URL here                              
        "http://localhost...",
        [
          Content = Json.FromValue(
            [
              image = Binary.ToText(
                [Content],
                BinaryEncoding.Base64
              )
            ]
          ),
          Headers = [
            #"Content-Type"
              = "application/json"
          ]
        ]
      )
    )
  ),
  #"Parsed JSON"
    = Table.TransformColumns(
    #"Added Custom",
    {{"CallAPI", Json.Document}}
  ),
  #"Expanded CallAPI"
    = Table.ExpandRecordColumn(
    #"Parsed JSON",
    "CallAPI",
    {"predictions"},
    {"predictions"}
  ),
  #"Expanded predictions"
    = Table.ExpandListColumn(
    #"Expanded CallAPI",
    "predictions"
  ),
  #"Expanded predictions1"
    = Table.ExpandRecordColumn(
    #"Expanded predictions",
    "predictions",
    {"label", "confidence"},
    {"label", "confidence"}
  ),
  #"Pivoted Column" = Table.Pivot(
    #"Expanded predictions1",
    List.Distinct(
      #"Expanded predictions1"[label]
    ),
    "label",
    "confidence",
    List.Sum
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Pivoted Column",
    {
      {"Tower", type number},
      {"Spire", type number}
    }
  ),
  #"Removed Columns"
    = Table.RemoveColumns(
    #"Changed Type",
    {"Content"}
  )
in
  #"Removed Columns"

You can export models to a variety of other places for production use, including Azure Functions and Azure Machine Learning.

Definitely something to keep an eye on, especially because it will soon be able to do object detection and data classification as well as image classification.

Ten Reasons Why I’m Excited About The New Power BI/Excel Integration Features

My favourite Power BI announcement at the Microsoft Business Applications Summit was, without a doubt, that Excel PivotTables connected to Power BI datasets will very soon work in the browser and not just on the desktop. This is something I have wanted for a long time, way before I joined Microsoft, so this is a feature I have a personal interest in. However I also think it’s an incredibly important step forward for Power BI in general and in this post I’ll outline the reasons why.

Before we carry on please make sure you read this post on the Excel blog which has more details on all of the new Power BI/Excel integration features that are being released. Quick summary: if you’re reading this in late May 2021 you probably won’t have all of this functionality available in your tenant yet but it is coming very soon.

So why exactly am I excited?

It makes Excel a third option for building Power BI reports

Up to now, if you wanted to build Power BI reports and share them with other people online you had two choices: regular Power BI reports and paginated reports. Now Excel gives you a third option: you can upload Power BI-connected Excel workbooks to a Power BI workspace, make them available via a Power BI app, and not only will they be fully interactive but the data in them will also update automatically when the data in your dataset updates.

PivotTables are the best way to explore Power BI data

Why do we need Excel as an option for building reports on data stored in Power BI? The first reason is data exploration. Excel PivotTables are a much better way to explore your data than any other method in Power BI, in my opinion. Why try to recreate an Excel PivotTable using a matrix visual in a regular Power BI report when you can give your users the real thing?

Cube functions also now work in the browser – and they make it easy to design financial reports

The Excel cube functions (CubeMember, CubeValue etc) are, I think, the best-kept feature in Excel. While PivotTables are great for exploring data they aren’t always so great when you want to build highly-formatted reports. The Excel cube functions make it easy to bind individual cells in a worksheet to individual values in your dataset and because they’re just like any other Excel function they allow you to use all of Excel’s formatting and charting functionality. This then makes it possible to build certain types of report, such as financial reports, much more easily. If you want to learn more about them check out this video from Peter Myers – it shows how to use them with Analysis Services but they work just the same when connected to a Power BI dataset.

Organisational data types make it easy to access Power BI data

While the Excel cube functions are very powerful they are also somewhat difficult to use and sometimes suffer from performance problems. The new organisational data types in Excel do something very similar and while they don’t yet have all the features you need to build complex reports they are also a lot easier to understand for most business users.

Excel formulas are easier than DAX for a many calculations

Everyone knows DAX can be hard sometimes. However, once you’ve got the data you need from your dataset into Excel using a PivotTable, cube functions or organisational data types you can then do your own calculations on that data using regular Excel formulas. This not only allows business users to add their own calculations easily but for BI professionals it could be the case that an Excel formula is easier to write and faster to execute than the equivalent DAX.

Excel can visualise data in ways that Power BI can’t

Excel is a very mature data visualisation tool and it has some types of chart and some formatting options that aren’t (yet) available in Power BI’s native visuals. One example that springs to mind is that you can add error bars to a bar chart in Excel; another is sparklines, although they are coming to Power BI later this year.

Power Pivot reports will also work in the browser

Even if you don’t have a Power BI pro licence, if you have a commercial version of Excel you’ll have Power Pivot and the Excel Data Model. And guess what, Power Pivot reports also now work in the browser!

Collaborate in real-time with your colleagues in Excel Online

With Excel reports connected to Power BI stored in OneDrive for Business or a SharePoint document library you get great features for collaboration and co-authoring, so you and your colleagues can analyse data together even if you’re not in the same room.

There’s a lot of other cool stuff happening in Excel right now

The Excel team are on a hot streak at the moment: dynamic arrays, LAMBDAs, LET, the beginnings of Power Query on the Mac and lots more cool new stuff has been delivered recently. If you’re only familiar with the Excel features you learned on a course 20 years ago you’re missing out on some really powerful functionality for data analysis.

Everyone knows Excel!

Last of all, it goes without saying that Excel is by far the most popular tool for working with data in the world. Everyone has it, everyone knows it and everyone wants to use it. As Power BI people we all know how difficult it is to persuade our users to abandon their old Excel habits, so why not meet them halfway? Storing data in a Power BI dataset solves many of the problems of using Excel as a reporting tool: no more manual exports, old-of-date data or multiple versions of the truth. Using Excel to build reports on top of a Power BI dataset may be much easier to learn and accept for many business users – at least at first – than learning how to build reports in Power BI Desktop.

Video: Advanced Analytics Features In Power BI

Following on from my last post, another SQLBits session of mine I wanted to highlight was “Advanced Analytics Features In Power BI”. The subject is a bit outside my normal area of expertise but it’s also one that I don’t think gets enough attention: it’s all the features available in Power BI reports that can help you explain why something happened rather than just what happened. Things I talk about include:

  • Adding forecasts to line charts
  • Symmetry shading, ratio lines and clustering on scatter charts
  • The “Explain the increase” and “Find where this distribution is different” features
  • The Key Influencers and Decomposition Tree visuals
  • Custom visuals such as Sanddance
  • Natural language querying with Q&A

Video: Performance Tuning Power BI Dataset Refresh

The team at SQLBits have been publishing all the session recordings from their last (online) conference on their YouTube channel. There’s a lot of great content there to check out and this post is to highlight one of my sessions, “Performance tuning Power BI dataset refresh”.

In this session I look at all of the factors that can influence how long it takes to import data into Power BI and what you can do to make it faster. Topics covered include:

  • Choosing a dataset storage mode
  • The importance of good data modelling
  • How the type of data source you use effects how quickly data can load
  • Ways to measure refresh performance, such as using SQL Server Profiler and Power Query Query Diagnostics
  • Power Query options that can influence refresh times such as disabling data previews
  • Query folding in the Power Query engine
  • Vertipaq engine features that affect refresh, such as calculated columns and calculated tables
  • How dataflows can help refresh performance

Power Query And Power BI Connectivity Announcements At The Microsoft Business Applications Summit

There were a lot of exciting announcements at the Microsoft Business Applications Summit this week but if you only watched the keynotes or read the recap on the Power BI blog you will have missed all the Power Query-related news in the “Data Prep in Power BI, Power Platform and Excel using Power Query” session:

https://mymbas.microsoft.com/sessions/1332f59f-a051-4a06-ae50-8f3185501a88

It covers all the new things that have happened in Power Query over the last few months such as Diagram View and, more importantly, talks about what’s going to happen in the next few months. It’s relatively short but for those of you with no time or patience, here’s a summary of the roadmap announcements:

[BTW “Power Query Online” is the browser-based version of Power Query that is used in Power BI dataflows]

My highlights are:

  • The ability to create a dataflow quickly by uploading a file to Power Query Online without needing to use a gateway to connect to a file on-premises, useful for one-time import scenarios.
  • Multi-value M parameter support – useful for dynamic M parameters and other things I can’t talk about yet 😉
  • The things that Miguel talks about regarding “easier design experiences” for Synapse are kept intentionally vague but it’s worth listening to carefully to what he says here!
  • Native SQL support for Snowflake, BigQuery and Redshift – this is really useful for anyone who wants to use DirectQuery with these databases because it will allow you to write your own SQL query and use it as the source of a table, rather than having to use a table or a view.
  • AAD based Single Sign-On support for Redshift and BigQuery (similar to what we have today for Snowflake) will also be very important for DirectQuery, because it means that the identity of the user running the report can be passed back to the database.
  • A dataflows connector for Excel Power Query – which means, at last, you’ll be able to get data from a dataflow direct into Excel. This will make a lot of Excel users very happy, I think: a lot of the time all users want is a table of data dumped to Excel and dataflows will be a great way to do provide them with that.

Last of all, the session showcases the great new home for all things Power Query – http://www.powerquery.com/ – which has great resources, newly-updated documentation and a blog. Make sure you check it out!

Power BI, Excel Organisation Data Types And Images

Excel Organisation data types were released last year (see here for details), but did you know that you can now use them to bring images as well as text and numbers into Excel? Here’s a super-simple example that shows you how to do this.

Here’s a table called ‘Fruit With Image’ in a dataset that I have published to the Power BI Service:

Notice that the Data Category property on the Image column, which contains the URL of a picture of each type of fruit listed, to “Image URL” (for more details on what this does see here). If I use this table in a Power BI report, I see the name of each fruit and a picture:

So far no surprises. I can also set this table up as a Featured Table (for more details see here) so it can be used as the source for an Organisation Data Type in Excel:

The cool thing is that when I type these fruit names into Excel and mark them as the “Fruit With Image” data type (see here for more details), I can then access the Image field and it will show the image that the URL points to inside a cell:

Measuring DirectQuery Performance In Power BI

If you have a slow DirectQuery report in Power BI one of the first questions you need to ask is how long the SQL queries that Power BI generates take to run. This is a more complicated question to answer than you might think, though, and in this post I’ll explain why.

I happen to have access to some of the famous New York taxi data in a Snowflake database, and in there is a table with trip data that has 173 million rows that I have a built a Power BI dataset from. The data and the database used are not really important here though – what is important is that it’s DirectQuery and a large-ish amount of data. Here’s a report page with a single table visual on it, showing passenger count aggregated by the hack license field:

It’s slow, but how slow? Here’s what Performance Analyzer shows when I refresh the table:

The DAX query takes 5.4 seconds but the Direct Query time is only 3.3 seconds – and the numbers don’t seem to add up. Here’s what Profiler captures for the same refresh shown in Performance Analyzer:

This shows there’s a gap of 2 seconds between the DirectQuery End event and the Query End event. What if I paste the DAX query into DAX Studio? Here’s what the Server Timings tab shows:

This is a different query execution to the two examples above, both of which show data for the same execution, which explains why the numbers are slightly different here – but again there seems to be an extra second of stuff happening and DAX Studio suggests that it’s in the Formula Engine.

So what is going on? The answer lies in understanding what the DirectQuery End Profiler event actually measures: it’s the amount of time between the Analysis Services engine handing a query over to the Power Query engine and the Analysis Services engine receiving the first row in the resultset back, including the time taken for the Power Query engine to fold the query.

Therefore if it takes a long time to get all the rows in the resultset then that could explain what’s going on here. Unfortunately there’s no way of knowing from Profiler events how long this takes – but there is another way. Going back to Performance Analyzer, if you export the data from it to JSON (by clicking the Export button) and load it into Power Query, you can see more detail about a DirectQuery query execution. Here’s the data from the first execution above:

[There’s a very good paper documenting what’s in the Performance Analyzer JSON file here]

Looking at the record in the metrics column for the Execute Direct Query event you can see the same 3.2 second duration shown above in Profiler. Notice that there are two other metrics here as well: RowsRead, which is the total number of rows returned by the resultset; and DataReadDuration, which is the amount of time to read these rows after the first row has been received plus some other Analysis Services Engine operations such as encoding of column values, joining with unpushed semijoins, projections of aggregations such as Average and saving the resultset to the in-memory cache. In this case the SQL query has returned 43191 rows and this takes 1.95 seconds – which explains the gap between the end of the Execute Direct Query event and the end of the query.

One last question: why this SQL query is returning so many rows when the DAX query is only asking for the top 502 rows?

The reason is that, at the time of writing at least, the Analysis Services engine can only push a top(n) operation down to a DirectQuery SQL query in very simple scenarios where there are no measures and no aggregation involved – and in this case we’re summing up values. As a result, if you’re using DirectQuery mode and have a visual like this that can potentially display a large number of rows and includes a measure or aggregated values, you may end up with slow performance.

[Thanks to Jeffrey Wang for providing the information in this post]