Simulating Slow Data Sources In Power BI

As a postscript to my series on Power BI refresh timeouts (see part 1, part 2 and part 3) I thought it would be useful to document how I was able to simulate a slow data source in Power BI without using large data volumes or deliberately complex M code.

It’s relatively easy to create an M query that returns a table of data after a given delay. For example, this query returns a table with one column and one row after one hour and ten minutes:

let
  Source = Function.InvokeAfter(
    () => #table({"A"}, {{1}}), 
    #duration(0, 1, 10, 0)
  )
in
  Source

Some notes:

  • I’m using #table to return the table without having to connect to a data source. More details on how to use #table can be found here.
  • The delay is achieved using the Function.InvokeAfter M function, with the amount of time to wait for specified using #duration

A more interesting problem is how to create an M query that, instead of waiting for a given duration and then returning a table immediately, returns the rows of a table one at a time with a delay between each row. Here’s a query that does that, returning ten rows one second at a time:

let
  NumberOfRows = 10,
  DaysToWait = 0,
  HoursToWait = 0,
  MinutesToWait = 0,
  SecondsToWait = 1,
  Source = #table(
    {"A"},
    List.Transform(
      {1 .. NumberOfRows},
      each Function.InvokeAfter(
        () => {1},
        #duration(
          DaysToWait,
          HoursToWait,
          MinutesToWait,
          SecondsToWait
        )
      )
    )
  )
in
  Source

Last of all, to simulate a slow SQL Server data source – not being much good at TSQL at all – I borrowed some code from this thread on Stack Overflow to create a function that returns a scalar value after a specified number of seconds:

CREATE FUNCTION [dbo].[ForceDelay](@seconds int) returns int as 
BEGIN DECLARE @endTime datetime2(0) = DATEADD(SECOND, @seconds, GETDATE()); 
WHILE (GETDATE() < @endTime ) 
SET @endTime = @endTime;
return 1;
END

I showed how to call this function from Power Query using a native SQL query here.

Troubleshooting Power BI Timeouts, Part 3: The External Command Timeout

In the first post in this series I showed how any Power BI dataset refresh started via the Power BI portal or API is limited to 2 hours in Shared capacity and 5 hours in Premium capacity, and how you could work around that by running a refresh via Premium’s XMLA endpoint feature. In the second post in this series I showed how some M functions allow you to set timeouts. However, even if you initiate a refresh via the XMLA endpoint you may still get a timeout error and in this post I’ll discuss another reason why: the External Command Timeout.

This property is a hangover from Analysis Services (you can see it documented here). It represents the amount of time the Analysis Services engine inside Power BI will wait to get data from a data source. How it behaves exactly depends on the data source: it may limit the amount of time it takes to get the first row of data from the source or it may limit the amount of time it takes to get all the rows of data. In Power BI Premium it is set to five hours, which means that no single partition can take more than about five hours to refresh. In the first post in this series I worked around this by creating a dataset with multiple partitions, each of which took about an hour to refresh, but when trying to refresh a dataset with a single partition that takes more than five hours I got the following error when trying to refresh from SQL Server Management Studio through the XMLA Endpoint:

The error message here is:

Timeout expired. The timeout period elapsed prior to completion of the operation.. The exception was raised by the IDbCommand interface.

In this case I saw the same error in the Error event in Profiler:

…and in other cases, when testing a different source, I got a different error in Profiler in the Progress Report Error event:

The message here is:

Error processing partition ‘<pii>SlowCommand-2114bb81-69d3-4fe4-9d54-6b2661b0c497</pii>’ of table ‘<pii>SlowCommand</pii>’ [Internal Error Code: ‘0xc112001a’].

There’s no way to avoid the External Command Timeout. Instead, what you need to do is either change your partitioning strategy so each partition refreshes in under five hours or tune your data source, M code or gateway (if you’re using one) so that data is returned to Power BI faster.

In Shared capacity I believe the External Command Timeout is set to two hours (again, to match the overall refresh timeout) but it’s much less important there because you can’t create partitions manually (the only way a dataset in Shared can be partitioned is by setting up incremental refresh) and there’s no XMLA Endpoint so there’s no way to work around the two hour overall refresh limit anyway.

[Thanks, as always, to Akshai Mirchandani for a lot of the information in this post]

Troubleshooting Power BI Timeouts, Part 2: Timeouts Specified In Power Query Functions

In the first post in this series I showed how the Power BI Service applies a limit on the total amount of time it takes to refresh a dataset in the Power BI Service, except when you initiate your refresh via an XMLA Endpoint. In this post I’ll look at the various timeouts that can be configured in Power Query functions that are used to access data.

Every time a Power BI Import mode dataset connects to a data source it goes through a Power Query query, and inside the code of that Power Query query will be an M function that connects to a specific type of data source. Most – but not all – of these M functions have the option to set timeouts. The M functions that connect to relational databases (for example Sql.Database which is used to connect to SQL Server-related sources, or Odbc.DataSource which is used to connect to ODBC sources) typically allow you to configure two types of timeout:

  • A connection timeout, which specifies how long to wait when trying to open a connection to the data source
  • A command timeout, which specifies how long the query to get data from the source is allowed to run

Some other functions have other timeouts more appropriate to the data source they access: for example Web.Contents and OData.Feed have a Timeout property for the HTTP requests they make behind the scenes. Other functions (for example Excel.Workbook) have no timeout-related properties that you can set at all. You should check the documentation , either online or using the #shared intrinsic variable in the Power Query Editor, to see what options are available and what the default timeouts are.

Here’s a simple example of how to set a timeout when connecting to SQL Server. First of all, I created a scalar function called ForceDelay in TSQL that returns a value of 1 after a specified number of seconds, so the query:

select [dbo].[ForceDelay](10) as MyColumn

…takes 10 seconds to run.

When you connect to SQL Server in Power BI/Power Query you can paste a SQL query in the connection dialog under Advanced Options; when you do so you’ll also see the option to set the command timeout in minutes:

Here’s the M code generated when you use the SQL query above, set the Command timeout to 1 minute and click OK:

let
  Source = Sql.Database(
    "localhost",
    "AdventureWorksDW2017",
    [
      Query
        = "select [dbo].[ForceDelay](10) as MyColumn",
      CommandTimeout = #duration(
        0,
        0,
        1,
        0
      )
    ]
  )
in
  Source

Notice that the CommandTimeout option has been set on the Sql.Database function, and that the value passed to it is a duration of one minute defined using #duration. Since the SQL query takes 10 seconds to run and the timeout is 1 minute, it runs successfully.

However, if you set the CommandTimeout option to 5 seconds like so:

let
  Source = Sql.Database(
    "localhost", 
    "AdventureWorksDW2017", 
    [
      Query
        = "select [dbo].[ForceDelay](10) as MyColumn", 
      CommandTimeout = #duration(
        0, 
        0, 
        0, 
        5
      )
    ]
  )
in
  Source

…then the query will take longer that the timeout and you’ll see the following error in the Power Query Editor:

The error message is:

DataSource.Error: Microsoft SQL: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

If you hit the timeout when refreshing in the main Power BI window you’ll see the same message in the refresh dialog box:

So, as you can see, if you are working with large data volumes or slow queries you will need to be aware of the default timeouts set in the M functions you’re using and alter them if need be.

Troubleshooting Power BI Timeouts, Part 1: The Two/Five Hour Import Refresh Limit

If you’re working with a large Power BI dataset and/or a slow data source in Import mode it can be very frustrating to run into timeout errors after you have already waited a long time for a refresh to finish. There are a number of different types of timeout that you might run into, and in this series I’ll look at a few of them and discuss some of the ways you can work around them.

In this post I’ll look at one of the most commonly-encountered timeouts: the limit on the maximum length of time an Import mode dataset refresh can take. As documented here these limits are:

  • Two hours for an import-mode dataset in Shared capacity
  • Five hours for an import-mode dataset in a Premium or PPU capacity

If you are using Premium you can connect to your dataset using SQL Server Profiler and run a trace when the refresh starts and you’ll see the timeout set as a property of the TMSL command that runs the refresh (18000 seconds = 5 hours):

Here’s an example of the message you will see in the Refresh History dialog if your dataset refresh takes more than five hours in PPU:

Once again you get more details in a Profiler trace. When the refresh times out you’ll see an Error event with long message that starts with the following text:

Timeout expired. The timeout period elapsed prior to completion of the operation.. The exception was raised by the IDbCommand interface. The command has been canceled..

There is a way to work around this limit if you’re using Premium or PPU: it only applies to refreshes (scheduled or manual) that you start from the Power BI portal or via the REST API. If you start your refresh by running a TMSL script via the XMLA Endpoint (for example via SQL Server Management Studio or from PowerShell) then the limit does not apply. This is because you’re executing your own TMSL script rather than having the Power BI Service execute its own refresh command – with a timeout set – like the one shown in the screenshot above.

For example, here’s the same dataset from the screenshot above but refreshed successfully from SQL Server Management Studio (notice the Type column says “Via XMLA Endpoint”) and with a duration of just over five hours:

There are a couple of blog posts out there showing how you can implement a refresh strategy using the XMLA Endpoint; this post from Marc Lelijveld and Paulien van Eijk is a great example of how to do this using Azure Automation and Azure Data Factory.

You should also consider tuning any refresh that takes a long time and it could be that after tuning it you fall well within the two/five hour limit. There are a lot of things to consider when tuning dataset refresh; I did a conference session on this topic last year (you can watch the recording here) covering issues like data modelling, query folding in Power Query and the use of dataflows but it doesn’t cover everything and I have learned a lot even since then. In my opinion one of the most important things you can do to improve refresh performance for very large Import datasets is to partition your fact tables and increase the amount of parallelism, as I describe here. One customer I worked with was able to reduce their refresh time from four hours down to one hour using this technique:

As I said, though, there are many different types of timeout that you may encounter – so even if you refresh via the XMLA Endpoint it may still time out for another reason.

Increasing Refresh Parallelism -And Performance – In Power BI Premium

One of the factors that affects dataset refresh performance in Power BI is the number of objects that are refreshed in parallel. At the time of writing there is a default maximum of six objects that can be refreshed in parallel in Power BI Premium but this can be increased by using custom TMSL scripts to run your refresh.

A few months ago I blogged about how partitioning a table in Power BI Premium can speed up refresh performance. The dataset I created for that post contains a single table with nine partitions, each of which is connected to a CSV file stored in ADLSgen2 storage. Using the technique described by Phil Seamark here I was able to visualise the amount of parallelism when the dataset is refreshed in a Premium Per User workspace:

In this case I started the refresh from the Power BI portal so the default parallelism settings were used. The y axis on this graph shows there were six processing slots available, which means that six objects could be refreshed in parallel – and because there are nine partitions in the only table in the dataset, this in turn meant that some slots had to refresh two partitions. Overall the dataset took 33 seconds to refresh.

However, if you connect from SQL Server Management Studio to the dataset via the workspace’s XMLA Endpoint (it’s very similar to how you connect Profiler, something I blogged about here) you can construct a TMSL script to refresh these partitions with more parallelism. You can generate a TMSL script by right-clicking on your table in the Object Explorer pane and selecting Partitions:

…then, in the Partitions dialog, selecting all the partitions and clicking the Process button (in this case ‘process’ means the same thing as ‘refresh’):

…then, on the Process Partition(s) dialog, making sure all the partitions are selected, selecting Process Full from the Mode dropdown:

…and then clicking the Script button and selecting Script Action to New Query Window:

This generates a new TMSL script with a Refresh command that refreshes all the partitions:

This needs one more change to enable more parallelism though: it needs to be wrapped in a TMSL Sequence command that contains the maxParallelism property. Here’s the snippet that goes before the refresh (you also need to close the braces after the Refresh command too):

{
"sequence":
{
"maxParallelism": 9,

Executing this command refreshed all nine partitions in parallel in nine slots:

This refresh took 25 seconds – eight seconds faster than the original refresh with six slots.

As you can see, increasing the number of refresh slots in this way can have a big impact on refresh performance – although, of course, you need to have enough tables or partitions to take advantage of any parallelism and you also need to be sure that your data source can handle increased parallelism. You can try setting MaxParallelism to any value up to 30 although no guarantees can be made about how many slots are available at any given time. It’s also worth pointing out that there are scenarios where you may want to set maxParallelism to a value that is lower than the default of six, for example to reduce to load on data sources that can’t handle many parallel queries.

[Thanks to Akshai Mirchandani for the information in this post]

Storage and the NameColumn and KeyColumns Properties

Those of you who have worked with SSAS Multidimensional for a reasonable amount time will, no doubt, be very familiar with the NameColumn and KeyColumns properties of an attribute (if you’re not, see here and here) and how they should be used. You will probably also know that when the KeyColumns property has been set to only one column, then the NameColumn property can be left unset and the key will be used as the name of the attribute’s members.

However, while onsite with a customer recently I noticed something strange. Here’s an illustration: if you create a simple dimension based on the DimCustomer table in Adventure Works and create a single attribute based on the CustomerKey column, this is what you see in the Properties for that attribute:

image

However, if you deploy the database then import it into a new project in SSDT/BIDS, then you see that the NameColumn property has been set:

image

My first thought was that this was a bit dangerous, because it might mean that the imported version of the database would start storing extra strings for the names. But this was incorrect because a look at the data directories for the two versions of the dimensions showed they contained the same files and were using the same amount of storage:

image

I’m very grateful to Akshai Mirchandani of the dev team for confirming that in this situation, it is irrelevant whether you set the NameColumn or not – data duplication will always happen, and the key values will be stored again as strings. The only time it doesn’t happen is when the key and the name of the attribute are both the same column and that column is a string.

This means that if you have a very large attribute that is in danger of exceeding the infamous 4GB limit (although this is of course fixed in SSAS 2012) and which never needs to be visible, you can use the trick that Greg Galloway describes here to reduce the size of the string store. This involves creating a dummy column in your DSV (or underlying view or table) that contains only an empty string and then setting this as the NameColumn of your attribute. For the example above, this is the result:

image

image

For this example, the overall amount of storage used for the dimension has gone down from 1.24MB to 1.04MB, and although you can see the .ahstore file (the hash store) for the Customer Key attribute have grown, the size of the string store, Customer Key.asstore (note: don’t get confused between .asstore and .astore files), has reduced from 362KB to 1KB.

%d bloggers like this: