Best Practices For Power BI On Databricks Webinar

I recently took part in a webinar with Denny Lee, Liping Huang and Marius Panga from Databricks on the subject of best practices for using Power BI on Databricks. You can view the recording on LinkedIn here:

https://www.linkedin.com/video/live/urn:li:ugcPost:7174102151407779841

…or on YouTube here:

My section at the beginning covering Power BI best practices for Import and DirectQuery doesn’t contain any new information – if you’ve been following the DirectQuery posts on this blog or read the DirectQuery guidance docs here and here then there won’t be any surprises. What I thought was really useful, though, was hearing the folks from Databricks talk about best practices on the Databricks side and this took up the majority of the webinar. Definitely worth checking out.

Overhead Of Getting Relationship Columns In Power BI DirectQuery Mode

Many Power BI connectors for relational databases, such as the SQL Server connector, have an advanced option to control whether relationship columns are returned or not. By default this option is on. Returning these relationship columns adds a small overhead to the time taken to open a connection to a data source and so, for Power BI DirectQuery semantic models, turning this option off can improve report performance slightly.

What are relationship columns? If you connect to the DimDate table in the Adventure Works DW 2017 sample database using Power Query, you’ll see then on the right-hand side of the table. The following M code:

let
Source = Sql.Database("localhost", "AdventureWorksDW2017"),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]
in
dbo_DimDate

…shows the relationship columns:

Whereas if you explicitly turn off the relationships by deselecting the “Including relationship columns” checkbox:

…you get the following M code with the CreateNavigationProperties property set to false:

let
Source = Sql.Database("localhost", "AdventureWorksDW2017", [CreateNavigationProperties=false]),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]
in
dbo_DimDate

…and you don’t see those extra columns.

How much overhead does fetching relationship columns add? It depends on the type of source you’re using, how many relationships are defined and how many tables there are in your model (because the calls to get this information are not made in parallel). It’s also, as far as I know, impossible to measure the overhead from any public telemetry such as a Profiler trace or to deduce it by looking at the calls made on the database side. The overhead only happens when Power BI opens a connection to a data source and the result is cached afterwards, so it will only be encountered occasionally and not for every query that is run against your data source. I can say that the overhead can be quite significant in some cases though and can be made worse by other factors such as a lack of available connections or network/gateway issues. Since I have never seen anyone actually use these relationship columns in a DirectQuery model – they are quite handy in Power Query in general though – you should always turn them off when using DirectQuery mode.

[Thanks to Curt Hagenlocher for the information in this post]

Measuring The Total Time Spent Querying Sources In Power BI DirectQuery Mode

If you’re tuning a DirectQuery semantic model in Power BI one of the most important things you need to measure is the total amount of time spent querying your data source(s). Now that the queries Power BI generates to get data from your source can be run in parallel it means you can’t just sum up the durations of the individual queries sent to get the end-to-end duration. The good news is that there are new traces event available in Log Analytics (though not in Profiler at the time of writing) which solves this problem.

The events have the OperationName ProgressReportBegin/ProgressReportEnd and the OperationDetailName ParallelSession. Here’s a simple Log Analytics query that you can use to see how this event works:

PowerBIDatasetsWorkspace
| where TimeGenerated > ago(10min)
| where OperationName in
("QueryBegin", "QueryEnd",
"DirectQueryBegin", "DirectQueryEnd",
"ProgressReportBegin", "ProgressReportEnd")
| project OperationName, OperationDetailName, EventText, TimeGenerated,
DurationMs,CpuTimeMs, DatasetMode, XmlaRequestId
| order by TimeGenerated asc

Here’s what this query returned for a single DAX query that generated multiple SQL queries (represented by DirectQueryBegin/End event pairs) against a relational data source:

Notice that there is just one ProgressReportBegin/End event pair per query; this is always the case, no matter how many fact tables or data sources are used, at least as far as I know. The ProgressReportBegin event for ParallelSession comes before the DirectQueryBegin/End event pairs and the associated End event comes after the final DirectQueryEnd event. The DurationMs column for the ProgressReportEnd event gives you the total duration in milliseconds of all the DirectQuery events. In this case there were six SQL queries sent to the source (and so six DirectQueryBegin/End event pairs) each of which took between 1-2 seconds. However, since all of these queries ran in parallel, the overall duration was still just over 2 seconds.

dataCoverageDefinition: A New Optimisation For Hybrid Tables In Power BI

Hybrid tables – tables which contain both Import mode and DirectQuery mode partitions to hold data from different time periods – have been around for a while. They are useful in cases where your historic data doesn’t change but your most recent data changes very frequently and you need to reflect those changes in your reports; you can also have “reverse hybrid tables” where the latest data is in Import mode but your historic data (which may not be queried often but still needs to be available) is in DirectQuery mode. Up to now they had a problem though: even when you were querying data that was in the Import mode partition, Power BI still sent a SQL query to the DirectQuery partition and that could hurt performance. That problem is now solved with the new dataCoverageDefinition property on the DirectQuery partition.

You can find full documentation here:
https://learn.microsoft.com/en-us/analysis-services/tom/table-partitions?view=asallproducts-allversions

What dataCoverageDefinition does is tell Power BI what data is present in the DirectQuery partition so it knows whether to generate a SQL query to get data from your DirectQuery source or not. It takes the form of a simple DAX expression that returns True for rows that are stored in the DirectQuery partition, for example:

RELATED('DimDate'[CalendarYear]) IN {2017,2018,2019}

If you’re working with large fact tables in DirectQuery mode and don’t have a religious objection to some of that data being stored in Import mode, you should check out hybrid tables because they can really improve report performance!

Performance Implications Of Using Calculated Columns In Composite Models On Power BI Semantic Models

I don’t have anything against the use of calculated columns in Power BI semantic models in general but you do need to be careful using them with DirectQuery mode. In particular when you have a DirectQuery connection to another Power BI semantic model – also known as a composite model on a Power BI semantic model – it’s very easy to cause serious performance problems with calculated columns. Let’s see a simple example of why this is.

Let’s say you have an Import mode semantic model called Source Model containing sales data:

Here’s the contents of the Sales table:

And here’s the definition of the Sales Amount measure:

Sales Amount = SUM(Sales[Sales])

Now, let’s say this semantic model gets published to the Service and you open Power BI Desktop, create a Live connection to this model and then click the “Make changes to this model” button to create a composite model. You now have a local composite model in Power BI Desktop that is connected to the Source Model semantic model.

Next, you decide to create a calculated column on the Sales table and after reading Marco and Alberto’s article on circular dependencies (because you’ll definitely run into circular dependency errors) come up with the following code:

Tax =
CALCULATE (
[Sales Amount] * 0.1,
ALLEXCEPT ( Sales, Sales[TransactionNumber] )
)

Great! You can now build a report that looks like this:

This is where the first warning comes though. In Import mode semantic models calculated columns are evaluated when the model is refreshed, but in DirectQuery mode they are evaluated at query time if the query needs to use them. Performance Analyzer shows two DAX queries generated for the visual above. The first is the query that the visual runs against the local semantic model in Power BI Desktop, which looks like this:

DEFINE
VAR __DS0Core =
SUMMARIZECOLUMNS(
ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", 'Sales'[Sales Amount],
"SumTax", CALCULATE(SUM('Sales'[Tax]))
)

VAR __DS0PrimaryWindowed =
TOPN(502, __DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
__DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

There’s nothing much interesting here. However, there’s a second DAX query generated: the one generated by the local model to get the data it needs from the Source Model semantic model in the Service. This query looks like this:

Define
COLUMN 'Sales'[ASDQ_Tax] = CALCULATE([Sales Amount] * 0.1, ALLEXCEPT(Sales, Sales[TransactionNumber]))

var ASDQ___DS0Core = SUMMARIZECOLUMNS( ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", [Sales Amount],
"SumTax", CALCULATE(SUM('Sales'[ASDQ_Tax]))
)
var ASDQ___DS0PrimaryWindowed = TOPN(502, ASDQ___DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
ASDQ___DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

Notice, right at the top of the query, a DEFINE statement that defines the Tax calculated column. Every time this visual is rendered in your Power BI report the calculated column will be re-evaluated and that will have a performance overhead. If the calculated column had been created in Source Model (which is Import mode remember) it would have been evaluated when the model was refreshed and any queries that used it would probably be faster as a result.

There’s something else to watch out for though. Let’s say you define a second calculated column in the local model like so:

Apple Sales =
CALCULATE (
[Sales Amount],
FILTER ( ALLNOBLANKROW ( 'Sales'[Product] ), 'Sales'[Product] = "Apples" )
)

You can now create a table that looks like this, which does include the Apple Sales column but does not include the Tax column:

The DAX query sent by the visual to the local model is not worth looking at. However, the DAX query sent by the local model to Source Model does have something surprising in it:

Define
COLUMN 'Sales'[ASDQ_Apple Sales] = CALCULATE([Sales Amount], FILTER(ALLNOBLANKROW('Sales'[Product]), 'Sales'[Product]="Apples"))

COLUMN 'Sales'[ASDQ_Tax] = CALCULATE([Sales Amount] * 0.1, ALLEXCEPT(Sales, Sales[TransactionNumber]))

var ASDQ___DS0Core = SUMMARIZECOLUMNS( ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", [Sales Amount],
"SumApple_Sales", CALCULATE(SUM('Sales'[ASDQ_Apple Sales]))
)

var ASDQ___DS0PrimaryWindowed = TOPN(502, ASDQ___DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
ASDQ___DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

There’s a DEFINE statement for the Apple Sales column as you would expect. However there is also a DEFINE statement for the Tax column, which means it is evaluated too even though it’s not shown in the visual and it seems as though the Apple Sales column doesn’t reference it – or does it? I’m not going to try to explain what’s going on here (again, this is best left to Marco and Alberto) but the important point is that both calculated columns are now being evaluated at query time which means the query will be even slower. If you have large data volumes and/or your calculated columns contain some complex DAX you could end up with some very slow reports. If you’re using Premium then you could also be putting a lot of extra load on your capacity.

Composite models are an important part of Power BI’s self-service story and it’s inevitable that at some point your developers will want to use build calculated columns in them. As always you need to monitor the performance of your reports carefully, both in terms of duration and CPU usage, while you are building them to make sure they are well optimised and keep the use of calculated columns in composite models to a minimum.

New Limits For The “Maximum Connections Per Data Source” Property In Power BI DirectQuery Mode

One of the most important properties you can set in a Power BI DirectQuery semantic model is the “Maximum connections per data source” property, which controls the number of connections that can be used to run queries against a data source. The good news is that the maximum value that you can set this property to has just been increased in Premium.

This property is important because the number of open connections to a data source acts as a limit on the number of concurrent queries that can be run by the semantic model against the source: each connection can only have one query running on it at any one time. If you have Power BI reports that have a large number of visuals on a page and/or a large number of users running reports at the same time then it is very easy for a DirectQuery semantic model to need to send lots of queries back to your data source at the same time. If some of the queries that your semantic model runs against your data source are slow – more than one or two seconds even – then the number of queries that need to be run at a given time will increase. The same is true if you have increased the Max Parallelism Per Query property to increase the number of parallel queries that can be generated by a single DAX query.

This property is documented in a number of places, including the DirectQuery guidance documentation and in data source-specific best practice documents such as this one for Snowflake. You can set the property in Power BI Desktop in the Current File/DirectQuery section of the Options dialog:

If you are not using Power BI Premium (ie you are using Power BI Shared capacity, also known as Power BI Pro) then the maximum value that you can set this property to is 10. If you are using Power BI Premium then the maximum value up to today was 30 but now that limit has been increased. The table on this page shows what the new limits per SKU are:

As you can see, for a P1/F64 the maximum limit is now 50 rather than 30 and this limit goes all the way up to 200 for a P4/F512 and higher.

I’ve seen plenty of cases where increasing the value of this property makes Power BI reports run a lot faster. However, this will only be true if your data source is able to handle the number of queries that Power BI is trying to run against it. As I showed in this post, if your data source can’t handle the number of queries you’re trying to run then performance will get worse and not better, so you should try different values to see which one works best.

Power BI DirectQuery Best Practices Video

Here’s a recording of a session I did for the Manchester (UK) Power BI user group recently on best practices for DirectQuery mode in Power BI:

I’ve done it for a few other groups over the last six months but this is the latest and best version, I think.

I’ve worked with several customers using DirectQuery mode since I joined the CAT team and learned a lot along the way. Some of this knowledge has been written up by me (eg this post on Snowflake DirectQuery mode best practices), some of it by the people who work with the data sources Power BI runs on top of (see posts by Dany Hoter on the Azure Data Explorer blog, for example; there’s also going to be a lot of new material on DirectQuery for Databricks coming soon, with this post as a start). There’s a lot of detailed information in the docs too, for example here, here and here.

But remember folks: the most important piece of advice around DirectQuery is to not use it unless you’re really, really, really sure you have no other option. It’s possible to make it work well but it takes a lot more tuning and specialist skill than Import mode!

Disabling Filter Pane Aggregates In Power BI

When you add a field to the filter pane in a Power BI report and select the “Basic filtering” filter type, in most cases you’ll see some numbers next to the field values:

These numbers are counts of the number of rows for each value in the table that the field is from. The query to get these counts is usually quite fast and inexpensive, but if you’re filtering on a field from a very large table (for example a fact table) and/or using DirectQuery mode that might not be true. For example, the screenshot above is taken from a DirectQuery dataset and here’s the SQL query that generates the counts shown:

Luckily there’s a way to disable these queries and stop the counts being displayed: set the Discourage Implicit Measures property to true on your dataset. The main purpose of this property is to stop the automatic creation of implicit measures when building Power BI reports; this makes sense when you’re using calculation groups, for example, and when you add a calculation group to your dataset then this property is set to true for you. You can also set Discourage Implicit Measures to true manually by connecting to the dataset using Tabular Editor:

Here’s what the filter pane looks like with Discourage Implicit Measures set to true:

Just to be safe, I think it makes sense to set Discourage Implicit Measures to true for all DirectQuery datasets to reduce the overall number of queries run against your data source and reduce the risk of a really expensive query being run. I don’t think seeing the counts adds much value for end users anyway.

[Thanks to Ed Hansberry for pointing out this behaviour and John Vulner for explaining it]

DirectQuery Parallelisation In Power BI – Some Examples

Recently we announced an important new optimisation for DirectQuery datasets: the ability to run (some) of the queries generated by a single DAX query in parallel. You can read the blog post here:

https://powerbi.microsoft.com/en-za/blog/query-parallelization-helps-to-boost-power-bi-dataset-performance-in-directquery-mode/

A few of us on the Power BI CAT team have tested this out with customers and seen some great results, so I thought I’d write a post illustrating the effect this optimisation can have and explaining when it can and can’t help.

For the purposes of this blog post I built a DirectQuery dataset (the source and the design of the dataset are irrelevant) using the New York Taxi data. I then created a table with five measures on columns; the precise definitions of the measures don’t matter much either, except for the fact that they each generate a single Storage Engine request and horizontal fusion cannot combine these requests in any way.

Next, I published the dataset to the Power BI Service, connected to it from Tabular Editor, set the compatibility level property to 1569 and saved the change:

Having done this the Max Parallelism Per Query property became visible, and I set it to 1:

1 is not the default value for this property but it does give the same behaviour as the default at the time of writing – it ensures that the Storage Engine requests for a single DAX query are always executed one at a time.

Next I connected DAX Studio up to the XMLA Endpoint for the workspace and ran the DAX query generated by the table visual above with Server Timings enabled. The new timeline view in DAX Studio 3.0.6 (see Marco’s video for more details) does a great job of showing when the Storage Engine queries – in this case the SQL queries sent to the data source – are made:

As you can see, in this case it’s clear that the six Storage Engine queries/SQL queries generated by this one DAX query are all executed in sequence. Also notice that the duration of the DAX query was 3 seconds.

I then changed the Max Parallelism Per Query property to 6 and reran the same query from DAX Studio. Here’s what the timeline looked like:

The Storage Engine queries/SQL queries now all run in parallel and the overall duration of the DAX query is just 0.9 seconds!

This is great, but as always there are a few things that need to be highlighted. First, Power BI may not be able to run all the Storage Engine queries in parallel if there are certain types of dependency in your DAX. In the example above there were no dependencies between the measures – none of them referenced any other measures, and they all used fairly simple expressions – but in the real world that might not be the case. To illustrate this I created a new measure with the following definition:

MEASURE WithDependencies =
    IF (
        CALCULATE ( [Distinct Medallions], ALL ( 'GEOGRAPHY'[STATE] ) ) > 0,
        [Trip Count],
        [Monday Trips]
    )

I then used it in a new table visual and captured the DAX query generated:

Here’s what the DAX Studio Server Timings showed for this query with Max Parallelism Per Query set to 6:

This time there is some parallelism but the first Storage Engine query/SQL query has to complete before the last two can be executed. I wrote this measure specifically to get this behaviour so you may be able to rewrite your DAX to get better parallelism, but it’s something to be aware of.

One last thing to mention: increasing parallelism here may lead to worse query performance overall if you end up running into bottlenecks elsewhere in your architecture (see here for an example). You may need to increase the “maximum number of connections per data source” property on your dataset to allow more queries to run in parallel, and if that results in additional load on the data source then you may need to scale up or scale out there too. Remember also that the DAX queries for a report page are, and always have been run in parallel so the Storage Engine queries for different DAX queries will also be run in parallel; you’ll need to watch out for report pages with a large number of visuals on them.

ApproximateDistinctCount DAX Function Now Works On More DirectQuery Sources

Some good news for those of you using DirectQuery mode in Power BI: the ApproximateDistinctCount DAX function, which returns an estimate of the number of the distinct values in a column and which can be a lot faster than a true distinct count as returned by the DistinctCount function, is now available to use with BigQuery, Databricks and Snowflake sources. It only worked with Azure SQL DB and Synapse before; RedShift is coming soon. You can use it in exactly the same way that you would with the DistinctCount function except that it only works in DirectQuery mode.

For example, I have a Power BI DirectQuery dataset that uses the New York Taxi data in Snowflake as its source. With the following two DAX measures defined on the Trip table:

Approximate Distinct Medallions = APPROXIMATEDISTINCTCOUNT('TRIP'[MEDALLIONID])
Distinct Medallions = DISTINCTCOUNT('TRIP'[MEDALLIONID])

…I can build a table visual that compares the output of the two measures (as you can see, the difference isn’t that big):

…and see that the ApproximateDistinctCount DAX function is translated to the APPROX_COUNT_DISTINCT function in Snowflake SQL:

select { fn convert(count(distinct("MEDALLIONID")), SQL_DOUBLE) } + { fn convert(max("C1"), SQL_DOUBLE) } as "C1",
    approx_count_distinct("MEDALLIONID") as "C2"
from 
(
    select "DATEID",
        "MEDALLIONID",
        "HACKNEYLICENSEID",
        "PICKUPTIMEID",
        "DROPOFFTIMEID",
        "PICKUPGEOGRAPHYID",
        "DROPOFFGEOGRAPHYID",
        "PICKUPLATITUDE",
        "PICKUPLONGITUDE",
        "PICKUPLATLONG",
        "DROPOFFLATITUDE",
        "DROPOFFLONGITUDE",
        "DROPOFFLATLONG",
        "PASSENGERCOUNT",
        "TRIPDURATIONSECONDS",
        "TRIPDISTANCEMILES",
        "PAYMENTTYPE",
        "FAREAMOUNT",
        "SURCHARGEAMOUNT",
        "TAXAMOUNT",
        "TIPAMOUNT",
        "TOLLSAMOUNT",
        "TOTALAMOUNT",
        case
            when "MEDALLIONID" is null
            then CAST(1 as INTEGER)
            else CAST(0 as INTEGER)
        end as "C1"
    from "NYCDATA_DB"."NYCTAXIDATA"."TRIP"
) as "ITBL"

Distinct counts are often the slowest type of measure but in my experience report consumers are very unwilling to accept seeing “near enough” numbers in their reports rather than numbers that are 100% accurate, even if the approximate distinct counts are much faster. I heard someone suggest using field parameters to allow report consumers to switch between showing fast approximate distinct counts for exploration and accurate distinct counts when they really need them, and I think this is a great compromise.

Bonus links: if you need to do an approximate distinct count in Import mode, Phil Seamark shows how to do this here; Phil also wrote a great post on building aggregations for distinct counts (read it carefully – this is a really powerful technique!) here.