If you’ve been following some of my recent posts about improving Power BI refresh performance by partitioning tables you will have seen a lot of screenshots that look like the one below:
It’s a visualisation from a report created by my colleague Phil Seamark (as detailed in this blog post) showing how long all the partitions in a dataset take to refresh. If you look at these visualisations you’ll probably ask the same question I did: why does the first partition always start before the others?
It turns out this is because when a table is refreshed, the first thing that has to happen is that a certain amount of data is read so the type of encoding (Value or Hash) used for each column is determined. In most cases tables only contain one partition so it’s not obvious that this is happening, but when a table has more than one partition this happens only for the first partition – which explains why the first partition seems to start before the others. You can’t avoid it happening but you can reduce the impact a little by using encoding hints (see here and here for more details): this process can be skipped for columns that have a Hash encoding hint, or which the engine knows in advance have to use Hash encoding, although it cannot be skipped for columns that have a Value encoding hint. What’s more the Execute SQL event for the first partition will have to complete before the Execute SQL events for all the other partitions can start.
[Thanks to Akshai Mirchandani for the information in this post]
I’ve always thought it would be useful if we could run queries against AAS or Power BI via a simple REST API and I’ve just come across a really nice sample project created by one of my colleagues at Microsoft, David Browne, that allows you to do just that. You can find it here:
This sample is a HTTP proxy for XMLA endpoints, intended for use with Power BI Premium or Azure Analysis Services.
It’s implemented as ASP.NET Core 5 API Project. The main API is /api/Query which allows you to POST a DAX query and receive the results as a JSON result.
The sample is coded to pass-through authentication from the client to the XMLA endpoint. So to call the API either use HTTP BASIC auth over HTTPS, passing credentials with the request. It’s highly advised that this be a Service Principal, rather than an AAD user. To specify a Service Principal use a UserName of the form app:[ClientID]@[TenantID], and pass a Client Secret as the Password.
I’ll leave it to other people to get worked up about things like Power BI Premium gen2 – what really gets me excited is a new Profiler event type. And indeed there is a new one to play with – Job Graph – that you can use to visualise Azure Analysis Services processing jobs (it’s not enabled yet for the Analysis Services engine in Power BI).
More details about it, and how it can be used, are in the samples here:
The data returned by the Job Graph event isn’t intelligible if you look at the text it returns in Profiler. However if you save a .trc file with Job Graph event data to XML you can use the Python scripts in the GitHub repo to generate DGML diagrams that can be viewed in Visual Studio, plus Gantt charts embedded in HTML. Of course to do this you’ll need to have Python installed; you’ll also need to have Visual Studio and its DGML editor installed (see here for details).
Here’s the DGML diagram for the refresh of a very simple AAS database with just one table in it, called People:
The eight nodes in a semi-circle around the top of the diagram represent the eight jobs that process the attribute hierarchies in the table. These eight jobs must be completed before the single partition in the table is processed (this is the node in the centre of the diagram) and that partition must be processed before the table itself is processed (this is the node at the bottom of the diagram).
Since this is the annotated output you can hover over a node in the diagram and see more details about it, including its duration and how long it was blocked for:
Notice how three nodes have dark backgrounds? These nodes represent the critical path. From the sample documentation:
When the job graph executes, there is always a job that finishes last before the engine can commit the change. This job that finishes last is the “critical dependency” for the commit; the entire commit needs to wait for this one job to finish before executing. This last job depends on other jobs, one of which finished after all the others. This is the next critical dependency. Tracing this path of critical dependencies forms the critical path, which helps engineers and customers identify why processing takes so long.
To diagnose slow refresh times, look for the critical path and start at the top. Look at “Blocked duration”, “Waiting duration”, and “Running duration”. If a job has a long blocked duration, it spent a long time waiting on other jobs. If a job has a long waiting duration, it was waiting for an available thread, so increasing maxParallelism could help. If a job has a long running duration, then the model might have to be changed to speed up the job.
Here’s another diagram, generated from a slightly more complex database with four tables, a calculated column and a calculated table:
There’s also a script that takes this data and generates a Gantt chart from it, but at the time of writing it seems to be a bit buggy; I can make it work if I comment out some code but I suspect that means the output isn’t accurate. Hopefully it will get fixed soon, but in the meantime here’s an example of what it produces:
I think this is going to be really useful when I’m troubleshooting processing performance problems. It would also be really cool to have this in DAX Studio (hint, hint)…
In my last post I showed how to connect SQL Server Profiler up to a Power BI Premium dataset but I didn’t give you any examples of why this might be useful. In this post I’ll show you how you can use a Profiler trace to visualise all the queries run by a Power BI report, see when they start to run, see which ones run in parallel with each other and see what the overall time taken to run all the queries is.
Why is this important? When you’re tuning the performance of a Power BI report the first thing to do is to look at the performance of the individual DAX queries run and make them run as fast as possible. However when a Power BI report is rendered any one query is likely to be run at the same time as several other queries run for the same report, and this will have an impact on its performance. How much of an impact there is will depend on how many queries need to be run and the number of back-end v-cores available on your Premium capacity, or the number of QPUs available on your Azure Analysis Services instance if you’re using a Live connection to AAS. The more v-cores/QPUs you have available, the more of the work needed for a query that can be run in parallel; you can see a table listing the number of v-cores for each Premium SKU here, and the number of QPUs for each Azure Analysis Services SKU here. As a result of this if you have reports with a large number of visuals that generate slow DAX queries, scaling up your Power BI Premium capacity or AAS instance may improve overall report performance. Reducing the number of visuals on your report and/or reducing the number of visuals needed to display the same information will also reduce the number of queries that need to be run and therefore improve overall performance.
As I showed last week, SQL Server Profiler can be used to create a trace that logs all the queries run against a Power BI Premium dataset in the same way as it can be used with Azure Analysis Services. Assuming that you have a trace running that uses only the Query End event, this will give you a list of all the queries that are being run along with their start time, end time, duration and a lot of other interesting information. A table with all this data in can still be difficult to interpret though, so I built a Power BI template for a report that visualises all these queries and helps you understand the amount of parallelism that is taking place. You can download the template file here.
To use it, first you need a trace file. Make sure that no-one else is running reports on the Premium capacity you want to test (creating a Power BI Embedded capacity for testing purposes is a good idea) and then, when the trace is running, refresh your report using the technique I described in the “Use the network tab” section of this blog post. This will also allow you to correlate what you see in the trace with the information you see in the DevTools tab in the browser.
Then save the trace file you can created to XML by going to File/Save As/Trace XML File:
Next, open the Power BI template file and when prompted, enter the full path of the trace XML file you just created:
A new Power BI report will then be created. If you want to point the report to a different trace XML file all you need to do is change the value of the TraceXMLFile Power Query parameter.
On the first page you’ll see the name of the trace XML file you connected to plus a bar chart showing each Query End event (with each query identified by a number) on the y axis and the duration of each query on the x axis:
It’s not quite a simple bar chart though. What I’ve done is:
Found the start time of the first query run
Calculated the start time of every other query in the file relative to this first start time (although, unfortunately, Profiler only gives you start times rounded to the nearest second which means you can’t know exactly when a query starts)
Created a stacked bar chart where the first value in the stack is this relative start time and the second value is the duration of the query in seconds
Made the colour of the relative start time transparent, so you only see the blue sections of the bar for the query durations. This gives you a waterfall-like effect and allows you to see which queries are run in parallel. This also makes it easy to see the total amount of time taken to run your queries, from the start of the first query to the end of the last query, which is just as useful to know as the duration of any single query.
There’s also a drillthrough page so you can right-click on a bar and see a table with the DAX query for the query you clicked on, as well as its start time and duration.
It’s a very basic report, I know, and I would be interested to know if you have any ideas about other ways of visualising this data. What’s more, a visual like this raises more questions than I know how to answer… yet. For example, one thing I want to investigate is the effect that query interleaving has on this graph and both perceived and actual report performance. So stay tuned for more blog posts on this subject!
Among all the exciting announcements made at Ignite last week, one you may have missed (even in all the Excel-related announcements here) was that dynamic arrays have finally reached GA. Ever since they were announced I’ve been interested in how they can be used with Excel cube functions, which allow you to get data from Analysis Services, Power BI and the Excel Data Model/Power Pivot into cells on the worksheet, and I’ve blogged about this once already. Even though right now the CubeValue function can’t be used with dynamic arrays – alas – there are still some interesting ways dynamic arrays and cube functions can be used together and in this post I’ll show you another one I’ve found.
Say you have the following set of tables in the Excel Data Model:
And the following measure that counts the number of customers who bought something:
Let’s also say you have on a worksheet the following table of data showing the number of distinct customers broken down by product, created using the CubeMember and CubeValue functions:
Here are the formulas for this table:
So far, all very straightforward. Now let’s say you want to use a dynamic array to filter the rows in this table so you only see the products that have more than a certain number of distinct customers, and you want to see a grand total for this filtered list of customers, like so:
Here are the formulas to achieve this:
To explain what’s going on here:
Cell C2 contains the text “ThisWorkbookDataModel”, the name of the connection to the Excel Data Model. Doing this makes the other Excel formulas here much easier to read!
Cell F2 – which I’ve turned into a named range called FilterThreshold – contains the number of customers which a product must exceed to be shown.
Cell E5 contains the dynamic array Filter function that filters the rows shown in the first two screenshots above and only returns the products that exceed the threshold. The formula is:
Cell E6 does something similar to E5 but returns the number of distinct customers for each product. If all I wanted was the products and the number of distinct customers I wouldn’t have needed two separate formulas, I could have used one, but I need to add the grand total too…
Cell E4 contains the formula:
CUBESET($C$2, E5#, “Total”)
This creates a set from the spill reference E5#, which contains the cells containing CubeMember formulas returned by the Filter function. For some reason CubeSet returns an error if you try to use a dynamic array formula in its second parameter but using a spill reference works ok; this is why I had to use separate formulas in E5 and E6.
Finally, cell F4 contains the formula that returns the total distinct count for all products returned in the filtered list using CubeValue and the set created in cell E4:
CUBEVALUE($C$2, “[Measures].[Distinct Customers]”, E4)
Since this is a distinct count the only way to get the correct value here is using CubeSet, CubeValue and the DAX measure – there would be no way to calculate the correct value using Excel formulas. The same could be said of almost any other DAX measure that wasn’t a simple aggregation.
You can download the sample workbook for this post here.
If you’re building DAX queries using Power BI Report Builder you might notice something that looks like a new DAX function called RSCustomDaxFilter. For example, here’s a simple DAX query built from the Adventure Works Tabular demo database, with one measure and one parameter built Calendar Year:
Note that the “Enable Multi Value Parameters” option has been selected. Here’s what you’ll see if you view the text of the DAX query in the Query Designer:
"Internet Total Sales",
[Internet Total Sales]
But what is RSCustomDaxFilter? If you run your report and see what happens on the server using SQL Server Profiler, here’s the query that actually gets run (in this case I selected the years 2013 and 2014 for the parameter):
VALUES ( 'Date'[Calendar Year] ),
( 'Date'[Calendar Year] = VALUE ( "2013" ) )
|| ( 'Date'[Calendar Year] = VALUE ( "2014" ) )
"Internet Total Sales", [Internet Total Sales]
What has happened is that RSCustomDaxFilter has been replaced with an expression using the DAX Filter() function that implements the filter on the selected years; it’s just a placeholder for a dynamically-generated DAX expression that is substituted in at runtime.
Why is it needed? Handling multi-value parameters is difficult in DAX when you don’t know how many values are going to be passed to the parameters (it’s a subject I’ve blogged about here and here) and some kind of dynamic code generation is a reasonable solution to this problem.
There is one drawback with this approach though – it can generate a DAX query that is too long to be executed. Here’s the error message you’ll see:
The specified query is too complex to be evaluated as a single statement.
I’m not sure what the maximum length of a query is in DAX – I suspect it’s 32768 characters. You’ll hit this limit if you create a parameter on a field with thousands of values in and then select all but a few of these values; from the example above you can imagine how long the resulting DAX query could be:
There’s no single workaround for this but some custom DAX (such as the example here) will be necessary; you’ll just need to ensure that the DAX query is as short as possible.
Back in 2016 I wrote the following blog post about changes to the way Excel 365 generated MDX queries for PivotTables connected to Analysis Services, Power Pivot/the Excel Data Model and Power BI datasets:
I know it sounds boring and not something you need to worry about but trust me, this is important – these changes solved the vast majority of Excel PivotTable performance problems that I encountered when I was a consultant so you should read the above post before continuing.
Unfortunately, earlier this year these changes had to be partially rolled back because in some rare cases the queries generated returned incorrect results; this means that you may find that values for subtotals and grand totals are again being returned even when they aren’t being displayed. The good news is that you should still be able to get the improved performance with a few minor tweaks.
Using the example from the previous post, a PivotTable connected to SSAS MD and the Adventure Works cube (the queries generated for SSAS Tabular, Power Pivot and Power BI may be slightly different and slightly better but the basic problem is the same), if you build the following in Excel:
…even though the subtotals in the PivotTable are not displayed in Excel, the MDX query generated by Excel not only returns them but also returns others that are not needed at all – in fact 36 rows (although the query returns them as columns) of data rather than the 13 rows that are displayed. Here’s a screenshot of the results returned when you run the MDX query in SQL Server Management Studio:
Any time you see a row or column containing an All Member (in this case All Customers or All Products) you know you are looking at a subtotal or grand total.
In this case, to get an efficient query, you need to explicitly turn off subtotals and grand totals for the PivotTable in Excel:
In the PivotTable the only difference you’ll see is that the grand totals are now not displayed:
But the query generated now only returns the values that are actually needed, and as a result will be a lot more efficient and potentially a lot faster. In this case, the query now only returns the 12 rows of data displayed:
Even with subtotals and grand totals turned off there are still some cases where unwanted values will be returned. Take the following PivotTable, where I have used the Calendar hierarchy from the Date dimension on columns and filtered it so I only see the three months in Q1 CY 2012:
The MDX query generated by Excel requests requests four extra columns with subtotals and a grand total that aren’t displayed:
There is a fairly simple workaround though. By changing how the PivotTable is constructed – in this case by not putting a hierarchy on columns but just the Month Of Year attribute, then adding slicers for Calendar Year and Calendar Quarter to control which months are displayed – you can get an efficient query. This version of the PivotTable:
…results in this query being generated, which only returns the required values:
To sum up, then, if you have a PivotTable that seems a bit slow:
Turn off subtotals and grand totals unless you really need to see them
Look at the MDX query being generated by Excel and see if it is still returning unnecessary subtotals and grand totals, and if it is try rebuilding the PivotTable to get the same results in a different way (for example by not drilling down on hierarchies as in the last example) to get a more efficient query
[All the queries in this post were generated by Excel 365 version 1910 build 12130.20238]
In Shabnam Watson’s recent blog post on a bug she found when trying to create a Live connection from Power BI to Analysis Services she mentioned that the AutoSetDefaultInitialCatalog server property could be used to solve her problem. This piqued my interested because I’d seen this property but had no idea what it did exactly or why it was there. Luckily, now I work for Microsoft, it’s even easier for me to find out about things like this from the dev team and Akshai Mirchandani was able to help.
First of all, what does it do? The documentation on this property has just been added here, and this is what it says:
A Boolean property. When set to true, new client connections automatically default to the first catalog (database) the user has permissions to connect to.
When set to false, no initial catalog is specified. Clients must select a default catalog prior to running queries or discover operations against a database on the server. If no default catalog is specified, an error is returned. If Initial Catalog property is specified in the connection string, the default catalog will be applied from this property.
The default value for this property is true.
Let me illustrate what this means. Say you have an instance of Analysis Services (in this case it’s Tabular, but it could be Multidimensional) with two databases on it:
I’ve expanded the Roles tab for each database reasons that will become clear later.
Next, let’s say you run a simple trace on this server looking at the Discover End and Session Initialize events:
…and while this trace is running, you open up SQL Server Management Studio and connect to the SSAS instance. Here’s what you see in Profiler:
Now, just to be clear, all I did was open up SQL Server Management Studio and connect to the instance. I did not open up a DAX query window or anything like that; all that happened was the list of databases on the instance was displayed in the Object Explorer pane.
The interesting thing to notice from the trace above is that when I did that there are five Session Initialize events and even though the Database column in Profiler is blank, you can see from the list of role names in the TextData column that in each case a connection has been made to the Adventure Works Internet Sales database.
This is because when you open a connection to Analysis Services and do not set the Initial Catalog connection string property, what happens is that you will get a connection to the default database on the instance. Which database is the default? It’s just the first database that the user has permission to access on the instance, which is a bit random.
This happens at other times too. Let’s say you right click on the EmptyDB database and process it in SQL Management Studio:
Here’s what I see in Profiler:
In this case there are three connections to the default database, Adventure Works Internet Sales, when the database I am processing is EmptyDB!
Most of the time these unnecessary connections have no impact at all but sometimes they can cause problems such as the ones Shabnam describes in her blog post. For example:
It can cause performance problems, because there is an overhead to opening a connection – for example roles are evaluated when a connection is opened
Monitoring and auditing gets complicated because, as you can see from the traces above, there are a whole lot of connections to the default database taking place that you aren’t expecting
Most importantly, when a connection is opened a read-commit lock is acquired on that database and in a few rare cases this can cause deadlocks and other locking-related issues
This is why the AutoSetDefaultInitialCatalog server property was introduced. With this server property set to False, when you open a connection to SSAS with no Initial Catalog set, then you get a connection with no database set. You can find this server property in SQL Server Management Studio in the Analysis Server properties dialog (which you can find by right-clicking on your instance name, selecting Properties, and going to the General tab) and checking the Advanced (All) Properties box.
With AutoSetDefaultInitialCatalog set to False, here’s what Profiler shows when I rerun my original test of connecting to SQL Server Management Studio:
Note that there are now no Session Initialize events now.
Here’s what opening up a new MDX query window in SQL Management Studio shows with AutoSetDefaultInitialCatalog set to False if you don’t explicitly set a database when you connect:
Note the empty database dropdown box on the toolbar and the “Error loading metadata: No cubes were found” error message shown in the Metadata pane.
So why didn’t the dev team set AutoSetDefaultInitialCatalog to False by default on new instances? The problem with doing this is that it is a potential breaking change that could cause errors in some client tools. I’m not aware of any specific cases where this might happen but if you did decide to change AutoSetDefaultInitialCatalog to False on your instance you would need to test thoroughly to make sure it didn’t break anything. My feeling is, though, it is probably a good idea to AutoSetDefaultInitialCatalog to False on production servers and do the appropriate testing just in case those unnecessary connections are causing problems.
Users of on-premises Analysis Services will know that most of the useful server properties can be set in SQL Server Management Studio, some (such as MaxIntermediateRowsetSize) can only be set by editing the msmdsrv.ini file. How do you set these properties in Azure Analysis Services though, when there is no msmdsrv.ini file to edit?
The solution is to use an XMLA script to make the change. The easy way to do this is to open up the server properties dialog in SQL Management Studio by right clicking on your instance name in the Object Explorer and selecting Properties:
Then, in the server properties dialog, change any server property but do not click ok. Instead, click on the Script button and then select Script Action to New Query Window:
This will create a new XMLA query window in SSMS (the connection dialog for this window will be open too, which will freeze the server properties dialog, so you’ll need to either connect or dismiss the dialog to close the server properties dialog) with the XMLA script to make the server properties change you made. The actual change won’t take place, though, unless you execute the script – so don’t do that.
Instead, change the name of the server property in the script to the one you actually want to set and enter the value you want to set it to:
Note that you can’t just enter the name of the server property in most cases because server properties can be grouped into sections, so you’ll need to enter the section names too. For example for the MaxIntermediateRowsetSize property you’ll need to enter DAX\DQ\MaxIntermediateRowsetSize.
After that, all you need to do is hit the Execute button and the change will be made.
An interesting – if obscure – fact I learned recently is that a small number of DAX functions such as Median() do not work on tables with more than 2 billion rows in Analysis Services Tabular, Azure AS and Power BI.
It’s quite easy to reproduce in Power BI. The following M expression returns a table with two billion and four rows:
#"Converted to Table" =
#"Changed Type" =
#"Converted to Table",
It takes some time to load this table – around twenty minutes – but because there are only four distinct values in the table the resulting .pbix file is only 31KB thanks to the way Power BI compresses data.
If you load this table into your dataset, call it VeryBigTable and create the following measure:
Median Test = MEDIAN(VeryBigTable[Column1])
…and use the measure in a visual, you’ll see the following error:
The current query cannot be evaluated for the ‘VeryBigTable (42)’ table, because the table contains more than two billion rows.
What’s more, the error will always occur even if you apply a filter to the table that returns less than two billion rows. The same problem occurs with some other functions, such as Percentile(), but it’s worth pointing out that the vast majority of DAX functions work as normal with tables with more than two billion rows – for example, in the pbix file used here the Sum() and CountRows() functions not only work fine but return instantly.
Luckily, in the case of the Median() function, there is an easy workaround because you can calculate a median in other ways such as the one described on the DAX Patterns site here. The code is a lot more verbose but it works on a 2 billion+ row table.