A few weeks ago an important new feature for managing connections to data sources in the Power BI Service was released: Shareable Cloud Connections. You can read the blog post announcing them here. I won’t describe their functionality because the post already does that perfectly well; I want to focus on one thing in particular that is important for anyone using Power BI with Snowflake (and, I believe BigQuery and probably several other non-Microsoft sources): Shareable Cloud Connections allow you to have multiple connections to the same data source in the Power BI Service, each using different credentials.
Some of you are going to read that last sentence and get very excited. Many of you will probably be surprised that Power BI didn’t already support this. To understand what’s going on here you first have to understand what Power BI considers a “data source”. The answer can be found on this page of the Power Query SDK docs:
The M engine identifies a data source using a combination of its Kind and Path […]
The Path value is derived from the required parameters of your data source function. Optional parameters aren’t factored into the data source path identifier.
In the case of the Snowflake connector, the “Kind” of the connector is Snowflake and the “Path” is the determined by the two required parameters in the Snowflake connector, namely the Server and the Warehouse:
Before Shareable Cloud Connections, unless you used a gateway, you could only use one connection with one set of credentials for each data source used in the Power BI Service. This meant, for Snowflake, you could only use one set of credentials for all datasets that connected to the same Server and Warehouse, which led to a variety of problems like this one where different credentials were needed for different Snowflake databases or like this one where one user would publish a dataset and enter credentials that worked for them and then a second user would publish another dataset, enter different credentials for the same Server/Warehouse combination and break refresh for the first dataset. With most other popular connectors these issues were rarer because their Paths are more specific and aligned to how you’d want to use different credentials.
As I said, Shareable Clould Connections solve all this by allowing the creation of multiple named connections to the same source, each of which can use different credentials. As a result I strongly recommend everyone using Snowflake with Power BI to create new Shareable Clould Connections and use them in the Power BI Service.
While I was at the Data Scotland conference in Edinburgh on Friday (great event by the way) I stopped by the Tabular Editor stand and got the nice people there to give me a demo of their new tool, DAX Optimizer. It’s currently in private beta but if you’re curious to learn more, Nikola Ilic has already blogged about it in detail here.
Rather than blog about the tool itself – there’s no point repeating Nikola’s post – I thought it would be good to answer a question someone asked me later that day about Tabular Editor and which I’m definitely going to be asked about DAX Optimizer, namely:
This looks great, but it’s expensive and it’s hard for me to get sign-off to use third-party tools like this. Why doesn’t Microsoft give me something like this for free?
Before I carry on, let me make a few things clear:
I work for Microsoft but these are my personal opinions.
I have known many of the people involved in Tabular Editor and DAX Optimizer, including Marco and Alberto, for many years and have had business relationships with them in the past before working for Microsoft.
I don’t endorse any non-Microsoft Power BI-related commercial tools here on my blog but I do use many of them and mention them regularly, leaving readers to draw their own conclusions. This post is not an endorsement of Tabular Editor or DAX Optimizer.
With that out of the way let me address some of the different aspects of this question.
There’s a post on the Power BI blog from 2021 here co-written by Marco Russo and Amir Netz which covers Microsoft’s official position on community and third party Power BI development tools and which is still relevant. There’s also a companion article by Marco here that’s worth reading. In summary Microsoft’s long-term goal is to provide great tools for all Power BI developers, including enterprise developers, but in the meantime our priority is to build a solid platform that other people can build these tools on. I know many of you won’t believe me but here at Microsoft we have finite development resources and we need to make difficult decisions about what we invest in all the time. We can’t build every feature that everyone wants immediately and everyone wants different features.
As a result there will always be space for free and commercial third-party tools to innovate in the Power BI ecosystem. In the same way Tabular Editor serves the enterprise tools market, the vendors in the Power BI custom visuals marketplace extend Power BI with custom visuals. There are literally hundreds of other examples I could give in different areas such as planning and budgeting and admin and governance. Why doesn’t Microsoft buy some or all of these tools? We do buy tools vendors sometimes, but I feel these tools and companies tend to fare better outside Microsoft where they can compete with each other and move quickly, and when there’s a vibrant partner ecosystem around a product then the customer is better off too.
DAX Optimizer is slightly different to Tabular Editor and these other tools though. While the tool is very sophisticated the tool itself is not the whole point; it’s like a much, much more sophisticated version of Tabular Editor’s Best Practices Analyzer feature, a feature which is available in both the free and paid versions of Tabular Editor. The real value lies in the IP inside DAX Optimizer: these aren’t just any rules, these are Marco and Alberto’s rules for optimising DAX. Anyone could build the tool, but only Marco and Alberto could write these particular rules. I guess that’s why the Tabular Editor team had these stickers on their stand on Friday:
Doesn’t Microsoft have people who are this good at DAX who could write the same rules? We do have people who know more about DAX than Marco and Alberto (namely the people who create it, for example Jeffrey Wang) and we do have people who are extremely good at performance tuning DAX (for example my colleagues Michael Kovalsky or Phil Seamark). Indeed, back in 2021 Michael Kovalsky published a free set of rules here which you can use with Best Practices Analyzer in Tabular Editor and which represent the Power BI CAT team’s best practices recommendations on DAX and modelling, so you can argue that Microsoft already does offer a free solution to the problem that DAX Optimizer is trying to solve.
Marco and Alberto are Marco and Alberto though. They have a very strong brand. Consultancy is a famously hard business to scale and this is a very clever way for them to scale the business of DAX performance tuning. If you want their help in whatever form then you’ll need to pay for it. Couldn’t Microsoft just hire Marco and Alberto? I doubt they’d say yes if we asked, and in any case the situation is the same as with buying the tools I mentioned above: I think they add more value to the Power BI ecosystem outside Microsoft than they ever could inside it.
I’ve been lucky enough to get an invitation code to test DAX Optimizer and will be doing so this week, but I deliberately wrote this post before giving it a try. It’s important for me to stay up-to-date with everything happening in the world of Power BI because the customers I work with ask for my opinion. I wish the team behind it well in the same way I wish anyone who tries to build a business on top of Power BI well; the more successful they are, the more successful Power BI and Fabric are.
Sometimes when you’re importing data from files using Power Query in either Power BI or Excel you may encounter the following error:
DataFormat.Error: External table is not in the expected format
What causes it? TL;DR it’s because you’re trying to load data from one type of file, probably Excel (I don’t think you can get this error with any other source but I’m not sure), and actually connecting to a different type of file.
Let’s see a simple example. Say you have a folder with two files: one is an Excel file called Date.xlsx and one is a CSV file called Date.csv.
Here’s the M code for a Power Query query that connects to the Excel file and reads the data from a table in it:
let
Source = Excel.Workbook(File.Contents("C:\MyFolder\Date.xlsx"), null, true),
Date_Table = Source{[Item = "Date", Kind = "Table"]}[Data]
in
Date_Table
Now, if you change the file path in this query – and only the file path – to point at the CSV file instead like so:
let
Source = Excel.Workbook(File.Contents("C:\MyFolder\Date.csv"), null, true),
Date_Table = Source{[Item = "Date", Kind = "Table"]}[Data]
in
Date_Table
…you will get the “external table is not in the expected format” error shown above. This is because your code is using the Excel.Workbook M function, which is used to import data from Excel workbooks, to connect to a file that is a CSV file and not an Excel workbook. The way to fix it is to use the appropriate function, in this case Csv.Document, to access the file like so:
To be honest, if making this change is beyond your Power Query skills and you’re sure you’re trying to connect to the right file, you’re better off creating a completely new query rather than editing the query you already have.
Another common scenario where you might encounter this error is when you’re importing data from all the files in a folder and one of the files isn’t in the correct format. For example, let’s say you have a folder with three Excel files in and you use the Folder data source to import all the data from all three files:
Since all three files are Excel files the Folder option will work:
However, if you take a CSV file and drop it into the folder like so:
Then you’ll get the same error in Power Query:
Apart from deleting the CSV file you have another option to solve this problem in this case: filtering the folder so you only try to get data from the .xlsx files and no other file type. To do this, click on the step that is called “Source”. When you do this you’ll see that the step returns a table containing all the files in the folder you’re pointing at:
You’ll see that the table in this step contains a column called Extension which contains the file extension for each file. If you filter this table – which will insert a new step at this point in the query, which is ok – by clicking on the down arrow in the Extension column, delselecting the (Select All) option and selecting “.xlsx” so the table only contains .xlsx files then you can avoid this problem:
If, as in this example, the rogue file happens to be the first file in the folder and you’ve selected that first file to be your “sample” file when setting up the import, then you’ll also need to go to the query called Sample File in the Queries pane and make exactly the same change there (ie click on the Source step and filter to remove any non .xlsx files).
A common requirement from Power BI customers in highly-regulated industries is the need to log users out of Power BI if they have been inactive for a certain amount of time. If your Power BI reports contain extremely sensitive data you don’t want someone to open a report, leave their desk for lunch, forget to lock their PC and let everyone in the office see what’s on their screen, for obvious reasons. This has actually been possible for some time now with Power BI and is now supported for Fabic, so I thought I’d write a blog post to raise awareness.
The feature that makes this possible is Microsoft 365’s Idle Session Timeout, which you can read about here:
To turn it on, a Microsoft 365 admin has to go to the M365 admin centre and Org Settings/Security & Privacy and select Idle Session Timeout. There you can set the amount of time to wait before users are logged out:
Once that is set, anyone who has Power BI open in their browser but doesn’t interact with it will see the following message after the specified period of time:
Your session is about to expire
Your organization’s policy enforces automatic sign out after a period of inactivity on Microsoft 365 web applications.
Do you want to stay signed in?
There are a few things to point out about how this works (read this for the full details):
You can’t turn it on for just Power BI, you have to turn it on for all supported Microsoft 365 web apps. This includes Outlook and the other Office web apps
You can’t turn it on for specific users – it has to be for the whole organisation
Users won’t get signed out if they get single sign-on into the web app from the device-joined account, or select “Stay signed in” when they log in (an option that can be hidden), or if they’re on a managed device and using a supported browser like Edge or Chrome
You’ll need to be on friendly terms with your M365 admin if you want to use this, clearly, but if you need this functionality it makes sense to enforce activity-based timeout rules for more apps than just Power BI.
One frequently asked question I see asked on Power BI forums is whether it’s possible to run Power BI Desktop on a Mac or indeed anything other than a Windows PC. There are already a lot of detailed blog posts and videos out there on this subject, such as this one from Guy In A Cube: the answer is no, you can’t run Power BI Desktop natively on a Mac or any other OS apart from Windows and there are no plans to port it over, so you need to either install Windows somehow (for example with Boot Camp) or use tools like Parallels or Turbo.Net to run Power BI Desktop. You can also spin up a Windows VM, for example in Azure, and run Power BI Desktop on that; Power BI Desktop is also now fully supported on Azure Virtual Desktop too although not on other virtual environments like Citrix.
Turning the question around, however, leads you to some aspects of the question that haven’t been fully explored. Instead of asking “Can I run Power BI Desktop on my Mac?”, you can instead ask “Can I do all of my Power BI development using only a browser?”. At Microsoft our long-term goal is to make all Power BI development web-based, but how close are we to that goal?
The first point to make is that it has always been possible to build Power BI reports (as opposed to datasets) in the browser without needing Power BI Desktop. You can even now build basic paginated reports in the browser too now. Historically I’ve never been a fan of encouraging users to do this because developing in Power BI Desktop gives you the chance to roll back to a previous version of the report if you need to – assuming you have saved those previous versions of your .pbix file. What’s more, if two or more people try to edit the same report at the same time then the last person to save wins and overwrites the other person’s changes, which can be dangerous. Fabric’s Git integration, which does work for Power BI reports, has changed my attitude somewhat though. As Rui Romano discusses here you can now safely make changes to reports in the Power BI Service, save them to source control and then roll back if you need to; this assumes your users are comfortable using Git, however, and it doesn’t solve the simultaneous development problem.
What about dataset development? Web editing for datasets has been in preview for a few months now and is getting better and better, although there are still several limitations and the focus up to now has been on modelling; connecting to data sources is on the public roadmap though. As a result Power BI Desktop is still needed for dataset development, at least for now.
Do datamarts change anything? Or Direct Lake mode in Fabric? Datamarts do solve the problem of being able to connect to and load data using just your browser and are available (if not GA yet) today. If you’re only using datamarts to avoid the need for a Windows PC to develop on, though, you’re paying a price: for a start you’ll either be loading the data twice if you want to use Import mode for your dataset (once to load data into the datamart, once to load the same data into the dataset) or taking the query performance hit of using DirectQuery mode. There are also some other limitations to watch out for. Fabric Direct Lake mode datasets, for me, offer all the benefits of Datamarts without so many of the limitations – Direct Lake mode means you only load the data once and still get near-Import mode performance, for example – and will be the obvious choice when Fabric GAs and once features like OneSecurity are available. With Fabric it will be possible to for most Power BI developers do all their work using only a browser, although for more complex projects (and to be clear this is only a small minority of projects) it will still be necessary to use other tools such as Tabular Editor, DAX Studio, SQL Server Management Studio and SQL Server Profiler which can only run on a Windows PC. I can imagine some of this more advanced developer functionality coming to the browser too in time, though.
In summary while Power BI Desktop and therefore Windows is still needed for Power BI development today, the day when you can do most and maybe all of your development in the browser is in sight. All you Mac owners need to be patient just a little while longer!
A few months ago a new option was added to the Sql.Database and Sql.Databases functions in Power Query in Power BI and Excel which allows Power Query queries that combine data from different SQL Server databases to fold. Here’s a simple example showing how to use it.
On my local PC I have SQL Server installed and the Adventure Works DW 2017 and Contoso Retail DW sample databases:
Both of these databases have date dimension tables called DimDate. Let’s say you want to create a Power Query query that merges these two tables.
Here’s the M code for a Power Query query called DimDate AW to get just the DateKey and CalendarYear columns from the DimDate table of the Adventure Works DW 2017 database:
let
Source = Sql.Database("localhost", "AdventureWorksDW2017"),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(dbo_DimDate,{"DateKey", "CalendarYear"})
in
#"Removed Other Columns"
Here’s the M code for a Power Query query called DimDate Contoso to get just the Datekey and CalendarYear columns from the DimDate table in the ContosoRetailDW database:
let
Source = Sql.Database("localhost", "ContosoRetailDW"),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(dbo_DimDate,{"Datekey", "CalendarYear"})
in
#"Removed Other Columns"
Both of these Power Query queries fold. However if you create a third query to merge these two queries (ie do the equivalent of a SQL join between them) on the CalendarYear columns like so:
…this query does not fold, because it combines data from two different SQL Server databases.
However if you edit the Sql.Database function in the Source step of both of the first two queries above to set the new EnableCrossDatabaseFolding option to true, like so:
In the second post in this series I discussed a KQL query that can be used to analyse Power BI refresh throughput at the partition level. However, if you remember back to the first post in this series, it’s actually possible to get much more detailed information on throughput by looking at the ProgressReportCurrent event, which fires once for every 10000 rows read during partition refresh.
Here’s yet another mammoth KQL query that you can use to analyse the ProgressReportCurrent event data:
It filters the Log Analytics data down to get events from the last day and just the ProgressReportCurrent events, as well as the ProgressReportBegin/End events which are fired before and after the ProgressReportCurrent events.
It then splits the data into groups of rows (‘partitions’ in KQL, but of course not the partitions that are being refreshed) by a combination of XmlaRequestId (ie the refresh operation) and XmlaObjectPath (ie the partition that is being refreshed)
For each group of rows it will then:
Find the ProgressReportBegin event and from this get the time when data started to be read from the source
Get all subsequent ProgressReportCurrent events and calculate the amount of time elapsed since the previous event (which might be the ProgressReportBegin event or a previous ProgressReportCurrent event) and the number of rows read
When the ProgressReportEnd event is encountered, calculate the amount of time elapsed since the previous ProgressReportCurrent event and the number of rows (which will be less than 10000) read since then
Filter out the ProgressReportBegin events because we don’t need them any more
Finally, add columns that splits out the table name and partition name and calculates the number of rows read per second for each row by dividing the number of rows read for each event by the amount of time elapsed since the previous event
What can this query tell us about throughput?
First of all, something interesting but not necessarily useful. At least for the data source I’m using for my tests, when I plot a column chart with the number of rows read on the x axis and the amount of time elapsed since the last event on the y axis (ie the amount of time it takes to read 10000 rows for all but the last column) then I noticed that every 200000 rows something happens to slow down the read:
I have no idea what this is, whether it’s a quirk of this particular source or connector, but it’s a great example of the kind of patterns that become obvious when you visualise data rather than look at a table of numbers.
Plotting time on the x axis of a line chart and the cumulative total of rows read on the y axis gives you something more useful. Here’s the chart for one of the refreshes mentioned in my last post where four partitions of the same table are refreshed in parallel:
In this case throughput is fine up until the end of the refresh at which point something happens to the February, March and April partitions but not the January partition to slow them down for about 30 seconds, after which throughput goes back to what it was before. Here’s the same chart zoomed in a bit:
Here’s the same problem shown in the first graph above, where the number of rows read is on the x axis, showing how for example with the April partition there’s a sudden spike where it takes 14 seconds to read 10000 rows rather than around 0.3 seconds:
What is this, and why isn’t the January partition affected? Maybe it was a network issue or caused by something happening in the source database? Looking at another refresh that also refreshes the same four partitions in parallel, it doesn’t seem like the same thing happens – although if you look closely at the middle of the refresh there might be a less pronounced flattening off:
Again, the point of all this is not the mysterious blips I’ve found in my data but the fact that if you take the same query and look at your refreshes, you may find something different, something more significant and something you can explain and do something about.
In the first post in this series I described the events in Log Analytics that can be used to understand throughput – the speed that Power BI can read from your dataset when importing data from it – during refresh. While the individual events are easy to understand when you look at a simple example they don’t make it easy to analyse the data in the real world, so here’s a KQL query that takes all the data from all these events and gives you one row per partition per refresh:
//Headline stats for partition refresh with one row for each partition and refresh
//Get all the data needed for this query and buffer it in memory
let RowsForStats =
materialize(
PowerBIDatasetsWorkspace
| where TimeGenerated > ago(1d)
| where OperationName == "ProgressReportEnd"
| where OperationDetailName == "ExecuteSql" or OperationDetailName == "ReadData"
or (OperationDetailName == "TabularRefresh" and (EventText contains "partition"))
);
//Get just the events for the initial SQL execution phase
let ExecuteSql =
RowsForStats
| where OperationDetailName == "ExecuteSql"
| project XmlaRequestId, XmlaObjectPath,
ExecuteSqlStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ),
ExecuteSqlEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ),
ExecuteSqlDurationMs = DurationMs, ExecuteSqlCpuTimeMs = CpuTimeMs;
//Get just the events for the data read and calculate rows read per second
let ReadData =
RowsForStats
| where OperationDetailName == "ReadData"
| project XmlaRequestId, XmlaObjectPath,
ReadDataStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ),
ReadDataEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ),
ReadDataDurationMs = DurationMs, ReadDataCpuTime = CpuTimeMs,
TotalRowsRead = ProgressCounter, RowsPerSecond = ProgressCounter /(toreal(DurationMs)/1000);
//Get the events for the overall partition refresh
let TabularRefresh =
RowsForStats
| where OperationDetailName == "TabularRefresh"
| parse EventText with * '[MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: ' MashupPeakMemoryKB:long ' KB]'
| project XmlaRequestId, XmlaObjectPath,
TabularRefreshStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ),
TabularRefreshEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ),
TabularRefreshDurationMs = DurationMs, TabularRefreshCpuTime = CpuTimeMs,
MashupCPUTimeMs, MashupPeakMemoryKB;
//Do an inner join on the three tables so there is one row per partition per refresh
ExecuteSql
| join kind=inner ReadData on XmlaRequestId, XmlaObjectPath
| join kind=inner TabularRefresh on XmlaRequestId, XmlaObjectPath
| project-away XmlaRequestId1, XmlaRequestId2, XmlaObjectPath1, XmlaObjectPath2
| extend Table = tostring(split(XmlaObjectPath,".", 2)[0]), Partition = tostring(split(XmlaObjectPath,".", 3)[0])
| project-reorder XmlaRequestId, Table, Partition
| order by XmlaRequestId, ExecuteSqlStartTime desc
It’s a bit of a monster query but what it does is quite simple:
First it gets all the events relating to partition refresh in the past 1 day (which of course you can change) and materialises the results.
Then it filters this materialised result and gets three sets of tables:
All the ExecuteSql events, which tell you how long the data source took to start returning data and how much CPU time was used.
All the ReadData events, which tell you how long Power BI took to read all the rows from the source after the data started to be returned, how much CPU time was used, and how many rows were read. Dividing duration by rows read lets you calculate the number of rows read per second during this phase.
All the TabularRefresh events, which give you overall data on how long the partition refresh took, how much CPU time was used, plus information on Power Query peak memory usage and CPU usage.
What can this tell us about refresh throughput though? Let’s use it to answer some questions we might have about throughput.
What is the impact of parallelism on throughput? I created a dataset on top of the NYC taxi data Trip table with a single table, and in that table created four partitions containing data for January, February, March and April 2013, each of which contained 13-15 million rows. I won’t mention the type of data source I used because I think it distracts from what I want to talk about here, which is the methodology rather than the performance characteristics of a particular source.
I then ran two refreshes of these four partitions: one which refreshed them all in parallel and one which refreshed them sequentially, using custom TSML refresh commands and the maxParallelism property as described here. I did a refresh of type dataOnly, rather than a full refresh, in the hope that it would reduce the number of things happening in the Vertipaq engine during refresh that might skew my results. Next, I used the query above as the source for a table in Power BI (for details on how to use Log Analytics as a source for Power BI see this post; I found it more convenient to import data rather than use DirectQuery mode though) to visualise the results.
Comparing the amount of time taken for the SQL query used to start to return data (the ExecuteSqlDurationMs column from the query above) for the four partitions for the two refreshes showed the following:
The times for the four partitions vary a lot for the sequential refresh but are very similar for the parallel refresh; the January partition, which was refreshed first, is slower in both cases. The behaviour I described here regarding the first partition refreshed in a batch could be relevant.
Moving on to the Read Data phase, looking at the number of rows read per second (the RowsPerSecond column from the query above) shows a similar pattern:
There’s a lot more variation in the sequential refresh. Also, as you would expect, the number of rows read per second is much higher when partitions are refreshed sequentially compared to when they are refreshed in parallel.
Looking at the third main metric, the overall amount of time taken to refresh each partition (the TabularRefreshDurationMs column from the query above) again shows no surprises:
Each individual partition refreshes a lot faster in the sequential refresh – almost twice as fast – compared to the parallel refresh. Since four partitions are being refreshed in parallel during the second refresh, though, this means that any loss of throughput for an individual partition as a result of refreshing in parallel is more than compensated for by the parallelism, making the parallel refresh faster overall. This can be shown using by plotting the TabularRefreshStartTime and TabularRefreshEndTime columns from the query above on a timeline chart (in this case the Craydec Timelines custom visual) for each refresh and each partition:
On the left of the timeline you can see the first refresh where the partitions are refreshed sequentially, and how the overall duration is just over 20 minutes; on the right you can see the second refresh where the partitions are refreshed in parallel, which takes just under 10 minutes. Remember also that this is just looking at the partition refresh times, not the overall time taken for the refresh operation for all partitions, and it’s only a refresh of type dataOnly rather than a full refresh.
So does this mean more parallelism is better? That’s not what I’ve been trying to say here: more parallelism is better for overall throughput in this test but if you keep on increasing the amount of parallelism you’re likely to reach a point where it makes throughput and performance worse. The message is that you need to test to see what the optimal level of parallelism – or any other factor you can control – is for achieving maximum throughput during refresh.
These tests only show throughput at the level of the ReadData event for a single partition, but as mentioned in my previous post there is even more detailed data available with the ProgressReportCurrent event. In my next post I’ll take a closer look at that data.
[Thanks to Akshai Mirchandani for providing some of the information in this post, and hat-tip to my colleague Phil Seamark who has already done some amazing work in this area]
If you’re tuning refresh for a Power BI Import mode dataset one of the areas you’ll be most interested in is throughput, that is to say how quickly Power BI can read data from the data source. It can be affected by a number of different factors: how quickly the data source can return data; network latency; the efficiency of the connector you’re using; any transformations in Power Query; the number of columns in the data and their data types; the amount of other objects in the same Power BI dataset being refreshed in parallel; and so on. How do you know if any or all of these factors is a problem for you? It’s a subject that has always interested me and now that Log Analytics for Power BI datasets is GA we have a powerful tool to analyse the data, so I thought I’d do some testing and write up my findings in a series of blog posts.
The first thing to understand is what events there are in Log Analytics that provide data on throughput. The events to look at are ProgressReportBegin, ProgressReportCurrent and ProgressReportEnd (found in the OperationName column), specifically those with OperationDetailName ExecuteSql, ReadData and Tabular Refresh. Consider the following KQL query that looks for at this data for a single refresh and for a single partition:
let RefreshId = "e5edc0de-f223-4c78-8e2d-01f24b13ccdc";
let PartitionObjectPath = "28fc7514-d202-4969-922a-ec86f98a7ea2.Model.TIME.TIME-ffca8cb8-1570-4f62-8f04-993c1d1d17cb";
PowerBIDatasetsWorkspace
| where TimeGenerated > ago(3d)
| where XmlaRequestId == RefreshId
| where XmlaObjectPath == PartitionObjectPath
| where OperationDetailName == "ExecuteSql" or OperationDetailName == "ReadData" or OperationDetailName == "TabularRefresh"
| project XmlaObjectPath, Table = split(XmlaObjectPath,".", 2)[0], Partition = split(XmlaObjectPath,".", 3)[0],
TimeGenerated, OperationName, OperationDetailName, EventText, DurationMs, CpuTimeMs, ProgressCounter
| order by XmlaObjectPath, TimeGenerated asc;
Some notes on what this query does:
All the data comes from the PowerBIDatasetsWorkspace table in Log Analytics
I’ve put an arbitrary filter on the query to only look for data in the last three days, which I know contains the data for the specific refresh I’m interested in
An individual refresh operation can be identified by the value in the XmlaRequestId column and I’m filtering by that
An individual partition being refreshed can be identified by the value in the XmlaObjectPath column and I’m filtering by that too
The value in the XmlaObjectPath column can be parsed to obtain both the table name and partition name
The query filters the events down to those mentioned above
Here are some of the columns from the output of this query, showing data for the refresh of a single table called TIME with a single partition:
Some more notes on what we can see here:
The TimeGenerated column gives the time of the event. Although the time format shown here only shows seconds, it actually contains the time of the event going down to the millisecond level – unlike in a Profiler trace, where time values are rounded to the nearest second and are therefore lot less useful.
The first event returned by this query is a ProgressReportBegin event of type TabularRefresh which marks the beginning of the partition refresh.
As I blogged here, after that are a pair of ProgressReportBegin/End events of type ExecuteSql. The value in the DurationMs column tells you how long it takes for the data source (which includes time taken by the actual query generated against the data source and any Power Query transformations) to start to return data – which was, in this case, 6016 milliseconds or 6 seconds.
Next there is a ProgressReportBegin event which indicates the beginning of data being read from the source.
After that there are a series of ProgressReportCurrent events which mark the reading of chunks of 10000 rows from the source. The ProgressCounter column shows the cumulative number of rows read.
Next there is a ProgressReportEnd event that marks the end of the data read. The ProgressCounter shows the total number of rows read (which must be less than 10000 rows greater than the value in the ProgressCounter column for the previous ProgressReportCurrent event); the DurationMs column shows the total time taken to read the data. In this case 86430 rows of data were read in 1.4 seconds.
Finally there is a ProgressReportEnd event of type TabularRefresh, the pair of the first event shown returned. It not only shows the total amount of time taken and the CPU time used to refresh the partition (which includes other operations that are nothing to do with throughput such as compressing data and building dictionaries) in the DurationMs column, as I blogged here it also includes data on the CPU and peak memory used by Power Query during the refresh. In this case the total refresh time was 7.6 seconds, so not much more than the time taken to read the data.
Clearly there’s a lot of useful, interesting data here, but how can we analyse it or visualise it? Stay tuned for my next post…
If you’re excited about Direct Lake mode in Fabric you’re probably going to want to test it with some of your own data, and in particular look at DAX query performance. Before you do so, though, there are a few things to know about performance testing with Direct Lake datasets that are slightly different from what you might be used to with Import mode or DirectQuery datasets.
Dataset “hotness”
In my last post I talked about how, in Direct Lake datasets, Power BI can page individual column segments, dictionaries and join indexes into memory on demand when a DAX query is run and how those artefacts may get paged out later on. It therefore follows that there are four possible states or levels of “hotness” that a dataset can be in when a DAX query is run and that each of these states will have different performance characteristics:
The column segments, dictionaries and join indexes needed to answer a query are not held in memory and need to be paged in before the query can run.
Some of the column segments, dictionaries and join indexes needed to answer a query are not held in memory and need to be paged in, while some of them are already in memory.
All of the column segments, dictionaries and join indexes needed by the query are already held in memory.
All of the column segments, dictionaries and join indexes needed by the query are already held in memory and, as a result of previous query activity, Vertipaq engine caches useful for the query are also already populated.
State (1) is the “coldest” state and will give the worst possible query performance while state (4) is the “hottest” state and will give the best possible query performance.
When you’re testing the performance of a DAX query on a Direct Lake dataset you should test it on a dataset that is in state (1), state (3) and state (4) so you get a good idea of how much time is taken to page data into memory and how much of a performance improvement Vertipaq engine caching brings.
Ensuring everything is paged out
To test query performance in state (1) you need a way to ensure that all column segments, dictionaries and join indexes are paged out of memory. At the time of writing this post you can ensure this simply by refreshing the dataset. This will change sometime in the next few months though, because paging everything out of memory when you refresh is not ideal behaviour in the real world, so there will be another way to ensure everything is paged out. I’ll update this post when that change happens. You can ensure that all column segments and dictionaries have been paged out by running the two DMV queries mentioned in my previous post: DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS and DISCOVER_STORAGE_TABLE_COLUMNS.
Why you should use DAX Studio for performance testing
I also recommend using DAX Studio to run queries when performance testing, for a number of reasons. First of all it makes it easy to clear the Vertipaq engine cache before a query is run with the “Clear Cache” and “Clear on Run” (which automatically clears the cache before each query) buttons. This not only runs a Clear Cache command, to clear the Vertipaq cache, it also runs a (hidden) DAX query that does not query any data from the dataset but which does trigger the creation of all the measures on the dataset. In most cases this is very fast, but if you have thousands of measures it could take a few seconds (similar to what I show here for report-level measures). If you are not using DAX Studio you can achieve the same result by running a query like:
EVALUATE {1}
DAX Studio also lets you run the same query multiple times using its “Run benchmark” feature (although this only lets you test for states (3) and (4) at the time of writing) and its “Server Timings” feature is invaluable for understanding what’s going on inside the Vertipaq engine when a query runs.
Also make sure you are running the very latest version of DAX Studio (which is 3.0.8 at the time of writing) to make sure it works properly with Fabric.
Performance testing methodology
So, putting this all together, in order to run a single performance test on a Direct Lake dataset to capture performance for states (1), (3) and (4):
Preparation
Create a custom Power BI dataset in Fabric (not everything here works with a default dataset at the time of writing)
Open your report in Power BI Desktop connected to your published Direct Lake dataset
Capture the DAX queries generated by the visuals you want to test using Performance Analyzer by clicking Copy Query (see here for how to do this)
Install the very latest version of DAX Studio and open it
Connect DAX Studio to your workspace’s XMLA Endpoint (see here for how to do this)
Paste your DAX query into DAX Studio
Turn on DAX Studio’s Server Timings option
Ensure the “Clear on Run” option in the ribbon is turned off
To test performance for state (1):
Refresh your dataset to ensure everything is paged out of memory
Click the “Clear Cache” button on the ribbon in DAX Studio
Run the DAX query
Save the output of the Server Timings pane in DAX Studio by clicking the Export button
To test performance for state (3), immediately after the previous steps:
Click the “Clear Cache” button on the ribbon in DAX Studio
Run the DAX query
Save the output of the Server Timings pane in DAX Studio
To test the performance for state (4), immediately after the previous steps:
Run the DAX query again without clicking the “Clear Cache” button
Save the output of the Server Timings pane
Fallback to DirectQuery
Even with a Direct Lake dataset there is no guarantee that your query will be answered in Direct Lake mode: in as-yet not fully documented scenarios (but basically if your data volumes are too large for the capacity you’re using) Power BI will switch to using DirectQuery mode against the Lakehouse’s SQL Endpoint. One of the objectives of your performance testing should be to make sure that this happens as infrequently as possible because DirectQuery mode will perform noticeably worse than Direct Lake mode.
You may notice some DirectQuery events even when the query itself only uses Direct Lake; these are used to get metadata or security information from the Lakehouse and can be ignored. Here’s an example of this:
Load testing
Testing for a single user is important, but don’t forget about testing performance with multiple concurrent users. As I discuss here, realistic load tests are the only way to get a good idea of how your report will actually perform in production and you can load test a Direct Lake dataset in exactly the same way as an Import mode or DirectQuery dataset.
Direct Lake is still in preview!
The last point to make is that like the rest of Fabric, Direct Lake is still in preview. This not only means that functionality is missing, it also means that performance is not yet as good as it will be yet. So, by all means test Direct Lake and tell us how fast (or not) it is, but be aware that your test results will be out of date very quickly as the engine evolves.
[Thanks to Krystian Sakowski for the information in this post]