To be honest I’m slightly ashamed of this fact because, as I say in the post, the solution I describe is a bit of a hack – but at the same time, the post is popular because a lot of people have the problem of needing to add new data to the data that’s already there in their Power BI dataset and there’s no obvious way of doing that. As I also say in that post, the best solution is to stage the data in a relational database or some other store outside Power BI so you have a copy of all the data if you ever need to do a full refresh of your Power BI dataset.
Why revisit this subject? Well, with Fabric it’s now much easier for you as a Power BI developer to build that place to store a full copy of your data outside your Power BI dataset and solve this problem properly. For a start, you now have a choice of where to store your data: either in a Lakehouse or a Warehouse, depending on whether you feel comfortable with using Spark and notebooks or relational databases and SQL to manage your data. What’s more, with Dataflows gen2, when you load data to a destination you now have the option to append new data to existing data as well as to replace it:
If you need more complex logic to make sure you only load new records and not ones that you’ve loaded before, there’s a published pattern for that.
“But I’m a Power BI developer, not a Fabric developer!” I hear you cry. Perhaps the most important point to make about Fabric is that Power BI is Fabric. If you have Power BI today, you will have Fabric soon if you don’t have the preview already – they are the same thing. One way of thinking about Fabric is that it’s just Power BI with a lot more stuff in it: databases, notebooks, Spark and pipelines as well as reports, datasets and dataflows. There are new skills to learn but solving this problem with the full range of Fabric workloads is a lot less complex than the pure Power BI approach I originally described.
“But won’t this be expensive? Won’t it need a capacity?” you say. It’s true that to do all this you will need to buy a Fabric capacity. But Fabric capacities start at a much cheaper price than Power BI Premium capacities: an F2 capacity costs $0.36USD per hour or $262.80USD per month and OneLake storage costs $0.023 per GB per month (for more details see this blog post and the docs), so Fabric capacities are a lot more affordable than Power BI Premium capacities.
So, with Fabric, there’s no need for complex and hacky workarounds to solve this problem. Just spin up a Fabric capacity, create a Warehouse or Lakehouse to store your data, use Dataflows Gen2 to append new data to any existing data, then build your Power BI dataset on that.
A few weeks ago an important new feature for managing connections to data sources in the Power BI Service was released: Shareable Cloud Connections. You can read the blog post announcing them here. I won’t describe their functionality because the post already does that perfectly well; I want to focus on one thing in particular that is important for anyone using Power BI with Snowflake (and, I believe BigQuery and probably several other non-Microsoft sources): Shareable Cloud Connections allow you to have multiple connections to the same data source in the Power BI Service, each using different credentials.
Some of you are going to read that last sentence and get very excited. Many of you will probably be surprised that Power BI didn’t already support this. To understand what’s going on here you first have to understand what Power BI considers a “data source”. The answer can be found on this page of the Power Query SDK docs:
The M engine identifies a data source using a combination of its Kind and Path […]
The Path value is derived from the required parameters of your data source function. Optional parameters aren’t factored into the data source path identifier.
In the case of the Snowflake connector, the “Kind” of the connector is Snowflake and the “Path” is the determined by the two required parameters in the Snowflake connector, namely the Server and the Warehouse:
Before Shareable Cloud Connections, unless you used a gateway, you could only use one connection with one set of credentials for each data source used in the Power BI Service. This meant, for Snowflake, you could only use one set of credentials for all datasets that connected to the same Server and Warehouse, which led to a variety of problems like this one where different credentials were needed for different Snowflake databases or like this one where one user would publish a dataset and enter credentials that worked for them and then a second user would publish another dataset, enter different credentials for the same Server/Warehouse combination and break refresh for the first dataset. With most other popular connectors these issues were rarer because their Paths are more specific and aligned to how you’d want to use different credentials.
As I said, Shareable Clould Connections solve all this by allowing the creation of multiple named connections to the same source, each of which can use different credentials. As a result I strongly recommend everyone using Snowflake with Power BI to create new Shareable Clould Connections and use them in the Power BI Service.
While I was at the Data Scotland conference in Edinburgh on Friday (great event by the way) I stopped by the Tabular Editor stand and got the nice people there to give me a demo of their new tool, DAX Optimizer. It’s currently in private beta but if you’re curious to learn more, Nikola Ilic has already blogged about it in detail here.
Rather than blog about the tool itself – there’s no point repeating Nikola’s post – I thought it would be good to answer a question someone asked me later that day about Tabular Editor and which I’m definitely going to be asked about DAX Optimizer, namely:
This looks great, but it’s expensive and it’s hard for me to get sign-off to use third-party tools like this. Why doesn’t Microsoft give me something like this for free?
Before I carry on, let me make a few things clear:
I work for Microsoft but these are my personal opinions.
I have known many of the people involved in Tabular Editor and DAX Optimizer, including Marco and Alberto, for many years and have had business relationships with them in the past before working for Microsoft.
I don’t endorse any non-Microsoft Power BI-related commercial tools here on my blog but I do use many of them and mention them regularly, leaving readers to draw their own conclusions. This post is not an endorsement of Tabular Editor or DAX Optimizer.
With that out of the way let me address some of the different aspects of this question.
There’s a post on the Power BI blog from 2021 here co-written by Marco Russo and Amir Netz which covers Microsoft’s official position on community and third party Power BI development tools and which is still relevant. There’s also a companion article by Marco here that’s worth reading. In summary Microsoft’s long-term goal is to provide great tools for all Power BI developers, including enterprise developers, but in the meantime our priority is to build a solid platform that other people can build these tools on. I know many of you won’t believe me but here at Microsoft we have finite development resources and we need to make difficult decisions about what we invest in all the time. We can’t build every feature that everyone wants immediately and everyone wants different features.
As a result there will always be space for free and commercial third-party tools to innovate in the Power BI ecosystem. In the same way Tabular Editor serves the enterprise tools market, the vendors in the Power BI custom visuals marketplace extend Power BI with custom visuals. There are literally hundreds of other examples I could give in different areas such as planning and budgeting and admin and governance. Why doesn’t Microsoft buy some or all of these tools? We do buy tools vendors sometimes, but I feel these tools and companies tend to fare better outside Microsoft where they can compete with each other and move quickly, and when there’s a vibrant partner ecosystem around a product then the customer is better off too.
DAX Optimizer is slightly different to Tabular Editor and these other tools though. While the tool is very sophisticated the tool itself is not the whole point; it’s like a much, much more sophisticated version of Tabular Editor’s Best Practices Analyzer feature, a feature which is available in both the free and paid versions of Tabular Editor. The real value lies in the IP inside DAX Optimizer: these aren’t just any rules, these are Marco and Alberto’s rules for optimising DAX. Anyone could build the tool, but only Marco and Alberto could write these particular rules. I guess that’s why the Tabular Editor team had these stickers on their stand on Friday:
Doesn’t Microsoft have people who are this good at DAX who could write the same rules? We do have people who know more about DAX than Marco and Alberto (namely the people who create it, for example Jeffrey Wang) and we do have people who are extremely good at performance tuning DAX (for example my colleagues Michael Kovalsky or Phil Seamark). Indeed, back in 2021 Michael Kovalsky published a free set of rules here which you can use with Best Practices Analyzer in Tabular Editor and which represent the Power BI CAT team’s best practices recommendations on DAX and modelling, so you can argue that Microsoft already does offer a free solution to the problem that DAX Optimizer is trying to solve.
Marco and Alberto are Marco and Alberto though. They have a very strong brand. Consultancy is a famously hard business to scale and this is a very clever way for them to scale the business of DAX performance tuning. If you want their help in whatever form then you’ll need to pay for it. Couldn’t Microsoft just hire Marco and Alberto? I doubt they’d say yes if we asked, and in any case the situation is the same as with buying the tools I mentioned above: I think they add more value to the Power BI ecosystem outside Microsoft than they ever could inside it.
I’ve been lucky enough to get an invitation code to test DAX Optimizer and will be doing so this week, but I deliberately wrote this post before giving it a try. It’s important for me to stay up-to-date with everything happening in the world of Power BI because the customers I work with ask for my opinion. I wish the team behind it well in the same way I wish anyone who tries to build a business on top of Power BI well; the more successful they are, the more successful Power BI and Fabric are.
If you read this post that was published on the Fabric blog back in July, you’ll know that each Power Query query in a Fabric Gen2 dataflow has a property that determines whether its output is staged or not – where “staged” means that the output is written to the (soon-to-be hidden) Lakehouse linked to the dataflow, regardless of whether you have set a destination for the query output to be written to. Turning this on or off can have a big impact on your refresh times, making them a lot faster or a lot slower. You can find this property by right-clicking on the query name in the Queries pane:
At the moment this property is on by default for every query although this may change in the future. But should you turn it on for the queries in your Gen2 dataflows? It depends, and you should test to see what gives you the best performance.
Let’s see a simple example. I uploaded a CSV file from my favourite data source, the Land Registry price paid data, with about a million rows in it to the files section of a Lakehouse, then created a query that did a group by on one of the columns to find the number of property transactions by each county in England and Wales. The query was set to load its output to a table in a Warehouse.
Here’s the diagram view for this query:
I then made sure that staging was turned off for this query:
This means that the Power Query engine did the group by itself as it read the data from the file.
Looking at the refresh history for this dataflow:
…showed that the query took between 18-24 seconds to run. Clicking on an individual refresh to see the details:
…showed a single activity to load the output to the Warehouse. Clicking on this activity to see more details:
…shows how long it took – 15 seconds – plus how many rows were loaded to the destination Warehouse and how much data.
I then created a second dataflow to see the effect of staging. It’s important to understand that copying the previous dataflow and enabling staging on the only query in it does not do what I wanted here: I had to create two queries, one with staging enabled (called PP here) and no destination set to stage all the raw data from the CSV file, and a second one (called Counties here) that references the first with staging disabled and its destination set to the Warehouse I used in the previous dataflow to do the group by.
Here’s the diagram view for these two queries:
Note the blue outline on the PP query which indicates that it’s staged and the grey outline on the Counties query that indicates that it is not staged.
Looking at the Refresh History for this dataflow showed that it took around 40 seconds to run on average:
Looking at the first level of detail for the last refresh showed the extra activity for staging the data:
Clicking on the details for this staging activity for the PP table showed that it took 17 seconds to load all the raw data:
The activity to write the data to the Warehouse took about the same as with the first dataflow:
In summary, the first dataflow clearly performs better than the second dataflow. In this case, therefore, it looks like the overhead of staging the data made the performance worse.
Don’t take this simple example to prove a general rule: every dataflow will be different and there are a lot of performance optimisations planned for Dataflows Gen2 over the next few months, so you should test the impact of staging for yourself. I can imagine for different data sources (a Lakehouse source is likely to perform very well, even for files) and different transformations then staging will have a positive impact. On the other hand if you’re struggling with Dataflows Gen2 performance, especially at the time of writing this post, turning off staging could lead to a performance improvement.
Sometimes when you’re importing data from files using Power Query in either Power BI or Excel you may encounter the following error:
DataFormat.Error: External table is not in the expected format
What causes it? TL;DR it’s because you’re trying to load data from one type of file, probably Excel (I don’t think you can get this error with any other source but I’m not sure), and actually connecting to a different type of file.
Let’s see a simple example. Say you have a folder with two files: one is an Excel file called Date.xlsx and one is a CSV file called Date.csv.
Here’s the M code for a Power Query query that connects to the Excel file and reads the data from a table in it:
let
Source = Excel.Workbook(File.Contents("C:\MyFolder\Date.xlsx"), null, true),
Date_Table = Source{[Item = "Date", Kind = "Table"]}[Data]
in
Date_Table
Now, if you change the file path in this query – and only the file path – to point at the CSV file instead like so:
let
Source = Excel.Workbook(File.Contents("C:\MyFolder\Date.csv"), null, true),
Date_Table = Source{[Item = "Date", Kind = "Table"]}[Data]
in
Date_Table
…you will get the “external table is not in the expected format” error shown above. This is because your code is using the Excel.Workbook M function, which is used to import data from Excel workbooks, to connect to a file that is a CSV file and not an Excel workbook. The way to fix it is to use the appropriate function, in this case Csv.Document, to access the file like so:
To be honest, if making this change is beyond your Power Query skills and you’re sure you’re trying to connect to the right file, you’re better off creating a completely new query rather than editing the query you already have.
Another common scenario where you might encounter this error is when you’re importing data from all the files in a folder and one of the files isn’t in the correct format. For example, let’s say you have a folder with three Excel files in and you use the Folder data source to import all the data from all three files:
Since all three files are Excel files the Folder option will work:
However, if you take a CSV file and drop it into the folder like so:
Then you’ll get the same error in Power Query:
Apart from deleting the CSV file you have another option to solve this problem in this case: filtering the folder so you only try to get data from the .xlsx files and no other file type. To do this, click on the step that is called “Source”. When you do this you’ll see that the step returns a table containing all the files in the folder you’re pointing at:
You’ll see that the table in this step contains a column called Extension which contains the file extension for each file. If you filter this table – which will insert a new step at this point in the query, which is ok – by clicking on the down arrow in the Extension column, delselecting the (Select All) option and selecting “.xlsx” so the table only contains .xlsx files then you can avoid this problem:
If, as in this example, the rogue file happens to be the first file in the folder and you’ve selected that first file to be your “sample” file when setting up the import, then you’ll also need to go to the query called Sample File in the Queries pane and make exactly the same change there (ie click on the Source step and filter to remove any non .xlsx files).
A common requirement from Power BI customers in highly-regulated industries is the need to log users out of Power BI if they have been inactive for a certain amount of time. If your Power BI reports contain extremely sensitive data you don’t want someone to open a report, leave their desk for lunch, forget to lock their PC and let everyone in the office see what’s on their screen, for obvious reasons. This has actually been possible for some time now with Power BI and is now supported for Fabic, so I thought I’d write a blog post to raise awareness.
The feature that makes this possible is Microsoft 365’s Idle Session Timeout, which you can read about here:
To turn it on, a Microsoft 365 admin has to go to the M365 admin centre and Org Settings/Security & Privacy and select Idle Session Timeout. There you can set the amount of time to wait before users are logged out:
Once that is set, anyone who has Power BI open in their browser but doesn’t interact with it will see the following message after the specified period of time:
Your session is about to expire
Your organization’s policy enforces automatic sign out after a period of inactivity on Microsoft 365 web applications.
Do you want to stay signed in?
There are a few things to point out about how this works (read this for the full details):
You can’t turn it on for just Power BI, you have to turn it on for all supported Microsoft 365 web apps. This includes Outlook and the other Office web apps
You can’t turn it on for specific users – it has to be for the whole organisation
Users won’t get signed out if they get single sign-on into the web app from the device-joined account, or select “Stay signed in” when they log in (an option that can be hidden), or if they’re on a managed device and using a supported browser like Edge or Chrome
You’ll need to be on friendly terms with your M365 admin if you want to use this, clearly, but if you need this functionality it makes sense to enforce activity-based timeout rules for more apps than just Power BI.
One frequently asked question I see asked on Power BI forums is whether it’s possible to run Power BI Desktop on a Mac or indeed anything other than a Windows PC. There are already a lot of detailed blog posts and videos out there on this subject, such as this one from Guy In A Cube: the answer is no, you can’t run Power BI Desktop natively on a Mac or any other OS apart from Windows and there are no plans to port it over, so you need to either install Windows somehow (for example with Boot Camp) or use tools like Parallels or Turbo.Net to run Power BI Desktop. You can also spin up a Windows VM, for example in Azure, and run Power BI Desktop on that; Power BI Desktop is also now fully supported on Azure Virtual Desktop too although not on other virtual environments like Citrix.
Turning the question around, however, leads you to some aspects of the question that haven’t been fully explored. Instead of asking “Can I run Power BI Desktop on my Mac?”, you can instead ask “Can I do all of my Power BI development using only a browser?”. At Microsoft our long-term goal is to make all Power BI development web-based, but how close are we to that goal?
The first point to make is that it has always been possible to build Power BI reports (as opposed to datasets) in the browser without needing Power BI Desktop. You can even now build basic paginated reports in the browser too now. Historically I’ve never been a fan of encouraging users to do this because developing in Power BI Desktop gives you the chance to roll back to a previous version of the report if you need to – assuming you have saved those previous versions of your .pbix file. What’s more, if two or more people try to edit the same report at the same time then the last person to save wins and overwrites the other person’s changes, which can be dangerous. Fabric’s Git integration, which does work for Power BI reports, has changed my attitude somewhat though. As Rui Romano discusses here you can now safely make changes to reports in the Power BI Service, save them to source control and then roll back if you need to; this assumes your users are comfortable using Git, however, and it doesn’t solve the simultaneous development problem.
What about dataset development? Web editing for datasets has been in preview for a few months now and is getting better and better, although there are still several limitations and the focus up to now has been on modelling; connecting to data sources is on the public roadmap though. As a result Power BI Desktop is still needed for dataset development, at least for now.
Do datamarts change anything? Or Direct Lake mode in Fabric? Datamarts do solve the problem of being able to connect to and load data using just your browser and are available (if not GA yet) today. If you’re only using datamarts to avoid the need for a Windows PC to develop on, though, you’re paying a price: for a start you’ll either be loading the data twice if you want to use Import mode for your dataset (once to load data into the datamart, once to load the same data into the dataset) or taking the query performance hit of using DirectQuery mode. There are also some other limitations to watch out for. Fabric Direct Lake mode datasets, for me, offer all the benefits of Datamarts without so many of the limitations – Direct Lake mode means you only load the data once and still get near-Import mode performance, for example – and will be the obvious choice when Fabric GAs and once features like OneSecurity are available. With Fabric it will be possible to for most Power BI developers do all their work using only a browser, although for more complex projects (and to be clear this is only a small minority of projects) it will still be necessary to use other tools such as Tabular Editor, DAX Studio, SQL Server Management Studio and SQL Server Profiler which can only run on a Windows PC. I can imagine some of this more advanced developer functionality coming to the browser too in time, though.
In summary while Power BI Desktop and therefore Windows is still needed for Power BI development today, the day when you can do most and maybe all of your development in the browser is in sight. All you Mac owners need to be patient just a little while longer!
A few months ago a new option was added to the Sql.Database and Sql.Databases functions in Power Query in Power BI and Excel which allows Power Query queries that combine data from different SQL Server databases to fold. Here’s a simple example showing how to use it.
On my local PC I have SQL Server installed and the Adventure Works DW 2017 and Contoso Retail DW sample databases:
Both of these databases have date dimension tables called DimDate. Let’s say you want to create a Power Query query that merges these two tables.
Here’s the M code for a Power Query query called DimDate AW to get just the DateKey and CalendarYear columns from the DimDate table of the Adventure Works DW 2017 database:
let
Source = Sql.Database("localhost", "AdventureWorksDW2017"),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(dbo_DimDate,{"DateKey", "CalendarYear"})
in
#"Removed Other Columns"
Here’s the M code for a Power Query query called DimDate Contoso to get just the Datekey and CalendarYear columns from the DimDate table in the ContosoRetailDW database:
let
Source = Sql.Database("localhost", "ContosoRetailDW"),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(dbo_DimDate,{"Datekey", "CalendarYear"})
in
#"Removed Other Columns"
Both of these Power Query queries fold. However if you create a third query to merge these two queries (ie do the equivalent of a SQL join between them) on the CalendarYear columns like so:
…this query does not fold, because it combines data from two different SQL Server databases.
However if you edit the Sql.Database function in the Source step of both of the first two queries above to set the new EnableCrossDatabaseFolding option to true, like so:
In the second post in this series I discussed a KQL query that can be used to analyse Power BI refresh throughput at the partition level. However, if you remember back to the first post in this series, it’s actually possible to get much more detailed information on throughput by looking at the ProgressReportCurrent event, which fires once for every 10000 rows read during partition refresh.
Here’s yet another mammoth KQL query that you can use to analyse the ProgressReportCurrent event data:
It filters the Log Analytics data down to get events from the last day and just the ProgressReportCurrent events, as well as the ProgressReportBegin/End events which are fired before and after the ProgressReportCurrent events.
It then splits the data into groups of rows (‘partitions’ in KQL, but of course not the partitions that are being refreshed) by a combination of XmlaRequestId (ie the refresh operation) and XmlaObjectPath (ie the partition that is being refreshed)
For each group of rows it will then:
Find the ProgressReportBegin event and from this get the time when data started to be read from the source
Get all subsequent ProgressReportCurrent events and calculate the amount of time elapsed since the previous event (which might be the ProgressReportBegin event or a previous ProgressReportCurrent event) and the number of rows read
When the ProgressReportEnd event is encountered, calculate the amount of time elapsed since the previous ProgressReportCurrent event and the number of rows (which will be less than 10000) read since then
Filter out the ProgressReportBegin events because we don’t need them any more
Finally, add columns that splits out the table name and partition name and calculates the number of rows read per second for each row by dividing the number of rows read for each event by the amount of time elapsed since the previous event
What can this query tell us about throughput?
First of all, something interesting but not necessarily useful. At least for the data source I’m using for my tests, when I plot a column chart with the number of rows read on the x axis and the amount of time elapsed since the last event on the y axis (ie the amount of time it takes to read 10000 rows for all but the last column) then I noticed that every 200000 rows something happens to slow down the read:
I have no idea what this is, whether it’s a quirk of this particular source or connector, but it’s a great example of the kind of patterns that become obvious when you visualise data rather than look at a table of numbers.
Plotting time on the x axis of a line chart and the cumulative total of rows read on the y axis gives you something more useful. Here’s the chart for one of the refreshes mentioned in my last post where four partitions of the same table are refreshed in parallel:
In this case throughput is fine up until the end of the refresh at which point something happens to the February, March and April partitions but not the January partition to slow them down for about 30 seconds, after which throughput goes back to what it was before. Here’s the same chart zoomed in a bit:
Here’s the same problem shown in the first graph above, where the number of rows read is on the x axis, showing how for example with the April partition there’s a sudden spike where it takes 14 seconds to read 10000 rows rather than around 0.3 seconds:
What is this, and why isn’t the January partition affected? Maybe it was a network issue or caused by something happening in the source database? Looking at another refresh that also refreshes the same four partitions in parallel, it doesn’t seem like the same thing happens – although if you look closely at the middle of the refresh there might be a less pronounced flattening off:
Again, the point of all this is not the mysterious blips I’ve found in my data but the fact that if you take the same query and look at your refreshes, you may find something different, something more significant and something you can explain and do something about.
In the first post in this series I described the events in Log Analytics that can be used to understand throughput – the speed that Power BI can read from your dataset when importing data from it – during refresh. While the individual events are easy to understand when you look at a simple example they don’t make it easy to analyse the data in the real world, so here’s a KQL query that takes all the data from all these events and gives you one row per partition per refresh:
//Headline stats for partition refresh with one row for each partition and refresh
//Get all the data needed for this query and buffer it in memory
let RowsForStats =
materialize(
PowerBIDatasetsWorkspace
| where TimeGenerated > ago(1d)
| where OperationName == "ProgressReportEnd"
| where OperationDetailName == "ExecuteSql" or OperationDetailName == "ReadData"
or (OperationDetailName == "TabularRefresh" and (EventText contains "partition"))
);
//Get just the events for the initial SQL execution phase
let ExecuteSql =
RowsForStats
| where OperationDetailName == "ExecuteSql"
| project XmlaRequestId, XmlaObjectPath,
ExecuteSqlStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ),
ExecuteSqlEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ),
ExecuteSqlDurationMs = DurationMs, ExecuteSqlCpuTimeMs = CpuTimeMs;
//Get just the events for the data read and calculate rows read per second
let ReadData =
RowsForStats
| where OperationDetailName == "ReadData"
| project XmlaRequestId, XmlaObjectPath,
ReadDataStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ),
ReadDataEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ),
ReadDataDurationMs = DurationMs, ReadDataCpuTime = CpuTimeMs,
TotalRowsRead = ProgressCounter, RowsPerSecond = ProgressCounter /(toreal(DurationMs)/1000);
//Get the events for the overall partition refresh
let TabularRefresh =
RowsForStats
| where OperationDetailName == "TabularRefresh"
| parse EventText with * '[MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: ' MashupPeakMemoryKB:long ' KB]'
| project XmlaRequestId, XmlaObjectPath,
TabularRefreshStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ),
TabularRefreshEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ),
TabularRefreshDurationMs = DurationMs, TabularRefreshCpuTime = CpuTimeMs,
MashupCPUTimeMs, MashupPeakMemoryKB;
//Do an inner join on the three tables so there is one row per partition per refresh
ExecuteSql
| join kind=inner ReadData on XmlaRequestId, XmlaObjectPath
| join kind=inner TabularRefresh on XmlaRequestId, XmlaObjectPath
| project-away XmlaRequestId1, XmlaRequestId2, XmlaObjectPath1, XmlaObjectPath2
| extend Table = tostring(split(XmlaObjectPath,".", 2)[0]), Partition = tostring(split(XmlaObjectPath,".", 3)[0])
| project-reorder XmlaRequestId, Table, Partition
| order by XmlaRequestId, ExecuteSqlStartTime desc
It’s a bit of a monster query but what it does is quite simple:
First it gets all the events relating to partition refresh in the past 1 day (which of course you can change) and materialises the results.
Then it filters this materialised result and gets three sets of tables:
All the ExecuteSql events, which tell you how long the data source took to start returning data and how much CPU time was used.
All the ReadData events, which tell you how long Power BI took to read all the rows from the source after the data started to be returned, how much CPU time was used, and how many rows were read. Dividing duration by rows read lets you calculate the number of rows read per second during this phase.
All the TabularRefresh events, which give you overall data on how long the partition refresh took, how much CPU time was used, plus information on Power Query peak memory usage and CPU usage.
What can this tell us about refresh throughput though? Let’s use it to answer some questions we might have about throughput.
What is the impact of parallelism on throughput? I created a dataset on top of the NYC taxi data Trip table with a single table, and in that table created four partitions containing data for January, February, March and April 2013, each of which contained 13-15 million rows. I won’t mention the type of data source I used because I think it distracts from what I want to talk about here, which is the methodology rather than the performance characteristics of a particular source.
I then ran two refreshes of these four partitions: one which refreshed them all in parallel and one which refreshed them sequentially, using custom TSML refresh commands and the maxParallelism property as described here. I did a refresh of type dataOnly, rather than a full refresh, in the hope that it would reduce the number of things happening in the Vertipaq engine during refresh that might skew my results. Next, I used the query above as the source for a table in Power BI (for details on how to use Log Analytics as a source for Power BI see this post; I found it more convenient to import data rather than use DirectQuery mode though) to visualise the results.
Comparing the amount of time taken for the SQL query used to start to return data (the ExecuteSqlDurationMs column from the query above) for the four partitions for the two refreshes showed the following:
The times for the four partitions vary a lot for the sequential refresh but are very similar for the parallel refresh; the January partition, which was refreshed first, is slower in both cases. The behaviour I described here regarding the first partition refreshed in a batch could be relevant.
Moving on to the Read Data phase, looking at the number of rows read per second (the RowsPerSecond column from the query above) shows a similar pattern:
There’s a lot more variation in the sequential refresh. Also, as you would expect, the number of rows read per second is much higher when partitions are refreshed sequentially compared to when they are refreshed in parallel.
Looking at the third main metric, the overall amount of time taken to refresh each partition (the TabularRefreshDurationMs column from the query above) again shows no surprises:
Each individual partition refreshes a lot faster in the sequential refresh – almost twice as fast – compared to the parallel refresh. Since four partitions are being refreshed in parallel during the second refresh, though, this means that any loss of throughput for an individual partition as a result of refreshing in parallel is more than compensated for by the parallelism, making the parallel refresh faster overall. This can be shown using by plotting the TabularRefreshStartTime and TabularRefreshEndTime columns from the query above on a timeline chart (in this case the Craydec Timelines custom visual) for each refresh and each partition:
On the left of the timeline you can see the first refresh where the partitions are refreshed sequentially, and how the overall duration is just over 20 minutes; on the right you can see the second refresh where the partitions are refreshed in parallel, which takes just under 10 minutes. Remember also that this is just looking at the partition refresh times, not the overall time taken for the refresh operation for all partitions, and it’s only a refresh of type dataOnly rather than a full refresh.
So does this mean more parallelism is better? That’s not what I’ve been trying to say here: more parallelism is better for overall throughput in this test but if you keep on increasing the amount of parallelism you’re likely to reach a point where it makes throughput and performance worse. The message is that you need to test to see what the optimal level of parallelism – or any other factor you can control – is for achieving maximum throughput during refresh.
These tests only show throughput at the level of the ReadData event for a single partition, but as mentioned in my previous post there is even more detailed data available with the ProgressReportCurrent event. In my next post I’ll take a closer look at that data.
[Thanks to Akshai Mirchandani for providing some of the information in this post, and hat-tip to my colleague Phil Seamark who has already done some amazing work in this area]