Can You Develop For Power BI Using Only Your Browser?

One frequently asked question I see asked on Power BI forums is whether it’s possible to run Power BI Desktop on a Mac or indeed anything other than a Windows PC. There are already a lot of detailed blog posts and videos out there on this subject, such as this one from Guy In A Cube: the answer is no, you can’t run Power BI Desktop natively on a Mac or any other OS apart from Windows and there are no plans to port it over, so you need to either install Windows somehow (for example with Boot Camp) or use tools like Parallels or Turbo.Net to run Power BI Desktop. You can also spin up a Windows VM, for example in Azure, and run Power BI Desktop on that; Power BI Desktop is also now fully supported on Azure Virtual Desktop too although not on other virtual environments like Citrix.

Turning the question around, however, leads you to some aspects of the question that haven’t been fully explored. Instead of asking “Can I run Power BI Desktop on my Mac?”, you can instead ask “Can I do all of my Power BI development using only a browser?”. At Microsoft our long-term goal is to make all Power BI development web-based, but how close are we to that goal?

The first point to make is that it has always been possible to build Power BI reports (as opposed to datasets) in the browser without needing Power BI Desktop. You can even now build basic paginated reports in the browser too now. Historically I’ve never been a fan of encouraging users to do this because developing in Power BI Desktop gives you the chance to roll back to a previous version of the report if you need to – assuming you have saved those previous versions of your .pbix file. What’s more, if two or more people try to edit the same report at the same time then the last person to save wins and overwrites the other person’s changes, which can be dangerous. Fabric’s Git integration, which does work for Power BI reports, has changed my attitude somewhat though. As Rui Romano discusses here you can now safely make changes to reports in the Power BI Service, save them to source control and then roll back if you need to; this assumes your users are comfortable using Git, however, and it doesn’t solve the simultaneous development problem.

What about dataset development? Web editing for datasets has been in preview for a few months now and is getting better and better, although there are still several limitations and the focus up to now has been on modelling; connecting to data sources is on the public roadmap though. As a result Power BI Desktop is still needed for dataset development, at least for now.

Do datamarts change anything? Or Direct Lake mode in Fabric? Datamarts do solve the problem of being able to connect to and load data using just your browser and are available (if not GA yet) today. If you’re only using datamarts to avoid the need for a Windows PC to develop on, though, you’re paying a price: for a start you’ll either be loading the data twice if you want to use Import mode for your dataset (once to load data into the datamart, once to load the same data into the dataset) or taking the query performance hit of using DirectQuery mode. There are also some other limitations to watch out for. Fabric Direct Lake mode datasets, for me, offer all the benefits of Datamarts without so many of the limitations – Direct Lake mode means you only load the data once and still get near-Import mode performance, for example – and will be the obvious choice when Fabric GAs and once features like OneSecurity are available. With Fabric it will be possible to for most Power BI developers do all their work using only a browser, although for more complex projects (and to be clear this is only a small minority of projects) it will still be necessary to use other tools such as Tabular Editor, DAX Studio, SQL Server Management Studio and SQL Server Profiler which can only run on a Windows PC. I can imagine some of this more advanced developer functionality coming to the browser too in time, though.

In summary while Power BI Desktop and therefore Windows is still needed for Power BI development today, the day when you can do most and maybe all of your development in the browser is in sight. All you Mac owners need to be patient just a little while longer!

Performance Testing Power BI Direct Lake Mode Datasets In Fabric

If you’re excited about Direct Lake mode in Fabric you’re probably going to want to test it with some of your own data, and in particular look at DAX query performance. Before you do so, though, there are a few things to know about performance testing with Direct Lake datasets that are slightly different from what you might be used to with Import mode or DirectQuery datasets.

Dataset “hotness”

In my last post I talked about how, in Direct Lake datasets, Power BI can page individual column segments, dictionaries and join indexes into memory on demand when a DAX query is run and how those artefacts may get paged out later on. It therefore follows that there are four possible states or levels of “hotness” that a dataset can be in when a DAX query is run and that each of these states will have different performance characteristics:

  1. The column segments, dictionaries and join indexes needed to answer a query are not held in memory and need to be paged in before the query can run.
  2. Some of the column segments, dictionaries and join indexes needed to answer a query are not held in memory and need to be paged in, while some of them are already in memory.
  3. All of the column segments, dictionaries and join indexes needed by the query are already held in memory.
  4. All of the column segments, dictionaries and join indexes needed by the query are already held in memory and, as a result of previous query activity, Vertipaq engine caches useful for the query are also already populated.

State (1) is the “coldest” state and will give the worst possible query performance while state (4) is the “hottest” state and will give the best possible query performance.

When you’re testing the performance of a DAX query on a Direct Lake dataset you should test it on a dataset that is in state (1), state (3) and state (4) so you get a good idea of how much time is taken to page data into memory and how much of a performance improvement Vertipaq engine caching brings.

Ensuring everything is paged out

To test query performance in state (1) you need a way to ensure that all column segments, dictionaries and join indexes are paged out of memory. At the time of writing this post you can ensure this simply by refreshing the dataset. This will change sometime in the next few months though, because paging everything out of memory when you refresh is not ideal behaviour in the real world, so there will be another way to ensure everything is paged out. I’ll update this post when that change happens. You can ensure that all column segments and dictionaries have been paged out by running the two DMV queries mentioned in my previous post: DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS and DISCOVER_STORAGE_TABLE_COLUMNS.

Why you should use DAX Studio for performance testing

I also recommend using DAX Studio to run queries when performance testing, for a number of reasons. First of all it makes it easy to clear the Vertipaq engine cache before a query is run with the “Clear Cache” and “Clear on Run” (which automatically clears the cache before each query) buttons. This not only runs a Clear Cache command, to clear the Vertipaq cache, it also runs a (hidden) DAX query that does not query any data from the dataset but which does trigger the creation of all the measures on the dataset. In most cases this is very fast, but if you have thousands of measures it could take a few seconds (similar to what I show here for report-level measures). If you are not using DAX Studio you can achieve the same result by running a query like:

EVALUATE {1}

DAX Studio also lets you run the same query multiple times using its “Run benchmark” feature (although this only lets you test for states (3) and (4) at the time of writing) and its “Server Timings” feature is invaluable for understanding what’s going on inside the Vertipaq engine when a query runs.

Also make sure you are running the very latest version of DAX Studio (which is 3.0.8 at the time of writing) to make sure it works properly with Fabric.

Performance testing methodology

So, putting this all together, in order to run a single performance test on a Direct Lake dataset to capture performance for states (1), (3) and (4):

  1. Preparation
    • Create a custom Power BI dataset in Fabric (not everything here works with a default dataset at the time of writing)
    • Open your report in Power BI Desktop connected to your published Direct Lake dataset
    • Capture the DAX queries generated by the visuals you want to test using Performance Analyzer by clicking Copy Query (see here for how to do this)
    • Install the very latest version of DAX Studio and open it
    • Connect DAX Studio to your workspace’s XMLA Endpoint (see here for how to do this)
    • Paste your DAX query into DAX Studio
    • Turn on DAX Studio’s Server Timings option
    • Ensure the “Clear on Run” option in the ribbon is turned off
  2. To test performance for state (1):
    • Refresh your dataset to ensure everything is paged out of memory
    • Click the “Clear Cache” button on the ribbon in DAX Studio
    • Run the DAX query
    • Save the output of the Server Timings pane in DAX Studio by clicking the Export button
  3. To test performance for state (3), immediately after the previous steps:
    • Click the “Clear Cache” button on the ribbon in DAX Studio
    • Run the DAX query
    • Save the output of the Server Timings pane in DAX Studio
  4. To test the performance for state (4), immediately after the previous steps:
    • Run the DAX query again without clicking the “Clear Cache” button
    • Save the output of the Server Timings pane

Fallback to DirectQuery

Even with a Direct Lake dataset there is no guarantee that your query will be answered in Direct Lake mode: in as-yet not fully documented scenarios (but basically if your data volumes are too large for the capacity you’re using) Power BI will switch to using DirectQuery mode against the Lakehouse’s SQL Endpoint. One of the objectives of your performance testing should be to make sure that this happens as infrequently as possible because DirectQuery mode will perform noticeably worse than Direct Lake mode.

You may notice some DirectQuery events even when the query itself only uses Direct Lake; these are used to get metadata or security information from the Lakehouse and can be ignored. Here’s an example of this:

Load testing

Testing for a single user is important, but don’t forget about testing performance with multiple concurrent users. As I discuss here, realistic load tests are the only way to get a good idea of how your report will actually perform in production and you can load test a Direct Lake dataset in exactly the same way as an Import mode or DirectQuery dataset.

Direct Lake is still in preview!

The last point to make is that like the rest of Fabric, Direct Lake is still in preview. This not only means that functionality is missing, it also means that performance is not yet as good as it will be yet. So, by all means test Direct Lake and tell us how fast (or not) it is, but be aware that your test results will be out of date very quickly as the engine evolves.

[Thanks to Krystian Sakowski for the information in this post]

On-Demand Loading Of Direct Lake Power BI Datasets In Fabric

For any Power BI person, Direct Lake mode is the killer feature of Fabric. Import mode report performance (or near enough) direct on data from the lake, with none of the waiting around for data to refresh! It seems too good to be true. How can it be possible?

The full answer, going into detail about how data from Delta tables is transcoded to Power BI’s in-memory format, is too long for one blog post. But in part it is possible through something that existed before Fabric but which didn’t gain much attention: on-demand loading. There’s a blog post about it in Power BI Premium from December 2021 here:

https://powerbi.microsoft.com/en-us/blog/announcing-on-demand-loading-capabilities-for-large-models-in-power-bi/

TLDR; instead of loading all the data from the tables in your Lakehouse into memory when a query is run, on-demand loading means only the data that is needed for a query is loaded which naturally makes everything a lot faster. What’s more you can see what gets loaded into memory using DMVs. Here’s a simple example…

Let’s say you have a Lakehouse with a table in it called Sales that contains three columns: Country, Product and Sales.

I populated this table using a Dataflow gen2 using the following M expression:

let
  Source = #table(
    type table [Country = text, Product = text, Sales = number], 
    {
    {"UK", "Apples", 1}, {"France", "Apples", 2}, 
    {"UK", "Oranges", 5}, {"Germany", "Pears", 10}
    }
  )
in
  Source

Let’s also say you have a custom dataset built on this Lakehouse (you can’t use the default dataset, at least not yet, for what comes next) containing this table, called Sales Custom Dataset.

At this point you can open up DAX Studio, connect to your workspace’s XMLA Endpoint, and query this dataset. The DMV mentioned in the blog post above, DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS, tells you whether a column segment in a table can be paged in/out, if it is paged or out, the “temperature” (which represents the frequency that the column segment is accessed) and when the column segment was last accessed amongst other things. A column segment is a structure that holds compressed data for a column in the Vertipaq engine inside Power BI.

Running the following query:

Select
COLUMN_ID, SEGMENT_NUMBER, ISPAGEABLE, 
ISRESIDENT, TEMPERATURE, LAST_ACCESSED
from SYSTEMRESTRICTSCHEMA 
($System.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS, 
[DATABASE_NAME] = 'Sales Custom Dataset')

Immediately after creating the dataset or refreshing it (again, what refresh means for a Direct Lake dataset is something for another post but it has the side effect of paging everything out of memory) will give the following result:

Note that for this simple dataset there is only one row returned per column – this will not be the case for datasets with more data, or depending on how the data is loaded into the lake, where there will be more rows because there are more segments. Each row represents a column segment, not a column. Also note that apart from the RowNumber column (which is hidden and always there) every column segment is pageable, is not resident in memory, and has no value for Temperature or Last Accessed.

Now let’s say that you build a report on this custom dataset with a single table visual using just the Country and Sales column, like so:

Rerunning the DMV query above now returns the following:

You can now see that the Country and Sales columns are resident in memory, have a fairly “hot” temperature, and you can see the date and time they were last accessed.

Doing nothing else apart from waiting about five minutes and then rerunning the same DMV returns the following:

You can see that Country and Sales are still in memory but their temperature has reduced.

Adding the Product column to the table visual like so:

…and then rerunning the DMV query now gives the following results:

As you might expect, the only column segment for Product has now been paged into memory but it still has a lower temperature than the column segments for Country and Sales.

As well as looking at column segments, it’s also possible to do the same thing for column dictionaries by running the following DMV query:

Select
COLUMN_ID, DICTIONARY_SIZE, DICTIONARY_ISPAGEABLE,
DICTIONARY_ISRESIDENT, DICTIONARY_TEMPERATURE, DICTIONARY_LAST_ACCESSED
from SYSTEMRESTRICTSCHEMA 
($System.DISCOVER_STORAGE_TABLE_COLUMNS, 
[DATABASE_NAME] = 'Sales Custom Dataset')

Here’s an example of the output:

Finally, after leaving the dataset for two days and not running any other queries or opening any reports, rerunning both DMVs shows that everything has been paged out of memory again:

OK, so this is all very interesting for nerds like me who like to see how things work behind the scenes. As I said, though, it’s something that has happened for Large Datasets in Import mode for a while now; if you’re thinking that it would be cool to build a report using these DMVs and use the Temperature column to see which columns are the most frequently used, Gilbert Quevauvilliers already did that here.

There is one big difference between Import mode and Direct Lake mode to point out though. When you do a full refresh on an Import mode dataset the whole dataset has to fit in memory, and you are therefore at the mercy of the limits imposed on the amount of memory a dataset can use in either Shared capacity or the Premium capacity you are using. These same limits exist for Direct Lake datasets, but since refresh for Direct Lake datasets is not the same thing as refresh for Import datasets, at no point during a refresh does the whole of the dataset need to be in memory. This means the memory limits only apply to how much of the dataset can be paged into memory at any one time, so in some cases you’ll be able to work with much larger datasets than are possible in Import so long as your queries only need to read a small part of the data at a time. I say “in some cases” because it’s complicated: there are various other, as yet undocumented, rules about whether a Direct Lake dataset can be queried as such or whether queries fall back to Direct Query mode and some of these rules do relate to the data volumes used. The point is that as Power BI developers the way we think about dataset memory usage will have to change with Direct Lake mode.

Power Query Templates In Excel And Fabric

Earlier this year a new feature was added to Power Query in Excel and Dataflows that very few people noticed: the abilty to export all the queries in the Power Query Editor to a template file. The reason few people noticed it was that, when the feature released, the only place you could import a template file was in a Power Platform dataflow. The documentation for exporting from Excel and importing to a Power Platform dataflow is very detailed and can be found here:

https://learn.microsoft.com/en-us/power-query/new-dataflow-from-template

With the release of Fabric you can now import templates into Gen2 Dataflows. This means you can export a template from a Gen1 dataflow and import it into a Gen2 dataflow, which is documented here:

https://learn.microsoft.com/en-us/fabric/data-factory/move-dataflow-gen1-to-dataflow-gen2

Joining the dots, this also means you can now export a template from Power Query in Excel and import it to a Gen2 dataflow. This makes it easy to develop on your PC before pushing your work up to Fabric. Here’s a quick walkthrough of how to do this.

In Excel Power Query let’s say you have one or more queries in the Power Query Editor:

If you then go to the File menu in the top left corner of the screen you’ll see the option to export to a template:

If you click this menu item, you’ll see a dialog where you can enter a name and description for the template:

Click Export and you can save the template to a .pqt file.

If you then create a new Gen2 Dataflow in Fabric then, in the middle of the screen, you’ll see the option to import from a template:

[Notice also the Export Template option in the top right corner]

If you import the .pqt file created in Excel the queries in it will get created in the dataflow. You may need to re-enter credentials:

Once you’ve done that, the queries run:

Template files are the way the cool kids share their Power Query queries these days and no kid is cooler than my colleague Alex Powers who, I saw on Twitter recently, has created a template file that scrapes all the ideas from the Fabric Ideas site:

[If you do import this template close to the time of writing this post you may run into this bug from the list of known issues for Fabric Data Factory and Dataflows]

Hopefully we’ll start to see more and more useful Power Query queries and functions shared as template files!

Monitoring Power BI Dataset Refresh Memory And CPU Usage With Log Analytics

Maybe the fourth- or fifth-most exciting Power BI-related announcement last month (admittedly it was an exciting month) was that Log Analytics for Power BI datasets is now GA and you can now link multiple Power BI workspaces to a single Log Analytics workspace. This, for me, means that enabling Log Analytics has gone from being useful to essential for anyone interested in monitoring Analysis Services engine activity in an enterprise Power BI/Fabric deployment. It also works with Direct Lake datasets too!

Anyway, following on from my recent posts on measuring memory and CPU usage during dataset refresh, both for the refresh as a whole and for individual partitions, here are some KQL queries that allow you to extract that information from Log Analytics.

Here’s the KQL query to get the memory and CPU usage for a refresh (see this post for details):

PowerBIDatasetsWorkspace
| where OperationName == "CommandEnd"
| parse-where EventText with * '[PeakMemory: ' PeakMemoryKB:long ' KB, MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: '  MashupPeakMemoryKB:long ' KB]'
| extend DatasetName=ArtifactName
| project TimeGenerated, CpuTimeMs, DurationMs, MashupCPUTimeMs, MashupPeakMemoryKB, PeakMemoryKB, XmlaRequestId, DatasetName, PowerBIWorkspaceName

And here’s the KQL query to get the memory and CPU usage for individual partitions/Power Query queries (see this post for details):

PowerBIDatasetsWorkspace
| where OperationName == "ProgressReportEnd"
| parse-where EventText with * '[MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: '  MashupPeakMemoryKB:long ' KB]'
| extend Table=split(XmlaObjectPath,'.').[2], Partition=split(XmlaObjectPath,'.').[3]
| extend DatasetName=ArtifactName
| project TimeGenerated, Table, Partition, CpuTimeMs, DurationMs, MashupCPUTimeMs, MashupPeakMemoryKB, XmlaRequestId, DatasetName, PowerBIWorkspaceName

You could use these queries in a Power BI DirectQuery dataset to report on all your refreshes, similar to what I did in this series of posts last year. The XmlaRequestId columns contains the unique identifier for the refresh operation so you could build a one-to-many relationship between the table with the refresh-level data and the table with the partition-level data. Maybe I’ll get round to building that report sometime…

[Thanks to Jast Lu, who wrote the original version of these queries]