Power BI Semantic Model Memory Errors, Part 1: Model Size

You probably know that semantic models in Power BI can use a fixed amount of memory. This is true of all types of semantic model – Import, Direct Lake and DirectQuery – but it’s not something you usually need to worry about for DirectQuery mode. The amount of memory they can use depends on whether you’re using Shared (aka Pro) or a Premium/Fabric capacity, and if you’re using a capacity how large that capacity is. In Shared/Pro the maximum amount of memory that a semantic model can use is 1GB; if you are using a capacity then the amount of memory available for models in each SKU is documented in the table here in the Max Memory column:

What counts as “memory usage” though? More importantly, how can you breach this limit and what do all of the different memory-related error messages that you might see mean? In this series I will try to answer these questions, and in this post I will look at one particular error you see when your model needs to use more memory than it is allowed to.

First of all it’s important to understand that the amount of memory used by a semantic model is not the same as the amount of data “in” the model. The diagram below shows how model memory usage can be broken down. The data in the columns and tables of your model, along with supporting objects like relationships (represented by the blue box in the diagram below) makes up just one part of the overall model memory usage. In addition, more memory is needed to store data associated with row-level security, user sessions, caches and so on (represented by the orange box in the diagram below).

Both Import mode and Direct Lake models can page data in and out of memory as required, so the whole model may not be in memory at any given time. However, in order for a query to run, the data it needs must be in memory and cannot be paged out until the query has finished with it. Therefore out of all the memory consumed by a semantic model, at any given time, some of that memory is “evictable” because it isn’t in use while some of it is “non-evictable” because it is being used. Evictable memory may be paged out of memory for a variety of reasons, for example because the model is nearing its allowed memory limit.

One further factor to take into account is the memory used by queries that are running on the model (the purple boxes in the diagram above). While each query has a limit on the amount of memory it can use – I mentioned the Query Memory Limit in this post but I will revisit it later on in this series – the total amount of memory used by queries also contributes to the overall memory use of a semantic model. If you have a large number of concurrent queries running, even if no single query uses much memory, this can contribute a lot to the overall memory usage of your model.

In summary then, the total amount of memory used by a semantic model is made up of three groups:

  1. The data in the tables in your model (the blue box above)
  2. Supporting data for RLS security roles, sessions and caches (the orange box above)
  3. Data used by queries (the purple boxes above)

When the sum of these three groups exceeds the total amount of memory allowed for your model, and no data can be evicted from memory to reduce this sum, then you’ll get an error.

To illustrate this I created a new F2 capacity, which has a 3GB limit on the amount of memory used by a semantic model, loaded a table (called SourceData) with 3.5 million rows of random numbers stored as text into a Lakehouse, then created a new custom Direct Lake semantic model on it. I set the Direct Lake Behavior property on the model to “Direct Lake only” to prevent fallback to DirectQuery mode.

After creating the model I used DAX Studio’s Model Metrics feature with the “Read statistics from data” option turned off to find the amount of data stored in memory (ie the blue box value).

Unsurprisingly, at this stage, the size of the model was very small: only 8KB.

I then turned the “Read statistics from data” option on, knowing that this would force data to be paged into memory. This showed the total potential size of the model to be 4.25GB:

I was initially confused by this because this is already well over the 3GB limit, but it was pointed out to me that what is probably happening is that DAX Studio runs a number of DMV queries to get the data needed to calculate this value and when this happens different parts of the model are paged in and out of memory. It was certainly very slow for DAX Studio to calculate the Model Metrics when I did this which fits with the paging in/out theory.

Finally, I ran a simple DAX query to get the top 10 rows from the SourceData table:

EVALUATE TOPN(10, SourceData)

This query ran for about ten seconds and then failed with the following error message:

Resource Governing: We cannot complete the requested operation because there isn’t enough memory (consumed memory 4620 MB, memory limit 3072 MB). Either reduce the size of your dataset, such as by limiting the amount of in-memory data, or host the dataset on a Fabric or Premium capacity with a sufficient memory size. See https://go.microsoft.com/fwlink/?linkid=2159753 to learn more.

[The error code associated with this message is 0xC13E0006 or -1052901370]

This is the error that you get when your model needs to use more memory than it is allowed to use for the capacity SKU it is running on. The query references every column from the only table in the model, which means the whole table – which is the whole model – would have to be paged in to memory for the query to run, but the whole model requires more memory than is available on an F2 capacity.

If you aren’t getting this exact error message then something slightly different might be happening. In future posts in this series I will look at some of these other errors including the query memory limit and the command memory limit.

[Thanks to Marius Dumitru for the information in this post]

How Much Does Copilot Cost In Microsoft Fabric?

There’s a lot of excitement about Copilot in Power BI and in Fabric as a whole. The number one question I’m asked about Copilot by customers is “How much does it cost?” and indeed there have been two posts on the Fabric blog here and here attempting to answer this question. The point I want to make in this post, though, is that it’s the wrong question to ask.

Why? Well to start off with Copilot isn’t something you buy separately. Every time you use Copilot in Fabric it uses compute from a capacity, either a P SKU or an F SKU, just the same as if you opened a report or refreshed a semantic model. You buy the capacity and that gives you a pool of compute that you can use however you want, and using Copilot is just one of the things you can use that compute for. No-one ever asked me how much it cost in dollars to open a report in Power BI or refresh a semantic model, so why ask this question about Copilot?

Of course I understand why Copilot feels like a special case: customers know a lot of users will want to play with it and they also know how greedy AI is for resources. Which brings me back to the point of this post: the question you need to ask about Copilot is “How much of my Power BI/Fabric capacity’s compute will be used by Copilot if I turn it on?”. Answering this question in terms of percentages is useful because if you consistently go over 100% usage on a Fabric capacity then you will either need to buy more capacity or experience throttling. And if Copilot does end up using a lot of compute, and you don’t want to buy more capacity or deal with throttling, then maybe you do want to limit its usage to a subset of your users or even turn it off completely?

To a certain extent the question about percentage usage can be answered with the Premium Capacity Metrics app. For example, I opened up a gen2 dataflow on a Premium capacity that returns the following data:

I expanded the Copilot pane and typed the following prompt:

Filter the table so I only have products that were produced in the country France

And here’s what I got back – a filtered table showing the row with the Producer Country France:

So what impact did this have on my capacity? Since I know this was the only time I used Copilot the day I wrote this post it was easy to find the relevant line in the “Background operations for timerange” table on the TimePoint Detail page of the Premium Capacity Metrics app:

There are two important things to note:

  • For the 30-second window I selected the operation above used 0.02% of my capacity
  • Copilot operations are classed as “background operations” so their cost is smoothed out over 24 hours. Therefore the operation above used 0.02% of my capacity for 24 hours from a few minutes after the point I hit enter and the operation ran; in this particular case I ran my test just before 16:20 on Friday March 15th and the operation used 0.02% of my capacity from 16:20 on Friday the 15th until 16:20 on Saturday March 16th.

How can you extrapolate tests like this to understand the likely load on your capacity in the real world? With great difficulty. Almost all the Fabric/Power BI Copilot experiences available today are targeted at people developing rather than just consuming, so that naturally limits the opportunities people have to use Copilot. Different prompts will come with different costs (as the blog posts mentioned above explain), a single user will want to use Copilot more than once in a day and you’ll have more than one user wanting to use Copilot. What’s more, going forward there will be more and more opportunities to use Copilot in different scenarios and as Copilot gets better and better your users will want to use it more. The compute usage of different Fabric Copilots may change in the future too. So your mileage will vary a lot.

Is it possible to make a rough estimate? Let’s say you have 20 developers (which I think is reasonable for a P1/F64 – how many actively developed solutions are you likely to have on any given capacity?) writing 20 prompts per day. If each prompt uses 0.02% of your capacity then 20*20*0.02%=a maximum of 8% of your capacity used by Copilot for the whole day. That’s not inconsiderable and I’m sure someone will leave a comment saying I’m underestimating what usage will be.

Which brings me to my last point: should you even see Copilot as being different from anything else you can do in Fabric that consumes compute? Or as an optional extra or additional cost? After all, dataflows consume compute, you can enable or disable dataflows in your tenant in the same way you can enable or disable Copilot, but very few customers disable dataflows because they see the benefit of using them. Turning off dataflows would reduce the load on your capacity but it would also stop your users from being so productive, and why would you do that? If we at Microsoft deliver on the promise of Copilot (and believe me, we’re working hard on this) then the productivity gains it brings to your developers should offset the cost of any extra capacity you need to buy – if indeed you need to buy any extra capacity.

So, to sum up, if you enable Copilot in Fabric you will see additional load on your capacities and you may need to buy more capacity as a result – but the benefits will be worth it. Predicting that additional load is difficult but it’s no different from predicting how your overal Fabric capacity usage will grow over time, as more and more reports, semantic models, notebooks, warehouses and so on get created and used. Rather than doing lots of complex calculations based on vague assumptions to try to predict that load, my advice is that you should use the Capacity Metrics app to monitor your actual usage and buy that extra capacity when you see you’re going to need it.

Visualising Power BI Import Mode Refresh Job Graphs

A few years ago a new pair of Profiler events was added for Power BI Import mode datasets (and indeed AAS models): the Job Graph events. I blogged about them here but they never got used by anyone because it was extremely difficult to extract useful data from them – you had to run a Profiler trace, save the trace file, run a Python script to generate a .dgml file, then open that file in Visual Studio – which was a shame because they contain a lot of really interesting, useful information. The good news is that with the release of Semantic Link in Fabric and the ability to run Profiler traces from a Fabric notebook it’s now much easier to access Job Graph data and in this blog post I’ll show you how.

Quick recap: what are the Job Graph events and why are they useful? Let’s say you have a Power BI Import mode semantic model and you want to optimise refresh performance. When you refresh a semantic model, that refresh is made up of multiple jobs which themselves are made up of multiple jobs: refreshing a semantic model involves refreshing all the tables in that model, refreshing a table involves refreshing all the partitions in that table, refreshing a partition involves loading the data and building attribute hierarchies, and so on. Some of these jobs can happen in parallel but in some cases there are dependencies between jobs, so one job can only start when another has completed. The Job Graph events give you information on these refresh jobs and the dependencies between them so you can work out which jobs you need to optimise. In order to capture information from them you need to run a trace while the semantic model is being refreshed; the data from some of these Job Graph events can be reconstituted into a Directed Graph Markup Language (DGML) file, which is an XML-based format, and once you’ve got that you can either visualise the DGML file using a suitable viewer or extract the data from it and analyse it further.

[Before I carry on I have to acknowledge that I’m extremely new at Python and a lot of the code in this post is adapted from the code in my colleague Phil Seamark’s excellent recent post on visualising Power BI refresh information with Semantic Link. Any feedback on ways to optimise the code is gratefully received.]

Here’s some Python code that you can use in a Fabric notebook to run a refresh and generate a DGML file. Each code snippet can be used in a separate code cell or combined into a single cell.

First of all you need to install Semantic Link:

%pip install semantic-link

Next you need to define the events you want in your trace, which in this case are just the Job Graph events:

import sempy.fabric as fabric
import pandas as pd
import time
import warnings

base_cols = ["EventClass", "EventSubclass", "TextData", "IntegerData"]

# define events to trace and their corresponding columns

event_schema = {
"JobGraph": base_cols
}

warnings.filterwarnings("ignore")

You then need to start a trace using this definition, refresh the semantic model, stop the trace and filter the events captured so you only have those with the EventSubclass GraphFinished, remove the event which contains the metadata (which has a value of 0 in the IntegerData column) and then finally sort the rows in ascending order by the values in the IntegerData column:

WorkspaceName = "Insert workspace name here"
SemanticModelName = "Insert semantic model name here"

with fabric.create_trace_connection(SemanticModelName,WorkspaceName) as trace_connection:
# create trace on server with specified events
with trace_connection.create_trace(event_schema, "Simple Refresh Trace") as trace:

trace.start()

# run the refresh
request_status_id = fabric.refresh_dataset(SemanticModelName, WorkspaceName, refresh_type="full")
print("Progress:", end="")

while True:
status = fabric.get_refresh_execution_details(SemanticModelName, request_status_id, WorkspaceName).status
if status == "Completed":
break

print("░", end="")
time.sleep(2)

print(": refresh complete")
# allow ending events to collect
time.sleep(5)

# stop Trace and collect logs
final_trace_logs = trace.stop()



# only return GraphFinished events
final_trace_logs = final_trace_logs[final_trace_logs['Event Subclass'].isin(["GraphFinished"])]
# ignore metadata row
final_trace_logs = final_trace_logs[final_trace_logs['Integer Data'].ne(0)]
# sort in ascending order by Integer Data column
final_trace_logs = final_trace_logs.sort_values(by=['Integer Data'], ascending=True)

Finally, you need to take all the text from the EventText column of the remaining events and concatenate it to get the contents of the DGML file and then save that file to the Files section of the Lakehouse attached to your notebook:

# concatenate all text in TextData column
out = ''.join(final_trace_logs['Text Data'])
# change background colour of critical path nodes so it's easier to see in VS Code
out = out.replace("#263238", "#eba0a7")

# write dgml file
dgmlfile = open("/lakehouse/default/Files/" + request_status_id + ".dgml", 'x')
print (out, file=dgmlfile)
dgmlfile.close()

#dispose of trace connection
trace_connection.disconnect_and_dispose()

I found a nice Visual Studio Code extension called DGMLViewer which makes viewing DGML files easy. Rather than manually downloading the file, OneLake Explorer makes it easy to sync files in OneLake with your PC in a very similar way to OneDrive, which makes working with these DGML files in VS Code very straightforward because you can simply open the local copy when it syncs.

Here’s what one of thse DGML files, generated from the refresh of a very basic semantic model, looks like when viewed in DGML Viewer:

If you have Visual Studio you can also use it to view DGML files (you need to install the DGML Editor first); I found a VS extension called DgmlPowerTools 2022 which adds some advanced features. Here’s what a DGML file for a refresh looks like when visualised in Visual Studio 2022:

OK, so this looks cool but it also looks very complicated. What does it all mean? How can you interpret all this information and use it to optimise a refresh? That’s something for a future blog post!

[In my next post I look at how you can interpret this data and understand the concepts of blocking and waiting, and in the post after that show how you can extract the data in this DGML file to a table using a Dataflow]

Getting Different Versions Of Data With Value.Versions In Power Query

Something I mentioned in my recent post about the new DeltaLake.Tables M function on the Fabric blog recently was the fact that you can get different versions of the data held in a Delta table using the Value.Versions M function. In fact, the Value.Versions is the way to access different versions of data in any source that has this concept – so long as Power Query has added support for doing so. The bad news is that, at least at the the time of writing, apart from the DeltaLake connector there’s only one other source where Value.Versions can be used in this way: the connector for Fabric Lakehouses.

Here’s how you can access the data in a table using the Lakehouse.Contents M function:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data]
in
SelectTable

As with DeltaLake.Table, you can get a table with all the different versions available using Value.Versions:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
ShowVersions = Value.Versions(SelectTable)
in
ShowVersions

Version 0 is the earliest version; the latest version of the data is the version with the highest number and this version can also be accessed from the row with the version number null. The nested values in the Data column are tables which give you the data for that particular version number. So, for example, if I wanted to get the data for version 2 I could click through on the nested value in the Data column in the row where the Version column contained the value 2. Here’s the M code for this:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
ShowVersions = Value.Versions(SelectTable),
Data = ShowVersions{2}[Data]
in
Data

The Lakehouse connector uses the TDS Endpoint of the Lakehouse to get data by default, as in the first code snippet above, but if you use Value.Versions to get specific versions then this isn’t (as yet) possible so it will use a slower method to get data and performance may suffer.

Last of all, you can get the version number of the data you’re looking at using the Value.VersionIdentity function. If you’re looking at the latest version of the data then Value.VersionIdentity will return null:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
GetVersion = Value.VersionIdentity(SelectTable)
in
GetVersion

If you are looking at version 2 of the data then Value.VersionIdentity will return 2:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
ShowVersions = Value.Versions(SelectTable),
GetVersion2 = ShowVersions{2}[Data],
GetVersionNumber = Value.VersionIdentity(GetVersion2)
in
GetVersionNumber

Thoughts On Power BI Datasets Being Renamed To Semantic Models

Last week it was announced that Power BI datasets have been renamed: they are now semantic models. You can read the announcement blog post here and see the change in the Fabric/Power BI UI already.

The name change proved to be surprisingly uncontroversial. Of course it’s very disruptive – trust me, I know, I have around 500 blog posts that I need to do a search-and-replace on at some point – so I have a lot of sympathy for people with books or training courses that need updating or who are getting calls from confused end users who are wondering where their datasets have gone. But there was a general consensus that the change was the right thing to do:

When Marco approves of a change the whole Fabric team breathes a sigh of relief. The term “dataset” is too generic and too confusing for new developers; “semantic model” is a lot more specific and descriptive. Kurt Buhler has just written a very detailed post on what semantic models are. What else is there to say?

A name is often not just a name, it’s a statement of intent. While I don’t want you to read too much into the name change (Christian Wade does a good job of explaining how and why the name “semantic model” was chosen at the start of this Ignite session) and it’s always a mistake to think that we at Microsoft have some elaborate secret master plan for our products’ future development, people are nevertheless asking what the name “semantic model” signifies:

…and when someone as senior as Amir Netz asks me to do something, it’s probably a good idea for me to oblige 😉:

Power BI as a semantic layer is certainly one of my favourite topics: I wrote a very popular post on it last year. Even if it isn’t immediately apparent, Power BI is a semantic layer, a semantic layer made up of one or more semantic models. A lot of things (not just names) have changed in the world of Microsoft BI since I wrote that post which, in my opinion, only strengthen my arguments.

However you define the term “semantic layer”, reusability of data and business logic is a key feature. We all know that Bad Things happen to companies like the one discussed here on Reddit which create one semantic model per report: source systems are overloaded by the number of refreshes, the burden of maintenance becomes overwhelming and there are multiple versions of the truth. Creating the minimum number of semantic models necessary and using them as the source for your reports has always been a best practice in Power BI and the new name will, I hope, prompt developers to think about doing this more.

Would Power BI be better if it forced all developers to build their semantic layer upfront? No, I don’t think so. I believe a good BI tool gives you the flexibility to use it however you like so long as it can be used in the right way if you want – where “right” will mean different things for different organisations. If Power BI was more prescriptive and made you to do the “right” thing up front then I doubt the company discussed on Reddit in the link above would be more successful; instead it would add so many barriers to getting started they probably wouldn’t be using Power BI in the first place, they would be using Excel or some other tool in an equally inefficient way. What’s more if Power BI chose one “right” way of doing things it might exclude other “right” ways doing things, which would alienate the adherents of those other ways and be commercially damaging.

Fabric provides several new opportunities for reuse, with shortcuts and Direct Lake mode as the most obvious examples. Think about the number of Import mode semantic models you have in your organisation: each one will have a Date dimension table for sure, and there will certainly be a lot of dimension tables and probably a few fact tables duplicated across them. How much time and CPU is spent refreshing each of these tables? How many different versions of these tables are there, each one refreshed at different times? In Fabric you can maintain a single physical copy of your shared dimension tables and fact tables in Delta format in a Lakehouse, load data into them once, then reuse them in as many semantic models as you want via shortcuts. With Direct Lake mode no further refresh is needed, so each semantic model reuses the same copy of each dimension table and fact table and shows exactly the same data, saving time and compute and making them all consistent with each other. You can even now sync the tables in your Import mode semantic models to OneLake, making this pattern easier to adopt for existing Power BI users.

Another cause of data duplication in the past has been the different toolsets used by BI professionals and data scientists. Data is modelled and loaded for Power BI reports and business logic coded in DAX by the BI professionals, while in parallel data scientists have taken their own copies of the raw data, modelled it differently and implemented business logic in their own way in languages like Python. As Sandeep Pawar points out here, Semantic Link in Fabric now allows data scientists to query semantic models in SQL or in code, again promoting reuse and consistency.

Finally, looking ahead, I think the new Power BI Desktop Developer mode, Git integration and Tabular Model Definition Language (TMDL) will provide new ways of sharing and reusing business logic such as measure definitions between multiple semantic models. Not all the features necessary to do this are in Power BI/Fabric yet but when they do appear I’m sure we’ll see the community coming up with new patterns (perhaps successors to Michael Kovalsky’s Master Model technique) and external tools to support them.

In conclusion, as Power BI evolves into a part of something bigger with Fabric, then the new features I’ve mentioned here make it an even more mature semantic layer. Changing the name of datasets to semantic models is a way of highlighting this.

Analyse Power BI Data In Excel With Python

In the Power BI/Fabric community everyone is excited about the recent release of Semantic Link: the ability to analyse Power BI data easily using Python in Fabric notebooks. Sandeep Pawar has an excellent blog post here explaining what this is and why it’s so cool. Meanwhile in the Excel community, everyone is excited about the new integration of Python into Excel. But can you analyse Power BI data in Excel using Python? Yes you can – so as my teenage daughter would say, it’s time for a crossover episode.

Let’s see a simple example. The main problem to solve is how to get data from Power BI into Excel in a format that Python in Excel can consume easily, ie a table rather than a PivotTable. The easiest way to do this is using the new Excel Connected Table feature, described here, which creates a live connection back to Power BI so when the data in the dataset changes the data in Excel is updated too. I have a Power BI dataset published to the Service that contains data from the UK government’s Land Registry Price Paid data which has details of all the property transactions in England and Wales; I found this in Excel and clicked Insert Table:

I then defined a query that found the number of property transactions and average price paid broken down by county:

This gave me a table, connected to the dataset in the Power BI Service using a DAX query, in my worksheet:

I renamed this table to “Sales”:

Unfortunately you can’t change the ugly column names without rewriting the DAX query behind the table, which makes life more difficult later on.

Then, on a new worksheet, I created a Python code cell using the PY function and entered the following Python code:

df=xl("Sales[#All]", headers=True)
s = plt.scatter(df.iloc[:,1], df.iloc[:,2], marker=11)
s.axes.xaxis.set_label_text("Count Of Sales")
s.axes.yaxis.set_label_text("Average Price Paid")

[I’ve only just started learning Python so please excuse any newbie bad practices/mistakes here! The main point is to visualise the data returned from Power BI]

This gave me a scatterplot with each county as a marker, the count of sales measure on the x axis and the average price paid measure on the y axis:

A few comments:

  • The xl function allows you to reference Excel cells, ranges and tables in your code; the reference to Sales[#All] gets the whole of the Sales table, including headers; adding headers=True means the table headers are recognised as such
  • Dealing with those ugly column names in Python is such a pain that I copped out and referenced the columns by position
  • After entering the code and committing it, you also need to tell Excel to treat the output as an Excel Value rather than a Python object to see the scatterplot; you also need to resize the cell

A second way of getting data into Excel from Power BI is to export the data from a published Power BI report. If you’re going to do that, you should export as a connected table so again the data stays connected to the source Power BI dataset.

There’s also a third , slightly different way of getting data from Power BI into Excel that is possible if you have Premium and which is a bit more complex but also more flexible: you can use Power Query, but maybe not in the way you would expect. The xl function can reference the output of a Power Query query even if that query is not loaded to a worksheet or the Excel Data Model – which I think is a nice touch and important if you’re working with larger data volumes.

To get data from Power BI into Excel using Power Query you need to use Power Query’s Analysis Services connector to connect to your workspace’s XMLA Endpoint. Go to the Data tab in Excel, click the Get Data button then From Database/From SQL Server Analysis Services Database (Import):

On the connection dialog the XMLA Endpoint goes into the Server box, the name of the dataset goes into the Database box and you can paste a DAX query into the MDX or DAX query box:

There are several benefits to using this approach:

  • You can use your own DAX query rather than have one generated for you
  • You can easily edit the DAX query after you have created the Power Query query
  • You can rename the query as well as all those ugly column names, making them easier to work with in Python – I named my query SalesByCounty and renamed my columns to County, CountOfSales and AveragePricePaid

I then closed the Power Query Editor without loading the output of the query anywhere.

You can read more about how to use Power Query queries in Python in Excel here.

Finally, here’s the modified version of the Python code to create the scatterplot shown above:

df=xl("SalesByCounty")
s = plt.scatter(df.CountOfSales, df.AveragePricePaid, marker=11)
s.axes.xaxis.set_label_text("Count Of Sales")
s.axes.yaxis.set_label_text("Average Price Paid")

Note how, in the first line, I can reference the Power Query query by name in the xl function and how, in the second line, renaming the columns in Power Query makes writing the Python code much easier.

Is this actually going to be useful to anyone? Well if Python in Excel is going to be used, it will be used by data analysts who love both Excel and Python – and who are also likely to use Power BI too. As Sandeep argues in the blog post about Semantic Link referenced above there are several reasons why these data analysts should use a Power BI dataset as a data source for their work rather than going back to the raw data: for example they can be sure they are using exactly the same data that is being used in their reports and they can use measures defined in the dataset rather than have to recreate the same calculations in their own code. While Semantic Link in Fabric is much more powerful than anything you can do in Excel with Power BI data, it’s only available in Fabric notebooks and this needs a Fabric or Premium capacity; this technique is available to anyone who has Python in Excel and works with Power BI Pro as well as Premium. So yes, in conclusion, I think there are some practical uses for this.

What Does It Mean To Refresh A Direct Lake Power BI Dataset In Fabric?

If you’ve heard about the new Direct Lake mode for Power BI datasets in Fabric you’ll know that it gives you the query performance of Import mode (well, almost) without the need to actually import any data. Direct Lake datasets can be refreshed though – in fact, they refresh automatically by default – and if you look at the dataset’s Refresh History you’ll see there’s a Direct Lake section which sometimes shows errors:

Also, if you look at a custom dataset’s Settings page (although not yet for a default dataset) you’ll see some properties to control refresh too:

So what does it actually mean to refresh a Direct Lake dataset if it doesn’t involve loading data into the dataset?

The section on Direct Lake refresh in the Fabric docs has the following information:

Invoking a refresh for a Direct Lake dataset is a low cost operation where the dataset analyzes the metadata of the latest version of the Delta Lake table and is updated to reference the latest files in the OneLake.

Let’s see what this means using a simple example. I built a Dataflow Gen2 that loads a single row of data into a table in a Fabric Lakehouse with two columns: one called Sales that always contains the value 1 and one called LoadDate that contains the date and time the dataflow ran:

I ran the dataflow once to load a row of data into a table called MyTable in a Lakehouse:

I then built a custom dataset (because I want to change those refresh options mentioned above) consisting of just this table, and finally a report showing the contents of the table in the dataset:

I then connected SQL Server Profiler to the custom dataset via the XMLA Endpoint and started a trace to capture the Command Begin/End and Progress Report Begin/End events, and refreshed the dataflow (and only the dataflow) to load another row of data into the Lakehouse. Soon after the dataflow refresh finished, the Profiler trace showed a dataset refresh started automatically:

Refreshing the report showed the second row that had just been loaded:

This shows that, with the default settings, a Direct Lake dataset is automatically refreshed when data is loaded into a Lakehouse.

I then went to the Settings pane for the dataset and turned off the “Keep your Direct Lake data up to date” property:

I then ran the dataflow again and this time the Profiler trace showed that no automatic refresh took place; the new row was not shown in the report either. Manually refreshing the dataset from the workspace did result in the new row appearing in the report:

Next, I used a Notebook to delete all the rows from the table:

At this point the report still showed the three rows displayed in the previous screenshot. Finally, I refreshed the dataset one more time and all the data disappeared from the report:

It’s important to stress that the reason that the refresh is needed to show the latest data in the table is not because the data is being loaded into the dataset. It’s because, as the docs say, refresh tells the dataset to look at the latest version of the data in the table – which leads on to the whole topic of time travel in Delta tables in Fabric. Dennes Torres has a nice blog post on this subject here which is a great place to start.

Why would you ever want to refresh a Direct Lake dataset manually? Again, the docs have the answer:

You may want to disable [refresh] if, for example, you need to allow completion of data preparation jobs before exposing any new data to consumers of the dataset. 

So, let’s say you need to load some new data to your table and also delete or update some data that’s already there and this needs to be done as several separate jobs. It’s very likely that you don’t want your Power BI reports to show any of the new data until all of these jobs have been completed, so to do this you will want to disable automatic refresh and do a manual dataset refresh as the last step of your ETL.

Keep Your Existing Power BI Data And Add New Data To It Using Fabric

One of the most popular posts on my blog in the last few years has been this one:

To be honest I’m slightly ashamed of this fact because, as I say in the post, the solution I describe is a bit of a hack – but at the same time, the post is popular because a lot of people have the problem of needing to add new data to the data that’s already there in their Power BI dataset and there’s no obvious way of doing that. As I also say in that post, the best solution is to stage the data in a relational database or some other store outside Power BI so you have a copy of all the data if you ever need to do a full refresh of your Power BI dataset.

Why revisit this subject? Well, with Fabric it’s now much easier for you as a Power BI developer to build that place to store a full copy of your data outside your Power BI dataset and solve this problem properly. For a start, you now have a choice of where to store your data: either in a Lakehouse or a Warehouse, depending on whether you feel comfortable with using Spark and notebooks or relational databases and SQL to manage your data. What’s more, with Dataflows gen2, when you load data to a destination you now have the option to append new data to existing data as well as to replace it:

If you need more complex logic to make sure you only load new records and not ones that you’ve loaded before, there’s a published pattern for that.

“But I’m a Power BI developer, not a Fabric developer!” I hear you cry. Perhaps the most important point to make about Fabric is that Power BI is Fabric. If you have Power BI today, you will have Fabric soon if you don’t have the preview already – they are the same thing. One way of thinking about Fabric is that it’s just Power BI with a lot more stuff in it: databases, notebooks, Spark and pipelines as well as reports, datasets and dataflows. There are new skills to learn but solving this problem with the full range of Fabric workloads is a lot less complex than the pure Power BI approach I originally described.

“But won’t this be expensive? Won’t it need a capacity?” you say. It’s true that to do all this you will need to buy a Fabric capacity. But Fabric capacities start at a much cheaper price than Power BI Premium capacities: an F2 capacity costs $0.36USD per hour or $262.80USD per month and OneLake storage costs $0.023 per GB per month (for more details see this blog post and the docs), so Fabric capacities are a lot more affordable than Power BI Premium capacities.

So, with Fabric, there’s no need for complex and hacky workarounds to solve this problem. Just spin up a Fabric capacity, create a Warehouse or Lakehouse to store your data, use Dataflows Gen2 to append new data to any existing data, then build your Power BI dataset on that.

Fabric Dataflows Gen2: To Stage Or Not To Stage?

If you read this post that was published on the Fabric blog back in July, you’ll know that each Power Query query in a Fabric Gen2 dataflow has a property that determines whether its output is staged or not – where “staged” means that the output is written to the (soon-to-be hidden) Lakehouse linked to the dataflow, regardless of whether you have set a destination for the query output to be written to. Turning this on or off can have a big impact on your refresh times, making them a lot faster or a lot slower. You can find this property by right-clicking on the query name in the Queries pane:

At the moment this property is on by default for every query although this may change in the future. But should you turn it on for the queries in your Gen2 dataflows? It depends, and you should test to see what gives you the best performance.

Let’s see a simple example. I uploaded a CSV file from my favourite data source, the Land Registry price paid data, with about a million rows in it to the files section of a Lakehouse, then created a query that did a group by on one of the columns to find the number of property transactions by each county in England and Wales. The query was set to load its output to a table in a Warehouse.

Here’s the diagram view for this query:

I then made sure that staging was turned off for this query:

This means that the Power Query engine did the group by itself as it read the data from the file.

Looking at the refresh history for this dataflow:

…showed that the query took between 18-24 seconds to run. Clicking on an individual refresh to see the details:

…showed a single activity to load the output to the Warehouse. Clicking on this activity to see more details:

…shows how long it took – 15 seconds – plus how many rows were loaded to the destination Warehouse and how much data.

I then created a second dataflow to see the effect of staging. It’s important to understand that copying the previous dataflow and enabling staging on the only query in it does not do what I wanted here: I had to create two queries, one with staging enabled (called PP here) and no destination set to stage all the raw data from the CSV file, and a second one (called Counties here) that references the first with staging disabled and its destination set to the Warehouse I used in the previous dataflow to do the group by.

Here’s the diagram view for these two queries:

Note the blue outline on the PP query which indicates that it’s staged and the grey outline on the Counties query that indicates that it is not staged.

Looking at the Refresh History for this dataflow showed that it took around 40 seconds to run on average:

Looking at the first level of detail for the last refresh showed the extra activity for staging the data:

Clicking on the details for this staging activity for the PP table showed that it took 17 seconds to load all the raw data:

The activity to write the data to the Warehouse took about the same as with the first dataflow:

In summary, the first dataflow clearly performs better than the second dataflow. In this case, therefore, it looks like the overhead of staging the data made the performance worse.

Don’t take this simple example to prove a general rule: every dataflow will be different and there are a lot of performance optimisations planned for Dataflows Gen2 over the next few months, so you should test the impact of staging for yourself. I can imagine for different data sources (a Lakehouse source is likely to perform very well, even for files) and different transformations then staging will have a positive impact. On the other hand if you’re struggling with Dataflows Gen2 performance, especially at the time of writing this post, turning off staging could lead to a performance improvement.

Can You Develop For Power BI Using Only Your Browser?

One frequently asked question I see asked on Power BI forums is whether it’s possible to run Power BI Desktop on a Mac or indeed anything other than a Windows PC. There are already a lot of detailed blog posts and videos out there on this subject, such as this one from Guy In A Cube: the answer is no, you can’t run Power BI Desktop natively on a Mac or any other OS apart from Windows and there are no plans to port it over, so you need to either install Windows somehow (for example with Boot Camp) or use tools like Parallels or Turbo.Net to run Power BI Desktop. You can also spin up a Windows VM, for example in Azure, and run Power BI Desktop on that; Power BI Desktop is also now fully supported on Azure Virtual Desktop too although not on other virtual environments like Citrix.

Turning the question around, however, leads you to some aspects of the question that haven’t been fully explored. Instead of asking “Can I run Power BI Desktop on my Mac?”, you can instead ask “Can I do all of my Power BI development using only a browser?”. At Microsoft our long-term goal is to make all Power BI development web-based, but how close are we to that goal?

The first point to make is that it has always been possible to build Power BI reports (as opposed to datasets) in the browser without needing Power BI Desktop. You can even now build basic paginated reports in the browser too now. Historically I’ve never been a fan of encouraging users to do this because developing in Power BI Desktop gives you the chance to roll back to a previous version of the report if you need to – assuming you have saved those previous versions of your .pbix file. What’s more, if two or more people try to edit the same report at the same time then the last person to save wins and overwrites the other person’s changes, which can be dangerous. Fabric’s Git integration, which does work for Power BI reports, has changed my attitude somewhat though. As Rui Romano discusses here you can now safely make changes to reports in the Power BI Service, save them to source control and then roll back if you need to; this assumes your users are comfortable using Git, however, and it doesn’t solve the simultaneous development problem.

What about dataset development? Web editing for datasets has been in preview for a few months now and is getting better and better, although there are still several limitations and the focus up to now has been on modelling; connecting to data sources is on the public roadmap though. As a result Power BI Desktop is still needed for dataset development, at least for now.

Do datamarts change anything? Or Direct Lake mode in Fabric? Datamarts do solve the problem of being able to connect to and load data using just your browser and are available (if not GA yet) today. If you’re only using datamarts to avoid the need for a Windows PC to develop on, though, you’re paying a price: for a start you’ll either be loading the data twice if you want to use Import mode for your dataset (once to load data into the datamart, once to load the same data into the dataset) or taking the query performance hit of using DirectQuery mode. There are also some other limitations to watch out for. Fabric Direct Lake mode datasets, for me, offer all the benefits of Datamarts without so many of the limitations – Direct Lake mode means you only load the data once and still get near-Import mode performance, for example – and will be the obvious choice when Fabric GAs and once features like OneSecurity are available. With Fabric it will be possible to for most Power BI developers do all their work using only a browser, although for more complex projects (and to be clear this is only a small minority of projects) it will still be necessary to use other tools such as Tabular Editor, DAX Studio, SQL Server Management Studio and SQL Server Profiler which can only run on a Windows PC. I can imagine some of this more advanced developer functionality coming to the browser too in time, though.

In summary while Power BI Desktop and therefore Windows is still needed for Power BI development today, the day when you can do most and maybe all of your development in the browser is in sight. All you Mac owners need to be patient just a little while longer!