Microsoft Fabric, Power BI, Analysis Services, DAX, M, MDX, Power Query, Power Pivot and Excel
Author: Chris Webb
My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
Power Query allows you to merge (“join” in database terms) two tables together in a variety of different ways, including left and right anti joins. Unfortunately, as I found recently, anti joins don’t fold on SQL Server-related data sources, which can result in performance problems. Luckily there is a different way of doing anti joins that does fold.
Say you have two Power Query queries called Fruit1 and Fruit2 that get data from SQL Server tables containing the names of different varieties of fruit:
Now, let’s say you want to get a list of all the fruit varieties that are in Fruit1 and not in Fruit2. The obvious thing to do is to do a Merge and use the Left Anti option like so:
Here’s the M code, including an extra step to remove the join column that this creates:
let
Source = Table.NestedJoin(Fruit1, {"Fruit"}, Fruit2, {"Fruit"}, "Fruit2", JoinKind.LeftAnti),
#"Removed Other Columns" = Table.SelectColumns(Source,{"Fruit"})
in
#"Removed Other Columns"
This gives you the correct result:
…but it doesn’t fold, so performance may be bad.
However, if you do a Merge and use the Left Outer option instead:
Then expand the join column (called Fruit2.Fruit here):
And then filter on that column so you only keep the rows where it contains the value null, and then remove that column, you get the same result:
Here’s the M:
let
Source = Table.NestedJoin(Fruit1, {"Fruit"}, Fruit2, {"Fruit"}, "Fruit2", JoinKind.LeftOuter),
#"Expanded Fruit2" = Table.ExpandTableColumn(Source, "Fruit2", {"Fruit"}, {"Fruit2.Fruit"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded Fruit2", each ([Fruit2.Fruit] = null)),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Fruit"})
in
#"Removed Other Columns"
This now does fold (meaning performance should be better) and gives you the following SQL:
select [_].[Fruit]
from
(
select [$Outer].[Fruit],
[$Inner].[Fruit2]
from [dbo].[Fruit1] as [$Outer]
left outer join
(
select [_].[Fruit] as [Fruit2]
from [dbo].[Fruit2] as [_]
) as [$Inner] on ([$Outer].[Fruit] = [$Inner].[Fruit2] or [$Outer].[Fruit] is null and [$Inner].[Fruit2] is null)
) as [_]
where [_].[Fruit2] is null
A lot of DAX performance problems relate to the use of strict and eager evaluation with the IF or SWITCH functions. It’s an incredibly complex, “it depends”, black-box type of topic and naturally Marco and Alberto have a great article on it here. Rather than describe any general rules – which would be pretty much impossible – in this blog post I want to show a specific scenario that illustrates how the use of DAX variables can influence whether strict or eager evaluation takes place.
Consider a Power BI semantic model that has the following three tables:
TableA and TableB are fact tables with numeric measure columns; Choice is a disconnected table containing text values that is intended for use in a slicer, so a report user can select an item in that slicer and that in turn influences what values a measure returns:
Here’s the definition of one such measure:
With Variables =
VAR a =
SUM ( TableA[A] )
VAR b =
SUM ( TableB[B] )
RETURN
IF ( SELECTEDVALUE ( Choice[Choice] ) = "TableA", a, b )
)
And here’s my test report that includes a slicer and a card visual displaying the output of this measure:
Let’s look at the DAX query generated by the card visual containing the measure when “TableA” is selected in the slicer:
…and in particular what DAX Studio’s Server Timings feature shows us:
Even though only TableA is selected in the slicer and the query only returns the sum of the values in the A column of TableA, we can see that the Storage Engine is also querying TableB and getting the sum of the B column. It’s a great example of eager evaluation: both branches of the IF are being evaluated. Is this a bad thing? For this particular report it may be if the Storage Engine query for TableB is expensive.
How can you force strict evaluation to take place? You can force eager evaluation using the IF.EAGER function but there is no equivalent function to force strict evaluation. However you maybe be able to rewrite the measure to get strict evaluation to take place.
The key factor in this case is the use of variables in the measure definition. If you rewrite the measure to not use variables, like so:
No Variables =
IF (
SELECTEDVALUE ( Choice[Choice] ) = "TableA",
SUM ( TableA[A] ),
SUM ( TableB[B] )
)
…then the query for the card visual query behaves differently:
DEFINE VAR __DS0FilterTable = TREATAS({"TableA"}, 'Choice'[Choice])
Notice that there are now only two Storage Engine queries and that TableB is not now being queried, which will probably result in better performance. This is strict evaluation.
Why does the use of variables result in the use of eager evaluation here? Because it does, and it’s complicated. I need to stress that DAX uses lazy evaluation for variables which means that variables are not evaluated if they are not used – in the first measure above the IF is deliberately evaluating both branches. There are certainly other optimisations that may kick in and result in strict evaluation even when variables are used in IF/SWITCH. Indeed, the two measures in this example being from different fact tables is extremely important: if they had been from the same fact table then the behaviour would have been different and strict evaluation would have been used. In summary, though, if you are using variables and you want to try to force strict evaluation with IF/SWITCH then it’s worth rewriting your code to remove those variables to see if it makes a difference.
I also need to stress that these two measures will perform better or worse depending on how and where they are used. Consider these two table visuals that use the values from the Choice table on rows along with the two measures above:
Running the DAX query for the table on the left, which uses the measure with no variables and strict evaluation, gives the following in DAX Studio’s Server Timings:
Note that there are two Storage Engine queries for TableB, which is not a good thing.
On the other hand, the table on the right which uses the [No Variables] measure gives the following:
Note that there is only one Storage Engine query for TableB now, so in this case the tables have turned and the [With Variables] measure is likely to perform better.
When you use variables in the branches of IF/SWITCH, as in the [With Variables] measure, the variables are evaluated in a context at the place of the variable definition; if you don’t use variables in this way, as in the [No Variables] measure, the two SUM expressions used in the two branches are evaluated in the context of the branch, which adds a hidden filter context corresponding to the condition. This has two consequences which may hurt query performance:
During evaluation, the hidden filter may be materialised into a large in-memory table
Fusion cannot happen across common subexpressions in different branches because they have different contexts
In contrast the use of variables means the expressions used in them are evaluated in a context without the hidden filter corresponding to the branch, which can encourage fusion and discourage the materialisation of large in-memory tables.
[Thanks to Marius Dumitru and Jeffrey Wang for much of the information in this post]
Last week it was announced that Power BI datasets have been renamed: they are now semantic models. You can read the announcement blog post here and see the change in the Fabric/Power BI UI already.
The name change proved to be surprisingly uncontroversial. Of course it’s very disruptive – trust me, I know, I have around 500 blog posts that I need to do a search-and-replace on at some point – so I have a lot of sympathy for people with books or training courses that need updating or who are getting calls from confused end users who are wondering where their datasets have gone. But there was a general consensus that the change was the right thing to do:
When Marco approves of a change the whole Fabric team breathes a sigh of relief. The term “dataset” is too generic and too confusing for new developers; “semantic model” is a lot more specific and descriptive. Kurt Buhler has just written a very detailed post on what semantic models are. What else is there to say?
A name is often not just a name, it’s a statement of intent. While I don’t want you to read too much into the name change (Christian Wade does a good job of explaining how and why the name “semantic model” was chosen at the start of this Ignite session) and it’s always a mistake to think that we at Microsoft have some elaborate secret master plan for our products’ future development, people are nevertheless asking what the name “semantic model” signifies:
…and when someone as senior as Amir Netz asks me to do something, it’s probably a good idea for me to oblige 😉:
Power BI as a semantic layer is certainly one of my favourite topics: I wrote a very popular post on it last year. Even if it isn’t immediately apparent, Power BI is a semantic layer, a semantic layer made up of one or more semantic models. A lot of things (not just names) have changed in the world of Microsoft BI since I wrote that post which, in my opinion, only strengthen my arguments.
However you define the term “semantic layer”, reusability of data and business logic is a key feature. We all know that Bad Things happen to companies like the one discussed here on Reddit which create one semantic model per report: source systems are overloaded by the number of refreshes, the burden of maintenance becomes overwhelming and there are multiple versions of the truth. Creating the minimum number of semantic models necessary and using them as the source for your reports has always been a best practice in Power BI and the new name will, I hope, prompt developers to think about doing this more.
Would Power BI be better if it forced all developers to build their semantic layer upfront? No, I don’t think so. I believe a good BI tool gives you the flexibility to use it however you like so long as it can be used in the right way if you want – where “right” will mean different things for different organisations. If Power BI was more prescriptive and made you to do the “right” thing up front then I doubt the company discussed on Reddit in the link above would be more successful; instead it would add so many barriers to getting started they probably wouldn’t be using Power BI in the first place, they would be using Excel or some other tool in an equally inefficient way. What’s more if Power BI chose one “right” way of doing things it might exclude other “right” ways doing things, which would alienate the adherents of those other ways and be commercially damaging.
Fabric provides several new opportunities for reuse, with shortcuts and Direct Lake mode as the most obvious examples. Think about the number of Import mode semantic models you have in your organisation: each one will have a Date dimension table for sure, and there will certainly be a lot of dimension tables and probably a few fact tables duplicated across them. How much time and CPU is spent refreshing each of these tables? How many different versions of these tables are there, each one refreshed at different times? In Fabric you can maintain a single physical copy of your shared dimension tables and fact tables in Delta format in a Lakehouse, load data into them once, then reuse them in as many semantic models as you want via shortcuts. With Direct Lake mode no further refresh is needed, so each semantic model reuses the same copy of each dimension table and fact table and shows exactly the same data, saving time and compute and making them all consistent with each other. You can even now sync the tables in your Import mode semantic models to OneLake, making this pattern easier to adopt for existing Power BI users.
Another cause of data duplication in the past has been the different toolsets used by BI professionals and data scientists. Data is modelled and loaded for Power BI reports and business logic coded in DAX by the BI professionals, while in parallel data scientists have taken their own copies of the raw data, modelled it differently and implemented business logic in their own way in languages like Python. As Sandeep Pawar points out here, Semantic Link in Fabric now allows data scientists to query semantic models in SQL or in code, again promoting reuse and consistency.
Finally, looking ahead, I think the new Power BI Desktop Developer mode, Git integration and Tabular Model Definition Language (TMDL) will provide new ways of sharing and reusing business logic such as measure definitions between multiple semantic models. Not all the features necessary to do this are in Power BI/Fabric yet but when they do appear I’m sure we’ll see the community coming up with new patterns (perhaps successors to Michael Kovalsky’s Master Model technique) and external tools to support them.
In conclusion, as Power BI evolves into a part of something bigger with Fabric, then the new features I’ve mentioned here make it an even more mature semantic layer. Changing the name of datasets to semantic models is a way of highlighting this.
Excel workbooks are one of the slowest data sources you can use with Power Query in Excel or Power BI. Reading small amounts of data from small workbooks is usually fast; reading large amounts of data from large workbooks can be very slow. But what about reading small amounts of data from large Excel workbooks? I did some tests and it turns out that performance can vary a lot depending on where your data is in the workbook and how that workbook is structured.
[Note: in this post I’ll be looking at .xlsx files, rather than other Excel formats like .xls and .xlsb; Excel files stored on a local disk and accessed via the File.Contents M function rather than stored in SharePoint or any other location; data read from Excel tables rather than direct from the worksheet; and Power Query in Power BI. Other scenarios may perform differently.]
Let’s see a simple example to illustrate what I found out. I created a new Excel workbook with one worksheet in and put a small table of data on it:
At this point the workbook’s size was 11KB. I then opened Power BI Desktop and created a Power Query query that read this table of data from the Excel workbook:
let
Source = Excel.Workbook(File.Contents("C:\MyWorkbook.xlsx"), null, true),
Sales_Table = Source{[Item="Sales",Kind="Table"]}[Data],
#"Changed Type" = Table.TransformColumnTypes(Sales_Table,{{"Product", type text}, {"Sales", Int64.Type}})
in
#"Changed Type"
Then I used this technique to measure how long it took to load the data from Excel into Power BI. Unsurprisingly, it was extremely fast: 63ms.
Then I added a new worksheet to the workbook, copied the same table onto it, added a large amount of random numbers underneath using the following Excel formula, and then copied and pasted the values returned by the formula over the output of the formula:
=RANDARRAY(9999,300)
Doing this meant the size of the workbook grew to 43MB. I then created a new Power Query query in Power BI Desktop, identical to the one above except that it connected to the new table. This time the query took 4918ms – almost 5 seconds.
Interestingly, even with the second worksheet with all the data on was added, the first query above (on the worksheet with no other data on) was still fast. I also tested refreshing a Power BI dataset that connected to two identical small tables on different worksheets in the same workbook, both with large amounts of other data on as in the second scenario above, and the performance of both queries was only slightly slower: it was clear two Power Query queries can read data from the same Excel workbook in parallel.
So: reading a small amount of data from a table on a worksheet with a large amount of other data on it is very slow.
What can we learn from this? Well, if you can influence the structure and layout of the Excel workbooks you are using as a data source – and that’s a big if, because in most cases you can’t – and you only need to read some of the data from them, you should put the tables of data you are using as a source on separate worksheets and not on the same worksheet as any other large ranges or tables of data.
It turns out that when the Power Query Excel connector reads data from an .xlsx file it can deserialise just some of the data in it rather than the whole thing, but what it can and can’t avoid deserialising depends a lot on the structure of the workbook and how the data is stored within the workbook .xlsx file. If you’re quick you can even see how much data is being read in Power BI Desktop in the refresh dialog:
You can also use Process Monitor, as I describe here, to see how much data is being read from any file used by Power Query.
Performance also depends on which application generated the .xlsx file (it’s not just Excel that creates .xlsx files, because other applications export data to .xlsx format without using Excel) or even which version of Excel saved the .xlsx file. This is because the same data can be stored in an .xlsx file in different ways, some of which may be more efficient to read than others. I found this blog post by Brendan Long on the .xlsx file format was very clear and it helped me understand how Power Query might go about reading data from an .xlsx file.
[Thanks to Curt Hagenlocher of the Power Query team for answering some of my questions relating to this post]
Recently I was involved in an interesting discussion on Twitter X about how partitioning a table in a Power BI dataset semantic model can affect compression and therefore its size and query performance. You can read the thread here. It got me thinking: is there a way to get more detail on how well compression, and in particular run-length encoding (RLE from hereon), is working for a column when you’re using Import mode or Direct Lake? After a bit of research I found out there is, so let’s see some examples that illustrate what I learned.
First of all, consider the following M query that returns a table with one numeric column called MyNumbers:
let
Source = {1..DistinctValues},
Repeat = List.Transform(Source, each List.Repeat({_}, TotalRows/DistinctValues)),
Combine = List.Combine(Repeat),
#"Converted to Table" = Table.FromList(Combine, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "MyNumbers"}}),
#"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"MyNumbers", Int64.Type}})
in
#"Changed Type"
It references two M parameters: TotalRows, which determines the total number of rows in the table, and DistinctValues, which determines the number of distinct numeric values in the MyNumbers column. With TotalRows=9 and DistinctValues=3, it returns the following output:
Note that in this case it returns 3 rows with the value 1 repeated, 3 rows with the value 2 repeated and 3 three rows with the value 3 repeated; three sets of repeated values in all. It’s fair to assume that repeated sets of values like this are a good candidate for RLE.
I created a semantic model in Import mode containing only this table and published it to the Power BI Service. Initially TotalRows was set to 1,000,000 and DistinctValues was set to 100 – so the table consisted of just 100 sets of 10,000 repeated values. I chose 1,000,000 rows because that’s the size of a segment in the Power BI Service with the “small semantic model format” setting and any compression that takes place always takes place within a segment.
When the Analysis Services engine inside Power BI compresses data it looks for sequences of repeated values to see if RLE can be used. If it finds them, these sequences result in “pure” RLE runs; if it doesn’t find these sequences they are called “impure” RLE runs and the values are stored using bitpack compression. Pure runs are generally a good thing, impure runs generally a bad thing. You can see how many pure and impure runs there are using the TMSCHEMA_COLUMN_STORAGES DMV, for example with the following DMV query:
select
[Name], Statistics_DistinctStates, Statistics_RowCount,
Statistics_RLERuns, Statistics_OthersRLERuns
from $SYSTEM.TMSCHEMA_COLUMN_STORAGES
Running this query in DAX Studio on my published semantic model returned the following table:
[You can ignore all the rows except the one for the MyNumbers column in this table]
The Statistics_RLERuns column shows the number of pure RLE runs; the Statistics_OthersRLERuns column shows the number of impure RLE runs. In this case you can see, for the MyNumbers column, there were 100 pure RLE runs and no impure runs, so as expected RLE is working well.
Here’s what Vertipaq Analyzer showed for this table:
Unsurprisingly the size of the MyNumbers column is very small.
Then I changed DistinctValues to 100,000 (keeping TotalRows at 1,000,000), giving me 100,000 sets of 10 values, and refreshed the dataset. Here’s what the DMV query on TMSCHEMA_COLUMN_STORAGES returned:
And here’s what Vertipaq Analyzer showed:
As you can see, the column was a lot larger than before; there were no pure RLE runs and one impure RLE run. In this case the large number of distinct values in the column prevented RLE from taking place and this had a negative impact on the size of the column.
These are two extreme cases. What about a scenario that’s somewhere in between? I modified my M query as follows:
let
RepeatedNumbers =
let
Source = {1..DistinctValues},
Repeat = List.Transform(Source, each List.Repeat({_}, ((TotalRows/2)/DistinctValues))),
Combine = List.Combine(Repeat),
#"Converted to Table" = Table.FromList(Combine, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "MyNumbers"}}),
#"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"MyNumbers", Int64.Type}})
in
#"Changed Type",
RandomNumbers =
let
Source = {1..TotalRows/2},
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Added Custom" = Table.AddColumn(#"Converted to Table", "MyNumbers", each Number.Round(Number.RandomBetween(TotalRows+1, TotalRows*2))),
#"Removed Other Columns" = Table.SelectColumns(#"Added Custom",{"MyNumbers"}),
#"Changed Type" = Table.TransformColumnTypes(#"Removed Other Columns",{{"MyNumbers", Int64.Type}})
in
#"Changed Type",
Output = Table.Combine({RepeatedNumbers, RandomNumbers})
in
Output
What this version of the code does is return a table where the first 50% of the rows are repeated numbers and the second 50% are random numbers. With TotalRows set to 12 and DistinctValues set to 2 it produces the following output:
With this version published to the Power BI Service I set TotalRows to 1,000,000 again and set DistinctValues to 2000, resulting in a table with 2000 sets of 250 repeating values followed by 500,000 random values. Here’s what the DMV query against TMSCHEMA_COLUMN_STORAGES returned:
As you can see there are now 2000 pure runs (I assume for the first 50% of rows with repeated values) and 1 impure run (I assume for the second 50% of rows with random values).
Here’s the output of Vertipaq Analyzer:
The column is now almost as large as in the second scenario above.
You can get a bit more detail about what’s happening in the impure runs with the DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS DMV. Running the following query against the latest version of the table:
select
column_ID, partition_name, segment_number,
records_count, bits_count, used_size
from $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS
…returns the following values:
To get a rough idea of the number of rows in the impure runs you can use the following formula:
(used_size * 8)/bits_count
In this case for the MyNumbers column (1349720 * 8)/21 = 514,179 which makes sense since my code returns 500,000 rows of random numbers. The records_count column in this query returns the total number of rows in the segment, so the higher the numberof rows in impure runs relative to the total, the worse compression you’re getting.
What practical use is this information? Probably not much as you might think, interesting as it is. It can tell you how well RLE is working for a column but it doesn’t tell you much about how to optimise it, or if it is possible to optimise it, or if optimising it is a good idea – that’s a subject for another blog post.
[Thanks to Marius Dumitru and Akshai Mirchandani for the information in this post]
If you’re a fan of using Log Analytics for monitoring Power BI activity, then you may have noticed there’s some new data in the ApplicationContext column of the PowerBIDatasetsWorkspace table. Up until recently the ApplicationContext column only contained IDs that identify the report and the visual that generated a DAX query (something I blogged about here); it now contains additional information on the type of Power BI report that generated the query and an ID for the user session.
Here’s an example of the payload with the new data in bold:
[
{
"ReportId": "2beeb311-56c8-471a-83a3-6d7523d40dc7",
"VisualId": "477b5dc44249fe897411",
"HostProperties": {
"ConsumptionMethod": "Power BI Web App",
"UserSession": "a3d941bd-c374-4e0e-b911-5086310cb345"
}
}
]
Here’s an example KQL query that returns the new ConsumptionMethod and UserSession data from HostProperties:
PowerBIDatasetsWorkspace
| where TimeGenerated > ago(1d)
| where OperationName == 'QueryEnd'
| where ApplicationContext!=""
| extend hp = parse_json(ApplicationContext)
| extend ConsumptionMethod = hp.Sources[0].HostProperties.ConsumptionMethod,
UserSession = hp.Sources[0].HostProperties.UserSession
|project TimeGenerated, EventText, ApplicationName, ConsumptionMethod, UserSession
| order by TimeGenerated desc
ConsumptionMethod is less useful than it might first appear: at the time of writing it only returns data for Power BI reports (and not other types of report such as Paginated reports), although it will allow you to differentiate between different methods of viewing a Power BI report such as viewing via a browser or viewing via Teams. It should be used in combination with the ApplicationName column to get a fuller picture of the way reports are being consumed.
UserSession is something that I need to explore in more detail. Grouping user activity into sessions is something I blogged about here, but this is the user session ID used internally and therefore a lot more reliable. I don’t know the rules which govern how activity is grouped into sessions though, so I will only blog about this when I find out more.
[Thanks to my colleague Andreea Sandu for this information]
Back in 2021 I wrote a post showing how you can link a DAX query generated by a Power BI report in Log Analytics to a visual in a Power BI report. In that post I showed how you can get the unique IDs for each visual in a report using Performance Analyzer, an approach which works but has drawbacks. Then, in 2022 I wrote another post showing how you can get the same IDs using the Power BI Embedded Analytics Playground, which again works but isn’t perfect. After that, this August, my colleague Rui Romano pinged me on Teams to point out that the new Power BI Desktop Developer Mode and the .pbip file format provides the best and easiest way to get these IDs.
If you do save a Power BI report in .pbip format, look in the report folder and open the report.json file (where the report definition is stored) with a tool like Visual Studio Code then you can find the IDs for visuals quite easily. The sections array lists all the pages in the report; the visualContainers array within that lists all the visuals on a page; and the visual’s ID can be found in the config property as its name:
As a result, if you’re looking in Log Analytics and you need to understand which visual a query was generated by, you can just copy the ID from the ApplicationContext column of a QueryEnd event and search for that ID in the report.json file. Even when you’ve found that ID, though, it can be hard to know which visual it relates to. This is where a tool I blogged about a few weeks ago, PBI Inspector, comes in handy: as part of its output it draws wireframes of each report page showing each visual, its type and its ID. Here’s the report page from the example above, with two visuals on it, and the associated PBI Inspector wireframe:
Yet another way in which Power BI Developer Mode makes our lives better!
In the Power BI/Fabric community everyone is excited about the recent release of Semantic Link: the ability to analyse Power BI data easily using Python in Fabric notebooks. Sandeep Pawar has an excellent blog post here explaining what this is and why it’s so cool. Meanwhile in the Excel community, everyone is excited about the new integration of Python into Excel. But can you analyse Power BI data in Excel using Python? Yes you can – so as my teenage daughter would say, it’s time for a crossover episode.
Let’s see a simple example. The main problem to solve is how to get data from Power BI into Excel in a format that Python in Excel can consume easily, ie a table rather than a PivotTable. The easiest way to do this is using the new Excel Connected Table feature, described here, which creates a live connection back to Power BI so when the data in the dataset changes the data in Excel is updated too. I have a Power BI dataset published to the Service that contains data from the UK government’s Land Registry Price Paid data which has details of all the property transactions in England and Wales; I found this in Excel and clicked Insert Table:
I then defined a query that found the number of property transactions and average price paid broken down by county:
This gave me a table, connected to the dataset in the Power BI Service using a DAX query, in my worksheet:
I renamed this table to “Sales”:
Unfortunately you can’t change the ugly column names without rewriting the DAX query behind the table, which makes life more difficult later on.
Then, on a new worksheet, I created a Python code cell using the PY function and entered the following Python code:
df=xl("Sales[#All]", headers=True)
s = plt.scatter(df.iloc[:,1], df.iloc[:,2], marker=11)
s.axes.xaxis.set_label_text("Count Of Sales")
s.axes.yaxis.set_label_text("Average Price Paid")
[I’ve only just started learning Python so please excuse any newbie bad practices/mistakes here! The main point is to visualise the data returned from Power BI]
This gave me a scatterplot with each county as a marker, the count of sales measure on the x axis and the average price paid measure on the y axis:
A few comments:
The xl function allows you to reference Excel cells, ranges and tables in your code; the reference to Sales[#All] gets the whole of the Sales table, including headers; adding headers=True means the table headers are recognised as such
Dealing with those ugly column names in Python is such a pain that I copped out and referenced the columns by position
After entering the code and committing it, you also need to tell Excel to treat the output as an Excel Value rather than a Python object to see the scatterplot; you also need to resize the cell
A second way of getting data into Excel from Power BI is to export the data from a published Power BI report. If you’re going to do that, you should export as a connected table so again the data stays connected to the source Power BI dataset.
There’s also a third , slightly different way of getting data from Power BI into Excel that is possible if you have Premium and which is a bit more complex but also more flexible: you can use Power Query, but maybe not in the way you would expect. The xl function can reference the output of a Power Query query even if that query is not loaded to a worksheet or the Excel Data Model – which I think is a nice touch and important if you’re working with larger data volumes.
To get data from Power BI into Excel using Power Query you need to use Power Query’s Analysis Services connector to connect to your workspace’s XMLA Endpoint. Go to the Data tab in Excel, click the Get Data button then From Database/From SQL Server Analysis Services Database (Import):
On the connection dialog the XMLA Endpoint goes into the Server box, the name of the dataset goes into the Database box and you can paste a DAX query into the MDX or DAX query box:
There are several benefits to using this approach:
You can use your own DAX query rather than have one generated for you
You can easily edit the DAX query after you have created the Power Query query
You can rename the query as well as all those ugly column names, making them easier to work with in Python – I named my query SalesByCounty and renamed my columns to County, CountOfSales and AveragePricePaid
I then closed the Power Query Editor without loading the output of the query anywhere.
You can read more about how to use Power Query queries in Python in Excel here.
Finally, here’s the modified version of the Python code to create the scatterplot shown above:
df=xl("SalesByCounty")
s = plt.scatter(df.CountOfSales, df.AveragePricePaid, marker=11)
s.axes.xaxis.set_label_text("Count Of Sales")
s.axes.yaxis.set_label_text("Average Price Paid")
Note how, in the first line, I can reference the Power Query query by name in the xl function and how, in the second line, renaming the columns in Power Query makes writing the Python code much easier.
Is this actually going to be useful to anyone? Well if Python in Excel is going to be used, it will be used by data analysts who love both Excel and Python – and who are also likely to use Power BI too. As Sandeep argues in the blog post about Semantic Link referenced above there are several reasons why these data analysts should use a Power BI dataset as a data source for their work rather than going back to the raw data: for example they can be sure they are using exactly the same data that is being used in their reports and they can use measures defined in the dataset rather than have to recreate the same calculations in their own code. While Semantic Link in Fabric is much more powerful than anything you can do in Excel with Power BI data, it’s only available in Fabric notebooks and this needs a Fabric or Premium capacity; this technique is available to anyone who has Python in Excel and works with Power BI Pro as well as Premium. So yes, in conclusion, I think there are some practical uses for this.
If you’ve heard about the new Direct Lake mode for Power BI datasets in Fabric you’ll know that it gives you the query performance of Import mode (well, almost) without the need to actually import any data. Direct Lake datasets can be refreshed though – in fact, they refresh automatically by default – and if you look at the dataset’s Refresh History you’ll see there’s a Direct Lake section which sometimes shows errors:
Also, if you look at a custom dataset’s Settings page (although not yet for a default dataset) you’ll see some properties to control refresh too:
So what does it actually mean to refresh a Direct Lake dataset if it doesn’t involve loading data into the dataset?
Invoking a refresh for a Direct Lake dataset is a low cost operation where the dataset analyzes the metadata of the latest version of the Delta Lake table and is updated to reference the latest files in the OneLake.
Let’s see what this means using a simple example. I built a Dataflow Gen2 that loads a single row of data into a table in a Fabric Lakehouse with two columns: one called Sales that always contains the value 1 and one called LoadDate that contains the date and time the dataflow ran:
I ran the dataflow once to load a row of data into a table called MyTable in a Lakehouse:
I then built a custom dataset (because I want to change those refresh options mentioned above) consisting of just this table, and finally a report showing the contents of the table in the dataset:
I then connected SQL Server Profiler to the custom dataset via the XMLA Endpoint and started a trace to capture the Command Begin/End and Progress Report Begin/End events, and refreshed the dataflow (and only the dataflow) to load another row of data into the Lakehouse. Soon after the dataflow refresh finished, the Profiler trace showed a dataset refresh started automatically:
Refreshing the report showed the second row that had just been loaded:
This shows that, with the default settings, a Direct Lake dataset is automatically refreshed when data is loaded into a Lakehouse.
I then went to the Settings pane for the dataset and turned off the “Keep your Direct Lake data up to date” property:
I then ran the dataflow again and this time the Profiler trace showed that no automatic refresh took place; the new row was not shown in the report either. Manually refreshing the dataset from the workspace did result in the new row appearing in the report:
Next, I used a Notebook to delete all the rows from the table:
At this point the report still showed the three rows displayed in the previous screenshot. Finally, I refreshed the dataset one more time and all the data disappeared from the report:
It’s important to stress that the reason that the refresh is needed to show the latest data in the table is not because the data is being loaded into the dataset. It’s because, as the docs say, refresh tells the dataset to look at the latest version of the data in the table – which leads on to the whole topic of time travel in Delta tables in Fabric. Dennes Torres has a nice blog post on this subject here which is a great place to start.
Why would you ever want to refresh a Direct Lake dataset manually? Again, the docs have the answer:
You may want to disable [refresh] if, for example, you need to allow completion of data preparation jobs before exposing any new data to consumers of the dataset.
So, let’s say you need to load some new data to your table and also delete or update some data that’s already there and this needs to be done as several separate jobs. It’s very likely that you don’t want your Power BI reports to show any of the new data until all of these jobs have been completed, so to do this you will want to disable automatic refresh and do a manual dataset refresh as the last step of your ETL.
This week, one of my colleagues at Microsoft, Nat Van Gulck, showed me a cool new open-source tool he’s been working on to make VisOps for Power BI much easier: PBI Inspector. What is VisOps? I’ll admit I didn’t really know either, so being lazy I asked Nat to write a few paragraphs describing the project and why it will be useful:
Great progress has been made over the years with software development CI\CD tooling and processes (aka DevOps), just not so much with BI report visualisations and charts where we’ve come to accept only manual checks before publishing to production. PBI Inspector is a rules-based visual layer testing tool for Power BI which aims to fill this tooling gap. It runs on either the report author’s desktop or as part of a CI\CD pipeline. The latter follows naturally from the recent Power BI Desktop Developer mode announcement which marks a step change in providing Pro developers with much improved source control capabilities. PBI Inspector reads report files in the PBIP format (currently in Preview) although it also accepts PBIX files. Test results can be presented in several formats including HTML, JSON and Azure DevOps logging commands.
PBI Inspector’s rules combine Greg Dennis’s JsonLogic .NET implementation, which allows for the definition of expressive rule logic, with the querying abilities of JsonPath and JsonPointer libraries to select nodes from the Power BI report’s JSON layout definition for testing purposes. As an illustrative example, here’s a rule that tests if charts are wider than tall in each report page and returns an array with the names of visuals that fail the test:
Here’s an example result wireframe depiction of a report page (provided as part of the HTML output) highlighting two visuals that failed the test because they are taller than wide:
I think this is a great example of the kind of community innovation that Power BI Desktop Developer Mode allows (see also the recent announcement of PBI Explorer). A lot of organisations that use Power BI don’t, and will never, care about this kind of thing – but those who do have been very vocal about Power BI’s previous limitations in the area of DevOps and DataOps. Thanks to the work of people like Mathias Thierbach (of pbi-tools fame), John Kerski and Nat we can see how quickly Power BI is catching up.