Best Practices For Power BI On Databricks Webinar

I recently took part in a webinar with Denny Lee, Liping Huang and Marius Panga from Databricks on the subject of best practices for using Power BI on Databricks. You can view the recording on LinkedIn here:

https://www.linkedin.com/video/live/urn:li:ugcPost:7174102151407779841

…or on YouTube here:

My section at the beginning covering Power BI best practices for Import and DirectQuery doesn’t contain any new information – if you’ve been following the DirectQuery posts on this blog or read the DirectQuery guidance docs here and here then there won’t be any surprises. What I thought was really useful, though, was hearing the folks from Databricks talk about best practices on the Databricks side and this took up the majority of the webinar. Definitely worth checking out.

How Much Does Copilot Cost In Microsoft Fabric?

There’s a lot of excitement about Copilot in Power BI and in Fabric as a whole. The number one question I’m asked about Copilot by customers is “How much does it cost?” and indeed there have been two posts on the Fabric blog here and here attempting to answer this question. The point I want to make in this post, though, is that it’s the wrong question to ask.

Why? Well to start off with Copilot isn’t something you buy separately. Every time you use Copilot in Fabric it uses compute from a capacity, either a P SKU or an F SKU, just the same as if you opened a report or refreshed a semantic model. You buy the capacity and that gives you a pool of compute that you can use however you want, and using Copilot is just one of the things you can use that compute for. No-one ever asked me how much it cost in dollars to open a report in Power BI or refresh a semantic model, so why ask this question about Copilot?

Of course I understand why Copilot feels like a special case: customers know a lot of users will want to play with it and they also know how greedy AI is for resources. Which brings me back to the point of this post: the question you need to ask about Copilot is “How much of my Power BI/Fabric capacity’s compute will be used by Copilot if I turn it on?”. Answering this question in terms of percentages is useful because if you consistently go over 100% usage on a Fabric capacity then you will either need to buy more capacity or experience throttling. And if Copilot does end up using a lot of compute, and you don’t want to buy more capacity or deal with throttling, then maybe you do want to limit its usage to a subset of your users or even turn it off completely?

To a certain extent the question about percentage usage can be answered with the Premium Capacity Metrics app. For example, I opened up a gen2 dataflow on a Premium capacity that returns the following data:

I expanded the Copilot pane and typed the following prompt:

Filter the table so I only have products that were produced in the country France

And here’s what I got back – a filtered table showing the row with the Producer Country France:

So what impact did this have on my capacity? Since I know this was the only time I used Copilot the day I wrote this post it was easy to find the relevant line in the “Background operations for timerange” table on the TimePoint Detail page of the Premium Capacity Metrics app:

There are two important things to note:

  • For the 30-second window I selected the operation above used 0.02% of my capacity
  • Copilot operations are classed as “background operations” so their cost is smoothed out over 24 hours. Therefore the operation above used 0.02% of my capacity for 24 hours from a few minutes after the point I hit enter and the operation ran; in this particular case I ran my test just before 16:20 on Friday March 15th and the operation used 0.02% of my capacity from 16:20 on Friday the 15th until 16:20 on Saturday March 16th.

How can you extrapolate tests like this to understand the likely load on your capacity in the real world? With great difficulty. Almost all the Fabric/Power BI Copilot experiences available today are targeted at people developing rather than just consuming, so that naturally limits the opportunities people have to use Copilot. Different prompts will come with different costs (as the blog posts mentioned above explain), a single user will want to use Copilot more than once in a day and you’ll have more than one user wanting to use Copilot. What’s more, going forward there will be more and more opportunities to use Copilot in different scenarios and as Copilot gets better and better your users will want to use it more. The compute usage of different Fabric Copilots may change in the future too. So your mileage will vary a lot.

Is it possible to make a rough estimate? Let’s say you have 20 developers (which I think is reasonable for a P1/F64 – how many actively developed solutions are you likely to have on any given capacity?) writing 20 prompts per day. If each prompt uses 0.02% of your capacity then 20*20*0.02%=a maximum of 8% of your capacity used by Copilot for the whole day. That’s not inconsiderable and I’m sure someone will leave a comment saying I’m underestimating what usage will be.

Which brings me to my last point: should you even see Copilot as being different from anything else you can do in Fabric that consumes compute? Or as an optional extra or additional cost? After all, dataflows consume compute, you can enable or disable dataflows in your tenant in the same way you can enable or disable Copilot, but very few customers disable dataflows because they see the benefit of using them. Turning off dataflows would reduce the load on your capacity but it would also stop your users from being so productive, and why would you do that? If we at Microsoft deliver on the promise of Copilot (and believe me, we’re working hard on this) then the productivity gains it brings to your developers should offset the cost of any extra capacity you need to buy – if indeed you need to buy any extra capacity.

So, to sum up, if you enable Copilot in Fabric you will see additional load on your capacities and you may need to buy more capacity as a result – but the benefits will be worth it. Predicting that additional load is difficult but it’s no different from predicting how your overal Fabric capacity usage will grow over time, as more and more reports, semantic models, notebooks, warehouses and so on get created and used. Rather than doing lots of complex calculations based on vague assumptions to try to predict that load, my advice is that you should use the Capacity Metrics app to monitor your actual usage and buy that extra capacity when you see you’re going to need it.

Overhead Of Getting Relationship Columns In Power BI DirectQuery Mode

Many Power BI connectors for relational databases, such as the SQL Server connector, have an advanced option to control whether relationship columns are returned or not. By default this option is on. Returning these relationship columns adds a small overhead to the time taken to open a connection to a data source and so, for Power BI DirectQuery semantic models, turning this option off can improve report performance slightly.

What are relationship columns? If you connect to the DimDate table in the Adventure Works DW 2017 sample database using Power Query, you’ll see then on the right-hand side of the table. The following M code:

let
Source = Sql.Database("localhost", "AdventureWorksDW2017"),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]
in
dbo_DimDate

…shows the relationship columns:

Whereas if you explicitly turn off the relationships by deselecting the “Including relationship columns” checkbox:

…you get the following M code with the CreateNavigationProperties property set to false:

let
Source = Sql.Database("localhost", "AdventureWorksDW2017", [CreateNavigationProperties=false]),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]
in
dbo_DimDate

…and you don’t see those extra columns.

How much overhead does fetching relationship columns add? It depends on the type of source you’re using, how many relationships are defined and how many tables there are in your model (because the calls to get this information are not made in parallel). It’s also, as far as I know, impossible to measure the overhead from any public telemetry such as a Profiler trace or to deduce it by looking at the calls made on the database side. The overhead only happens when Power BI opens a connection to a data source and the result is cached afterwards, so it will only be encountered occasionally and not for every query that is run against your data source. I can say that the overhead can be quite significant in some cases though and can be made worse by other factors such as a lack of available connections or network/gateway issues. Since I have never seen anyone actually use these relationship columns in a DirectQuery model – they are quite handy in Power Query in general though – you should always turn them off when using DirectQuery mode.

[Thanks to Curt Hagenlocher for the information in this post]

Measuring The Total Time Spent Querying Sources In Power BI DirectQuery Mode

If you’re tuning a DirectQuery semantic model in Power BI one of the most important things you need to measure is the total amount of time spent querying your data source(s). Now that the queries Power BI generates to get data from your source can be run in parallel it means you can’t just sum up the durations of the individual queries sent to get the end-to-end duration. The good news is that there are new traces event available in Log Analytics (though not in Profiler at the time of writing) which solves this problem.

The events have the OperationName ProgressReportBegin/ProgressReportEnd and the OperationDetailName ParallelSession. Here’s a simple Log Analytics query that you can use to see how this event works:

PowerBIDatasetsWorkspace
| where TimeGenerated > ago(10min)
| where OperationName in
("QueryBegin", "QueryEnd",
"DirectQueryBegin", "DirectQueryEnd",
"ProgressReportBegin", "ProgressReportEnd")
| project OperationName, OperationDetailName, EventText, TimeGenerated,
DurationMs,CpuTimeMs, DatasetMode, XmlaRequestId
| order by TimeGenerated asc

Here’s what this query returned for a single DAX query that generated multiple SQL queries (represented by DirectQueryBegin/End event pairs) against a relational data source:

Notice that there is just one ProgressReportBegin/End event pair per query; this is always the case, no matter how many fact tables or data sources are used, at least as far as I know. The ProgressReportBegin event for ParallelSession comes before the DirectQueryBegin/End event pairs and the associated End event comes after the final DirectQueryEnd event. The DurationMs column for the ProgressReportEnd event gives you the total duration in milliseconds of all the DirectQuery events. In this case there were six SQL queries sent to the source (and so six DirectQueryBegin/End event pairs) each of which took between 1-2 seconds. However, since all of these queries ran in parallel, the overall duration was still just over 2 seconds.

dataCoverageDefinition: A New Optimisation For Hybrid Tables In Power BI

Hybrid tables – tables which contain both Import mode and DirectQuery mode partitions to hold data from different time periods – have been around for a while. They are useful in cases where your historic data doesn’t change but your most recent data changes very frequently and you need to reflect those changes in your reports; you can also have “reverse hybrid tables” where the latest data is in Import mode but your historic data (which may not be queried often but still needs to be available) is in DirectQuery mode. Up to now they had a problem though: even when you were querying data that was in the Import mode partition, Power BI still sent a SQL query to the DirectQuery partition and that could hurt performance. That problem is now solved with the new dataCoverageDefinition property on the DirectQuery partition.

You can find full documentation here:
https://learn.microsoft.com/en-us/analysis-services/tom/table-partitions?view=asallproducts-allversions

What dataCoverageDefinition does is tell Power BI what data is present in the DirectQuery partition so it knows whether to generate a SQL query to get data from your DirectQuery source or not. It takes the form of a simple DAX expression that returns True for rows that are stored in the DirectQuery partition, for example:

RELATED('DimDate'[CalendarYear]) IN {2017,2018,2019}

If you’re working with large fact tables in DirectQuery mode and don’t have a religious objection to some of that data being stored in Import mode, you should check out hybrid tables because they can really improve report performance!

Query Caching In Power BI Premium

The more you monitor something the more likely it is that you’ll see something strange you can’t explain. This is true of several customers I’ve spoken to recently who saw DAX queries run in the middle of the night by people who were definitely not online at the time and who were worried about the resulting load on their Premium capacities. What is the cause – hackers? Ghosts? In a lot of cases the culprit is actually Power BI’s query caching feature.

Query caching is a feature that is available in Premium and Fabric capacities for Import mode semantic models. It allows Power BI to pre-cache the data needed by some report visuals and dashboard tiles so the reports and dashboards open faster. The documentation here is pretty good but, as always, a few examples are useful to help understand how it works and there are a couple of extra things to point out that are not on the docs page I just linked to.

Let’s start with a simple pbix file called FruitSalesModel.pbix containing a single blank report page and one table in the Import mode semantic model:

The different fruit names will make it easy to link different visuals to different DAX queries. There’s also one measure called Sales Amount that sums the values in the Sales column.

If you publish this pbix to an otherwise empty workspace hosted on a Premium capacity and you’ll see that a semantic model and a blank report are created from it.

Query caching can be enabled on the semantic model by going to the model’s Settings page, expanding the Query Caching section and selecting “On”:

Enabling query caching now means that the Power BI Service may execute DAX queries to populate the cache immediately after a semantic model refresh takes place. If a query has been cached then, when an end user opens a report, Power BI can use the cached resultset for visuals which means that the report will open faster.

You can see how long all these queries take to run by looking at the Refresh History dialog and the details for an individual refresh (as mentioned in this blog post). The Query Cache line tells you when this query activity took place and how long it took:

At this point in our example, however, no queries will be executed because the only report connected to this semantic model is completely blank.

Which queries get cached? This is documented but I always prefer to run my own tests to see what happens; you can see the DAX queries being run after a refresh by running a Profiler trace on the semantic model.

If you edit the currently-blank report in the workspace so that there are two pages, add a single card to the first page that shows the value for the Sales Amount measure for the product Apples and a single card to the second page that shows the value for the Sales Amount measure for the product Grapes, make sure the first page is the page that is displayed when a user opens the report, and then close the report without going back to Reading View, and then refresh the semantic model, then a Profiler trace shows… no queries being executed after the refresh. This is because only visuals that are viewed by an end user in Reading View (not Editing mode) in the browser have their queries cached.

If you then open the report in Reading View and view both pages of the report and then refresh the semantic model again, you’ll see one query appear in the trace:

This is the query for the card on the first page of the report, for Sales Amount for Apples. The query for the card on the second page is not cached – only the queries for the visuals on the page that the report opens on are cached.

It’s also interesting to note that the ApplicationContext column in the trace has a value for the DatasetId but not ReportId or VisualId for these queries. This makes sense because these queries are not executed as a result of a report rendering.

If you then edit the report so it opens on the second page (which shows Sales Amount for Grapes) and not the first page, then refresh the semantic model again, the Profiler trace shows queries for both cards on both pages. In fact, even if you delete the first page from the report and refresh, both queries are still run:

The query for Sales Amount for Grapes is run because it’s now the query on the page of the report that opens first. The query for Sales Amount for Apples is still run because, once a query has been added to the list of queries to be cached, it will stay there until you either delete the semantic model or you turn the Query Caching setting on the semantic model off and on again.

It’s not just the visuals on the opening page of a report that are cached. If you pin any visual, from any page of a report, to a dashboard and then view that dashboard, the query for that visual will be cached. Also, if you create a personal bookmark and make that your default view when you open a report, then the queries for that personal bookmark will also be cached. Finally, if an end user opens a report, changes a filter or slicer on the opening page of a report, then closes the report and reopens it, the report will show the filter or slicer selection that was made when the report was last closed (this behaviour, known as “persistent filters”, is explained here). Queries for each user’s persistent filters on the opening page of a report will also be cached.

The behaviour above applies to all reports connected to a semantic model.

As a result, if your end users like creating dashboards or personal bookmarks, or if you have a lot of filters or slicers on the opening page of your report, or if you have lots of reports connected to a single semantic model, you can end up with a lot of queries being generated after the semantic model has refreshed when query caching is turned on. The good news for Premium capacity admins is that the queries run to populate the cache are classed as background operations and not interactive operations (as is the case for queries generated when a report renders). The CU cost of background operations is smoothed over 24 hours which means that you won’t get big spikes in usage on your capacity when a refresh takes place and the cache is populated, although if a lot of expensive queries are run this is definitely something you still need to monitor. Here’s a screenshot from the Capacity Metrics App showing DAX queries appearing as background operations:

In conclusion, query caching is a very powerful feature. Turning it on can make a big difference to the performance of your reports but you do need to understand how caching works to get the full benefit of it. You also need to make sure you watch out for it when monitoring a Premium capacity to make sure that it doesn’t increase your background utilisation too much.

Reading Parquet Metadata In Power Query In Power BI

There’s a new M function in Power Query in Power BI that allows you to read the data from a Parquet file: Parquet.Metadata. It’s not documented yet and it’s currently marked as “intended for internal use only” but I’ve been told I can blog about it. Here’s an example of how to use it:

let
Source = Parquet.Metadata(File.Contents("C:\myfile.snappy.parquet"))
in
Source

…and here’s an example of the output:

This query shows how to expand the record returned by this function into a table:

let
m = Parquet.Metadata(File.Contents("C:\myfile.snappy.parquet")),
schema = List.Accumulate(Table.ToRecords(m[Schema]), [], (x, y) => if y[NumChildren] = null then Record.AddField(x, y[Name], y[LogicalType] ?? y[ConvertedType]) else x),
expanded1 = Table.ExpandTableColumn(m[RowGroups], "Columns", {"MetaData"}),
renamed1 = Table.RenameColumns(expanded1, {{"Ordinal", "RowGroup"}, {"TotalCompressedSize", "RowGroupCompressedSize"}, {"TotalByteSize", "RowGroupSize"}}),
expanded2 = Table.ExpandRecordColumn(renamed1, "MetaData", {"Type", "Encodings", "PathInSchema", "Codec", "NumValues", "TotalUncompressedSize", "TotalCompressedSize", "KeyValueMetadata", "DataPageOffset", "IndexPageOffset", "DictionaryPageOffset", "Statistics", "EncodingStats"}),
renamed2 = Table.RenameColumns(expanded2, {{"Type", "PhysicalType"}}),
added1 = Table.AddColumn(renamed2, "Column", each Text.Combine([PathInSchema])),
added2 = Table.AddColumn(added1, "Cardinality", each [Statistics][DistinctCount]),
added3 = Table.AddColumn(added2, "NullCount", each [Statistics][NullCount]),
added4 = Table.AddColumn(added3, "DictionarySize", each [DataPageOffset] - [DictionaryPageOffset]),
added5 = Table.AddColumn(added4, "LogicalType", each Record.FieldOrDefault(schema, [Column], null)),
selected = Table.SelectColumns(added5, {"RowGroup", "Column", "Codec", "NumValues", "Cardinality", "NullCount", "TotalCompressedSize", "TotalUncompressedSize", "DictionarySize", "PhysicalType", "LogicalType"})
in
selected

As you can see this gives you all kinds of useful information about a Parquet file such as the schema, the compression type used, column cardinality and so on.

[Thanks to Curt Hagenlocher for the tip-off and the query above]

Performance Implications Of Using Calculated Columns In Composite Models On Power BI Semantic Models

I don’t have anything against the use of calculated columns in Power BI semantic models in general but you do need to be careful using them with DirectQuery mode. In particular when you have a DirectQuery connection to another Power BI semantic model – also known as a composite model on a Power BI semantic model – it’s very easy to cause serious performance problems with calculated columns. Let’s see a simple example of why this is.

Let’s say you have an Import mode semantic model called Source Model containing sales data:

Here’s the contents of the Sales table:

And here’s the definition of the Sales Amount measure:

Sales Amount = SUM(Sales[Sales])

Now, let’s say this semantic model gets published to the Service and you open Power BI Desktop, create a Live connection to this model and then click the “Make changes to this model” button to create a composite model. You now have a local composite model in Power BI Desktop that is connected to the Source Model semantic model.

Next, you decide to create a calculated column on the Sales table and after reading Marco and Alberto’s article on circular dependencies (because you’ll definitely run into circular dependency errors) come up with the following code:

Tax =
CALCULATE (
[Sales Amount] * 0.1,
ALLEXCEPT ( Sales, Sales[TransactionNumber] )
)

Great! You can now build a report that looks like this:

This is where the first warning comes though. In Import mode semantic models calculated columns are evaluated when the model is refreshed, but in DirectQuery mode they are evaluated at query time if the query needs to use them. Performance Analyzer shows two DAX queries generated for the visual above. The first is the query that the visual runs against the local semantic model in Power BI Desktop, which looks like this:

DEFINE
VAR __DS0Core =
SUMMARIZECOLUMNS(
ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", 'Sales'[Sales Amount],
"SumTax", CALCULATE(SUM('Sales'[Tax]))
)

VAR __DS0PrimaryWindowed =
TOPN(502, __DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
__DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

There’s nothing much interesting here. However, there’s a second DAX query generated: the one generated by the local model to get the data it needs from the Source Model semantic model in the Service. This query looks like this:

Define
COLUMN 'Sales'[ASDQ_Tax] = CALCULATE([Sales Amount] * 0.1, ALLEXCEPT(Sales, Sales[TransactionNumber]))

var ASDQ___DS0Core = SUMMARIZECOLUMNS( ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", [Sales Amount],
"SumTax", CALCULATE(SUM('Sales'[ASDQ_Tax]))
)
var ASDQ___DS0PrimaryWindowed = TOPN(502, ASDQ___DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
ASDQ___DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

Notice, right at the top of the query, a DEFINE statement that defines the Tax calculated column. Every time this visual is rendered in your Power BI report the calculated column will be re-evaluated and that will have a performance overhead. If the calculated column had been created in Source Model (which is Import mode remember) it would have been evaluated when the model was refreshed and any queries that used it would probably be faster as a result.

There’s something else to watch out for though. Let’s say you define a second calculated column in the local model like so:

Apple Sales =
CALCULATE (
[Sales Amount],
FILTER ( ALLNOBLANKROW ( 'Sales'[Product] ), 'Sales'[Product] = "Apples" )
)

You can now create a table that looks like this, which does include the Apple Sales column but does not include the Tax column:

The DAX query sent by the visual to the local model is not worth looking at. However, the DAX query sent by the local model to Source Model does have something surprising in it:

Define
COLUMN 'Sales'[ASDQ_Apple Sales] = CALCULATE([Sales Amount], FILTER(ALLNOBLANKROW('Sales'[Product]), 'Sales'[Product]="Apples"))

COLUMN 'Sales'[ASDQ_Tax] = CALCULATE([Sales Amount] * 0.1, ALLEXCEPT(Sales, Sales[TransactionNumber]))

var ASDQ___DS0Core = SUMMARIZECOLUMNS( ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", [Sales Amount],
"SumApple_Sales", CALCULATE(SUM('Sales'[ASDQ_Apple Sales]))
)

var ASDQ___DS0PrimaryWindowed = TOPN(502, ASDQ___DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
ASDQ___DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

There’s a DEFINE statement for the Apple Sales column as you would expect. However there is also a DEFINE statement for the Tax column, which means it is evaluated too even though it’s not shown in the visual and it seems as though the Apple Sales column doesn’t reference it – or does it? I’m not going to try to explain what’s going on here (again, this is best left to Marco and Alberto) but the important point is that both calculated columns are now being evaluated at query time which means the query will be even slower. If you have large data volumes and/or your calculated columns contain some complex DAX you could end up with some very slow reports. If you’re using Premium then you could also be putting a lot of extra load on your capacity.

Composite models are an important part of Power BI’s self-service story and it’s inevitable that at some point your developers will want to use build calculated columns in them. As always you need to monitor the performance of your reports carefully, both in terms of duration and CPU usage, while you are building them to make sure they are well optimised and keep the use of calculated columns in composite models to a minimum.

Learning Power BI And Fabric By Attending User Groups And Conferences

One of the most popular questions on the forums I follow is “How can I learn Power BI and Fabric?”. There are a ton of great, free online resources out there – blogs, videos and so on – but the people answering this question often do not mention user groups and conferences. I think that’s a mistake because, at least for me, attending (and indeed speaking at) online and in-person events is one of the most effective ways of learning.

Why is this? It’s not about the content, I think: a lot of what is presented at these meetings is freely available elsewhere already. Indeed, a lot of user groups and conferences record their sessions now and make them available for free online afterwards (eg SQLBits). What’s different is that an event, especially an in-person event, demands your attention more because it takes place at a specific place and time. You can’t pause an event because you’ve been distracted by an important work email or because you’ve seen a cat video that you just have to share, which means you’re more likely to concentrate on it than on a blog post or a video. That’s even more true for an in-person event where, if you’ve made the effort to travel to a venue and you’re sitting in an audience, there’s added social pressure to pay attention to what the speaker is saying. And trust me, I need all the help I can get avoiding distractions these days.

The other benefit of attending an event is the ability to chat with, and ask questions to, the presenter, other attendees and sponsors. Every online event I’ve been to has had a very active chat, and of course with in-person events there’s the ability to queue up and ask the speaker questions or ask for a selfie. I know a lot of people who have found new jobs as a result of attending conferences and user groups. In-person events very often have social activities which make it easy to meet people as well; again, SQLBits is a great example of this with its legendary party and other activities such as runs and the pub quiz.

User groups are almost always free and in many cases there is free pizza too. Most conferences are paid although some have a free day (in the UK SQLBits is free on Saturdays and Data Relay is an annual series of free one-day events). Even when you have to pay I think they represent very good value for money compared to other things you could spend you training budget on.

If I’ve convinced you to attend an event, the next question is: how do I find one near me? Meetup is the best way to find user groups and searching for “Power BI” or “Fabric” and your location there will give you a list of upcoming events you could attend. In the UK there are very active in-person user groups in London, Manchester (I was there last week and they had over 100 attendees), Birmingham, Leeds and Newcastle. There are also several UK groups that are wholly or partly online, such as Oxford, Devon & Cornwall and the Fabric UK User Group.

I don’t think there’s a single list of all the Power BI or Fabric conferences anywhere but here are a few coming up soon that I plan to attend or know are good:

  • The Microsoft Fabric Community Conference in Las Vegas, USA, March 26-28 2024. There will be a lot of Microsoft folks presenting and a lot of big announcements there!
  • Fabric February in Oslo, Norway, February 7-9 2024 – although I think it’s just sold out.
  • Power BI Cruise from Stockholm to Helsinki, June 3-5 2024. I really need to do this one year.
  • Power BI Next Step in Aarhus, Denmark, September 12-13 2024. Those Scandinavians do love the Microsoft data platform.
  • The Global Excel Summit in London, UK, February 7-9. Despite the name there’s a lot of Power BI content here too.
  • SQLDay in Wroclaw, Poland, 13-15 May 2024.
  • DATA:Scotland in Glasgow, UK, 13 September 2024.
  • SQLKonferenz in Hanau, Germany, 30 September-2 October.
  • SQLBits in Farnborough, UK, 19-23 March 2024. The best Microsoft data-related conference in the world as far as I’m concerned – I never miss it – and by far the largest in Europe too.

Apologies if I’ve missed your favourite event, there are too many out there to list or even remember…

So what are you waiting for? If you’re serious about learning Power BI and Fabric get yourself along to one of these events!

Understanding The “Evaluation resulted in a stack overflow” Error In Power Query In Excel And Power BI

If you’re writing your own M code in Power Query in Power BI or Excel you may run into the following error:

Expression.Error: Evaluation resulted in a stack overflow and cannot continue.

If you’re a programmer you probably know what a stack overflow is; if you’re not you might search for the term, find this Wikipedia article and still have no clue what has happened. Either way it may still be difficult to understand why you’re running into it in Power Query. To explain let’s see some examples.

First, a really simple one. You can write recursive functions – functions that call themselves – in Power Query, although as you might suspect that is generally a bad idea because they are difficult to write, difficult to understand, slow and may result in the error above. Consider the following function called MyRecursiveFunction which references a parameter of type number called MaxDepth:

(counter as number) as number=>
if counter>MaxDepth then counter else @MyRecursiveFunction(counter+1)

The function takes a number and calls itself, passing in a value one greater than the number that was passed to it, until the number passed is greater than MaxDepth. So, if the value of MaxDepth is 3 and you call the function and pass it the value of 1, like so:

MyRecursiveFunction(1)

…then you’ll get the value 4 back:

So far so good. But how long can a function in M go on calling itself? As you can imagine, it’s not forever. So, when you hit the point where the function can’t go on calling itself then you get the error above. For example if you try setting MaxDepth to 100000 then you’ll get the stack overflow error above instead of 100001:

As a result it’s almost always a good idea to avoid recursion in Power Query and use functions like List.Transform, List.Generate or List.Accumulate to achieve the same result. A great example of this is shown in the Power Query custom connector documentation in the section on handling APIs that return results broken up into pages with the Table.GenerateByPage code sample.

You may still get this error even when you’re not explicitly using recursion though, as a result of lazy evaluation. Consider the following query which uses List.Accumulate to generate a table with a given number of rows:

let
//define a table with one row and one column called x
MyTable = #table(type table [x = number], {{1}}),
//specify how many rows we want in our output table
NumberOfTimes = 3,
//Use List.Accumulate to create a table with this number of rows
//By calling Table.Combine
CombineAllTables = List.Accumulate(
{1 .. NumberOfTimes},
null,
(state, current) => if current = 1 then MyTable else Table.Combine({state, MyTable})
)
in
CombineAllTables

Here’s the output, a table with three rows:

But how does it get this result? With NumberOfTimes=3 you can think of this query lazily building up an M expression something like this:

Table.Combine({Table.Combine({MyTable, MyTable}), MyTable})

…which, once List.Accumulate has finished, suddenly all has to be evaluated and turned into a single table. Imagine how much nesting of Table.Combine there would be if NumberOfTimes was a much larger number though! And indeed, it turns out that you can’t make lots and lots of calls to Table.Combine without running into a stack overflow. So if NumberOfTimes=100000 like so:

let

//define a table with one row and one column called x
MyTable = #table(type table [x = number], {{1}}),
//specify how many rows we want in our output table
NumberOfTimes = 100000,
//Use List.Accumulate to create a table with this number of rows
//by calling Table.Combine
CombineAllTables = List.Accumulate(
{1 .. NumberOfTimes},
null,
(state, current) => if current = 1 then MyTable else Table.Combine({state, MyTable})
)
in
CombineAllTables

…then, after a minute or so, you get the “Evaluation resulted in a stack overflow and cannot continue” error again.

Rewriting the query so you build up the list of tables first and only call Table.Combine once at the end avoids the problem and is much faster:

let
//define a table with one row and one column called x
MyTable = #table(type table [x = number], {{1}}),
//specify how many rows we want in our output table
NumberOfTimes = 100000,
//create a table with NumberOfTimes rows
CombineAllTables = Table.Combine(List.Repeat({MyTable}, NumberOfTimes))
in
CombineAllTables

It’s also possible to solve the problem by forcing eager evaluation inside List.Accumulate but this is extremely tricky: there’s an example of this on Gil Raviv’s blog here.