By far the most exciting announcement for me this week was the new release of Power BI Report Builder that has Power Query built in, allowing you to connect to far more data sources in paginated reports than you ever could before. There’s a very detailed blog post and video showing you how this new functionality works here:
The main justification for building this feature was to allow customer to build paginated reports on sources like Snowflake or BigQuery, something which had only been possible before if you used an ODBC connection via a gateway or built a semantic model in between – neither of which are an ideal solution. However it also opens up a lot of other possibilities too.
For example, you can now build paginated reports on web services (with some limitations). I frequently get asked about building regular Power BI reports that get data from web services on demand – something which isn’t possible, as I explained here. To test using paginated reports on a web service I registered for Transport for London’s APIs and built a simple report on top of their Journey Planner API (Transport for London are the organisation that manages public transport in London). This report allows you to enter a journey starting point and ending point anywhere in or around London, calls the API and returns a table with different routes from the start to the destination, along with timings and instructions for each route. Here’s the report showing different routes for a journey from 10 Downing Street in London to Buckingham Palace:
You can also build paginated reports that connect to Excel workbooks that are stored in OneDrive or OneLake, meaning that changes made in the Excel workbook show up in the report as soon as the workbook is saved and closed:
So. Much. Fun. I’ll probably develop a presentation for user groups explaining how I built these reports soon.
And yes, if you need to export data to Excel on a schedule, paginated reports are now an even better choice. You know your users want this.
I recently took part in a webinar with Denny Lee, Liping Huang and Marius Panga from Databricks on the subject of best practices for using Power BI on Databricks. You can view the recording on LinkedIn here:
My section at the beginning covering Power BI best practices for Import and DirectQuery doesn’t contain any new information – if you’ve been following the DirectQuery posts on this blog or read the DirectQuery guidance docs here and here then there won’t be any surprises. What I thought was really useful, though, was hearing the folks from Databricks talk about best practices on the Databricks side and this took up the majority of the webinar. Definitely worth checking out.
There’s a lot of excitement about Copilot in Power BI and in Fabric as a whole. The number one question I’m asked about Copilot by customers is “How much does it cost?” and indeed there have been two posts on the Fabric blog here and here attempting to answer this question. The point I want to make in this post, though, is that it’s the wrong question to ask.
Why? Well to start off with Copilot isn’t something you buy separately. Every time you use Copilot in Fabric it uses compute from a capacity, either a P SKU or an F SKU, just the same as if you opened a report or refreshed a semantic model. You buy the capacity and that gives you a pool of compute that you can use however you want, and using Copilot is just one of the things you can use that compute for. No-one ever asked me how much it cost in dollars to open a report in Power BI or refresh a semantic model, so why ask this question about Copilot?
Of course I understand why Copilot feels like a special case: customers know a lot of users will want to play with it and they also know how greedy AI is for resources. Which brings me back to the point of this post: the question you need to ask about Copilot is “How much of my Power BI/Fabric capacity’s compute will be used by Copilot if I turn it on?”. Answering this question in terms of percentages is useful because if you consistently go over 100% usage on a Fabric capacity then you will either need to buy more capacity or experience throttling. And if Copilot does end up using a lot of compute, and you don’t want to buy more capacity or deal with throttling, then maybe you do want to limit its usage to a subset of your users or even turn it off completely?
To a certain extent the question about percentage usage can be answered with the Premium Capacity Metrics app. For example, I opened up a gen2 dataflow on a Premium capacity that returns the following data:
I expanded the Copilot pane and typed the following prompt:
Filter the table so I only have products that were produced in the country France
And here’s what I got back – a filtered table showing the row with the Producer Country France:
So what impact did this have on my capacity? Since I know this was the only time I used Copilot the day I wrote this post it was easy to find the relevant line in the “Background operations for timerange” table on the TimePoint Detail page of the Premium Capacity Metrics app:
There are two important things to note:
For the 30-second window I selected the operation above used 0.02% of my capacity
Copilot operations are classed as “background operations” so their cost is smoothed out over 24 hours. Therefore the operation above used 0.02% of my capacity for 24 hours from a few minutes after the point I hit enter and the operation ran; in this particular case I ran my test just before 16:20 on Friday March 15th and the operation used 0.02% of my capacity from 16:20 on Friday the 15th until 16:20 on Saturday March 16th.
How can you extrapolate tests like this to understand the likely load on your capacity in the real world? With great difficulty. Almost all the Fabric/Power BI Copilot experiences available today are targeted at people developing rather than just consuming, so that naturally limits the opportunities people have to use Copilot. Different prompts will come with different costs (as the blog posts mentioned above explain), a single user will want to use Copilot more than once in a day and you’ll have more than one user wanting to use Copilot. What’s more, going forward there will be more and more opportunities to use Copilot in different scenarios and as Copilot gets better and better your users will want to use it more. The compute usage of different Fabric Copilots may change in the future too. So your mileage will vary a lot.
Is it possible to make a rough estimate? Let’s say you have 20 developers (which I think is reasonable for a P1/F64 – how many actively developed solutions are you likely to have on any given capacity?) writing 20 prompts per day. If each prompt uses 0.02% of your capacity then 20*20*0.02%=a maximum of 8% of your capacity used by Copilot for the whole day. That’s not inconsiderable and I’m sure someone will leave a comment saying I’m underestimating what usage will be.
Which brings me to my last point: should you even see Copilot as being different from anything else you can do in Fabric that consumes compute? Or as an optional extra or additional cost? After all, dataflows consume compute, you can enable or disable dataflows in your tenant in the same way you can enable or disable Copilot, but very few customers disable dataflows because they see the benefit of using them. Turning off dataflows would reduce the load on your capacity but it would also stop your users from being so productive, and why would you do that? If we at Microsoft deliver on the promise of Copilot (and believe me, we’re working hard on this) then the productivity gains it brings to your developers should offset the cost of any extra capacity you need to buy – if indeed you need to buy any extra capacity.
So, to sum up, if you enable Copilot in Fabric you will see additional load on your capacities and you may need to buy more capacity as a result – but the benefits will be worth it. Predicting that additional load is difficult but it’s no different from predicting how your overal Fabric capacity usage will grow over time, as more and more reports, semantic models, notebooks, warehouses and so on get created and used. Rather than doing lots of complex calculations based on vague assumptions to try to predict that load, my advice is that you should use the Capacity Metrics app to monitor your actual usage and buy that extra capacity when you see you’re going to need it.
Many Power BI connectors for relational databases, such as the SQL Server connector, have an advanced option to control whether relationship columns are returned or not. By default this option is on. Returning these relationship columns adds a small overhead to the time taken to open a connection to a data source and so, for Power BI DirectQuery semantic models, turning this option off can improve report performance slightly.
What are relationship columns? If you connect to the DimDate table in the Adventure Works DW 2017 sample database using Power Query, you’ll see then on the right-hand side of the table. The following M code:
let Source = Sql.Database("localhost", "AdventureWorksDW2017"), dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data] in dbo_DimDate
…shows the relationship columns:
Whereas if you explicitly turn off the relationships by deselecting the “Including relationship columns” checkbox:
…you get the following M code with the CreateNavigationProperties property set to false:
let Source = Sql.Database("localhost", "AdventureWorksDW2017", [CreateNavigationProperties=false]), dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data] in dbo_DimDate
…and you don’t see those extra columns.
How much overhead does fetching relationship columns add? It depends on the type of source you’re using, how many relationships are defined and how many tables there are in your model (because the calls to get this information are not made in parallel). It’s also, as far as I know, impossible to measure the overhead from any public telemetry such as a Profiler trace or to deduce it by looking at the calls made on the database side. The overhead only happens when Power BI opens a connection to a data source and the result is cached afterwards, so it will only be encountered occasionally and not for every query that is run against your data source. I can say that the overhead can be quite significant in some cases though and can be made worse by other factors such as a lack of available connections or network/gateway issues. Since I have never seen anyone actually use these relationship columns in a DirectQuery model – they are quite handy in Power Query in general though – you should always turn them off when using DirectQuery mode.
[Thanks to Curt Hagenlocher for the information in this post]
If you’re tuning a DirectQuery semantic model in Power BI one of the most important things you need to measure is the total amount of time spent querying your data source(s). Now that the queries Power BI generates to get data from your source can be run in parallel it means you can’t just sum up the durations of the individual queries sent to get the end-to-end duration. The good news is that there are new traces event available in Log Analytics (though not in Profiler at the time of writing) which solves this problem.
The events have the OperationName ProgressReportBegin/ProgressReportEnd and the OperationDetailName ParallelSession. Here’s a simple Log Analytics query that you can use to see how this event works:
PowerBIDatasetsWorkspace | where TimeGenerated > ago(10min) | where OperationName in ("QueryBegin", "QueryEnd", "DirectQueryBegin", "DirectQueryEnd", "ProgressReportBegin", "ProgressReportEnd") | project OperationName, OperationDetailName, EventText, TimeGenerated, DurationMs,CpuTimeMs, DatasetMode, XmlaRequestId | order by TimeGenerated asc
Here’s what this query returned for a single DAX query that generated multiple SQL queries (represented by DirectQueryBegin/End event pairs) against a relational data source:
Notice that there is just one ProgressReportBegin/End event pair per query; this is always the case, no matter how many fact tables or data sources are used, at least as far as I know. The ProgressReportBegin event for ParallelSession comes before the DirectQueryBegin/End event pairs and the associated End event comes after the final DirectQueryEnd event. The DurationMs column for the ProgressReportEnd event gives you the total duration in milliseconds of all the DirectQuery events. In this case there were six SQL queries sent to the source (and so six DirectQueryBegin/End event pairs) each of which took between 1-2 seconds. However, since all of these queries ran in parallel, the overall duration was still just over 2 seconds.