In summary there will soon be a lightweight, web-based version of Visio available to anyone with a Microsoft 365 Business, Office 365 E1/E3/E5, F3, A1, A3 or A5 subscription. Previously Visio was not part of the main M365 plans and was only available as a separate purchase.
So what? As a Power BI user, why should I care? Well the Visio custom visual for Power BI has been around a long time now and it’s really powerful. Unfortunately it’s very rarely used because Power BI developers don’t usually have Visio licences – but this is exactly what is about to change. With these licensing changes pretty much everyone who uses Power BI will have access to the new lightweight Visio web app. It’s not as sophisticated as desktop Visio but I’ll be honest, I’m no Visio expert and it’s good enough for me and really easy to use. As a result this is going to unlock the power of the Power BI Visio visual for a much, much larger number of people!
To get an idea of what you can do with the Power BI Visio visual, this video is a good place to start:
A few months ago I heard about a new tool from Microsoft called Lobe which makes it easy to train machine learning models. It’s nothing to do with Power BI but I find anything to do with self-service data analytics interesting, and when I finally got round to playing with it today I thought it was so much fun that it deserved a blog post.
You can download it and learn more at https://www.lobe.ai/ and there’s a great ten minute video describing how to use it here:
The most impressive thing about it is not what it does but how it does it: a lot of tools claim to make machine learning easy for non-technical users but Lobe really is easy to use. My AI/ML knowledge is very basic but I got up and running with it extremely quickly.
To test it out I downloaded lots of pictures of English churches and trained a model to detect whether the church had a tower or a spire. After I labelled the pictures appropriately:
…Lobe was able to train the model:
I could test it inside the tool. The model was able to tell whether a church had a tower:
…or a spire:
If I have one criticism it’s that when you want to use your model things get a lot more technical, at least compared to something like AI Builder for Power Apps and Power Automate, but I guess that’s because it is just a tool for training models. There have been some recent improvements here though (see this blog post) and Lobe does provide a local API for testing purposes that can be consumed in Power BI with some custom M code.
Here’s an example of how to call the local API in Power Query:
My favourite Power BI announcement at the Microsoft Business Applications Summit was, without a doubt, that Excel PivotTables connected to Power BI datasets will very soon work in the browser and not just on the desktop. This is something I have wanted for a long time, way before I joined Microsoft, so this is a feature I have a personal interest in. However I also think it’s an incredibly important step forward for Power BI in general and in this post I’ll outline the reasons why.
Before we carry on please make sure you read this post on the Excel blog which has more details on all of the new Power BI/Excel integration features that are being released. Quick summary: if you’re reading this in late May 2021 you probably won’t have all of this functionality available in your tenant yet but it is coming very soon.
So why exactly am I excited?
It makes Excel athird option for building Power BI reports
Up to now, if you wanted to build Power BI reports and share them with other people online you had two choices: regular Power BI reports and paginated reports. Now Excel gives you a third option: you can upload Power BI-connected Excel workbooks to a Power BI workspace, make them available via a Power BI app, and not only will they be fully interactive but the data in them will also update automatically when the data in your dataset updates.
PivotTables are the best way to explore Power BI data
Why do we need Excel as an option for building reports on data stored in Power BI? The first reason is data exploration. Excel PivotTables are a much better way to explore your data than any other method in Power BI, in my opinion. Why try to recreate an Excel PivotTable using a matrix visual in a regular Power BI report when you can give your users the real thing?
Cube functions also now work in the browser – and they make it easy to design financial reports
The Excel cube functions (CubeMember, CubeValue etc) are, I think, the best-kept feature in Excel. While PivotTables are great for exploring data they aren’t always so great when you want to build highly-formatted reports. The Excel cube functions make it easy to bind individual cells in a worksheet to individual values in your dataset and because they’re just like any other Excel function they allow you to use all of Excel’s formatting and charting functionality. This then makes it possible to build certain types of report, such as financial reports, much more easily. If you want to learn more about them check out this video from Peter Myers – it shows how to use them with Analysis Services but they work just the same when connected to a Power BI dataset.
Organisational data types make it easy to access Power BI data
While the Excel cube functions are very powerful they are also somewhat difficult to use and sometimes suffer from performance problems. The new organisational data types in Excel do something very similar and while they don’t yet have all the features you need to build complex reports they are also a lot easier to understand for most business users.
Excel formulas are easier than DAX for a many calculations
Everyone knows DAX can be hard sometimes. However, once you’ve got the data you need from your dataset into Excel using a PivotTable, cube functions or organisational data types you can then do your own calculations on that data using regular Excel formulas. This not only allows business users to add their own calculations easily but for BI professionals it could be the case that an Excel formula is easier to write and faster to execute than the equivalent DAX.
Excel can visualise data in ways that Power BI can’t
Excel is a very mature data visualisation tool and it has some types of chart and some formatting options that aren’t (yet) available in Power BI’s native visuals. One example that springs to mind is that you can add error bars to a bar chart in Excel; another is sparklines, although they are coming to Power BI later this year.
Power Pivot reports will also work in the browser
Even if you don’t have a Power BI pro licence, if you have a commercial version of Excel you’ll have Power Pivot and the Excel Data Model. And guess what, Power Pivot reports also now work in the browser!
Collaborate in real-time with your colleagues in Excel Online
With Excel reports connected to Power BI stored in OneDrive for Business or a SharePoint document library you get great features for collaboration and co-authoring, so you and your colleagues can analyse data together even if you’re not in the same room.
There’s a lot of other cool stuff happening in Excel right now
The Excel team are on a hot streak at the moment: dynamic arrays, LAMBDAs, LET, the beginnings of Power Query on the Mac and lots more cool new stuff has been delivered recently. If you’re only familiar with the Excel features you learned on a course 20 years ago you’re missing out on some really powerful functionality for data analysis.
Everyone knows Excel!
Last of all, it goes without saying that Excel is by far the most popular tool for working with data in the world. Everyone has it, everyone knows it and everyone wants to use it. As Power BI people we all know how difficult it is to persuade our users to abandon their old Excel habits, so why not meet them halfway? Storing data in a Power BI dataset solves many of the problems of using Excel as a reporting tool: no more manual exports, old-of-date data or multiple versions of the truth. Using Excel to build reports on top of a Power BI dataset may be much easier to learn and accept for many business users – at least at first – than learning how to build reports in Power BI Desktop.
Following on from my last post, another SQLBits session of mine I wanted to highlight was “Advanced Analytics Features In Power BI”. The subject is a bit outside my normal area of expertise but it’s also one that I don’t think gets enough attention: it’s all the features available in Power BI reports that can help you explain why something happened rather than just what happened. Things I talk about include:
Adding forecasts to line charts
Symmetry shading, ratio lines and clustering on scatter charts
The “Explain the increase” and “Find where this distribution is different” features
The Key Influencers and Decomposition Tree visuals
The team at SQLBits have been publishing all the session recordings from their last (online) conference on their YouTube channel. There’s a lot of great content there to check out and this post is to highlight one of my sessions, “Performance tuning Power BI dataset refresh”.
In this session I look at all of the factors that can influence how long it takes to import data into Power BI and what you can do to make it faster. Topics covered include:
Choosing a dataset storage mode
The importance of good data modelling
How the type of data source you use effects how quickly data can load
Ways to measure refresh performance, such as using SQL Server Profiler and Power Query Query Diagnostics
Power Query options that can influence refresh times such as disabling data previews
Query folding in the Power Query engine
Vertipaq engine features that affect refresh, such as calculated columns and calculated tables
It covers all the new things that have happened in Power Query over the last few months such as Diagram View and, more importantly, talks about what’s going to happen in the next few months. It’s relatively short but for those of you with no time or patience, here’s a summary of the roadmap announcements:
[BTW “Power Query Online” is the browser-based version of Power Query that is used in Power BI dataflows]
My highlights are:
The ability to create a dataflow quickly by uploading a file to Power Query Online without needing to use a gateway to connect to a file on-premises, useful for one-time import scenarios.
Multi-value M parameter support – useful for dynamic M parameters and other things I can’t talk about yet 😉
The things that Miguel talks about regarding “easier design experiences” for Synapse are kept intentionally vague but it’s worth listening to carefully to what he says here!
Native SQL support for Snowflake, BigQuery and Redshift – this is really useful for anyone who wants to use DirectQuery with these databases because it will allow you to write your own SQL query and use it as the source of a table, rather than having to use a table or a view.
AAD based Single Sign-On support for Redshift and BigQuery (similar to what we have today for Snowflake) will also be very important for DirectQuery, because it means that the identity of the user running the report can be passed back to the database.
A dataflows connector for Excel Power Query – which means, at last, you’ll be able to get data from a dataflow direct into Excel. This will make a lot of Excel users very happy, I think: a lot of the time all users want is a table of data dumped to Excel and dataflows will be a great way to do provide them with that.
Last of all, the session showcases the great new home for all things Power Query – http://www.powerquery.com/ – which has great resources, newly-updated documentation and a blog. Make sure you check it out!
Excel Organisation data types were released last year (see here for details), but did you know that you can now use them to bring images as well as text and numbers into Excel? Here’s a super-simple example that shows you how to do this.
Here’s a table called ‘Fruit With Image’ in a dataset that I have published to the Power BI Service:
Notice that the Data Category property on the Image column, which contains the URL of a picture of each type of fruit listed, to “Image URL” (for more details on what this does see here). If I use this table in a Power BI report, I see the name of each fruit and a picture:
So far no surprises. I can also set this table up as a Featured Table (for more details see here) so it can be used as the source for an Organisation Data Type in Excel:
The cool thing is that when I type these fruit names into Excel and mark them as the “Fruit With Image” data type (see here for more details), I can then access the Image field and it will show the image that the URL points to inside a cell:
If you have a slow DirectQuery report in Power BI one of the first questions you need to ask is how long the SQL queries that Power BI generates take to run. This is a more complicated question to answer than you might think, though, and in this post I’ll explain why.
I happen to have access to some of the famous New York taxi data in a Snowflake database, and in there is a table with trip data that has 173 million rows that I have a built a Power BI dataset from. The data and the database used are not really important here though – what is important is that it’s DirectQuery and a large-ish amount of data. Here’s a report page with a single table visual on it, showing passenger count aggregated by the hack license field:
It’s slow, but how slow? Here’s what Performance Analyzer shows when I refresh the table:
The DAX query takes 5.4 seconds but the Direct Query time is only 3.3 seconds – and the numbers don’t seem to add up. Here’s what Profiler captures for the same refresh shown in Performance Analyzer:
This shows there’s a gap of 2 seconds between the DirectQuery End event and the Query End event. What if I paste the DAX query into DAX Studio? Here’s what the Server Timings tab shows:
This is a different query execution to the two examples above, both of which show data for the same execution, which explains why the numbers are slightly different here – but again there seems to be an extra second of stuff happening and DAX Studio suggests that it’s in the Formula Engine.
So what is going on? The answer lies in understanding what the DirectQuery End Profiler event actually measures: it’s the amount of time between the Analysis Services engine handing a query over to the Power Query engine and the Analysis Services engine receiving the first row in the resultset back, including the time taken for the Power Query engine to fold the query.
Therefore if it takes a long time to get all the rows in the resultset then that could explain what’s going on here. Unfortunately there’s no way of knowing from Profiler events how long this takes – but there is another way. Going back to Performance Analyzer, if you export the data from it to JSON (by clicking the Export button) and load it into Power Query, you can see more detail about a DirectQuery query execution. Here’s the data from the first execution above:
[There’s a very good paper documenting what’s in the Performance Analyzer JSON file here]
Looking at the record in the metrics column for the Execute Direct Query event you can see the same 3.2 second duration shown above in Profiler. Notice that there are two other metrics here as well: RowsRead, which is the total number of rows returned by the resultset; and DataReadDuration, which is the amount of time to read these rows after the first row has been received plus some other Analysis Services Engine operations such as encoding of column values, joining with unpushed semijoins, projections of aggregations such as Average and saving the resultset to the in-memory cache. In this case the SQL query has returned 43191 rows and this takes 1.95 seconds – which explains the gap between the end of the Execute Direct Query event and the end of the query.
One last question: why this SQL query is returning so many rows when the DAX query is only asking for the top 502 rows?
The reason is that, at the time of writing at least, the Analysis Services engine can only push a top(n) operation down to a DirectQuery SQL query in very simple scenarios where there are no measures and no aggregation involved – and in this case we’re summing up values. As a result, if you’re using DirectQuery mode and have a visual like this that can potentially display a large number of rows and includes a measure or aggregated values, you may end up with slow performance.
[Thanks to Jeffrey Wang for providing the information in this post]
One of the most common questions I get asked is “How can I use Power BI in DirectQuery mode on top of a REST API?”. This seems like a reasonable thing to do but almost everyone who tries it will fail, and in this post I will explain why.
To answer this question we first of all have to review the two main ways of working with data in Power BI: Import mode and DirectQuery mode. In Import mode data is cached in Power BI’s own internal database and all the DAX queries that are generated by your reports are answered from there. In DirectQuery mode no data is stored inside Power BI; instead, when a report is run and DAX queries are fired off against your dataset, Power BI in turn generates queries against the data source to get the data needed. Most of the data sources that can be used with DirectQuery mode in Power BI are relational databases and so that means Power BI will generate SQL queries to get data from them, but Power BI can also generate queries in other languages too.
Let’s look at a simple example of how DirectQuery works. Consider the following table called FruitSales in a SQL Server database:
If I build a DirectQuery dataset containing only this table I can create the following Power BI report with a matrix and a slicer:
The SQL query generated by Power BI to get the data for the matrix is this:
TOP (1000001) [t0].[Colour],SUM(
CAST([t0].[Sales] as BIGINT)
select [$Table].[Fruit] as [Fruit],
[$Table].[Colour] as [Colour],
[$Table].[Sales] as [Sales]
from [dbo].[FruitSales] as [$Table]
([t0].[Colour] IN (N'Red',N'Green'))
GROUP BY [t0].[Colour]
A few things are immediately obvious from this: the query has a WHERE clause to filter the data down to the two colours selected in the slicer and there’s a GROUP BY to get the sum of Sales by Colour. You can see how the results of the SQL query are going to match what is displayed in the report. In more more complex cases – especially when non-trivial DAX calculations are used – a single visual may generate multiple, more sophisticated SQL queries and some calculation work may also be done inside Power BI.
This example illustrates the kind of of operations that Power BI needs to push down to the data source in DirectQuery mode. The question is, then, does your REST API support filtering and aggregating data in the way Power BI wants? If not, then you won’t be able to build a DirectQuery dataset on top of it.
“But wait”, I hear some of you say, “why can’t Power BI get a table of data returned by my REST API and do the filtering and aggregation itself?”. I agree it would be lovely if it could but it can’t. And if it did, that would be something more like Import mode and not DirectQuery. “Can’t I just somehow refresh an Import mode dataset on demand then?”, you may then say, “I know it’s possible to refresh a dataset using Power Automate and we can now trigger a Power Automate flow from a report using the new Power Automate visual. What’s more, with Power BI Premium there’s no limit on the number of refreshes you can trigger via the API that Power Automate uses”. That’s certainly true, but there are a few problems with this approach:
Refreshing an Import mode dataset can be relatively slow and expensive in terms of resources, and if you have hundreds of users refreshing a dataset every time they view a report you may end up with the dataset being permanently refreshing which again impacts performance.
Commercial APIs often have limits on the number of times you can call them within a given time period and it’s likely you’d hit these rate limits if you let your users refresh a dataset any time they wanted.
In Import mode there’s no easy way to take a selection made by an end user and pass it back to the API as a parameter (if you don’t need to pass parameters back to the API why not just create a regular Import dataset and refresh it on a schedule?). Dynamic M parameters only work in DirectQuery mode. I suppose you could capture a selection and, using the Power Automate visual again, pass it to a Flow that used a Power Automate custom connector to call the Power BI REST API and change an M parameter’s value in your dataset but that would be very tricky to set up. However…
A dataset is something that is shared between all the users of your report. What happens if User A changes a parameter, refreshes the dataset, starts interacting with the report and then User B also starts viewing the report and changes the parameter to something different and refreshes again? In that case User A would see the numbers for User B’s selection in the report and it would be extremely confusing.
Let’s say your API does support all the rich operations Power BI needs in DirectQuery mode though. You can certainly build a Power BI custom connector that supports DirectQuery mode yourself. One way of doing this is to build your own ODBC provider on top of your API and then wrap it in a simple Power BI custom connector – which, as you can probably guess, is no easy task. There’s a sample of how to create a custom connector that supports DirectQuery on an ODBC provider here. Alternatively you can try not going down the ODBC route and putting all the logic in the custom connector but there’s no documentation on how to do this and it would be extremely difficult to do, so you’d need a lot of support from someone at Microsoft who is much better at coding in M than I am.
The good news is that some third-party vendors, for example CData and Progress (there may be others too) sell Power BI custom connectors that allow you to create DirectQuery datasets on top of REST APIs. These connectors are thin wrappers for ODBC providers and work by caching the data returned from the API within the provider and providing a SQL query interface on top of that. I haven’t tested these products so I can’t comment on how well they work or what their performance is like. You’d need to pay extra to use them, of course, and since they are custom connectors you’d need to have an on-premises data gateway to use them.
To conclude, the purpose of this post is not to explain why building a report on top of a REST API is a bad idea but to explain why it’s hard to do this in Power BI. This is, partly, the result of the way Power BI is designed: as a model-based tool rather than a report-based tool (Marco Russo wrote a great post on this subject) it needs access to all the data for a report via a richer interface than most APIs can provide. Indeed, my advice to anyone wanting to build a DirectQuery dataset on an API is that it is better to bypass the API and go to the data source that the API uses (which is often a relational database) instead, if you can, or to extract all the data that the API can provide and stage it locally. It might go against someone in your organisation’s grand plan for a service-oriented architecture but you’ll be a lot more successful with your reporting.
Over the years I have written a lot about Power BI/Power Query performance but it has always been in the context of loading data direct into datasets, not dataflows. A lot of cool things have been happening in dataflows recently, though, and now that Premium Per User has made Premium features to a much wider audience I thought it would be interesting to look at an example of how PPU can help dataflow performance and specifically how and when the Enhanced Compute Engine can make dataflow refresh faster.
The queries called First and Second are identical and just load all the data from the (same) source CSV file; they also have their load disabled. The query called Merge does an inner join between these two queries on the column called A:
The Merge query has its load enabled so it’s the only output of the dataflow; after it has joined the data it expands the nested columns returned and sets the data types on all the output columns.
Refreshing this dataflow in shared capacity took on average 150 seconds.
I then moved the workspace to Premium Per User capacity and without making any changes, I refreshed again.
Refreshing the same dataflow in PPU took on average 73 seconds.
So the first finding is that moving the dataflow to PPU more than halved the refresh time, which is a pretty good result.
However, at this point the Enhanced Compute Engine is not being used – so, to enable it, I enabled loading for the First and Second queries which in turn made the Merge query a Computed Table (what used to be a Computed Entity before the recent terminology changes, indicated by the lightning bolt icon):
For a full explanation of when the Enhanced Compute Engine can and can’t be used see Matthew Roche’s blog post here; basically it loads data into a SQL-based cache which Computed Tables can then leverage which means that data access is faster and the Power Query engine can push transformations back to it via query folding. The only other change I made was to set data types on the columns in the output of First and Second.
Refreshing this new version of the dataflow in PPU took on average 90 seconds
So performance was worse – but why? Enabling loading on First and Second means that more work is done at refresh time because their output needs to be ingested twice (once into ADLSgen2 and once into the SQL cache used by the Enhanced Compute Engine) before the Enhanced Compute Engine can access it. In this case the extra work needed to load First and Second outweighs the performance gains from using the Enhanced Compute Engine. The new metrics available from the dataflow’s Refresh History provide some insight into this (I strongly recommend you read the docs on these metrics here); here’s some of the data from one of the refresh history CSV files loaded into Excel:
In this particular case the overall refresh time of the dataflow was 88 seconds. First and Second refreshed in parallel – First taking 48 seconds and Second taking 51 seconds – and once they had both finished, Merge could refresh and only took 36 seconds to join the output of First and Second. So in this case Merge is indeed faster (36 seconds compared to 73 seconds before) as a result of using the Enhanced Compute Engine but that improvement isn’t enough to cancel out the additional time needed to load the data returned by First and Second into it.
What about a scenario where the Enhanced Compute Engine does make a positive difference? Take a look at the following dataflow, a slight variation on the dataflow above:
There are now three new tables: Ouput Table 1, Output Table 2 and Output Table 3. Each of these tables gets the maximum value from a different column in the table returned by Merge. Note that there are no Computed Tables in this dataflow so the Enhanced Compute Engine is not used, and that First, Second and Merge have load disabled.
Refreshing this dataflow on PPU took on average 95 seconds
Here are the refresh metrics for one of the refreshes:
As you can see, the three tables were refreshed in parallel and took between 84 and 93 seconds. It’s important to remember that for each of these tables the source data was loaded and the Merge query evaluated independently, which explains why they each take so long. The fact that Merge is evaluated three times when this dataflow refreshes is counter-intuitive but really important here – for more details see this post on how queries are evaluated in Power Query.
Now consider this version of the same dataflow where First, Second and Merge have their load enabled, making Merge, Output Table 1, Output Table 2 and Output Table 3 all Computed Tables.
Refreshing this dataflow on PPU took on average 88 seconds
Not a massive improvement, but an improvement. Now look at how different the refresh metrics are:
In this case Output Table 1, Output Table 2 and Output Table 3 only take 1 second to evaluate, but that’s because they are working from data cached in the Enhanced Compute Engine – the table returned by Merge – and the transformations in them fold. The Merge table also uses data cached in the Enhanced Compute Engine: the tables returned by First and Second. What’s more, because Merge is a Computed Table it is only evaluated once in this dataflow. Loading the data for First and Second takes 52 seconds and 50 seconds respectively and Merge takes 35 seconds. In this case the hit of loading the data into the Enhanced Compute Engine is worth taking.
In conclusion, there are two things that these tests have shown:
Moving your dataflow to PPU can make a big difference to refresh performance.
The Enhanced Compute Engine can make dataflow refresh faster but not in all cases: you need to understand how it works, and in some cases the overhead of loading the data into it outweighs the performance advantages it gives you for any transformations later on. Use the information in Refresh History to work out what’s happening for your dataflow.