In my last post I mentioned the Power Query engine’s persistent cache, which in some scenarios caches the data read from a data source when a query is refreshed. Another important nugget of information that Ehren von Lehe of the Power Query dev team mentioned in a post on the Power Query MSDN forum recently is the fact that if you use File.Contents to get data from a file then the persistent cache is not used, but if you use Web.Contents to get data from the same file then the persistent cache is used. I guess the thinking here is that there is no point creating an on-disk cache containing the contents of a file that is already on disk.
Using Process Monitor (see here and here for more information on how to do this) to view how much data is read from disk when a query is run it is possible to see this in action. Here’s the graph showing how much data is read from a 150MB csv file when a slow query is refreshed. This particular slow query results in five reads to the csv file – it’s more or less the same scenario from my second Process Monitor blog post here, with the query itself described in a lot of detail here. Here’s a graph of the data captured by Process Monitor showing time in seconds on the X axis and amount of data read in MB on the Y axis:
Since this query uses File.Contents to get the data from the csv file, the persistent cache is not used; as you can see, the graph shows clearly that the full contents of the csv file are read five times.
The same query altered to use Web.Contents shows just two full reads:
I have no idea why the file is read twice rather than once, but it’s definitely different behaviour to the version that uses File.Contents.
As far as I can see it is possible to replace File.Contents with Web.Contents in every case. So, if you have the following expression:
Which one is faster though? Just because a query reads data from disk more often does not necessarily mean that it will be slower. In the above scenario, with the csv file stored on my local hard drive, the Web.Contents version of the query refreshes in 18 seconds while the File.Contents version refreshes in 14 seconds. Replacing the csv file with an Excel file that contains the same data (remember that Excel files are a lot slower than csv files to read data from, as I showed here) results in the version of the query that uses File.Contents running in 205 seconds, while the version that uses Web.Contents running in 297 seconds. So it looks like, in most cases, File.Contents is the right choice when reading data from a file (as you would hope).
However, when using the same csv file stored on a network file share, the Web.Contents version takes 23 seconds while the File.Contents version takes 25 seconds. So maybe if you are dealing with files that are stored remotely over a slow connection it might be worth replacing File.Contents with Web.Contents to see if you get any performance benefits. There may be other situations where Web.Contents is the faster choice too. If you test this and see a difference, let me know by leaving a comment!
UPDATE: please also read Curt Hagenlocher’s comment below – Web.Contents may be changed in the future so it only works with http/https
Some time ago a customer of mine (thank you, Robert Lochner) showed me a very interesting scenario where a set of Power Query queries in Power BI Desktop refreshed a lot faster with the “Enable Parallel Loading Of Tables” option turned off. This seemed a bit strange, but I have recently been reading lots of posts by Ehren von Lehe of the Power Query dev team on caching and query evaluation on the Power Query MSDN forum and two posts in particular, here and here, offered an explanation of what was going on. In this post I will walk through a very simple scenario that illustrates the information in that post and sheds some light on the internals of the Power Query engine.
Let’s say you have very simple web service built in Flow that, when it receives a GET request, waits 5 seconds and then returns the text “Hello”:
Here’s the M code for a Power Query query called SourceQuery that calls this web service and returns a table with one row and one column containing the text “Hello”:
let
Source = Web.Contents("https://entermyurlhere.com"),
ToLines = Lines.FromBinary(Source, null, null, 65001),
ToTable = Table.FromColumns({ToLines})
in
ToTable
Let’s also say you have four other identical queries called R1, R2, R3 and R4 that reference this query and do nothing but return the same table. Here’s the M code for these queries:
let
Source = SourceQuery
in
Source
Now, here’s the big question! If the output of SourceQuery is not loaded into Power BI but the output of R1, R2, R3 and R4 is loaded, as follows:
How many times would you expect the web service to be called when you refresh your dataset? One or four?
With the setting of “Enable Parallel Loading Of Tables” turned on, which is the default, the “Enable Data Preview To Download In The Background” setting turned off and Data Privacy checks turned off, Fiddler (see here for how to use it) shows the web service is called four times. SQL Server Profiler (see here for how to use it with Power BI) shows that the four queries are run in parallel and each query takes just over 5 seconds.
This seems wrong, because your natural inclination is to think that the Power Query engine evaluates SourceQuery first, and after that has happened the data passes from there to R1, R2, R3 and R4. In fact what happens is the reverse: R1, R2, R3 and R4 are evaluated in parallel and because each of them reference SourceQuery, then each of them causes SourceQuery to be evaluated independently. Yet another post by Ehren on the forum here explains this in detail.
The Power Query engine does do some caching in a ‘persistent cache’ that can be shared between queries and which stores the data requested from some types of data source on disk – it will store the data returned by Web.Contents for example, but not File.Contents – and only in some situations. The key statements about the persistent cache from this thread relevant to this particular problem are:
Power Query’s persistent cache (which stores data on disk during a particular refresh) is updated via a background thread. And separate evaluations (i.e. separate Microsoft.Mashup.Container.*.exe processes) running at the same time are not coordinated; when evaluation A is accessing the persistent cache (including updating it), this doesn’t block evaluation B from accessing the cache. This means that even when using a shared persistent cache, PQ can potentially end up requesting the same data twice. It depends on the timing of the various requests.
and
…you can reduce the number of evaluations happening at any given time (and thus the likelihood of timing-related cache misses, or additional unwanted requests) by doing the following:
Disabling background analysis (which pre-calculates the PQ Editor previews behind the scenes)
Disabling the Data Privacy Firewall (which does its own evaluations during refresh that could potentially cause duplicate requests, and also only consume a subset of the data in certain cases)
In PBIDesktop, you can also reduce the likelihood of timing-related cache misses by disabling parallel loading. (In Excel, the loading of multiple queries is always sequential)
As a result, with the “Enable Parallel Loading Of Tables” setting turned off, so the queries above are evaluated in series, Fiddler shows the web service is called only once. What’s more, Profiler shows that only one of the queries out of R1, R2, R3 and R4 takes just over five seconds while the other three, executed after it, are almost instant.
This must be because, when the queries are executed in series, the first query refreshes and data gets loaded into Power Query’s persistent cache. After that, even though the three subsequent queries also cause SourceQuery to be evaluated, each time SourceQuery is evaluated it can reuse the data stored in the persistent cache.
In contrast, with the four queries refreshed in parallel, each evaluation of SourceQuery takes place before the persistent cache has been populated and therefore the web service is called four times. This is slower, but in this case not much slower, than the scenario with the four queries executed in series – each query evaluation takes just over five seconds but remember that they are being executed in parallel so the overall duration is not much more than five seconds. In other cases the difference between parallel and sequential query execution could be a lot larger.
Now for the bad news: the “Enable Parallel Loading Of Tables” setting only works in Power BI Desktop, and there is no equivalent setting in the Power BI service. When a dataset is refreshed in the service, as far as I can tell the queries in the dataset are always evaluated in parallel. In Excel, as Ehren says, all queries are executed sequentially.
Is it possible to make sure the web service is called only once with all the queries executed in parallel? Yes, but not in a completely reliable way. The first thing to say here is that Table.Buffer and related buffering functions are not, as far as I understand it, useful here: they buffer data in memory within a single chain of execution, and in this case we have four separate chains of execution for R1, R2, R3 and R4. Instead what I have found that works is inserting a delay that gives the persistent cache time to be populated. If you keep the SourceQuery and R1 queries the same, and then alter the M code for R2, R3 and R4 so they wait for ten seconds (using Function.InvokeAfter) before returning as follows:z
let
Source = Function.InvokeAfter(
()=>SourceQuery,
#duration(0,0,0,10)
)
in
Source
In this particular case this results in the slowest refresh times and it only works because I can be sure that the call to the web service takes five seconds.
There’s one last thing to say here: wouldn’t it be nice if we could ensure that the web service was only called once, and all subsequent queries got their data from the a persisted copy of the data? As Matthew Roche points out here, this is exactly what dataflows allow you to do. Taking SourceQuery and turning it into an entity in a dataflow would result in a single call to the web service when the entity is refreshed, the data from the web service being persisted in the dataflow. You would then just need to have the queries R1, R2, R3 and R4 in your dataset, change them so they get their data from the entity in the dataflow, and as a result they would use the persisted copy of the data in the dataflow and would not call the web service. Another reason to use dataflows!
[I hope all the information in this post is correct – it is based on Ehren’s forum posts and my own observations alone. If anyone from the Power Query dev team is reading this and spots an error, please let me know]
I generally try to avoid writing book reviews here, but the fact that there are so few books available on Power Query and M means that I’m making an exception for “Collect, Combine and Transform Data using Power Query in Excel and Power BI” by Gil Raviv.
The first thing to say about this book is that it takes the approach of teaching through exercises and worked examples, rather than by explaining abstract concepts. If this suits your style of learning (and I know that it does for a lot of people) then you’re in luck; if you’re looking for a book that will explain what all the different join types for Merge operations do, for example, then you’ll be disappointed. This isn’t a criticism though – I don’t think it’s possible to write a book that will satisfy everyone – and Gil has done a good job of covering a lot of common data preparation scenarios. One important exception to this is the chapter on M which provides a very clear introduction to the language and the way it works. I suspect a lot of people will want to buy the book for this chapter alone.
The second thing to say about this book is that while it covers both Power Query in Excel and Power BI, in my opinion it’s aimed slightly more at Excel users. Again, this is not a criticism: although advanced Excel users and Power BI report designers have to solve many of the same problems, they also have some very different concerns too. What’s more, if you can assume your readers have good Excel skills and can explain Power Query concepts in Excel terms then you’ll serve that particular set of readers well, and probably do a better job for them than if you assume they are completely new to the area of data transformation and preparation have no existing skills in this area.
All in all it’s a good book that I can recommend to anyone who wants to learn Power Query and M, and also for intermediate users who want to deepen their knowledge. I still think there’s a need for a book completely devoted to M and covering topics such as custom connectors and dataflows; hopefully someone writes one soon!
Full disclosure: I received a free copy of the book from the author. I’m also the author of a Power Query book myself, but to be honest it’s several years old now and a bit out-of-date, so it’s hard to recommend it any more.
You can buy a copy of this book from Amazon UK here.
Yet more evidence that Power Query is taking over the world: there’s a new Power Query data source for SQL Server Integration Services in the latest version of SSDT. Here’s the blog post announcing it (the actual announcement is halfway through):
It’s a preview and as such, limited in scope. The main restriction, as the blog post says, is:
At present, to facilitate quick/frequent feedback-gathering and improvement cycles, it can only be used with SQL Server Data Tools (SSDT) and SSIS IR.
There’s also no user interface for generating queries: right now you can only paste in an M query or use an M query stored in an SSIS variable. I don’t think that’s much of a problem though – if you don’t want to write your own M code, you can always generate a query in Excel Power Query or Power BI Desktop and cut and paste from there. You don’t even have to open the Advanced Editor to do this, as I show here.
Beyond the normal excitement about cool new stuff, what are the benefits of having Power Query integrated into SSIS? I’m not an SSIS person so I’m probably not the right person to comment on the benefits for SSIS developers, but I would imagine that it will make it possible to connect to a wider range of data sources and also make it easier to work with certain others, such as Excel workbooks and web services. I’d be interested in hearing your thoughts on this, so please leave a comment.
From the point of view of the Excel, Power BI or SSAS developer, though, it’s clear this will make it much easier and quicker to take user-developed Power BI and Excel Power Query solutions and turn them into a managed, IT-department owned solutions – and this will be incredibly useful.
Working out how long a Power Query M query takes to run – the first step to troubleshooting refresh performance problems in Power BI or Excel – is something I have blogged about several times (see here, here and here for example). However, there are problems with all of these methods such as the need to alter the M code of your query. Luckily there is another method that I have found, using a combination of DAX Studio and SQL Server Profiler.
How can you get hold of these tools? You can download DAX Studio for free here:
Any serious Power BI developer should have it installed already, but if you have an older version you may need to update it to get the features I talk about in this post.
SQL Server Profiler is part of SQL Server Management Studio, and you can download it for free from here:
You can generate SQL Server Profiler trace files for diagnostic purposes from Power BI Desktop, as Kasper shows here, but these traces don’t show any information about refreshes as far as I can see. You can also hook Profiler up to Power BI Desktop direct, as Sam Lester shows here, but that’s a bit fiddly to do. Luckily DAX Studio now has a preview feature which makes connecting Profiler to Power BI Desktop much easier.
First of all, open up your report in Power BI Desktop, then open up DAX Studio and connect it to the instance of Power BI Desktop you just opened; if you want to use this technique with Excel Power Query queries then you will need to launch DAX Studio from inside Excel. Then, in DAX Studio, open the Options dialog and enable the “Show External Tools” preview feature:
Once you have done this, on the Advanced tab on DAX Studio’s ribbon you’ll see an option to launch SQL Profiler and connect it to the Power BI report or Excel file that DAX Studio is currently connected to:
Clicking the button opens up Profiler with a trace running. You’ll see that a lot of trace events are selected – probably too many to make sense of – so I suggest that you stop the trace, click the Properties button on the toolbar and then, on the Events Selection tab of the Trace Properties dialog select just the following four trace events:
Command Begin
Command End
Progress Report Begin
Progress Report End
Restart the trace and then refresh your dataset in Power BI Desktop. You’ll still see a lot of events generated in Profiler, but look for the following:
A Command Begin event followed by a lot of Progress Report Begin/End events, which is the command that marks the start of the refresh.
After that there will be a series of Progress Report Begin events that have an EventSubclass of “25 – ExecuteSQL” – one for each of the tables that are being loaded into your dataset and therefore one for each of the Power Query M queries associated with those tables. Don’t be confused by the name of the EventSubclass, these events will appear whatever data source you are using: it looks like the Vertipaq engine requests data from the Power Query engine using a basic form of SQL. For example, here’s a screenshot showing the three events I get when refreshing a dataset with three tables in it called “Property Type”, “Date” and “Price Paid 2017”:
Finally, below that there will be a series of Progress Report End events (one for each Progress Report Begin event) with the EventSubclass of “25 – ExecuteSQL”, and the Duration column here will tell you how long it took to get data from the Power Query M query associated with the table. For example, this screenshot shows that it took 5460ms (about 5.5 seconds) to get the data from the “Price Paid 2017” Power Query query:
And there you have it, exact timings for each of the Power Query M queries associated with each of the tables in your dataset. Remember that the time taken by each Power Query M query will include the time taken by any other queries that it references, and it does not seem to be possible to find out the amount of time taken by any individual referenced query in Profiler.
There is a lot more interesting information that can be found in this way: for example, dataset refresh performance is not just related to the performance of the Power Query M queries that are used to load data; time is also needed to build all of the structures inside the dataset by the Vertipaq engine once the data has been returned, and Profiler gives you a lot of information on these operations too. Discussion of that will have to wait until a future blog post though…
UPDATE April 2019:
Having used this technique quite a lot in the last few months, I’ve seen that the Progress Report End/25 – Execute SQL event doesn’t always give an accurate value for the amount of time taken to execute an M query – or rather it does, but things are more complicated than I initially understood. You may also need to look at the trace events that come immediately afterwards: Progress Report Begin/17 – Read Data and Progress Report Begin/17 – End Data. Akshai Mirchandani of the SSAS team gave me a great explanation of the significance of these events:
The SQL query traces indicate the begin/end of query execution against the data source – theoretically, once this is done, rows are ready to be pulled from the source.
That’s when you will see the “ReadData” begin event. Rows start getting pulled at that point, until all rows are read (and encoded/compressed) – then you see the end event.
It is not always predictable how each provider/source behaves. Some of them might defer the expensive query execution until the first “Read” operation, but typically the SQL execution step tries to ensure that the command is valid, prepare the command and (at the very least) start the execution of the command.
Sometimes, the command execution might compute all the results and buffer them too.
In effect, its up to the provider/source exactly how this actually behaves…
It looks like the Power Query engine can behave in both of the ways Akshai describes depending on what you’re doing in the query.
A while ago I found myself wondering – as I often do about this kind of thing – about what the undocumented third parameter of the Excel.Workbook() M function (called delayTypes) actually does. I found a forums post from Guy Hunkin of the Excel Power Query team in 2017 here, which gives the following summary:
Originally, Excel.Workbook used to read the entire workbook data to accurately assign types to the columns. This was EXTREMELY slow on large workbooks.
Having this in mind, we added this flag to delay the behavior. When set to “true”, we don’t infer any column types. Instead, the UI uses its normal inference algorithm to generate a Table.TransformColumnTypes step with the inferred types and if it inferred wrongly, the user can update the transformation explicitly.
I also posted a question on the Power Query MSDN forum and got some more details from Colin Banfield, one of the many M experts who hang out there, as well as doing some investigations of my own. This post summarises what I found out.
First of all, a simple demonstration of what it does. Consider the following Excel table:
Connecting to this table in the December 2018 release of Power BI Desktop generates the following M query:
[sourcecode language=”text” padlinenumbers=”true” highlight=”8″]
let
Source =
Excel.Workbook(
File.Contents(
"C:\DelayTypesTest.xlsx"
),
null,
true
),
SourceData_Table =
Source{[Item="SourceData",Kind="Table"]}[Data],
#"Changed Type" =
Table.TransformColumnTypes(
SourceData_Table,
{
{"TextColumn", type text},
{"IntegerColumn", Int64.Type},
{"DecimalColumn", type number},
{"DateColumn", type date}
}
)
in
#"Changed Type"
[/sourcecode]
Notice that the delayTypes option is now automatically set to true and that there is a “Changed Type” step. The output of this query, as seen in the Power Query Editor, is what you would expect:
Compare this with the output of the following query which has delayTypes set to false and has no “Changed Type” step:
[sourcecode language=”text”]
let
Source =
Excel.Workbook(
File.Contents(
"C:\DelayTypesTest.xlsx"
),
null,
false
),
SourceData_Table =
Source{[Item="SourceData",Kind="Table"]}[Data]
in
SourceData_Table
[/sourcecode]
There are two things to notice:
With delayTypes set to false no “Changed Type” step is necessary, the data types on the columns are set by Excel.Workbook(). With delayTypes set to true all the columns returned by Excel.Workbook() have their data type set to Any and a subsequent “Changed Type” step is necessary to set data types to something useful.
With delayTypes set to true the IntegerColumn and the DecimalColumn columns have different data types assigned to them by the “ChangedType” step: IntegerColumn becomes Int64.Type whereas DecimalColumn becomes type number. With delayTypes set to false both columns are set to number. So it looks like Excel.Workbook with delayTypes set to false will only convert columns to primitive data types.
The second question is: if it is, primarily, a performance optimisation how much of an impact does it have? I built four queries that read all the data from an 80MB xlsx file to test the performance of all of the variations of delayTypes true/false and having a “Changed Type” step present or not (just in case there was an overhead to having a “Changed Type” step). Here are the results – the main timings are in minutes and seconds, and in brackets is the length of the “Evaluating” phase of query execution:
delayTypes=false
delayTypes=true
no “Changed Type” step
2:08
(1:21)
1:06
(0:00)
“Changed Type” present
2:19
(1:30)
1:08
(0:00)
I wouldn’t read too much into these exact timings because in my experience timings for the same query can fluctuate quite a lot, but one thing is clear: setting delayTypes to true results in much better performance. Indeed with delayTypes set to false nearly half the time was spent in the “Evaluating” phase whereas with delayTypes set to true there was no “Evaluating” phase at all. As a result I can’t think of a reason not to set the delayTypes option to true.
It’s worth noting that delayTypes is set to false by default if you don’t specify this option, so be aware of this if you are writing your own M code.
The latest version of the “release notes” document that details the roadmap for Dynamics 365 and the whole Power Platform – which includes Power BI – has just been released. You can view it online here:
There are lots of very exciting announcements for Power BI. Here are a few of the features that are promised for later this year that caught my eye, that haven’t already been announced elsewhere (although some of them were discussed in public sessions at the last PASS Summit):
Expression-based formatting in Power BI reports – the ability to use a DAX expression to control properties such as colour and line style within a visual. One of those features that advanced Power BI users will love, and which will allow all kinds of interesting customisations.
Performance analyzer for reports – a tool in Power BI Desktop to tell you how long each visual on a page takes to refresh, making it easier to understand where your performance bottlenecks are. I wish we could have something similar for Power Query queries…
Shared and certified datasets – this is going to be incredibly useful! As the docs say “With shared datasets in Power BI, we are allowing a single dataset to be used by multiple reports, across workspaces”. I can’t wait. Lots of other cool, related features too here.
Large-scale compute engine for dataflows – this is somewhat cryptic. The docs say “We are enhancing the dataflow calculation engine in Power BI Premium to enable high-performing large-scale data transformations (TBs of data).”. Are these enhancements to the Power Query engine, or is it being replaced with something else entirely (eg to make it more like Azure Data Factory?). I don’t know.
More new Q&A features: auto-suggestions and did you mean. I’m glad Microsoft is continuing to invest in Q&A, I think it’s going to be one of those features that suddenly goes from being a demo feature that few people actually use to something that will be very powerful indeed. In fact, I would argue that it’s already a lot more powerful than most people realise.
Power BI Template Apps – the ability to create multiple apps in different workspaces from a single template. This was something that used to be possible with content packs (remember them?) so I guess anyone still using them should be able to finally migrate away. Useful for end-user customisation of apps.
I have already blogged in great detail many times about Power BI/Power Query data privacy settings (see this series for example) but there’s always something new to learn. Recently I was asked a question by Ian Eckert about how Power BI handles data privacy for cloud or web-based data sources after a dataset has been published, and it prompted yet more revelations…
Consider the following M query:
[sourcecode language=’text’ padlinenumbers=’true’]
let
Source = Xml.Tables(
Web.Contents(“https://blog.crossjoin.co.uk/feed/”)
),
channel = Source{0}[channel],
language = channel{0}[language],
out = Json.Document(
Web.Contents(
“https://data.gov.uk/api”,
[
RelativePath=”3/action/package_search”,
Query=[q=language]
]
)
),
result = out[result],
results = result[results],
#”Converted to Table” = Table.FromList(
results,
Splitter.SplitByNothing(),
null,
null,
ExtraValues.Error
),
Column1 = #”Converted to Table”{0}[Column1],
#”Converted to Table1″ = Record.ToTable(Column1)
in
#”Converted to Table1″
[/sourcecode]
It doesn’t do anything particularly interesting, but it does take data from one web-based data source (the RSS feed for this blog) and sends it to another (the UK government’s open data metadata search web service). As a result, in Power BI Desktop, if you set the data privacy settings for both data sources to Public then the query runs, but if you set the data privacy settings for both data sources to Private:
…As expected, you get the following error:
Formula.Firewall: Query ‘Test’ (step ‘Converted to Table1’) is accessing data sources that have privacy levels which cannot be used together. Please rebuild this data combination.
Now the strange thing is that, when you publish the dataset that contains this M query, refresh always works. Why? What’s more, other datasets that do something similar will always fail when refreshed.
It turns out that when you publish a dataset that uses cloud or web-based data sources like the two used here, the Power BI service does not use the data privacy settings you have set in Power BI Desktop but instead it automatically assigns data privacy levels as follows:
Data sources, like the ones used here, that use Anonymous authentication are automatically given the privacy level Public
All other data sources are given the privacy level Private.
Interestingly, Power BI dataflows also count as cloud-based data sources and because they do not use Anonymous authentication they default to Private too, so if you are combining data from a dataflow with another data source in your dataset then you need to be careful of this.
What’s more there is at the time of writing no way to change these data privacy levels in the Power BI web-based portal. Hopefully this will change soon.
There are some workarounds though!
First of all, you can force refresh to take place through a gateway. This might sound strange because in theory, if you’re only using cloud or web-based data sources, a gateway should not be necessary. However there are already similar scenarios where a gateway is needed, for example if you are scraping data from a web page you need to use a gateway, and if you are combining data from a cloud-based data source with an on-premises data source you also need to use a gateway. If you add your cloud/web-based data sources as data sources in your gateway (unfortunately it does not seem to be possible to add a dataflow as a data source in a gateway, though) you can set their data privacy levels in the Advanced Settings section in the Manage Gateways screen:
You will also need to set the “Use a data gateway” option to On in the Settings dialog for your dataset after it has been published:
The other workaround is to copy the M code for your query and paste it into a new blank M query in an entity in a dataflow, as Matthew Roche shows here. While it does not seem to be possible to set data privacy levels for individual data sources when creating an entity, it is possible to turn off data privacy checks for an entity completely. If you create a query that sends data from one data source to another (regardless, as far as I can see, of the authentication mechanism used), you will see the following message in the Power Query Online query editor:
The evaluation was cancelled because combining data from multiple sources may reveal data from one source to another. Click Continue if the possibility of revealing data is okay.
If you click Continue, data privacy checks are turned off and the query runs; you can also click the Options button on the ribbon and check the “Allow combining data from multiple sources” option:
If one of your data sources is already itself a dataflow you may need to do some editing of the M query to make things work, but as Matthew Roche shows here it is possible to have an entity in a dataflow refer to another entity without using a computed entity (which is a Premium-only feature).
[Thanks to Arthi Ramasubramanian Iyer from Microsoft for providing background information for this post]
If you’re learning Power BI and you want to deepen your understanding of its data transformation and loading functionality, the Power Query engine and the M language, I’m doing one-day pre-conference seminars on this subject at three conferences over the next few months:
SQLKonferenz in Darmstadt, Germany, 19-21 February 2019 (see here for more details)
SQLBits in Manchester, UK, 27 February-2 March 2019 (see here for more details)
You’ll learn about all the functionality in the Power Query Editor window plus concepts such as merging, appending, parameters, functions, data privacy and M. I hope to see you at one of them!
A few weeks ago I received a great tip from Simon Nuss about a Power Query feature relating to ODBC data sources that I hadn’t seen before (although it’s in Power BI Desktop and Excel 365, so it can’t be that new) and which is potentially important for the performance of data refresh and of the Power Query Editor’s preview pane. It’s only visible if you click “Advanced options” when creating your ODBC data source:
It’s the “Supported row reduction clauses” option:
What does it do? Well, first of all let’s see what happens if you don’t set it. Let’s say you have an ODBC data source that uses the Microsoft ODBC Driver for SQL Server to connect to the Adventure Works DW database. If you create a new query that uses this data source to get the contents of the DimDate table it works pretty much as you’d expect; here’s the M code:
let
Source =
Odbc.DataSource(
“dsn=Test SQL”,
[HierarchicalNavigation=true]
),
#”Adventure Works DW_Database” =
Source{
[Name=”Adventure Works DW”,Kind=”Database”]
}[Data],
dbo_Schema =
#”Adventure Works DW_Database”{
[Name=”dbo”,Kind=”Schema”]
}[Data],
DimDate_Table =
dbo_Schema{[Name=”DimDate”,Kind=”Table”]}[Data]
in
DimDate_Table
However, if you look in Profiler when the contents of the Preview window is refreshed, you’ll see that the Power Query engine is running a query that gets the entire contents of the table:
Obviously, if you are connecting to a big table this could make the Power Query Editor slow. What’s more, as this blog post shows, the Power Query Editor only actually wants to display 1000 rows and the fact that the query is requesting all the rows from the table suggests that even basic query folding is not happening on this ODBC data source.
Now let’s see what happens if you do select something from the “Supported row reduction clauses” dropdown. As the screenshot above shows, there are four values in the dropdown that relate to the form of SQL supported by your ODBC source to only get the first N rows from a table. If you don’t know which one to choose you can click the Detect button and it will select one for you. In this case I selected TOP:
Having done this, when you connect to the same DimDate table, you’ll see that now the Power Query Editor is only requesting the top 1000 rows from the table:
Also, if you look at the M code for the query, the Odbc.DataSource function has the as-yet undocumented SqlCapabilities option set in its second parameter,
Actually it’s not quite undocumented because it is mentioned here as something that is only available if you’re using Odbc.DataSource in a custom connector; obviously things have changed.
Setting this option can make a big difference to the responsiveness of the Power Query Editor when you’re working with large tables and/or slow data sources – I saw this myself this week when working with one of my customers. It’s also going to be important if you use the Keep Rows/Keep Top Rows button on the ribbon or the Table.FirstN() M function; I suspect there are other, more obscure, reasons why it might speed refresh up even if you aren’t filtering the rows of the table but I’m not 100% sure.
One last thing to mention is that there is another undocumented function which is what, I think, the Detect button in the screenshots above uses to find out how to set the SqlCapabilities option. It’s called Odbc.InferOptions and it can be used like so:
Odbc.InferOptions(“dsn=Test SQL”)
This function returns a record containing a field called LimitClauseKind that tells you the type of limit clause that is supported:
Here’s a list of the possible LimitClauseKind values I found using #shared: