Comparing The Performance Of Reading Data From Files With File.Contents And Web.Contents In Power Query And Power BI

In my last post I mentioned the Power Query engine’s persistent cache, which in some scenarios caches the data read from a data source when a query is refreshed. Another important nugget of information that Ehren von Lehe of the Power Query dev team mentioned in a post on the Power Query MSDN forum recently is the fact that if you use File.Contents to get data from a file then the persistent cache is not used, but if you use Web.Contents to get data from the same file then the persistent cache is used. I guess the thinking here is that there is no point creating an on-disk cache containing the contents of a file that is already on disk.

Using Process Monitor (see here and here for more information on how to do this) to view how much data is read from disk when a query is run it is possible to see this in action. Here’s the graph showing how much data is read from a 150MB csv file when a slow query is refreshed. This particular slow query results in five reads to the csv file – it’s more or less the same scenario from my second Process Monitor blog post here, with the query itself described in a lot of detail here. Here’s a graph of the data captured by Process Monitor showing time in seconds on the X axis and amount of data read in MB on the Y axis:

image

Since this query uses File.Contents to get the data from the csv file, the persistent cache is not used; as you can see, the graph shows clearly that the full contents of the csv file are read five times.

The same query altered to use Web.Contents shows just two full reads:

image

I have no idea why the file is read twice rather than once, but it’s definitely different behaviour to the version that uses File.Contents.

As far as I can see it is possible to replace File.Contents with Web.Contents in every case. So, if you have the following expression:

File.Contents("c:\users\myuser\Desktop\file.txt")

You can just replace it with:

Web.Contents("c:\users\myuser\Desktop\file.txt")

Which one is faster though? Just because a query reads data from disk more often does not necessarily mean that it will be slower. In the above scenario, with the csv file stored on my local hard drive, the Web.Contents version of the query refreshes in 18 seconds while the File.Contents version refreshes in 14 seconds. Replacing the csv file with an Excel file that contains the same data (remember that Excel files are a lot slower than csv files to read data from, as I showed here) results in the version of the query that uses File.Contents running in 205 seconds, while the version that uses Web.Contents running in 297 seconds. So it looks like, in most cases, File.Contents is the right choice when reading data from a file (as you would hope).

However, when using the same csv file stored on a network file share, the Web.Contents version takes 23 seconds while the File.Contents version takes 25 seconds. So maybe if you are dealing with files that are stored remotely over a slow connection it might be worth replacing File.Contents with Web.Contents to see if you get any performance benefits. There may be other situations where Web.Contents is the faster choice too. If you test this and see a difference, let me know by leaving a comment!

UPDATE: please also read Curt Hagenlocher’s comment below – Web.Contents may be changed in the future so it only works with http/https

5 responses

  1. The “multiple reads” problem can have significant performance and cost implications and affects various data source types.

    It is frustrating that Power BI issues the same query multiple times, especially for something like Google Big Query, where queries can be processing a very large amount of data and each query is charged. Often it seems the Gateway will run a query twice and desktop seems to sometimes run it more than that.

    My guess is that the first execution is to determine data types, etc, so I can understand it, but it nonetheless feels like an unresolved issue. I would much prefer a setting (on a per source basis, e.g. as a parameter in Sql.Database() that specifies a cache time) that allows me to control whether the query should be executed multiple times or whether Power BI sould simply cache the results on disk and re-use them for subsequent executions.

  2. Pingback: Power BI: Comparing Web.Contents and File.Contents Performance – Curated SQL

  3. Unfortunately, whoever implemented Web.Contents (which predates me) seems to have thought it would be a good idea to support file: and ftp: URLs, and the ability to send a regular Windows file path to Web.Contents is a side-effect of that. Once we eventually finish the “module versioning” story, we may define Web.Contents in V2 of the Web module to work with only http:/https:.

    Note that Web.Contents with file: and ftp: URLs will not work for Power BI Service refresh. I don’t remember any more why that is.

  4. @Chris – I once shared with the PQ team how to connect to an Excel / Access File stored on One Drive Personal (something that is not supported even today) – They promptly disabled it.

    I found another way – but have learnt my lesson now.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: