Parquet File Performance In Power BI/Power Query

There has been a lot of excitement around the newly-added support for reading from Parquet files in Power BI. However I have to admit that I was disappointed not to see any big improvements in performance when reading data from Parquet compared to reading data from CSV (for example, see here) when I first started testing it. So, is Power Query able to take advantage of Parquet’s columnar storage when reading data?

The answer is yes, but you may need to make some changes to your Power Query queries to ensure you get the best possible performance. Using the same data that I have been using in my recent series of posts on importing data from ADLSgen2, I took a single 10.1MB Parquet file and downloaded it to my PC. Here’s what the data looked like:

I then created a query to count the number of rows in the table stored in this Parquet file where the TransDate column was 1/1/2015:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Filtered Rows" = Table.SelectRows(
    Source,
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

Here’s the output:

I then used SQL Server Profiler to find out how long this query took to execute (as detailed here): on average it took 3 seconds.

Here’s what I saw in Power BI Desktop while loading the data just before refresh finished:

As you can see, Power Query is scanning all the data in the file.

I then added an extra step to the query to remove all columns except the TransDate column:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Removed Other Columns"
    = Table.SelectColumns(
    Source,
    {"TransDate"}
  ),
  #"Filtered Rows" = Table.SelectRows(
    #"Removed Other Columns",
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

This version of the query only took an average of 0.7 seconds to run – a substantial improvement. This time the maximum amount of data read by Power Query was only 2.44MB:

As you can see, in this case removing unnecessary columns improved the performance of reading data from Parquet files a lot. This is not always true though – I tested a Group By transformation and in that case the Power Query engine was clever enough to only read the required columns, and manually removing columns made no difference to performance.

This demonstrates that Power Query is able to take advantage of Parquet’s columnar storage to only read data from certain columns. However, this is the only performance optimisation available to Power Query on Parquet – it doesn’t do predicate pushdown or anything like that. What’s more, when reading data from the ADLSgen2 connector, the nature of Parquet storage stops Power Query from making parallel requests for data (I guess the same behaviour that is controlled by the ConcurrentRequests option) which puts it at a disadvantage compared to reading data from CSV files.

I think a lot more testing is needed to understand how to get the best performance when reading data from Parquet, so look out for more posts on this subject in the future…

[Thanks once again to Eric Gorelik from the Power Query development team for providing the information about how the Parquet connector works, and to Ben Watt and Gerhard Brueckl for asking the questions in the first place]

Bonus fact: in case you’re wondering, the following compression types are supported by the Parquet connector: GZip, Snappy, Brotli, LZ4, and ZStd.

Improving The Performance Of Importing Data From ADLSgen2 In Power BI By Partitioning Tables

So far in this series of posts looking at the performance of importing data from files stored in ADLSgen2 into Power BI I have looked at trying to tune refresh by changing various options in the Power Query engine or by changing the file format used. However there is one very important optimisation in the Analysis Services engine in Power BI that can make a significant difference to refresh performance: partitioning.

Using the same set of nine CSV files stored in ADLSgen2 storage described in the first post in this series I created a baseline dataset using the standard “From Folder” to load all the data into a single table in Power BI:

I published the dataset to a PPU workspace in the Power BI Service and I ran a Profiler trace while refreshing it. In line with my previous tests it took 64 seconds to do a full refresh. However, this time I did something extra: I ran a Profiler trace and captured the Job Graph data for the refresh command. This is something I have blogged about before, but recently my colleague Phil Seamark has created a great Power BI file for visualising this data much more easily – you can read about it here:

Here’s what Phil’s report showed about the refresh:

Power BI is able to read almost 125,000 rows per second but there isn’t any parallelism here – and that’s what partitioning a table can offer.

It isn’t currently possible to create partitions in a table using Power BI Desktop, so instead I created a new .pbix file with a single Power Query query that loaded the data from just one CSV file into a single table in the dataset. Every table in Power BI has a single partition; I then used Tabular Editor to duplicate the existing partition eight times and changed the M query bound to each partition to point to a different CSV file:

I then refreshed in in the Service while again running a Profiler trace. Here’s what Phil’s report showed for this refresh:

As you can see up to six partitions can be refreshed in parallel on a PPU capacity (the amount of parallelism varies depending on the size of the capacity) and this makes a big difference to refresh performance: this dataset refreshed in 40 seconds, 24 seconds faster than the version with a single partition. While the number of rows read per second for any single partition is lower than before overall the number of rows read per second was much higher at almost 200,000 rows per second.

Conclusion: partitioning tables in Power BI can lead to a significant improvement in refresh times when loading data from ADLSgen2 (and indeed any data source that can support a reasonable number of parallel queries).

Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless, Part 3: Parquet Files

Since I started this long and rambling series of posts on importing data from ADLSgen2 into Power BI a lot of people have asked me the same question: will using Parquet files instead of CSV files perform better? In this post you’ll find out.

To test the performance of Parquet files I took the data that I have been using in this series and loaded it from the original CSV files into Parquet files using Azure Data Factory. I then repeated some of the tests I ran in the first two posts in this series – here and here. The three tests were:

  • Loading all the data from the files
  • Filtering the data to the date 1/1/2015
  • Doing a Group By on the date column and counting the rows for each date

I ran these tests twice:

  • Connecting direct to the files in ADLSgen2 (using the AzureStorage.DataLake M function) from Power BI
  • Creating a view in Azure Synapse Analytics Serverless on top of the Parquet files and importing the data to Power BI from that using Power BI’s Synapse connector

Here’s a table with all the average refresh times for each test:

Connecting to ADLSgen2 directConnecting via Synapse Serverless
Loading all data72 seconds91 seconds
Filtering to 1/1/201529 seconds7 seconds
Group by on date34 seconds7 seconds

Some points to note:

  • The performance of importing all the data by connecting direct to the files in ADLSgen2 was the slightly slower here for Parquet files (72 seconds) than in my first blog post with CSV files (65 seconds)
  • The performance of the two subsequent tests for filtering by date and grouping by date were only slightly worse when connecting direct to the Parquet files in ADLSgen2 as when connecting to CSV files. Filtering by date took 29 seconds for the Parquet files and 27 seconds for the CSV files; grouping by date took 34 seconds for the Parquet files and 28 seconds for the CSV files.
  • Importing all the data from Parquet files via Synapse Serverless performed a lot worse than connecting direct to ADLSgen2; in fact it was the slowest method for loading all the data tested so far. Loading all the data via Synapse Serverless from Parquet files took 91 seconds whereas it only took 72 seconds via Synapse Serverless from CSV files.
  • The two transformation tests, filtering by date and grouping by date, were a lot faster than connecting direct to ADLSgen2. What’s more, Synapse Serverless on Parquet was substantially faster than Synapse Serverless on CSV: filtering by date via Serverless on Parquet took 7 seconds compared to 15 seconds via Serverless on CSV, and grouping by date via Serverless on Parquet also took 7 seconds compared to 15 seconds via Serverless on CSV.
  • There is another variable here that I’m not considering: what if the number and size of files used affects performance? As other people have found, it certainly affects Synapse Serverless performance; it may also affect Power Query performance too. However I don’t have the time or expertise to test this properly so I’m going to declare it out of scope and concentrate on comparing Synapse Serverless performance with the performance of connecting to the same files direct.

So, based on these results it seems fair to draw the following conclusions:

Conclusion #1: if you’re importing all the data from files in ADLSgen2 then connecting direct is faster than going via Synapse Serverless

Conclusion #2: if you’re connecting direct to files in ADLSgen2 and importing all the data from them then CSV files are faster than Parquet files

Conclusion #3: if you’re transforming data then connecting to Parquet files via Synapse Serverless is a lot faster than any other method

[UPDATE 29th March 2021: The tests in this post were run using the method of combining data from multiple files that Power Query automatically generates. In this post I show an optimised version of the code that greatly improves the performance of combining data from multiple Parquet files]

Parquet Files In Power BI/Power Query And The “Streamed Binary Values” Error

If you’re using the new Parquet connector in Power BI there’s a chance you will run into the following error:

Parameter.Error: Parquet.Document cannot be used with streamed binary values.
Details:
[Binary]

This isn’t a bug or anything that can be fixed, so it’s important to understand why it occurs and what you can do about it.

One easy way to reproduce this problem is by trying to access a reasonably large (larger than a few MB) Parquet file stored in SharePoint, something like this:

let
  Source = SharePoint.Files(
    "https://microsoft-my.sharepoint.com/personal/abc",
    [ApiVersion = 15]
  ),
  GetFile = Source
    {
      [
        Name = "myfile.parquet",
        #"Folder Path"
          = "https://microsoft-my.sharepoint.com/personal/abc/Documents/"
      ]
    }
    [Content],
  #"Imported Parquet" = Parquet.Document(
    GetFile
  )
in
  #"Imported Parquet"

The problem is that reading data from Parquet files requires random file access, and this is something that isn’t possible in Power Query for certain data sources like SharePoint and Google Cloud Storage. This problem will never occur with locally-stored files or files stored in ADLSgen2.

There is one possible workaround but it comes with some serious limitations: buffer the Parquet file in memory using the Binary.Buffer() M function. Here’s an example of how the above query can be rewritten to do this:

let
  Source = SharePoint.Files(
    "https://microsoft-my.sharepoint.com/personal/abc",
    [ApiVersion = 15]
  ),
  GetFile = Source
    {
      [
        Name = "myfile.parquet",
        #"Folder Path"
          = "https://microsoft-my.sharepoint.com/personal/abc/Documents/"
      ]
    }
    [Content],
  #"Imported Parquet" = Parquet.Document(
    Binary.Buffer(GetFile)
  )
in
  #"Imported Parquet"

The problem with buffering files in memory like this is that it’s only feasible for fairly small files because of the limits on the amount of memory Power Query can use (see here for more information): you’re likely to get really bad performance or errors if you try to buffer files that are too large, and Parquet files are often fairly large. The best way of solving this problem is to switch to using a data source like ADLSgen2 where this problem will not happen.

[Thanks to Eric Gorelik for the information in this post]

Measuring The Performance Of AzureStorage.DataLake() Using Power Query Query Diagnostics

In my last post I showed how changing the various options on the AzureStorage.DataLake() M function didn’t have much impact on dataset refresh performance in Power BI. I’ll admit I was slightly surprised by this, but it got me wondering why this was – and so I decided to do some tests to find out.

The answer can be found using Power Query’s query diagnostics functionality. Although you can’t use it to find out what happens when a dataset refresh takes place in the Power BI Service, you can use it to view requests to web services for refreshes in Power BI Desktop as I showed in this post. The Detailed diagnostic log query shows each request Power Query makes to get data from the ADLSgen2 API, the urls show the names of the files being accessed, and you can also see how long each request takes, the start and end time of each request and the amount of data read (the Content Length value in the response) amongst other things:

I wrote a Power Query query to extract all this useful information and put it in a more useful format, which can then be shown in Power BI. It’s fairly rough-and-ready but I turned it into an M function and posted the code here if you’d like to try it yourself – I haven’t done any serious testing on it though.

Here’s the data I captured for a refresh in Power BI Desktop that started at 10:55:42am yesterday and ended at 10:57:33am which took 111 seconds overall. I was using the default options for AzureStorage.DataLake() and this table only shows data for the GET requests to the ADLSgen2 API that returned data:

The main thing to notice here is that the total duration of all the requests was just 5.25 seconds – less than 5% of the overall refresh time – which explains why changing the options in AzureStorage.DataLake() didn’t make much difference to dataset refresh performance. Maybe if the files were larger, or there were more of them, changing the options would make a more noticeable impact. Of course there’s a lot more happening inside both the Power Query engine and the Analysis Services engine here beyond calling the web service to get the raw data. I also ran a Profiler trace while this refresh was running (see here for how to do this) and from the point of view of the Analysis Services engine it took 104 seconds to read the data from Power Query: the ExecuteSQL Profiler event took 4.5 seconds and the ReadData event took 99.5 seconds.

Conclusion: getting raw data from ADLSgen2 only represents a small part of the time taken to refresh a dataset that uses ADLSgen2 as a source, so any attempts to tune this may not have much impact on overall refresh times.

Testing The Performance Impact Of AzureStorage.DataLake() Options On Power BI Refresh Performance

Continuing my series on tuning the performance of importing data from ADLSgen2 into Power BI, in this post I’m going to look at the performance impact of setting some of the various options in the second parameter of the AzureStorage.DataLake() M function. In the last post in this series I showed how setting the HierarchicalNavigation option can improve refresh performance, but what about BlockSize, RequestSize or ConcurrentRequests?

Here’s what the documentation says about these options:

  • BlockSize : The number of bytes to read before waiting on the data consumer. The default value is 4 MB.
  • RequestSize : The number of bytes to try to read in a single HTTP request to the server. The default value is 4 MB.
  • ConcurrentRequests : The ConcurrentRequests option supports faster download of data by specifying the number of requests to be made in parallel, at the cost of memory utilization. The memory required is (ConcurrentRequest * RequestSize). The default value is 16.

Using the same 8 million row set of csv files I have used in my previous posts and the same queries generated by the From Folder source (see this post for more details – note that in this post I am not using Synapse Serverless, just loading direct from the files), I tested various options. Here’s an example of how these options can be set:

AzureStorage.DataLake(
  "https://xyz.dfs.core.windows.net/myfolder",
  [ConcurrentRequests = 1]
)

Here are the average dataset refresh times measured in the Power BI Service using Profiler:

OptionAverage Refresh Time (seconds)
None set – defaults used67
ConcurrentRequests=170
ConcurrentRequests=3267
BlockSize=170
BlockSize=8388608 (8MB)68
RequestSize=1Error (see below)
RequestSize=8388608 (8MB)68
ConcurrentRequests=32,
BlockSize=8388608,
RequestSize=8388608
67

From these results it looks like it’s possible to make performance slightly worse in some cases but none of the configurations tested made performance better than the default settings.

There are two somewhat interesting things to note. First, this is pretty much what the developers told me to expect when I asked about these options a while ago. However I was told that there may be some scenarios where reducing the value of ConcurrentRequests can be useful to reduce the memory overhead of a Power Query query – I guess to avoid paging on the Desktop (as discussed here) or memory errors in the Power BI Service.

Second, when I set RequestSize=1 (which means that each HTTP request was only allowed to return 1 byte of data, which is a pretty strange thing to want to do) I got the following error:

Expression.Error: The evaluation reached the allowed cache entry size limit. Try increasing the allowed cache size.

This reminds me I need to do some reasearch into how the Power Query cache works in Power BI Desktop and write that up as a post…

Overall, no major revelations here, but sometimes it’s good to know what doesn’t make any difference as much as what does.

Update 3/3/2021: read this post to see the results of some testing I did which shows why changing these options didn’t have much impact on refresh peformance.

Webcast: Accessing Web Services With Power BI And Power Query

Earlier this week I gave a webcast on accessing web services with Power BI, Power Query and M on Reza Rad’s YouTube channel. You can watch it here:

It’s an introduction to the subject: I cover the basics of using Web.Contents but don’t go into all the obscure details of what each of the options for it do (most of which I have blogged about anyway). I hope you find it useful!

Query Folding On SQL Queries In Power Query Using Value.NativeQuery() and EnableFolding=true

Here’s something that will Blow Your Mind if you’re a Power Query/M fan. Did you know that there’s a way you can get query folding to work if you’re using a native SQL query on SQL Server or Postgres as your data source?

There’s a new option on the Value.NativeQuery() M function that allows you to do this: you need to set EnableFolding=true in the third parameter. It’s documented here for the Postgres connector but it also works for the SQL Server connector too. Here’s an example using the SQL Server AdventureWorksDW2017 sample database:

let
  Source = Sql.Databases("localhost"),
  AdventureWorksDW2017 = Source
    {[Name = "AdventureWorksDW2017"]}
    [Data],
  RunSQL = Value.NativeQuery(
    AdventureWorksDW2017,
    "SELECT EnglishDayNameOfWeek FROM DimDate",
    null,
    [EnableFolding = true]
  ),
  #"Filtered Rows" = Table.SelectRows(
    RunSQL,
    each (
      [EnglishDayNameOfWeek] = "Friday"
    )
  )
in
  #"Filtered Rows"

Notice that my data source is a SQL query that gets all rows for the EnglishDayNameOfWeek column from the DimDate table and I’m only filtering down to the day name Friday using the #”Filtered Rows” step using the Table.SelectRows() function. Normally the #”Filtered Rows” step wouldn’t fold because I’ve used a native SQL query as my source, but in this case it does because I’ve set EnableFolding=true in Value.NativeQuery.

Here’s the SQL query generated by this M query:

select [_].[EnglishDayNameOfWeek]
from 
(
    SELECT EnglishDayNameOfWeek FROM DimDate
) as [_]
where [_].[EnglishDayNameOfWeek] = 'Friday'

Of course this doesn’t mean that everything can be folded now, but it’s nice to see that some folding on native SQL queries is now possible.

As I said this only works for SQL Server and Postgres at the time of writing and there is one other limitation: folding won’t happen if you’re passing parameters back to your SQL query in the way I describe here.

[Thanks to Curt Hagenlocher for the information]

Implementing Data (As Well As Metadata) Translations In Power BI

Power BI Premium has supported metadata translations – translations for table, column and measure names etc – for a while now. Kasper has a great blog showing how to use this feature here; Tabular Editor makes it very easy to edit metadata translations too. However (unlike SSAS Multidimensional) Power BI doesn’t have native support for data translations, that’s to say translating the data inside your tables and not just the names of objects. There are some blog posts out there that describe ways of tackling this problem but none are really satisfying or reliable: techniques involving row-level security, for example, can force you to use relationships in a way that will impact performance; the undocumented UserCulture() DAX function isn’t ready and shouldn’t be used in production yet. In this blog post I’m going to describe a new approach to solving the problem of data translations that relies on the DirectQuery on Live connections functionality released in preview in December. It is far from perfect but I think it’s the best way of solving this problem available at the moment.

Describing the problem

The best way to understand the problem and what this new approach offers is to look at the end result. Here’s a Power BI report where everything is in English:

And here is exactly the same report where the same information is shown translated into German (apologies for the actual translations…):

There are several things to point out in this German version of the report:

  1. The text in the title at the top and in the sidebar is now in German
  2. The column headers for the matrix are now in German (this is what has been possible for a while with metadata translations)
  3. The dates are now formatted using the default for the German locale (again, as a result of metadata translations)
  4. The day names are now shown in German. This is the key thing here – the data in a column, as well as the metadata has been translated
  5. The decimal values in the Umsatz column are now formatted using the German locale so that a comma is used as the decimal separator and a full stop (ie a period for my American readers) is used as a thousands separator (again, as a result of metadata translations)
  6. The text indicating the total row is now in German (this is the normal behaviour of Power BI for users with a German browser locale)

So how do you get both data and metadata translations working in Power BI? Here’s a super-simple example showing how…

Step 1: The source dataset

First, let’s take a look at the dataset that contains all the data for both these reports. It contains two tables.

The Sales table looks like this:

Note that the Day Name column contains the names of the days of the week, and that there are two other columns containing the names of the days of the week translated to German and French. This is the table that holds the data shown in the matrix in the reports above.

There is also a table called Text that contains the text shown in the title and sidebar in the reports above. Again, there is a column containing the English text and two other columns containing German and French translations of that text:

There are no relationships between these tables:

Finally, this dataset also needs to be published to a workspace in the Power BI Service.

Step 2: Building the English version of the report

Building the original English version of the report is also quite straightforward. The matrix just contains data from the Sales table:

The only interesting thing is how the text in the title and sidebar is handled. In both cases I have used a card visual and dragged the Text column from the Text table into it, then filtered the data on the Visual column of the Text table so the appropriate text is shown:

In this case the sidebar shows the data from the Text column (aggregated to get the First value) where the Visual field contains the value “Textbox”.

Step 3: Creating the German translation dataset

This is where things get interesting. The next thing to do is to open a new .pbix file in Power BI Desktop, create a Live connection to the dataset created in step 1, and then hit the “Make changes to this model” button to create a new local dataset. This is where the new DirectQuery on Live connections functionality comes in; you should read the documentation on this feature before you go any further. The important thing to remember is that when you do this you are not duplicating any data or logic that is in the original dataset but you can make your own modifications to it.

There are two things that have to be done in this dataset. First, in Power BI Desktop, some renaming is necessary:

  1. The “Day Name” column on the Sales table has been renamed “English Day Name”
  2. The “German Day Name” column on the Sales table has been renamed “Day Name”
  3. The “Text” column on the Text table has been renamed “English Text”
  4. The “German Text” column on the Text table has been renamed “Text”

This results in the following columns in the local dataset:

Second, a translation object needs to be added to the dataset for the German (de-DE) locale for the metadata translations. I used Tabular Editor (instructions here) because it was the quickest and easiest way to add metada translations:

Note how the name of the Sales table has been translated to Umsatz, and how the names of the Date, Day Name (note: this is the column that has just been renamed as Day Name, which points to the German Day Name column in the original dataset) and Sales columns have been translated to Datum, Tagesname and Umsatz respectively.

This local dataset also needs to be published to the Power BI Service to proceed.

Step 4: building the German version of the report

The last thing to do is to go back to the original English version of the report built in step 2, open it in Power BI Desktop, then point it to the new German local dataset created in the previous step. You can do this by going to the Home tab in the ribbon, clicking on the Transform data button and then selecting Data source settings:

…and then selecting the German local dataset like so:

At this point you’ll see that the report is in a semi-translated state: the data has been translated but the metadata has not been.

Don’t panic though! Metadata translations can only be viewed in the browser or by changing the language settings inside Power BI Desktop so this is to be expected.

Notice that the middle column in the matrix points to a column called ‘Sales'[Day Name]. In the source dataset this contains the English day names; in the German translation dataset we have switched the names of the columns so ‘Sales'[Day Name] now contains the German day names. This is the key to solving the problem: all you need to do is ensure that each of the translation datasets you create exposes the set of table, column and measure names that your report expects; you just need to rename columns appropriately in each translation dataset so that they point to the columns in the source dataset that contain the correct translated names.

You should then save the .pbix file and publish again. You’ll either need to change the name of the report or publish it to a separate workspace; I recommend the latter, because it means you can tell all your German-speaking users to go to one workspace for their reports and your English-speaking users to go to another for their reports.

And that’s it – someone with a German browser locale viewing the version of the report connected to the German translation dataset will see this:

Summary

Here’s a diagram showing everything that has been built so far:

English-language users (who will have an English-language browser locale) use the original report that points to the source dataset. German-language users (who will have a German-language browser locale) use the German version of the report, which in turn connects to the German translation dataset; this gives them the German data and metadata translations.

The important things to remember are:

  • Even though you have multiple datasets there is no duplication of data or logic because of the way the new DirectQuery on Live connections functionality works.
  • Even though you have multiple copies of the same report for different languages, the report design in each case is identical and the only the dataset that each report points to is different.

As a result the effort needed to maintain multiple translated copies of the same report is kept to a minimum.

Not too far in the future there will be new and improved functionality in Power BI that makes solving this problem even easier, and at that point I’ll write a follow-up blog post.

Optimise The Performance Of Reading Data From ADLSgen2 In Power BI With The HierarchicalNavigation Option

Last year Marco Russo wrote a very useful blog post pointing out the performance problems you can run into when connecting to data stored in ADLSgen2 from Power BI when there are a large number of files elsewhere in the container. You can read that post here:

https://www.sqlbi.com/blog/marco/2020/05/29/optimizing-access-to-azure-data-lake-storage-adls-gen-2-in-power-query/

Marco’s advice – which is 100% correct – is that you should either pass the full path to the folder that you want to connect in the initial call to AzureStorage.DataLake() or, if you’re connecting to a single file, pass the path to the file itself. This avoids the performance overhead of reading metadata from files you’re not interested in reading from, which can be quite considerable.

There are some scenarios where this advice doesn’t work, though, and there is another way to avoid this overhead and make the performance of reading data much faster – and this is by using the HierarchicalNavigation option of the AzureStorage.DataLake() function. I blogged about what this option does some time ago but didn’t realise at the time the performance benefits of using it:

https://blog.crossjoin.co.uk/2019/09/29/hierarchical-navigation-adlsgen2-power-bi/

Consider the following scenario. Let’s say you want to connect to a CSV file in a folder which also contains a subfolder that contains many (in this example 20,000) other files that you’re not interested in:

[I’m only going to connect to a single file here to keep the example simple; I know I could just connect direct to the file rather than the folder and avoid the performance overhead that way]

Here’s the M code generated by the Power Query Editor using the default options to get the contents of the aSales.csv file:

let
  Source = AzureStorage.DataLake(
    "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder"
  ),
  Navigate = Source
    {
      [
        #"Folder Path"
          = "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder/",
        Name = "aSales.csv"
      ]
    }
    [Content],
  #"Imported CSV" = Csv.Document(
    Navigate,
    [
      Delimiter  = ",",
      Columns    = 2,
      Encoding   = 1252,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers"
    = Table.PromoteHeaders(
    #"Imported CSV",
    [PromoteAllScalars = true]
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"Product", type text},
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed Type"

In Power BI Desktop refreshing the table that this M query returns (even with the Allow Data Preview To Download In The Background option turned off) takes 23 seconds. I measured refresh time using a stopwatch, starting with the time that I clicked the refresh button and ending when the refresh dialog disappeared; this is a lot longer than the refresh time that you might see using the Profiler technique I blogged about here, but as a developer this is the refresh time that you’ll care about.

The problem here is the Source step which returns a list of all the files in the ParentFolder folder and the ManySmallFiles subfolder.

Now, here’s an M query that returns the same data but where the HierarchicalNavigation=true option is set:

let
  Source = AzureStorage.DataLake(
    "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder",
    [HierarchicalNavigation = true]
  ),
  Navigation = Source
    {
      [
        #"Folder Path"
          = "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder/",
        Name = "aSales.csv"
      ]
    }
    [Content],
  #"Imported CSV" = Csv.Document(
    Navigation,
    [
      Delimiter  = ",",
      Columns    = 2,
      Encoding   = 1252,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers"
    = Table.PromoteHeaders(
    #"Imported CSV",
    [PromoteAllScalars = true]
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"Product", type text},
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed Type"

This takes just 3 seconds to refresh in Power BI Desktop – a really big improvement.

Conclusion: always use the HierarchicalNavigation=true option in AzureStorage.DataLake() when connecting to data in ADLSgen2 storage from Power BI to get the best refresh performance and the best developer experience in Power BI Desktop.