Optimise The Performance Of Reading Data From ADLSgen2 In Power BI With The HierarchicalNavigation Option

Last year Marco Russo wrote a very useful blog post pointing out the performance problems you can run into when connecting to data stored in ADLSgen2 from Power BI when there are a large number of files elsewhere in the container. You can read that post here:

https://www.sqlbi.com/blog/marco/2020/05/29/optimizing-access-to-azure-data-lake-storage-adls-gen-2-in-power-query/

Marco’s advice – which is 100% correct – is that you should either pass the full path to the folder that you want to connect in the initial call to AzureStorage.DataLake() or, if you’re connecting to a single file, pass the path to the file itself. This avoids the performance overhead of reading metadata from files you’re not interested in reading from, which can be quite considerable.

There are some scenarios where this advice doesn’t work, though, and there is another way to avoid this overhead and make the performance of reading data much faster – and this is by using the HierarchicalNavigation option of the AzureStorage.DataLake() function. I blogged about what this option does some time ago but didn’t realise at the time the performance benefits of using it:

https://blog.crossjoin.co.uk/2019/09/29/hierarchical-navigation-adlsgen2-power-bi/

Consider the following scenario. Let’s say you want to connect to a CSV file in a folder which also contains a subfolder that contains many (in this example 20,000) other files that you’re not interested in:

[I’m only going to connect to a single file here to keep the example simple; I know I could just connect direct to the file rather than the folder and avoid the performance overhead that way]

Here’s the M code generated by the Power Query Editor using the default options to get the contents of the aSales.csv file:

let
  Source = AzureStorage.DataLake(
    "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder"
  ),
  Navigate = Source
    {
      [
        #"Folder Path"
          = "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder/",
        Name = "aSales.csv"
      ]
    }
    [Content],
  #"Imported CSV" = Csv.Document(
    Navigate,
    [
      Delimiter  = ",",
      Columns    = 2,
      Encoding   = 1252,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers"
    = Table.PromoteHeaders(
    #"Imported CSV",
    [PromoteAllScalars = true]
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"Product", type text},
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed Type"

In Power BI Desktop refreshing the table that this M query returns (even with the Allow Data Preview To Download In The Background option turned off) takes 23 seconds. I measured refresh time using a stopwatch, starting with the time that I clicked the refresh button and ending when the refresh dialog disappeared; this is a lot longer than the refresh time that you might see using the Profiler technique I blogged about here, but as a developer this is the refresh time that you’ll care about.

The problem here is the Source step which returns a list of all the files in the ParentFolder folder and the ManySmallFiles subfolder.

Now, here’s an M query that returns the same data but where the HierarchicalNavigation=true option is set:

let
  Source = AzureStorage.DataLake(
    "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder",
    [HierarchicalNavigation = true]
  ),
  Navigation = Source
    {
      [
        #"Folder Path"
          = "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder/",
        Name = "aSales.csv"
      ]
    }
    [Content],
  #"Imported CSV" = Csv.Document(
    Navigation,
    [
      Delimiter  = ",",
      Columns    = 2,
      Encoding   = 1252,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers"
    = Table.PromoteHeaders(
    #"Imported CSV",
    [PromoteAllScalars = true]
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"Product", type text},
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed Type"

This takes just 3 seconds to refresh in Power BI Desktop – a really big improvement.

Conclusion: always use the HierarchicalNavigation=true option in AzureStorage.DataLake() when connecting to data in ADLSgen2 storage from Power BI to get the best refresh performance and the best developer experience in Power BI Desktop.

Testing The Performance Of Importing Data From ADLSgen2 Common Data Model Folders In Power BI

Following on from my last two posts comparing the performance of importing data from ADLSgen2 into Power BI using the ADLSgen2 connector and going via Synapse Serverless (see here and here), in this post I’m going to look at a third option for connecting to CSV files stored in ADLSgen2: connecting via a Common Data Model folder. There are two ways to connect to a CDM folder in Power BI: you can attach it as a dataflow in the Power BI Service, or you can use the CDM Folder View option in the ADLSgen2 connector.

First of all, let’s look at connecting via a dataflow. Just to be clear, I’m not talking about creating a new entity in a dataflow and using the Power Query Editor to connect to the data. What I’m talking about is the option you see when you create a dataflow to attach a Common Data Model folder as described here:

This is something I blogged about back in 2019; if you have a folder of CSV files it’s pretty easy to add the model.json file that allows you to attach this folder as a dataflow. I created a new model.json file and added it to the same folder that contains the CSV files I’ve been using for my tests in this series of blog posts.

Here’s what the contents of my model.json file looked like:

Something to notice here is that I created one CDM partition for each CSV file in the folder; only the first CDM partition is visible in the screenshot. Also, I wasn’t able to expose the names of the CSV source files as a column in the way I did for the ADLSgen2 connector and Synapse Serverless connector, which means I couldn’t compare some of the refresh timings from my previous two posts with the refresh timings here and had to rerun a few of my earlier tests.

How did it perform? I attached this CDM folder as a dataflow, connected a new dataset to it and ran some of the same tests I ran in my previous two blog posts. Importing all the data with no transformations (as I did in the first post in this series) into a single dataset took on average 70 seconds in my PPU workspace, slower than the ADLSgen2 connector which took 56 seconds to import the same data minus the filename column. Adding a step in the Power Query Editor in my dataset to group by the TransDate column and add a column with the count of days (as I did in the second post in this series) took on average 29 seconds to refresh in my PPU workspace which is again slightly slower than the ADLSgen2 connector.

Conclusion #1: Importing data from a dataflow connected to a CDM folder is slower than importing data using the ADLSgen2 connector with the default File System View option.

What about the Enhanced Compute Engine for dataflows? Won’t it help here? Not in the scenarios I’m testing, where the dataflow just exposes the data in the CSV files as-is and any Power Query transformations are being done in the dataset. Matthew Roche’s blog post here and the documentation explains when the Enhanced Compute Engine can help performance; if I created a computed entity to do the group by in my second test above then that would benefit from it for example. However in this series I want to keep a narrow focus on testing the performance of loading data from ADLSgen2 direct to Power BI without staging it anywhere.

The second way to import data from a CDM folder is to use the CDM Folder View option (which, at the time of writing is in beta) in the ADLSgen2 connector:

I expected the performance of this method to be the same as the dataflow method, but interestingly it performed better when loading all the data with no transformations: on average it took 60 seconds to refresh the dataset. This was still a bit slower than the 56 seconds the ADLSgen2 connector took using the default File System View option to return the same data minus the filename column. I then ran the test to create a group by on the Transdate column and that resulted in an average dataset refresh time of 27 seconds, which is exactly the same as the ADLSgen2 connector with the default File System View option.

Conclusion #2: Importing data from a Common Data Model folder via the ADLSgen2 connector’s CDM Folder View option may perform very slightly slower, or about the same as, the default File System View option.

So no performance surprises again, which is a good thing. Personally, I think exposing your data via a CDM folder is much more user-friendly than giving people access to a folder full of files – it’s a shame it isn’t done more often.

Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless, Part 2: Transformations

In my last post I showed how importing all the data from a folder of csv files stored in ADLSgen2 without doing any transformations performed about the same whether you use Power Query’s native ADLSgen2 connector or use Azure Synapse Serverless. After publishing that post, several people made the same point: there is likely to be a big difference if you do some transformations while importing.

So, using the same data I used in my last post, I did some more testing.

First of all I added an extra step to the original queries to add a filter on the TransDate column so only the rows for 1/1/2015 were returned. Once the datasets were published to the Power BI Service I refreshed them and timed how long the refresh took. The dataset using the ADLSgen2 connector took on average 27 seconds to refresh; the dataset connected to Azure Synapse Serverless took on average 15 seconds.

Next I removed the step with the filter and replaced it with a group by operation, grouping by TransDate and adding a column that counts the number of rows per date. The dataset using the ADLSgen2 connector took on average 28 seconds to refresh; the dataset using Azure Synapse Serverless took on average 15 seconds.

I chose both of these transformations because I guessed they would both fold nicely back to Synapse Serverless, and the test results suggest that I was right. What about transformations where query folding won’t happen with Synapse Serverless?

The final test I did was to remove the step with the group by and then add the following transformations: Capitalize Each Word (which is almost always guaranteed to stop query folding in Power Query) on the GuestId column then split the resulting column in to two separate columns at character position 5. The dataset using the ADLSgen2 connector took on average 99 seconds to refresh; the dataset using Synapse Serverless took on average 137 seconds. I have no idea why this was so much slower than the ADLSgen2 connector but it’s a very interesting result.

A lot more testing is needed here on different transformations and different data volumes but nevertheless I think it’s fair to say the following: if you are doing transformations while importing data into Power BI and you know query folding can take place then using Synapse Serverless as a source may perform a lot better than the native ADLSgen2 connector; however if no query folding is taking place then Synapse Serverless may perform a lot worse than the ADLSgen2 connector. Given that some steps in a Power Query query may fold while others may not, and given that it’s often the most expensive transformations (like filters and group bys) that will fold to Synapse Serverless, then more often than not Synapse Serverless will give you better performance while importing.

Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless

It’s becoming increasingly common to want to import data from files stored in a data lake into Power BI. What’s the best way of doing this though? There are a bewildering number of options with different performance and cost characteristics and I don’t think anyone knows which one to choose. As a result I have decided to do some testing of my own and publish the results here in a series of posts.

Today’s question: is it better to connect direct to files stored in ADLSgen2 using Power BI’s native ADLSgen2 connector or use Azure Synapse Analytics Serverless to connect instead? Specifically, I’m only interested in testing the scenario where you’re reading all the data from these files and not doing any transformations (which is a subject for another post).

To test this I uploaded nine csv files containing almost 8 million rows of random data to an ADLSgen2 container:

First of all I tested loading the data from just the first of these files, NewBasketDataGenerator.csv, into a single Power BI table. In both cases – using the ADLSgen2 connector and using Synapse Serverless via a view – the refresh took on average 14 seconds.

Conclusion #1: importing data from a single csv file into a single Power BI table performs about the same whether you use the ADLSgen2 connector or go via Synapse Serverless.

Next, I tested loading all of the sample data from all of the files into a single table in Power BI. Using the native Power BI ADLSgen2 connector the Power Query Editor created the set of queries you’d expect when combining files from multiple sources:

Here are the columns in the output table:

Using a Power BI PPU workspace in the same Azure region as the ADLSgen2 container it took an average of 65 seconds to load in the Power BI Service.

I then created a view in an Azure Synapse Serverless workspace on the same files (see here for details) and connected to it from a new Power BI dataset via the Synapse connector. Refreshing this dataset in the same PPU workspace in Power BI took an average of 72 seconds.

Conclusion #2: importing data from multiple files in ADLSgen2 into a single table in Power BI is slightly faster using Power BI’s native ADLSgen2 connector than using Azure Synapse Serverless

…which, to be honest, seems obvious – why would putting an extra layer in the architecture make things faster?

Next, I tested loading the same nine files into nine separate tables into a Power BI dataset and again compared the performance of the two connectors. This time the dataset using the native ADLSgen2 connector took on average 45 seconds and the Azure Synapse Serverless approach took 40 seconds on average.

Conclusion #3: importing data from multiple files in ADLSgen2 into multiple tables in Power BI may be slightly faster using Azure Synapse Serverless than using the native ADLSgen2 connector

Why is this? I’m not completely sure, but it could be something to do with Synapse itself or (more likely) Power BI’s Synapse connector. In any case, I’m not sure the difference in performance is significant enough to justify the use of Synapse in this case, at least on performance grounds, even if it is ridiculously cheap.

Not a particularly interesting conclusion in this case I admit. But what about file format: is Parquet faster than CSV for example? What about all those options in the Power BI ADLSgen2 connector? What if I do some transformations? Stay tuned…

Read part 2 of this series here

%d bloggers like this: