Importing Data From ADLSgen2 Into Power BI: Summary Of Findings

Over the last few months I’ve written a series of posts looking at different aspects of one question: what is the best way to import data from ADLSgen2 storage into a Power BI dataset? For example, is Parquet really better than CSV? Should you use Azure Synapse Serverless? In this post I’m going to summarise my findings and offer some recommendations – although, as always, I need to stress that these are the conclusions I can draw from my test results and not the absolute, incontrovertible “Microsoft-says” truth so please do your own testing too.

Partitioning makes import faster

Whatever other choices you make about importing data into Power BI, creating partitioned tables in your dataset is the key to getting the best refresh performance (see here for more details, and here for more thoughts/findings on the subject). However, creating partitioned tables manually adds a lot of complexity since you need to use external tools like Tabular Editor to do so; it also adds cost since you need to have Premium or PPU to get access to XMLA Endpoints for Tabular Editor to connect to. The bigger the Premium capacity SKU you use, the more partitions you’ll be able to refresh in parallel and the faster your refresh will be.

Since incremental refresh also creates partitions in the background, and incremental refresh does not require Premium or PPU, you may want to consider using it instead of creating partitions manually but it’s nowhere near as flexible and if you’re connecting direct to ADLSgen2 then you’d have to use the technique that Miguel Escobar describes here to ensure that query folding takes place.

Do you need to filter?

The most important question you need to ask in this whole process is this:

  1. Are you loading only some of the data from one or more files (for example by filtering on the values in one or more columns), or
  2. Are you loading all the data (ie all the rows) from one or more files?

Scenario #1 is, I guess, a classic data lake scenario where you have multiple Power BI datasets created by different users, each of which is importing a different subset of the data from the files in the lake. Scenario #2, on the other hand, will be the case when you’re creating a set of files in ADLSgen2 that contain exactly the data you want to load into a single Power BI dataset for a specific project.

If your answer is scenario #1 and you’re filtering data before you load, then you’ll get the best import performance if store your data in Parquet files and query it through Azure Synapse Serverless (see here and here). Although Power BI can take advantage of Parquet format to a certain extent and will give you better performance if you are only importing some of the columns from a file (see here) it doesn’t do predicate pushdown. There is an additional cost associated with using Synapse Serverless, of course, but it’s very reasonably priced (see here for more details on how the pricing works).

If your answer is scenario #2 and you’re not filtering data, then you’ll get better performance (and lower costs) by connecting to your files in ADLSgen2 direct from Power BI. Using Azure Synapse Serverless isn’t necessarily a bad option but it does come with an overhead.

Combining data from multiple Parquet files can be optimised

If you are connecting direct to files in ADLSgen2 (and aren’t using Azure Synapse Serverless) and aren’t creating one partition per file then you’ll be combining data from multiple files in your Power Query M code. The code that Power Query generates automatically when you do this performs faster for CSV files than Parquet files (see here) but as I show here, with some simple changes you can create a much faster query to combine data from multiple Parquet files – although this technique does not work with CSV files.

Always use HierarchicalNavigation=true with AzureStorage.DataLake()

If you are not use Azure Synapse Serverless and reading the data direct from ADLSgen2 using the AzureStorage.DataLake() M function then you should always set the HierarchicalNavigation=true option. As I showed in this post you can get some significant performance benefits from using this option. There are other options that you can set on AzureStorage.DataLake() but I couldn’t find any benefits from using them (see here) – probably because requesting data from ADLSgen2 is relatively fast and the act of loading the data returned into a table in your dataset is much slower (see here).

Attaching Common Data Model folders as Dataflows can be a user-friendly option

While connecting to tables in Azure Synapse Serverless is reasonably straightforward, connecting direct to files and folders in ADLSgen2 can be quite intimidating for inexperienced Power BI users. As a result exposing ADLSgen2 data stored in Common Data Model format by attaching it as a Dataflow may be worth considering. There’s a performance difference between doing this and connecting direct to multiple CSV files (see here) but it’s certainly a lot more user-friendly. It’s also worth noting that support for the newer CDM manifest format in Dataflows has just been announced, although Dataflows don’t support CDM folders with data stored in Parquet format yet.

Optimising The Performance Of Combining Data From Multiple Parquet Files In Power Query/Power BI

In all the testing I’ve done recently with importing data from Parquet files into Power BI I noticed something strange: loading data from a folder containing multiple Parquet files seemed a lot slower than I would expect, based on the time taken to load data from a single file. So I wondered – is there something that can be optimised? It turns out there is and in this blog post I’ll show you what I did.

If you import data from a folder containing Parquet files – whether it’s a local folder or a folder in ADLSgen2 storage – you’ll see a series of queries created for you in the Power Query Editor window that looks like this:

The query called Query1 shown in the screenshot iterates over all the files in the folder you’ve chosen and calls a function that reads the data from each Parquet file. It returns a table that contains a column with the name of the original source file in (which isn’t all that interesting for Parquet files) and all the columns from the Parquet files you’re containing.

Using the Parquet files from my series of posts on importing data from ADLSgen2 as a source, here’s the M code Power Quey generates for this query which I have modified to remove the column with the source file name in:

let
  Source = Folder.Files("C:\MyFolder"),
  #"Filtered Hidden Files1"
    = Table.SelectRows(
    Source,
    each [Attributes]?[Hidden]? <> true
  ),
  #"Invoke Custom Function1"
    = Table.AddColumn(
    #"Filtered Hidden Files1",
    "Transform File (3)",
    each #"Transform File (3)"(
      [Content]
    )
  ),
  #"Renamed Columns1"
    = Table.RenameColumns(
    #"Invoke Custom Function1",
    {"Name", "Source.Name"}
  ),
  #"Removed Other Columns1"
    = Table.SelectColumns(
    #"Renamed Columns1",
    {"Transform File (3)"}
  ),
  #"Expanded Table Column1"
    = Table.ExpandTableColumn(
    #"Removed Other Columns1",
    "Transform File (3)",
    Table.ColumnNames(
      #"Transform File (3)"(
        #"Sample File (3)"
      )
    )
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Expanded Table Column1",
    {
      {"TransDate", type date},
      {"GuestId", type text},
      {"ProductId", type text},
      {"NetAmount", type number}
    }
  )
in
  #"Changed Type"

Here’s the output:

On my PC this query took an average of 102 seconds to refresh.

Apart from this query being slower than I expected, I also noticed that there is a “Changed Type” step at the end – which I thought was unnecessary because unlike CSV files, Parquet has typed columns. If you connect to a single Parquet file in Power Query then it recognises the column types, so why not here? Well, it’s because of the way it’s combining files by expanding table columns, and there is a way to work around this that I blogged about here:
https://blog.crossjoin.co.uk/2017/09/25/setting-data-types-on-nested-tables-in-m/

Setting a type on the table column before expanding it did indeed improve performance, but this led me to another optimisation.

I know that using the Table.Combine M function can perform differently to the Table.ExpandTableColumn function used in the original version of the query (although it does not always perform better). Therefore I made the following change to the query above: using Table.Combine to return a single table with all the data in (note that setting a type on the table column is not necessary for this optimisation). Here’s the new version:

let
  Source = Folder.Files("C:\Myfolder"),
  #"Filtered Hidden Files1"
    = Table.SelectRows(
    Source,
    each [Attributes]?[Hidden]? <> true
  ),
  #"Invoke Custom Function1"
    = Table.AddColumn(
    #"Filtered Hidden Files1",
    "Transform File",
    each #"Transform File"([Content])
  ),
  #"Renamed Columns1"
    = Table.RenameColumns(
    #"Invoke Custom Function1",
    {"Name", "Source.Name"}
  ),
  #"Removed Other Columns1"
    = Table.SelectColumns(
    #"Renamed Columns1",
    {"Source.Name", "Transform File"}
  ),
  Combine = Table.Combine(
    #"Removed Other Columns1"[
      Transform File
    ]
  )
in
  Combine

This version of the query took, on average 43 seconds to refresh – a massive improvement.

If you’ve been following my series on ADLSgen2 refresh you may remember that I blogged about importing from a folder of Parquet files there too: in this post I noted that it took on average 72 seconds to load the same data from an ADLSgen2 folder in the Power BI Service using the original code; that was with the Source File column in and removing that column made no different to performance. This new version of the query took on average 49 seconds.

The conclusion is obvious: if you need to load data from a folder of Parquet files then you should use this new approach because the performance benefits are substantial. I know what you’re thinking: does this technique work for other file types apart from Parquet like CSV? The answer is no, because these file types don’t have typed columns like Parquet so it won’t work unfortunately.

Parquet File Performance In Power BI/Power Query

There has been a lot of excitement around the newly-added support for reading from Parquet files in Power BI. However I have to admit that I was disappointed not to see any big improvements in performance when reading data from Parquet compared to reading data from CSV (for example, see here) when I first started testing it. So, is Power Query able to take advantage of Parquet’s columnar storage when reading data?

The answer is yes, but you may need to make some changes to your Power Query queries to ensure you get the best possible performance. Using the same data that I have been using in my recent series of posts on importing data from ADLSgen2, I took a single 10.1MB Parquet file and downloaded it to my PC. Here’s what the data looked like:

I then created a query to count the number of rows in the table stored in this Parquet file where the TransDate column was 1/1/2015:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Filtered Rows" = Table.SelectRows(
    Source,
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

Here’s the output:

I then used SQL Server Profiler to find out how long this query took to execute (as detailed here): on average it took 3 seconds.

Here’s what I saw in Power BI Desktop while loading the data just before refresh finished:

As you can see, Power Query is scanning all the data in the file.

I then added an extra step to the query to remove all columns except the TransDate column:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Removed Other Columns"
    = Table.SelectColumns(
    Source,
    {"TransDate"}
  ),
  #"Filtered Rows" = Table.SelectRows(
    #"Removed Other Columns",
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

This version of the query only took an average of 0.7 seconds to run – a substantial improvement. This time the maximum amount of data read by Power Query was only 2.44MB:

As you can see, in this case removing unnecessary columns improved the performance of reading data from Parquet files a lot. This is not always true though – I tested a Group By transformation and in that case the Power Query engine was clever enough to only read the required columns, and manually removing columns made no difference to performance.

This demonstrates that Power Query is able to take advantage of Parquet’s columnar storage to only read data from certain columns. However, this is the only performance optimisation available to Power Query on Parquet – it doesn’t do predicate pushdown or anything like that. What’s more, when reading data from the ADLSgen2 connector, the nature of Parquet storage stops Power Query from making parallel requests for data (I guess the same behaviour that is controlled by the ConcurrentRequests option) which puts it at a disadvantage compared to reading data from CSV files.

I think a lot more testing is needed to understand how to get the best performance when reading data from Parquet, so look out for more posts on this subject in the future…

[Thanks once again to Eric Gorelik from the Power Query development team for providing the information about how the Parquet connector works, and to Ben Watt and Gerhard Brueckl for asking the questions in the first place]

Bonus fact: in case you’re wondering, the following compression types are supported by the Parquet connector: GZip, Snappy, Brotli, LZ4, and ZStd.

Parquet Files In Power BI/Power Query And The “Streamed Binary Values” Error

If you’re using the new Parquet connector in Power BI there’s a chance you will run into the following error:

Parameter.Error: Parquet.Document cannot be used with streamed binary values.
Details:
[Binary]

This isn’t a bug or anything that can be fixed, so it’s important to understand why it occurs and what you can do about it.

One easy way to reproduce this problem is by trying to access a reasonably large (larger than a few MB) Parquet file stored in SharePoint, something like this:

let
  Source = SharePoint.Files(
    "https://microsoft-my.sharepoint.com/personal/abc",
    [ApiVersion = 15]
  ),
  GetFile = Source
    {
      [
        Name = "myfile.parquet",
        #"Folder Path"
          = "https://microsoft-my.sharepoint.com/personal/abc/Documents/"
      ]
    }
    [Content],
  #"Imported Parquet" = Parquet.Document(
    GetFile
  )
in
  #"Imported Parquet"

The problem is that reading data from Parquet files requires random file access, and this is something that isn’t possible in Power Query for certain data sources like SharePoint and Google Cloud Storage. This problem will never occur with locally-stored files or files stored in ADLSgen2.

There is one possible workaround but it comes with some serious limitations: buffer the Parquet file in memory using the Binary.Buffer() M function. Here’s an example of how the above query can be rewritten to do this:

let
  Source = SharePoint.Files(
    "https://microsoft-my.sharepoint.com/personal/abc",
    [ApiVersion = 15]
  ),
  GetFile = Source
    {
      [
        Name = "myfile.parquet",
        #"Folder Path"
          = "https://microsoft-my.sharepoint.com/personal/abc/Documents/"
      ]
    }
    [Content],
  #"Imported Parquet" = Parquet.Document(
    Binary.Buffer(GetFile)
  )
in
  #"Imported Parquet"

The problem with buffering files in memory like this is that it’s only feasible for fairly small files because of the limits on the amount of memory Power Query can use (see here for more information): you’re likely to get really bad performance or errors if you try to buffer files that are too large, and Parquet files are often fairly large. The best way of solving this problem is to switch to using a data source like ADLSgen2 where this problem will not happen.

[Thanks to Eric Gorelik for the information in this post]

Measuring The Performance Of AzureStorage.DataLake() Using Power Query Query Diagnostics

In my last post I showed how changing the various options on the AzureStorage.DataLake() M function didn’t have much impact on dataset refresh performance in Power BI. I’ll admit I was slightly surprised by this, but it got me wondering why this was – and so I decided to do some tests to find out.

The answer can be found using Power Query’s query diagnostics functionality. Although you can’t use it to find out what happens when a dataset refresh takes place in the Power BI Service, you can use it to view requests to web services for refreshes in Power BI Desktop as I showed in this post. The Detailed diagnostic log query shows each request Power Query makes to get data from the ADLSgen2 API, the urls show the names of the files being accessed, and you can also see how long each request takes, the start and end time of each request and the amount of data read (the Content Length value in the response) amongst other things:

I wrote a Power Query query to extract all this useful information and put it in a more useful format, which can then be shown in Power BI. It’s fairly rough-and-ready but I turned it into an M function and posted the code here if you’d like to try it yourself – I haven’t done any serious testing on it though.

Here’s the data I captured for a refresh in Power BI Desktop that started at 10:55:42am yesterday and ended at 10:57:33am which took 111 seconds overall. I was using the default options for AzureStorage.DataLake() and this table only shows data for the GET requests to the ADLSgen2 API that returned data:

The main thing to notice here is that the total duration of all the requests was just 5.25 seconds – less than 5% of the overall refresh time – which explains why changing the options in AzureStorage.DataLake() didn’t make much difference to dataset refresh performance. Maybe if the files were larger, or there were more of them, changing the options would make a more noticeable impact. Of course there’s a lot more happening inside both the Power Query engine and the Analysis Services engine here beyond calling the web service to get the raw data. I also ran a Profiler trace while this refresh was running (see here for how to do this) and from the point of view of the Analysis Services engine it took 104 seconds to read the data from Power Query: the ExecuteSQL Profiler event took 4.5 seconds and the ReadData event took 99.5 seconds.

Conclusion: getting raw data from ADLSgen2 only represents a small part of the time taken to refresh a dataset that uses ADLSgen2 as a source, so any attempts to tune this may not have much impact on overall refresh times.

Testing The Performance Impact Of AzureStorage.DataLake() Options On Power BI Refresh Performance

Continuing my series on tuning the performance of importing data from ADLSgen2 into Power BI, in this post I’m going to look at the performance impact of setting some of the various options in the second parameter of the AzureStorage.DataLake() M function. In the last post in this series I showed how setting the HierarchicalNavigation option can improve refresh performance, but what about BlockSize, RequestSize or ConcurrentRequests?

Here’s what the documentation says about these options:

  • BlockSize : The number of bytes to read before waiting on the data consumer. The default value is 4 MB.
  • RequestSize : The number of bytes to try to read in a single HTTP request to the server. The default value is 4 MB.
  • ConcurrentRequests : The ConcurrentRequests option supports faster download of data by specifying the number of requests to be made in parallel, at the cost of memory utilization. The memory required is (ConcurrentRequest * RequestSize). The default value is 16.

Using the same 8 million row set of csv files I have used in my previous posts and the same queries generated by the From Folder source (see this post for more details – note that in this post I am not using Synapse Serverless, just loading direct from the files), I tested various options. Here’s an example of how these options can be set:

AzureStorage.DataLake(
  "https://xyz.dfs.core.windows.net/myfolder",
  [ConcurrentRequests = 1]
)

Here are the average dataset refresh times measured in the Power BI Service using Profiler:

OptionAverage Refresh Time (seconds)
None set – defaults used67
ConcurrentRequests=170
ConcurrentRequests=3267
BlockSize=170
BlockSize=8388608 (8MB)68
RequestSize=1Error (see below)
RequestSize=8388608 (8MB)68
ConcurrentRequests=32,
BlockSize=8388608,
RequestSize=8388608
67

From these results it looks like it’s possible to make performance slightly worse in some cases but none of the configurations tested made performance better than the default settings.

There are two somewhat interesting things to note. First, this is pretty much what the developers told me to expect when I asked about these options a while ago. However I was told that there may be some scenarios where reducing the value of ConcurrentRequests can be useful to reduce the memory overhead of a Power Query query – I guess to avoid paging on the Desktop (as discussed here) or memory errors in the Power BI Service.

Second, when I set RequestSize=1 (which means that each HTTP request was only allowed to return 1 byte of data, which is a pretty strange thing to want to do) I got the following error:

Expression.Error: The evaluation reached the allowed cache entry size limit. Try increasing the allowed cache size.

This reminds me I need to do some reasearch into how the Power Query cache works in Power BI Desktop and write that up as a post…

Overall, no major revelations here, but sometimes it’s good to know what doesn’t make any difference as much as what does.

Update 3/3/2021: read this post to see the results of some testing I did which shows why changing these options didn’t have much impact on refresh peformance.

Webcast: Accessing Web Services With Power BI And Power Query

Earlier this week I gave a webcast on accessing web services with Power BI, Power Query and M on Reza Rad’s YouTube channel. You can watch it here:

It’s an introduction to the subject: I cover the basics of using Web.Contents but don’t go into all the obscure details of what each of the options for it do (most of which I have blogged about anyway). I hope you find it useful!

Query Folding On SQL Queries In Power Query Using Value.NativeQuery() and EnableFolding=true

Here’s something that will Blow Your Mind if you’re a Power Query/M fan. Did you know that there’s a way you can get query folding to work if you’re using a native SQL query on SQL Server or Postgres as your data source?

There’s a new option on the Value.NativeQuery() M function that allows you to do this: you need to set EnableFolding=true in the third parameter. It’s documented here for the Postgres connector but it also works for the SQL Server connector too. Here’s an example using the SQL Server AdventureWorksDW2017 sample database:

let
  Source = Sql.Databases("localhost"),
  AdventureWorksDW2017 = Source
    {[Name = "AdventureWorksDW2017"]}
    [Data],
  RunSQL = Value.NativeQuery(
    AdventureWorksDW2017,
    "SELECT EnglishDayNameOfWeek FROM DimDate",
    null,
    [EnableFolding = true]
  ),
  #"Filtered Rows" = Table.SelectRows(
    RunSQL,
    each (
      [EnglishDayNameOfWeek] = "Friday"
    )
  )
in
  #"Filtered Rows"

Notice that my data source is a SQL query that gets all rows for the EnglishDayNameOfWeek column from the DimDate table and I’m only filtering down to the day name Friday using the #”Filtered Rows” step using the Table.SelectRows() function. Normally the #”Filtered Rows” step wouldn’t fold because I’ve used a native SQL query as my source, but in this case it does because I’ve set EnableFolding=true in Value.NativeQuery.

Here’s the SQL query generated by this M query:

select [_].[EnglishDayNameOfWeek]
from 
(
    SELECT EnglishDayNameOfWeek FROM DimDate
) as [_]
where [_].[EnglishDayNameOfWeek] = 'Friday'

Of course this doesn’t mean that everything can be folded now, but it’s nice to see that some folding on native SQL queries is now possible.

As I said this only works for SQL Server and Postgres at the time of writing and there is one other limitation: folding won’t happen if you’re passing parameters back to your SQL query in the way I describe here.

[Thanks to Curt Hagenlocher for the information]

Optimise The Performance Of Reading Data From ADLSgen2 In Power BI With The HierarchicalNavigation Option

Last year Marco Russo wrote a very useful blog post pointing out the performance problems you can run into when connecting to data stored in ADLSgen2 from Power BI when there are a large number of files elsewhere in the container. You can read that post here:

https://www.sqlbi.com/blog/marco/2020/05/29/optimizing-access-to-azure-data-lake-storage-adls-gen-2-in-power-query/

Marco’s advice – which is 100% correct – is that you should either pass the full path to the folder that you want to connect in the initial call to AzureStorage.DataLake() or, if you’re connecting to a single file, pass the path to the file itself. This avoids the performance overhead of reading metadata from files you’re not interested in reading from, which can be quite considerable.

There are some scenarios where this advice doesn’t work, though, and there is another way to avoid this overhead and make the performance of reading data much faster – and this is by using the HierarchicalNavigation option of the AzureStorage.DataLake() function. I blogged about what this option does some time ago but didn’t realise at the time the performance benefits of using it:

https://blog.crossjoin.co.uk/2019/09/29/hierarchical-navigation-adlsgen2-power-bi/

Consider the following scenario. Let’s say you want to connect to a CSV file in a folder which also contains a subfolder that contains many (in this example 20,000) other files that you’re not interested in:

[I’m only going to connect to a single file here to keep the example simple; I know I could just connect direct to the file rather than the folder and avoid the performance overhead that way]

Here’s the M code generated by the Power Query Editor using the default options to get the contents of the aSales.csv file:

let
  Source = AzureStorage.DataLake(
    "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder"
  ),
  Navigate = Source
    {
      [
        #"Folder Path"
          = "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder/",
        Name = "aSales.csv"
      ]
    }
    [Content],
  #"Imported CSV" = Csv.Document(
    Navigate,
    [
      Delimiter  = ",",
      Columns    = 2,
      Encoding   = 1252,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers"
    = Table.PromoteHeaders(
    #"Imported CSV",
    [PromoteAllScalars = true]
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"Product", type text},
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed Type"

In Power BI Desktop refreshing the table that this M query returns (even with the Allow Data Preview To Download In The Background option turned off) takes 23 seconds. I measured refresh time using a stopwatch, starting with the time that I clicked the refresh button and ending when the refresh dialog disappeared; this is a lot longer than the refresh time that you might see using the Profiler technique I blogged about here, but as a developer this is the refresh time that you’ll care about.

The problem here is the Source step which returns a list of all the files in the ParentFolder folder and the ManySmallFiles subfolder.

Now, here’s an M query that returns the same data but where the HierarchicalNavigation=true option is set:

let
  Source = AzureStorage.DataLake(
    "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder",
    [HierarchicalNavigation = true]
  ),
  Navigation = Source
    {
      [
        #"Folder Path"
          = "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder/",
        Name = "aSales.csv"
      ]
    }
    [Content],
  #"Imported CSV" = Csv.Document(
    Navigation,
    [
      Delimiter  = ",",
      Columns    = 2,
      Encoding   = 1252,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers"
    = Table.PromoteHeaders(
    #"Imported CSV",
    [PromoteAllScalars = true]
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"Product", type text},
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed Type"

This takes just 3 seconds to refresh in Power BI Desktop – a really big improvement.

Conclusion: always use the HierarchicalNavigation=true option in AzureStorage.DataLake() when connecting to data in ADLSgen2 storage from Power BI to get the best refresh performance and the best developer experience in Power BI Desktop.

Testing The Performance Of Importing Data From ADLSgen2 Common Data Model Folders In Power BI

Following on from my last two posts comparing the performance of importing data from ADLSgen2 into Power BI using the ADLSgen2 connector and going via Synapse Serverless (see here and here), in this post I’m going to look at a third option for connecting to CSV files stored in ADLSgen2: connecting via a Common Data Model folder. There are two ways to connect to a CDM folder in Power BI: you can attach it as a dataflow in the Power BI Service, or you can use the CDM Folder View option in the ADLSgen2 connector.

First of all, let’s look at connecting via a dataflow. Just to be clear, I’m not talking about creating a new entity in a dataflow and using the Power Query Editor to connect to the data. What I’m talking about is the option you see when you create a dataflow to attach a Common Data Model folder as described here:

This is something I blogged about back in 2019; if you have a folder of CSV files it’s pretty easy to add the model.json file that allows you to attach this folder as a dataflow. I created a new model.json file and added it to the same folder that contains the CSV files I’ve been using for my tests in this series of blog posts.

Here’s what the contents of my model.json file looked like:

Something to notice here is that I created one CDM partition for each CSV file in the folder; only the first CDM partition is visible in the screenshot. Also, I wasn’t able to expose the names of the CSV source files as a column in the way I did for the ADLSgen2 connector and Synapse Serverless connector, which means I couldn’t compare some of the refresh timings from my previous two posts with the refresh timings here and had to rerun a few of my earlier tests.

How did it perform? I attached this CDM folder as a dataflow, connected a new dataset to it and ran some of the same tests I ran in my previous two blog posts. Importing all the data with no transformations (as I did in the first post in this series) into a single dataset took on average 70 seconds in my PPU workspace, slower than the ADLSgen2 connector which took 56 seconds to import the same data minus the filename column. Adding a step in the Power Query Editor in my dataset to group by the TransDate column and add a column with the count of days (as I did in the second post in this series) took on average 29 seconds to refresh in my PPU workspace which is again slightly slower than the ADLSgen2 connector.

Conclusion #1: Importing data from a dataflow connected to a CDM folder is slower than importing data using the ADLSgen2 connector with the default File System View option.

What about the Enhanced Compute Engine for dataflows? Won’t it help here? Not in the scenarios I’m testing, where the dataflow just exposes the data in the CSV files as-is and any Power Query transformations are being done in the dataset. Matthew Roche’s blog post here and the documentation explains when the Enhanced Compute Engine can help performance; if I created a computed entity to do the group by in my second test above then that would benefit from it for example. However in this series I want to keep a narrow focus on testing the performance of loading data from ADLSgen2 direct to Power BI without staging it anywhere.

The second way to import data from a CDM folder is to use the CDM Folder View option (which, at the time of writing is in beta) in the ADLSgen2 connector:

I expected the performance of this method to be the same as the dataflow method, but interestingly it performed better when loading all the data with no transformations: on average it took 60 seconds to refresh the dataset. This was still a bit slower than the 56 seconds the ADLSgen2 connector took using the default File System View option to return the same data minus the filename column. I then ran the test to create a group by on the Transdate column and that resulted in an average dataset refresh time of 27 seconds, which is exactly the same as the ADLSgen2 connector with the default File System View option.

Conclusion #2: Importing data from a Common Data Model folder via the ADLSgen2 connector’s CDM Folder View option may perform very slightly slower, or about the same as, the default File System View option.

So no performance surprises again, which is a good thing. Personally, I think exposing your data via a CDM folder is much more user-friendly than giving people access to a folder full of files – it’s a shame it isn’t done more often.

%d bloggers like this: