Chris Webb's BI Blog: Importing Data From ADLSgen2 Into Power BI: Summary Of Findings

Over the last few months I’ve written a series of posts looking at different aspects of one question: what is the best way to import data from ADLSgen2 storage into a Power BI dataset? For example, is Parquet really better than CSV? Should you use Azure Synapse Serverless? In this post I’m going to summarise my findings and offer some recommendations – although, as always, I need to stress that these are the conclusions I can draw from my test results and not the absolute, incontrovertible “Microsoft-says” truth so please do your own testing too.

Partitioning makes import faster

Whatever other choices you make about importing data into Power BI, creating partitioned tables in your dataset is the key to getting the best refresh performance (see here for more details, and here for more thoughts/findings on the subject). However, creating partitioned tables manually adds a lot of complexity since you need to use external tools like Tabular Editor to do so; it also adds cost since you need to have Premium or PPU to get access to XMLA Endpoints for Tabular Editor to connect to. The bigger the Premium capacity SKU you use, the more partitions you’ll be able to refresh in parallel and the faster your refresh will be.

Since incremental refresh also creates partitions in the background, and incremental refresh does not require Premium or PPU, you may want to consider using it instead of creating partitions manually but it’s nowhere near as flexible and if you’re connecting direct to ADLSgen2 then you’d have to use the technique that Miguel Escobar describes here to ensure that query folding takes place.

Do you need to filter?

The most important question you need to ask in this whole process is this:

Are you loading only some of the data from one or more files (for example by filtering on the values in one or more columns), or
Are you loading all the data (ie all the rows) from one or more files?

Scenario #1 is, I guess, a classic data lake scenario where you have multiple Power BI datasets created by different users, each of which is importing a different subset of the data from the files in the lake. Scenario #2, on the other hand, will be the case when you’re creating a set of files in ADLSgen2 that contain exactly the data you want to load into a single Power BI dataset for a specific project.

If your answer is scenario #1 and you’re filtering data before you load, then you’ll get the best import performance if store your data in Parquet files and query it through Azure Synapse Serverless (see here and here). Although Power BI can take advantage of Parquet format to a certain extent and will give you better performance if you are only importing some of the columns from a file (see here) it doesn’t do predicate pushdown. There is an additional cost associated with using Synapse Serverless, of course, but it’s very reasonably priced (see here for more details on how the pricing works).

If your answer is scenario #2 and you’re not filtering data, then you’ll get better performance (and lower costs) by connecting to your files in ADLSgen2 direct from Power BI. Using Azure Synapse Serverless isn’t necessarily a bad option but it does come with an overhead.

Combining data from multiple Parquet files can be optimised

If you are connecting direct to files in ADLSgen2 (and aren’t using Azure Synapse Serverless) and aren’t creating one partition per file then you’ll be combining data from multiple files in your Power Query M code. The code that Power Query generates automatically when you do this performs faster for CSV files than Parquet files (see here) but as I show here, with some simple changes you can create a much faster query to combine data from multiple Parquet files – although this technique does not work with CSV files.

Always use HierarchicalNavigation=true with AzureStorage.DataLake()

If you are not use Azure Synapse Serverless and reading the data direct from ADLSgen2 using the AzureStorage.DataLake() M function then you should always set the HierarchicalNavigation=true option. As I showed in this post you can get some significant performance benefits from using this option. There are other options that you can set on AzureStorage.DataLake() but I couldn’t find any benefits from using them (see here) – probably because requesting data from ADLSgen2 is relatively fast and the act of loading the data returned into a table in your dataset is much slower (see here).

Attaching Common Data Model folders as Dataflows can be a user-friendly option

While connecting to tables in Azure Synapse Serverless is reasonably straightforward, connecting direct to files and folders in ADLSgen2 can be quite intimidating for inexperienced Power BI users. As a result exposing ADLSgen2 data stored in Common Data Model format by attaching it as a Dataflow may be worth considering. There’s a performance difference between doing this and connecting direct to multiple CSV files (see here) but it’s certainly a lot more user-friendly. It’s also worth noting that support for the newer CDM manifest format in Dataflows has just been announced, although Dataflows don’t support CDM folders with data stored in Parquet format yet.

8 thoughts on “Importing Data From ADLSgen2 Into Power BI: Summary Of Findings”

Pingback: Importing Data from ADLS Gen2 into Power BI – Curated SQL
milhouse77bs says:

April 30, 2021 at 7:20 pm

Chris, perfect timing. Trying to bring in ADLSg2 data into Power BI, and this helps a lot!

Loading...

Michal Dvorak (Nolock) says:

June 8, 2021 at 7:55 am

Hello Chris,
thank you for the excellent article.
Is there also a way how to improve the performance using Storage Account Query Acceleration? https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-query-acceleration-how-to?tabs=azure-powershell%2Cpowershell#retrieve-data-by-using-a-filter

Loading...

1. Chris Webb says:
  
  June 8, 2021 at 8:40 am
  
  I wondered about that too, but to be honest I don’t know. It would probably require writing a custom connector, so it’s just easier to use Synapse Serverless I think.
  
  Loading...
  
  1. Michal Dvorak (Nolock) says:
    
    June 8, 2021 at 10:28 am
    
    Thank you.
    
    Loading...
dattrieuds says:

March 17, 2022 at 5:18 pm

Hi Chris,
Thanks for a great post (as always).
I am connecting from PowerBI (tenant located in Region A) to an ADLS Gen2 (located in Region B). It works normally when I am using PBI Desktop.
But when I publish it to PBI Service and adjust the credentials to set up the schedule refresh. It says the credential is invalid. Do you know why? It’s something related to the fact that Power BI Service and ADLS Gen2 are in 2 different regions? And how to fix that?
Thanks!

Loading...

Rene S says:

July 26, 2022 at 7:58 am

Hey Chris,
thanks for the great article. Anyways, I have a question: we are using PBI embedded and I try to load data from the storage account into a partitioned table. The table has something like 60 partitions (two month, one partition per day) and I process the partitions in blocks of 15-16 partitions. However, there seems to be no difference at all between scaling PBI to A4 and scaling it to A6. I would have expected the performance to be at least a little better. Am I missing something here?

Loading...

1. Chris Webb says:
  
  July 26, 2022 at 8:53 am
  
  Did you try increasing the amount of parallelism? See https://blog.crossjoin.co.uk/2021/06/27/increasing-refresh-parallelism-and-performance-in-power-bi-premium/ On Premium or Embedded gen2 I wouldn’t expect moving up to a higher SKU to improve performance; all that does is increase the limits on the amount of CPU you can use.
  
  Loading...