Since I started this long and rambling series of posts on importing data from ADLSgen2 into Power BI a lot of people have asked me the same question: will using Parquet files instead of CSV files perform better? In this post you’ll find out.
To test the performance of Parquet files I took the data that I have been using in this series and loaded it from the original CSV files into Parquet files using Azure Data Factory. I then repeated some of the tests I ran in the first two posts in this series – here and here. The three tests were:
- Loading all the data from the files
- Filtering the data to the date 1/1/2015
- Doing a Group By on the date column and counting the rows for each date
I ran these tests twice:
- Connecting direct to the files in ADLSgen2 (using the AzureStorage.DataLake M function) from Power BI
- Creating a view in Azure Synapse Analytics Serverless on top of the Parquet files and importing the data to Power BI from that using Power BI’s Synapse connector
Here’s a table with all the average refresh times for each test:
|Connecting to ADLSgen2 direct||Connecting via Synapse Serverless|
|Loading all data||72 seconds||91 seconds|
|Filtering to 1/1/2015||29 seconds||7 seconds|
|Group by on date||34 seconds||7 seconds|
Some points to note:
- The performance of importing all the data by connecting direct to the files in ADLSgen2 was the slightly slower here for Parquet files (72 seconds) than in my first blog post with CSV files (65 seconds)
- The performance of the two subsequent tests for filtering by date and grouping by date were only slightly worse when connecting direct to the Parquet files in ADLSgen2 as when connecting to CSV files. Filtering by date took 29 seconds for the Parquet files and 27 seconds for the CSV files; grouping by date took 34 seconds for the Parquet files and 28 seconds for the CSV files.
- Importing all the data from Parquet files via Synapse Serverless performed a lot worse than connecting direct to ADLSgen2; in fact it was the slowest method for loading all the data tested so far. Loading all the data via Synapse Serverless from Parquet files took 91 seconds whereas it only took 72 seconds via Synapse Serverless from CSV files.
- The two transformation tests, filtering by date and grouping by date, were a lot faster than connecting direct to ADLSgen2. What’s more, Synapse Serverless on Parquet was substantially faster than Synapse Serverless on CSV: filtering by date via Serverless on Parquet took 7 seconds compared to 15 seconds via Serverless on CSV, and grouping by date via Serverless on Parquet also took 7 seconds compared to 15 seconds via Serverless on CSV.
- There is another variable here that I’m not considering: what if the number and size of files used affects performance? As other people have found, it certainly affects Synapse Serverless performance; it may also affect Power Query performance too. However I don’t have the time or expertise to test this properly so I’m going to declare it out of scope and concentrate on comparing Synapse Serverless performance with the performance of connecting to the same files direct.
So, based on these results it seems fair to draw the following conclusions:
Conclusion #1: if you’re importing all the data from files in ADLSgen2 then connecting direct is faster than going via Synapse Serverless
Conclusion #2: if you’re connecting direct to files in ADLSgen2 and importing all the data from them then CSV files are faster than Parquet files
Conclusion #3: if you’re transforming data then connecting to Parquet files via Synapse Serverless is a lot faster than any other method
[UPDATE 29th March 2021: The tests in this post were run using the method of combining data from multiple files that Power Query automatically generates. In this post I show an optimised version of the code that greatly improves the performance of combining data from multiple Parquet files]