Comparing The Performance Of CSV And Excel Data Sources In Power Query

Chris Webb Excel, Power BI, Power BI Desktop, Power Query August 2, 2018July 29, 2018 2 Minutes

My posts from two weeks ago (see here and here) on using Process Monitor to troubleshoot the performance of Power Query queries made me wonder about another question: how does the performance of reading data from CSV files compare to the performance of reading data from Excel files? I think most experienced Power Query users in either Power BI or Excel know that Excel data sources perform pretty badly but I had never done any proper tests. I’m not going to pretend that this post is a definitive answer to this question (and once again, I would be interested to hear your experiences) but hopefully it will be thought-provoking.

To start off, I took the 153.6MB CSV file used in my last few posts and built a simple query that applied a filter on one text column, then removed all but three columns. The query ran in 9 seconds and using the technique from my last post I was able to draw the following graph from a Process Monitor log file showing how Power Query reads data from the file:

Nothing very surprising there. Then I opened the same CSV file in Excel and saved the data as an xlsx file with one worksheet and no tables or named ranges; the resulting file was only 80.6MB. Finally I created a duplicate of the first query and pointed it to the Excel file instead. The resulting query ran in 59 seconds – around 6 times slower! Here’s a comparison with the performance of this query with the first query:

The black line in the graph above is the amount of data read (actually the offset values showing where in the file the data is read from, which is the same thing as a running total when Power Query is reading all the data) from the Excel file; the green line is the amount of data read from the CSV file (the same data shown in the first graph above). A few things to mention:

Running Process Monitor while this second query was refreshing had a noticeable impact on its performance – in fact it was almost 20 seconds slower
The initial values of 80 million bytes seem to be where data is read from the end of the Excel file. Maybe this is Power Query reading some file metadata? Anyway, it seems as though it takes 5 seconds before it starts to read the data needed by the query.
There’s a plateau between the 10 and 20 second mark where not much is happening; this didn’t happen consistently and may have been connected to the fact that Process Monitor was running

In any case you don’t need to study this too hard to understand that the performance of reading data from an xlsx format Excel file is terrible compared to the performance of reading data from a CSV. So, if you have a choice between these two formats, for example when downloading data, it seems fair to say that you should always choose CSV over xlsx.

Published by Chris Webb

My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel. View all posts by Chris Webb

Published August 2, 2018July 29, 2018

29 thoughts on “Comparing The Performance Of CSV And Excel Data Sources In Power Query”

Erik Svensen says:

August 2, 2018 at 1:35 pm

Hi Chris

I see exactly the same but haven’t tested it as properly as you 🙂

In order to convert the Excel files I use https://cloudconvert.com API to CSV files, and save them to Azure Blob storage. It actually works quite good.

Let me know if you want a copy of the flow

BR
Erik

Let me know if

Loading...

Reply
1. Remi Øvstebø says:
  
  December 19, 2018 at 8:16 am
  
  That sounds like an interesting Flow, is it possible to share?
  
  Loading...
  
  Reply
  1. Erik Svensen says:
    
    December 19, 2018 at 11:13 am
    
    Hi Remi
    
    If you sent me a DM on Twitter with your conact info I can send you a copy of the flow
    
    Best
    Erik
    
    Loading...
2. Redmond says:
  
  January 11, 2019 at 4:59 pm
  
  Hi Erik
  
  If you would also be able to send me a copy of the Flow I would really appreciate it.
  
  Cheers
  
  Redmond
  
  Loading...
  
  Reply
Pingback: Performance Test: Loading CSV Versus Loading Excel In Power Query – Curated SQL
Daniel Peñafiel says:

August 4, 2018 at 3:23 am

Hi Erick Could you give me a copy?

Loading...

Reply
MF says:

August 4, 2018 at 9:55 am

Thanks for these three posts. Really helping a lot when PQ users are facing the options between Excel and CSV.

I’d recently experienced LONG really LONG Refresh time for getting data from (seven) Excel files in a folder. Guess how long? It’s 10+ minutes…

I realised that (after reading your posts) the potential problem is the number of worksheets in each Excel file, and probably the number of Excel files in the folder. There are four worksheets in each file but I need only two. And there are 7 files in the folder (one file for a quarter).

What I did were to combine the 7 files into three (with more data in each file); and keep only the two worksheets I need in the three files. With this simple modification, the Refresh time took around 2 minutes. Although it’s still relatively long, it’s a huge improvement comparing to 10+ minutes.

I believe the performance may get better if the files are converted into CSV… will try that out later.

Thanks for pointing to the right direction. 🙂

Loading...

Reply
Sandeep Kothari says:

August 4, 2018 at 10:10 am

Dear Chris, I thought that excel files are better than csv files, until I read your article. Great Observation.

I had the opportunity to attend the webinar in which you explained M language code of PQ with so much simplicity. May I request you to share the recording link.

Loading...

Reply
SAM says:

August 4, 2018 at 5:12 pm

Try with xlsb – its worse than xlsx – But this is a know issue as per Microsoft – because the xlsb file is using an Access Driver to read the data – Strange no ?.

Loading...

Reply
1. Anonymous says:
  
  January 8, 2025 at 6:07 am
  
  I have a table with 70k rows and 60 columns.
  Xlsx file is around 20 MB, unzped xlsx folder is 130 MB.
  
  Task was to create around 170 grouping (combination of different 2 to 3 column filter) with 5 numerical columns sum.
  
  Tried with xlsx, csv, xml and finaly xlsb (7 MB).
  By far the best performance was with xlsb.
  
  The query for xlsx lasted more than 15 minutes.
  The query for xlsb lasted around 2 minutes.
  
  The only flaw with xlsb is that Excel.Workbook funciton does not recognize excel Table object.
  
  Loading...
  
  Reply
anthony34 says:

August 6, 2018 at 10:43 am

Hi Chris
The fact that BI works better with csv sources than Excel sources, it was reported long time ago (see https://www.excelando.co.il/en/analyzing-power-query-performance-source-large-files/#blog ).

But it is the first time I can see a measure about it that can be easily replicated.

And There is another point worth mentioning that I figured out thanks to your last posts:
“Union (append) CSVs is not a cause for slowness” is not true.
The location of the CSV file can be very important: when the csv sources sit on a remote location (server, NAS), each access to the tiniest file may take some time. For 1 file, it is not noticeable but for many files, it makes a big difference.
In some queries, I used to use 1 csv file with currenry rates for each day. The csv files were very small (5k) but 5 years of files means 1826 files to access at every refresh !!!
Now I made yearly csv files. Only 5 csv file to access. The same query takes now a few seconds while it used to take motre than a minute.

Loading...

Reply
DAniel Penafiel says:

August 6, 2018 at 2:57 pm

and What about txt files ? it is the same?

Loading...

Reply
1. Chris Webb says:
  
  August 9, 2018 at 4:32 pm
  
  I guess so, I’ll have to test.
  
  Loading...
  
  Reply
Pingback: Power BI Release Notes, Visual filters, RLS and more... (August 6, 2018) | Guy in a Cube
Pingback: Comparing The Performance Of CSV And Excel Data Sources In Power Query | Pardaan.com
w4sight says:

October 17, 2018 at 3:08 pm

How does CSV performance compare to the same data in a sharepoint list?

Loading...

Reply
Pingback: Comparing The Performance Of Reading Data From Files With File.Contents And Web.Contents In Power Query And Power BI « Chris Webb's BI Blog
Anxo says:

June 20, 2019 at 1:26 pm

Interesting explanation. To enrich it a bit more I leave a YouTube video where the same is done

Loading...

Reply
dosmastr says:

June 5, 2020 at 7:05 pm

what about excel binary files?

Loading...

Reply
D Hammer says:

December 22, 2020 at 1:18 pm

Using CSV files, how do you overcome dates coming through as text? I recently had one source file which I was not able to convert in Power Query, nor Excel, to convert the date to serial number/date formatting.

Loading...

Reply
1. Chris Webb says:
  
  December 22, 2020 at 4:10 pm
  
  You just change the data type of the column in the Power Query Editor window…?
  
  Loading...
  
  Reply
Pingback: Chris Webb's BI Blog: New Option To Solve Problems With Power Query Not Loading All Data From An Excel Worksheet Chris Webb's BI Blog
justice888 says:

January 18, 2021 at 3:11 pm

I think as @ datawiz.co.il it is written,
That Power BI will replace excel entirely,
especially with the use of AI, Big Data and intermidate solutions.

Loading...

Reply
Brian says:

March 31, 2021 at 6:57 pm

How about csv vs Access, is it the same??

Loading...

Reply
Justin says:

July 3, 2021 at 10:01 am

Hey Chris! Just wondering how does the speed of CSV files compare to accessing a table in the current workbook? (i.e. Excel.CurrentWorkbook())

Loading...

Reply
1. Chris Webb says:
  
  July 5, 2021 at 5:25 pm
  
  Good question – I don’t know, so you’d need to test it.
  
  Loading...
  
  Reply
Marold Moosrainer says:

January 9, 2022 at 10:35 pm

Hello Chris, ran into performance trouble when importing from SP folder XLSX files exactly as you have analyzed here. Not much trouble in the beginning but when projects get larger you have to wait for 30min to load data this way. So motivated by your hints I’ve spent some time in shifting some benchmark files from XLSX to CSV file import. Performance seems to be much better indeed, but now nasty format and regional setting errors show up where PBI XLSX data connector never has complaint about.

Problem 1: double quotes cause line breaks.
Solution: Transform Sample File >> Applied Steps >> Gear
Comma-Separated Values >> Line breaks >> Ignore quoted line breaks

Problem 2. my regional settings are German but delimiters set to US comma and semincolon for some good reasons. Decimal numbers are not imported correctly >> error (automatic type conversion) or text string (everything you can save manually)
Solution:
1st trial: File >> Options & Settings >> Options >> Current File >> Regional Settings >> Locale for Import >> Englisch (United States)
Then all dates are misinterpreted
2nd trial: Type conversion to string >> Add custom col >> =Number.FromText([Xyz], “en-US” >> correct col result.

But hey, this CSV cleansing stuff by manual operations is cumbersome and by no way robust. So I am afraid I’ll further move on to wait for my XLSX imports or do you have some panacea to a M code novice?

Loading...

Reply
1. Anonymous says:
  
  April 4, 2024 at 5:48 pm
  
  Hi,
  
  I’ve faced similar challenges with CSV. You’re intuition that the answer lies in M code is correct. Rather than importing multiple files and splitting them and combining them in the usual way, the answer is to import the files as one column table and then to split that column by delimiter as an additional step. It may be necessary to create interim steps to clean the data in that column, but the beauty of PWQ is that you only create the flow once. The UI options for the column split function provides some handy controls, but more can be done directly with the M code if necessary.
  
  Loading...
  
  Reply
S Rathinagiri says:

February 11, 2024 at 3:28 pm

Nice article. This is because of the structure of the CSV and xlsx files. CSV files are simply text files containing only data and there is no meta data. However, XLSX files are simple ZIP files of XML files. So, PQ has to first of all unzip the contents somewhere in the temp folder, then go through the entire XML tree to get the data among the meta data like formatting, formulas etc., of the cells. For a simple two dimensional table, CSV is the best!

Loading...

Reply