A while ago I found myself wondering – as I often do about this kind of thing – about what the undocumented third parameter of the Excel.Workbook() M function (called delayTypes) actually does. I found a forums post from Guy Hunkin of the Excel Power Query team in 2017 here, which gives the following summary:
Originally, Excel.Workbook used to read the entire workbook data to accurately assign types to the columns. This was EXTREMELY slow on large workbooks.
Having this in mind, we added this flag to delay the behavior. When set to “true”, we don’t infer any column types. Instead, the UI uses its normal inference algorithm to generate a Table.TransformColumnTypes step with the inferred types and if it inferred wrongly, the user can update the transformation explicitly.
I also posted a question on the Power Query MSDN forum and got some more details from Colin Banfield, one of the many M experts who hang out there, as well as doing some investigations of my own. This post summarises what I found out.
First of all, a simple demonstration of what it does. Consider the following Excel table:
Connecting to this table in the December 2018 release of Power BI Desktop generates the following M query:
[sourcecode language=”text” padlinenumbers=”true” highlight=”8″]
let
Source =
Excel.Workbook(
File.Contents(
"C:\DelayTypesTest.xlsx"
),
null,
true
),
SourceData_Table =
Source{[Item="SourceData",Kind="Table"]}[Data],
#"Changed Type" =
Table.TransformColumnTypes(
SourceData_Table,
{
{"TextColumn", type text},
{"IntegerColumn", Int64.Type},
{"DecimalColumn", type number},
{"DateColumn", type date}
}
)
in
#"Changed Type"
[/sourcecode]
Notice that the delayTypes option is now automatically set to true and that there is a “Changed Type” step. The output of this query, as seen in the Power Query Editor, is what you would expect:
Compare this with the output of the following query which has delayTypes set to false and has no “Changed Type” step:
[sourcecode language=”text”]
let
Source =
Excel.Workbook(
File.Contents(
"C:\DelayTypesTest.xlsx"
),
null,
false
),
SourceData_Table =
Source{[Item="SourceData",Kind="Table"]}[Data]
in
SourceData_Table
[/sourcecode]
There are two things to notice:
- With delayTypes set to false no “Changed Type” step is necessary, the data types on the columns are set by Excel.Workbook(). With delayTypes set to true all the columns returned by Excel.Workbook() have their data type set to Any and a subsequent “Changed Type” step is necessary to set data types to something useful.
- With delayTypes set to true the IntegerColumn and the DecimalColumn columns have different data types assigned to them by the “ChangedType” step: IntegerColumn becomes Int64.Type whereas DecimalColumn becomes type number. With delayTypes set to false both columns are set to number. So it looks like Excel.Workbook with delayTypes set to false will only convert columns to primitive data types.
The second question is: if it is, primarily, a performance optimisation how much of an impact does it have? I built four queries that read all the data from an 80MB xlsx file to test the performance of all of the variations of delayTypes true/false and having a “Changed Type” step present or not (just in case there was an overhead to having a “Changed Type” step). Here are the results – the main timings are in minutes and seconds, and in brackets is the length of the “Evaluating” phase of query execution:
delayTypes=false | delayTypes=true | |
no “Changed Type” step | 2:08 (1:21) |
1:06 (0:00) |
“Changed Type” present | 2:19 (1:30) |
1:08 (0:00) |
I wouldn’t read too much into these exact timings because in my experience timings for the same query can fluctuate quite a lot, but one thing is clear: setting delayTypes to true results in much better performance. Indeed with delayTypes set to false nearly half the time was spent in the “Evaluating” phase whereas with delayTypes set to true there was no “Evaluating” phase at all. As a result I can’t think of a reason not to set the delayTypes option to true.
It’s worth noting that delayTypes is set to false by default if you don’t specify this option, so be aware of this if you are writing your own M code.
Ohhh man! I bet “Automatically Detect Column Types” will make it default. I’m going to check right now—this could be a game changer!
Yes. I have noticed the time lag using false setting to delay type. I did not think twice about switching to true when I realized that it was much faster that way. But thank you for giving me an opportunity to revisit and understand what that really means. I bet that a number of updates that they have rolled out for the enhancement and improvements to the original code of accessing the data may have improved the speed of ‘Excel.Workbook’ which initially identified as a ‘time-consuming’ process. I guess the development team needs to regroup and reassess all the existing data accessing function to change the default value for parameters.
Notice, ‘delayTypes’ works only for .xlsx (OpenXML) format, but ignores for old .xls format. So you need to convert files to OpenXML before using Excel.Workbook() to prevent data loss due incorrect data type detection.
May I ask a question about I have an issue; I have realised your workbook A column is BLANK and Power Query does not recognise it most likely you referred a “table”. In my case; there is no any “table” just a sheet and A column is BLANK in standard file; at this stage Power Query recognises BLANK column A but if I update somewhere of the table; for example remove E and F columns which have some data and Column A is still BLANK, Power query does not recognise BLANK column A any more and query is broken. In my scenario, weekly reports arrive into a folder and Power Query consolidate them through Power Pivot and/or PowerBI. As far as I see it looks kind of bug or something. You can see my post https://community.powerbi.com/t5/Power-Query/Power-Query-removes-first-blank-column-of-the-table/m-p/1956462#M57693. Can you have any idea about 🙏