M · Power BI · Power Query

Making Sure All Columns Appear When You Combine Data From Multiple Files In Power BI/Power Query M

Here’s a really common problem that occurs when combining data from multiple files, or indeed any type of data source, in Power BI/Power Query/Excel Get&Transform. Imagine you have a folder with two Excel files in, and each Excel file contains a table called SalesTable:

image

You use the “From Folder” data source to combine all the data from all the Excel files in this folder, you get a table like this:

…and you’re happy. Then, at some later date a third file is added to the folder that has an extra column in its SalesTable table called Comments:

You refresh your query, though, and you don’t see the Comments column anywhere:

Why not? If you look at the query that has been generated and go back to the “Removed Other Columns1” step you’ll see a table containing a column containing table values:

…and you’ll also see that the next step in the query, “Expanded Table Column1”, uses the Table.ExpandTableColumn function – the M function that gets called if you click the Expand/Aggregate button in the column header highlighted in the previous screenshot – to flatten these nested tables out. And the problem is that Table.ExpandTableColumn needs to know in advance the names of the columns you want to expand.

Now this is an extremely common, and powerful, Power Query/M pattern. Apart from the “From Folder” functionality for automatically combining data from multiple files it’s something I find myself building manually all the time: write a function, for example to make a single call to a web service; create a table containing one row for each call to the web service that I want to make, use the Invoke Custom Function button to call my function for each row, and then expand to get all the data from all the function calls. And the more I use this pattern, the more I run into situations where I don’t see columns I’m expecting to see because I’ve done an Expand in an earlier step that has a hard-coded list of column names in it (it’s a very similar problem to the one that Ken Puls blogged about here). It’s a pain to have to keep changing this list, and the real problem comes when you don’t actually know in advance what the names of the columns to expand are.

One solution would be to do something similar to what I show in this post: iterate through all the tables in the table column, find a distinct list of column names, and then use this list with Table.ExpandTableColumn. However, there is an easier way to handle this: use Table.Combine instead of Table.ExpandTableColumn. The great thing about Table.Combine is that it will always return all of the columns from all of the tables it’s combining.

Here’s a function that shows how it can be used:

[sourcecode language=”text” padlinenumbers=”true”]
(TableColumn as list, optional SourceNameColumn as list) =>
let
AddIDs =
if
SourceNameColumn=null
then
TableColumn
else
let
ZipNames =
List.Zip({TableColumn, SourceNameColumn}),
AddColumnFunction =
(ListIn as list) =>
Table.AddColumn(ListIn{0}, "Source", each ListIn{1}),
AddColumns =
List.Transform(ZipNames, each AddColumnFunction(_))
in
AddColumns,
Combine = Table.Combine(AddIDs)
in
Combine
[/sourcecode]

This function takes a list of tables and, optionally, a list of text values that contain a name for each table (this optional parameter accounts for the majority of the code – without it all you would need is the Combine step). If you paste this code into a new query called, say, CombineTables, you can either call it by adding some M code to an existing query or more easily just call it direct from the UI. In the latter case when you click on the function query in the Query Editor window you’ll see this:

Assuming you already have a query like the one shown above that contains a column with table values in it and another column containing the original Excel file names, you need to click the Choose Column button for the TableColumn parameter and select the column that contains the table values in the dialog that appears:

…and then do the same thing for the SourceNameColumn parameter:

…and then click the Invoke button back in the Query Editor, and you’ll get a table containing all of the data from the SalesTable table in each workbook, including the Comments column from the third Excel workbook:

With no hard-coded column names you’ll now always get all of the data from all of the columns in the tables you’re trying to combine.

9 thoughts on “Making Sure All Columns Appear When You Combine Data From Multiple Files In Power BI/Power Query M

  1. hey Chris,
    thanks for the interesting blog, I have a related question please. Suppose that I have two tables T1 and T2 with columns named from A1 to A10 for table T1, and from B1 to B9 in table T2.
    What is the m script that I need to write to get a new query containing appended data from both tables and without duplicates, data must come from (A1, A2) and (B1, B2) fields only and respectively.
    in another words I need a query with two fields A1 + B1, and A2 + B2 with distinct values from all tables.

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      I’m not sure I understand what you want to do here exactly, but if you know the column names in advance can’t you just rename the columns from T2 and do an Append?

  2. Hey! This is really nice indeed. However, from my experience witch is mainly with power query for Excel, I find it is a pain to create UDF’s since the code will always run very slow, especially if you have a big data model (with more then 30 or 40 queries) and with big data volume. I think it’s not possible, but is there any way one could implement a similar technique without the UDF’s ? Cheers! 🙂

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Functions are not necessarily bad for performance – you could rewrite this to avoid using them fairly easily, but I’m not sure it would make much difference to performance.

  3. Hi Chris,

    Nice Function. Solves a lot of issues on this end. There is still the issue with what to do with the new column (“Comments” in this case), once it’s added to your data set. For example, what type of data should it be? (Notice it’s “Any”, in the example) If it’s added, there must be a purpose for it, so there are still some ETL steps that must be done with the new column(s). So, my point being, one would still need to go back and modify the query to address the added column(s). At least they are not caught with a column name change at the 11th hour, because someone changed a column name or added one to the daily email distribution of the data 4 days ago without telling anyone! (actually happened to us…)

    What happens if someone changed the name of an already existing column. Like changing “Employee ID” to “EMPID”. Using the custom function, I suppose you would end up with two columns, one for Employee ID and one for EMPID, which would need to be combined later in the ETL steps.

    At least you don’t lose the information, which is a big help.

    Thanks!

    1. Hi Carl,

      I just came across this post and its a total game changer.

      However, I had the same questions as you as I had a lot of data stored in the FileName and also wanted to extract the ModifiedDate of the specific files.

      I added a new column by joining the ModifiedDate and FileName (with “_” as delimiter) and used this a source from where I would invoke the function provided by Chris.

      Once all the data was combined I just splitted the column by the “_” delimit into the ModifiedDate & FileName columns.

  4. For anyone here in the comments like I am, wondering why new columns are still missing, it’s because the default “Transform File” step that PBI constructs uses a sample file to generate a hard coded count of columns:
    Source = Csv.Document(Parameter1,[Delimiter=”,”, Columns=8, Encoding=1252, QuoteStyle=QuoteStyle.None]),

    This hard coded number means that a new file with additional columns (or any existing file with more columns than your sample file) will be cut off.

    You can change the number to a higher column count, and you’ll get blank columns from the files that didn’t have as many. If anyone has an alternative to make this more dynamic, I’d welcome the advice!

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.