Working With Compression In Power Query And Power BI Desktop

Chris Webb M, Power BI Desktop, Power Query December 8, 2015January 18, 2020 2 Minutes

If you’re reading this post there’s one important question you’ll probably want to ask: is it possible to extract data from a zip file in Power Query/Power BI? The answer is, unfortunately, no (at least at the time of writing). As this answer from Tristan on the dev team explains, because there are so many flavours of zip file out there it’s an extremely difficult problem to solve – so it hasn’t been attempted yet. That said, there are two other mildly interesting things to learn about compression in Power Query/Power BI Desktop that I thought were worth blogging about…

The first is that Power Query/Power BI can work with gzip files. For example, given a gzip file that contains a single csv file, here’s an example M query showing how the Binary.Decompress() function can be used to extract the csv file from the gzip file and then treat the contents of the csv file as a table:

[sourcecode language=”text” padlinenumbers=”true”]
let
Source = Binary.Decompress(
File.Contents(
"C:\Myfolder\CompressedData.csv.gz"),
Compression.GZip),
#"Imported CSV" = Csv.Document(Source,
[Delimiter=",",Encoding=1252]),
#"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV"),
#"Changed Type" = Table.TransformColumnTypes(
#"Promoted Headers",{
{"Month", type text}, {" Sales", Int64.Type}
})
in
#"Changed Type"
[/sourcecode]

The other is that you’ll see Binary.Decompress() used when you import an Excel workbook that contains a linked table into Power BI Desktop. For example, consider an Excel workbook that contains the following table:

If this table is imported into the Excel Data Model as a linked table, and you then save the workbook and try to import it into Power BI using File/Import/Excel Workbook Contents:

… you’ll see this message:

Click Start and you’ll get another message:

If you choose the Copy Data option, the data from the Excel table will be copied into Power BI. But where is it stored exactly? A look in the Query Editor at the query that returns the data shows that it’s embedded in the M code itself:

[sourcecode language=”text”]
let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText(
"i45WMlTSUTJSitWJVjIGskzALFMgy0wpNhYA",
BinaryEncoding.Base64),
Compression.Deflate)),
{"A","B"}),
#"Changed Type" = Table.TransformColumnTypes(
Source,{{"A", Int64.Type}, {"B", Int64.Type}})
in
#"Changed Type"
[/sourcecode]

That big chunk of text in the middle of the Source step is the data from the Excel table stored as a compressed JSON document, and again Binary.Decompress() is used to extract this data.

Published by Chris Webb

My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel. View all posts by Chris Webb

Published December 8, 2015January 18, 2020

25 thoughts on “Working With Compression In Power Query And Power BI Desktop”

Pingback: Dew Drop – December 8, 2015 (#2147) | Morning Dew
1. Jack Vishneski says:
  
  February 10, 2017 at 7:06 pm
  
  I’ve been making heavy use of the gzip compression you posted here, Chris, but it only works for single files. What if I have a table of compressed files (all with the same name) that I want to iterate through and decompress before appending all of their data together? Not sure how to write the For Loop to decompress and return each file and it’s Content in a Table…
  
  Loading...
  
  Reply
  1. Chris Webb says:
    
    February 12, 2017 at 11:42 am
    
    What you need to do here is to create an M function that takes the path to a zip file as a parameter and returns the unzipped contents. You can then call this function using the Invoke Custom Function button (see https://blog.crossjoin.co.uk/2016/10/01/using-the-invoke-custom-function-button-in-power-bi/) on a table with one row for each file.
    
    Loading...
Matt Allington says:

December 9, 2015 at 6:24 am

Thanks for sharing Chris. The other file format I get asked about often is PDF. Do you know of any plans? I think there is a third party app that can do this, but native PQ support would be great. I realise it that data structure is an issue here too, but even if it just used the same type of “look for table” algorithm used on a web page, it would be great.

Loading...

Reply
Chris Webb says:

December 9, 2015 at 7:10 am

No, I don’t know of any plans I’m afraid.

Loading...

Reply
vstrien says:

December 9, 2015 at 7:46 am

That’s funny – not supporting ZIP because of the amount of flavours. Still, every office file (xlsx, docx, etc) IS a zip, and so are the “compressed folders” inside Windows. So one wonders: why not at least support the ZIP standard implemented in Win/Office..?

Loading...

Reply
Pingback: Power Query And Power BI Compression | Curated SQL
Pingback: Dew Drop – December 9, 2015 (#2148) | Morning Dew
Sri says:

December 12, 2015 at 2:56 pm

Extract data from ZIP and PDF Language R will helpful.

Loading...

Reply
tachytelic says:

February 5, 2016 at 11:19 am

It is a shame that Excel/PowerPivot Datamodel does not support this gzip method, I just tried it out. Unless I am doing something wrong. It would be super handy for me.

Loading...

Reply
1. Chris Webb says:
  
  February 5, 2016 at 11:37 am
  
  This should work in Power Query in Excel (it’s M code though, so it won’t work inside Power Pivot).
  
  Loading...
  
  Reply
KenR says:

February 7, 2016 at 2:04 am

I’ve had success today using Power Query to unzip a .zip file. In my case a compressed xml file.
This is the code that I used:

let
Source = File.Contents(“sheet1.zip”),

MyBinaryFormat = BinaryFormat.Record([MiscHeader=BinaryFormat.Binary(18),
FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
UnCompressedFileSize=BinaryFormat.Binary(4),
FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
TheRest=BinaryFormat.Binary()]),

MyCompressedFileSize = MyBinaryFormat(Source)[FileSize]+1,
MyFileNameLen = MyBinaryFormat(Source)[FileNameLen],

MyBinaryFormat2 = BinaryFormat.Record([Header=BinaryFormat.Binary(30), Filename=BinaryFormat.Binary(MyFileNameLen), Data=BinaryFormat.Binary(MyCompressedFileSize), TheRest=BinaryFormat.Binary()]),

GetDataToDecompress = MyBinaryFormat2(Source)[Data],
DecompressData = Binary.Decompress(GetDataToDecompress, Compression.Deflate),
#”Imported XML” = Xml.Tables(DecompressData)
in
#”Imported XML”

Basically, I parse the file once to grab some metadata that is needed (filesize and filename length).

Then I parse the file to extract the compressed data and discard anything after the compressed data.

I’m not sure how broadly this would work – obviously deflated files only. It could be fairly easily adapted to extract files from .zip files with multiple files.

Loading...

Reply
KenR says:

February 7, 2016 at 8:21 am

Improved to deal with “extras”

let
Source = File.Contents(“C:\Users\User\Dropbox\Apps\Pythonista Sync99\parsing word\Hyperlinks.zip”),

MyBinaryFormat = BinaryFormat.Record([MiscHeader=BinaryFormat.Binary(18),
FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
UnCompressedFileSize=BinaryFormat.Binary(4),
FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
TheRest=BinaryFormat.Binary()]),

MyCompressedFileSize = MyBinaryFormat(Source)[FileSize]+1,
MyFileNameLen = MyBinaryFormat(Source)[FileNameLen],
MyExtrasLen = MyBinaryFormat(Source)[ExtrasLen],

MyBinaryFormat2 = BinaryFormat.Record([Header=BinaryFormat.Binary(30), Filename=BinaryFormat.Binary(MyFileNameLen), Extras=BinaryFormat.Binary(MyExtrasLen), Data=BinaryFormat.Binary(MyCompressedFileSize), TheRest=BinaryFormat.Binary()]),

GetDataToDecompress = MyBinaryFormat2(Source)[Data],
DecompressData = Binary.Decompress(GetDataToDecompress, Compression.Deflate),
#”Imported XML” = Xml.Tables(DecompressData),
in
#”Imported XML”

Loading...

Reply
1. Chris Webb says:
  
  February 8, 2016 at 8:32 am
  
  Nice work! I would be interested to know the amount of variation there is out there in zip file formats, and the likelihood that this code will work with any given zip file.
  
  Loading...
  
  Reply
  1. Alen says:
    
    December 23, 2022 at 11:33 pm
    
    Hi Chris ,
    I am able to connect to excel zipped files but not CSV. I simply get 0 byte files. Any advice ?
    
    Thank you so much
    
    Loading...
kenbo01 says:

February 8, 2016 at 10:55 am

Thanks Chris. I’ve been experimenting with this a bit more today. One obvious limitation is that it is only decompressing the first file in the zip file and not paying any attention to the Central Directory File Headers that are located at the end of a zip file. Apparently it is “incorrect” to ignore it as theoretically some files that you encounter in the zip file may have been “deleted” or “updated”. I’m willing to take that chance 🙂

I’m currently working on some M code to parse/decompress all the files in a zip file; proving a little harder than I expected but I’m making headway.

Re how widely it will work. I’m not sure, I might run a few tests tomorrow and see what works and doesn’t work. It has worked with every file I’ve looked at so far.

Loading...

Reply
Pingback: How to extract data from a ZIP file using just Power Query – Excel and Power BI
KenR says:

February 14, 2016 at 9:00 am

Wrote up a blog post describing my solution here:

http://www.excelandpowerbi.com/?p=155

Loading...

Reply
1. Chris Webb says:
  
  February 14, 2016 at 3:29 pm
  
  Nice work!
  
  Loading...
  
  Reply
2. Michael says:
  
  July 3, 2020 at 5:27 pm
  
  Hi Ken
  The Website seems down. Can I view the code somewhere?
  
  Loading...
  
  Reply
  1. Ken says:
    
    July 3, 2020 at 11:01 pm
    
    Hi Michael,
    
    Better to use Mark’s solution lower down. I think there are also some other alternative approaches. My code was just a proof of concept.
    
    Regards
    
    Ken
    
    Loading...
Lingzy says:

November 21, 2016 at 1:12 am

Can I use this method to unzip both .zip & .zipx? Sorry, I’m a not familiar with the coding.. and I’m facing with a very routine job to unzip 20+ zip files with same file names.
Trying to use PQ to unzip it all, so i don’t have to rename the files by one by one.

Loading...

Reply
1. Ken Russell says:
  
  November 21, 2016 at 7:23 am
  
  Hi,
  
  Mark White improved on my solution here:
  
  http://sql10.blogspot.com/2016/06/reading-zip-files-in-powerquery-m.html
  
  Hard to know if it will solve your issue but I suspect that it will. Dealing with multiple files could be done using a Folder query and a column that calls a function.
  
  Happy to help you with it. I deal with zip files
  in PQ all the time.
  
  Regards
  
  Ken
  
  Loading...
  
  Reply
Jack Vishneski says:

February 1, 2017 at 5:21 pm

I was struggling with some .zip files from AWS that wouldn’t KenR or Mark White’s solutions, but when I switched the compression to .gzip and used this solution here, Chris, it worked like a charm. Thank you!

Loading...

Reply
Ricardo says:

February 21, 2022 at 9:09 pm

Thanks Chris!
Having zero coding skills, and following the sample code you provided, this is how I managed to do it (so other people like me may also set up power query to read compresses files (gzip) on a folder):

In Excel Power query, I had already set up a data source from a folder, where I had multiple CSV files.
I consolidated all these files using power query’s built in wizard: “New Source > File > Folder”
This automatically creates a set of “helper queries” (a folder with a sample file, a parameter, a function and a table).

So I made a backup, and all I had to do was
1) Gzip files in the folder individually (I used 7zip to do this). Each csv becomes a gz file.

2) open the advanced editor on the “Transform Sample File” query table, and change the code from:

let
Source = Csv.Document(Parameter1,[Delimiter=”;”, Columns=23, Encoding=1252, QuoteStyle=QuoteStyle.None])
in
Source

To Based upon Chris’ sample code.:

let
Source = Binary.Decompress(Parameter1, Compression.GZip),
#”Imported CSV” = Csv.Document(Source,[Delimiter=”;”, Columns=23, Encoding=1252, QuoteStyle=QuoteStyle.None])
in
#”Imported CSV”

This automatically updated the Transform File Function from:
let
Source = (Parameter1 as binary) => let
Source = Csv.Document(Parameter1,[Delimiter=”;”, Columns=23, Encoding=1252, QuoteStyle=QuoteStyle.None])
in
Source
in
Source

To:
let
Source = (Parameter1 as binary) => let
Source = Binary.Decompress(Parameter1, Compression.GZip),
#”Imported CSV” = Csv.Document(Source,[Delimiter=”;”, Columns=23, Encoding=1252, QuoteStyle=QuoteStyle.None])
in
#”Imported CSV”
in
Source

Note to use the Delimiter, Columns and Encoding that is appropriated for your own data.

This helped me save 90% disk space when storing my csv files.

I hope it helps.
Thanks.

Loading...

Reply