Working With Compression In Power Query And Power BI Desktop

If you’re reading this post there’s one important question you’ll probably want to ask: is it possible to extract data from a zip file in Power Query/Power BI? The answer is, unfortunately, no (at least at the time of writing). As this answer from Tristan on the dev team explains, because there are so many flavours of zip file out there it’s an extremely difficult problem to solve – so it hasn’t been attempted yet. That said, there are two other mildly interesting things to learn about compression in Power Query/Power BI Desktop that I thought were worth blogging about…

The first is that Power Query/Power BI can work with gzip files. For example, given a gzip file that contains a single csv file, here’s an example M query showing how the Binary.Decompress() function can be used to extract the csv file from the gzip file and then treat the contents of the csv file as a table:

[sourcecode language=”text” padlinenumbers=”true”]
let
Source = Binary.Decompress(
File.Contents(
"C:\Myfolder\CompressedData.csv.gz"),
Compression.GZip),
#"Imported CSV" = Csv.Document(Source,
[Delimiter=",",Encoding=1252]),
#"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV"),
#"Changed Type" = Table.TransformColumnTypes(
#"Promoted Headers",{
{"Month", type text}, {" Sales", Int64.Type}
})
in
#"Changed Type"
[/sourcecode]

The other is that you’ll see Binary.Decompress() used when you import an Excel workbook that contains a linked table into Power BI Desktop. For example, consider an Excel workbook that contains the following table:

If this table is imported into the Excel Data Model as a linked table, and you then save the workbook and try to import it into Power BI using File/Import/Excel Workbook Contents:

image

… you’ll see this message:

Click Start and you’ll get another message:

If you choose the Copy Data option, the data from the Excel table will be copied into Power BI. But where is it stored exactly? A look in the Query Editor at the query that returns the data shows that it’s embedded in the M code itself:

[sourcecode language=”text”]
let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText(
"i45WMlTSUTJSitWJVjIGskzALFMgy0wpNhYA",
BinaryEncoding.Base64),
Compression.Deflate)),
{"A","B"}),
#"Changed Type" = Table.TransformColumnTypes(
Source,{{"A", Int64.Type}, {"B", Int64.Type}})
in
#"Changed Type"
[/sourcecode]

That big chunk of text in the middle of the Source step is the data from the Excel table stored as a compressed JSON document, and again Binary.Decompress() is used to extract this data.

25 thoughts on “Working With Compression In Power Query And Power BI Desktop

    1. I’ve been making heavy use of the gzip compression you posted here, Chris, but it only works for single files. What if I have a table of compressed files (all with the same name) that I want to iterate through and decompress before appending all of their data together? Not sure how to write the For Loop to decompress and return each file and it’s Content in a Table…

      1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
        Chris Webb says:

        What you need to do here is to create an M function that takes the path to a zip file as a parameter and returns the unzipped contents. You can then call this function using the Invoke Custom Function button (see https://blog.crossjoin.co.uk/2016/10/01/using-the-invoke-custom-function-button-in-power-bi/) on a table with one row for each file.

  1. Matt Allington – Sydney Australia – I am a full time self service Business Intelligence trainer and consultant specialising in Microsoft Power BI, Power Query, and Power Pivot for Excel
    Matt Allington says:

    Thanks for sharing Chris. The other file format I get asked about often is PDF. Do you know of any plans? I think there is a third party app that can do this, but native PQ support would be great. I realise it that data structure is an issue here too, but even if it just used the same type of “look for table” algorithm used on a web page, it would be great.

  2. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
    Chris Webb says:

    No, I don’t know of any plans I’m afraid.

  3. That’s funny – not supporting ZIP because of the amount of flavours. Still, every office file (xlsx, docx, etc) IS a zip, and so are the “compressed folders” inside Windows. So one wonders: why not at least support the ZIP standard implemented in Win/Office..?

  4. It is a shame that Excel/PowerPivot Datamodel does not support this gzip method, I just tried it out. Unless I am doing something wrong. It would be super handy for me.

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      This should work in Power Query in Excel (it’s M code though, so it won’t work inside Power Pivot).

  5. I’ve had success today using Power Query to unzip a .zip file. In my case a compressed xml file.
    This is the code that I used:

    let
    Source = File.Contents(“sheet1.zip”),

    MyBinaryFormat = BinaryFormat.Record([MiscHeader=BinaryFormat.Binary(18),
    FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
    UnCompressedFileSize=BinaryFormat.Binary(4),
    FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
    TheRest=BinaryFormat.Binary()]),

    MyCompressedFileSize = MyBinaryFormat(Source)[FileSize]+1,
    MyFileNameLen = MyBinaryFormat(Source)[FileNameLen],

    MyBinaryFormat2 = BinaryFormat.Record([Header=BinaryFormat.Binary(30), Filename=BinaryFormat.Binary(MyFileNameLen), Data=BinaryFormat.Binary(MyCompressedFileSize), TheRest=BinaryFormat.Binary()]),

    GetDataToDecompress = MyBinaryFormat2(Source)[Data],
    DecompressData = Binary.Decompress(GetDataToDecompress, Compression.Deflate),
    #”Imported XML” = Xml.Tables(DecompressData)
    in
    #”Imported XML”

    Basically, I parse the file once to grab some metadata that is needed (filesize and filename length).

    Then I parse the file to extract the compressed data and discard anything after the compressed data.

    I’m not sure how broadly this would work – obviously deflated files only. It could be fairly easily adapted to extract files from .zip files with multiple files.

  6. Improved to deal with “extras”

    let
    Source = File.Contents(“C:\Users\User\Dropbox\Apps\Pythonista Sync99\parsing word\Hyperlinks.zip”),

    MyBinaryFormat = BinaryFormat.Record([MiscHeader=BinaryFormat.Binary(18),
    FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
    UnCompressedFileSize=BinaryFormat.Binary(4),
    FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
    ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
    TheRest=BinaryFormat.Binary()]),

    MyCompressedFileSize = MyBinaryFormat(Source)[FileSize]+1,
    MyFileNameLen = MyBinaryFormat(Source)[FileNameLen],
    MyExtrasLen = MyBinaryFormat(Source)[ExtrasLen],

    MyBinaryFormat2 = BinaryFormat.Record([Header=BinaryFormat.Binary(30), Filename=BinaryFormat.Binary(MyFileNameLen), Extras=BinaryFormat.Binary(MyExtrasLen), Data=BinaryFormat.Binary(MyCompressedFileSize), TheRest=BinaryFormat.Binary()]),

    GetDataToDecompress = MyBinaryFormat2(Source)[Data],
    DecompressData = Binary.Decompress(GetDataToDecompress, Compression.Deflate),
    #”Imported XML” = Xml.Tables(DecompressData),
    in
    #”Imported XML”

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Nice work! I would be interested to know the amount of variation there is out there in zip file formats, and the likelihood that this code will work with any given zip file.

      1. Hi Chris ,
        I am able to connect to excel zipped files but not CSV. I simply get 0 byte files. Any advice ?

        Thank you so much

  7. Thanks Chris. I’ve been experimenting with this a bit more today. One obvious limitation is that it is only decompressing the first file in the zip file and not paying any attention to the Central Directory File Headers that are located at the end of a zip file. Apparently it is “incorrect” to ignore it as theoretically some files that you encounter in the zip file may have been “deleted” or “updated”. I’m willing to take that chance 🙂

    I’m currently working on some M code to parse/decompress all the files in a zip file; proving a little harder than I expected but I’m making headway.

    Re how widely it will work. I’m not sure, I might run a few tests tomorrow and see what works and doesn’t work. It has worked with every file I’ve looked at so far.

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Nice work!

      1. Hi Michael,

        Better to use Mark’s solution lower down. I think there are also some other alternative approaches. My code was just a proof of concept.

        Regards

        Ken

  8. Can I use this method to unzip both .zip & .zipx? Sorry, I’m a not familiar with the coding.. and I’m facing with a very routine job to unzip 20+ zip files with same file names.
    Trying to use PQ to unzip it all, so i don’t have to rename the files by one by one.

  9. I was struggling with some .zip files from AWS that wouldn’t KenR or Mark White’s solutions, but when I switched the compression to .gzip and used this solution here, Chris, it worked like a charm. Thank you!

  10. Thanks Chris!
    Having zero coding skills, and following the sample code you provided, this is how I managed to do it (so other people like me may also set up power query to read compresses files (gzip) on a folder):

    In Excel Power query, I had already set up a data source from a folder, where I had multiple CSV files.
    I consolidated all these files using power query’s built in wizard: “New Source > File > Folder”
    This automatically creates a set of “helper queries” (a folder with a sample file, a parameter, a function and a table).

    So I made a backup, and all I had to do was
    1) Gzip files in the folder individually (I used 7zip to do this). Each csv becomes a gz file.

    2) open the advanced editor on the “Transform Sample File” query table, and change the code from:

    let
    Source = Csv.Document(Parameter1,[Delimiter=”;”, Columns=23, Encoding=1252, QuoteStyle=QuoteStyle.None])
    in
    Source

    To Based upon Chris’ sample code.:

    let
    Source = Binary.Decompress(Parameter1, Compression.GZip),
    #”Imported CSV” = Csv.Document(Source,[Delimiter=”;”, Columns=23, Encoding=1252, QuoteStyle=QuoteStyle.None])
    in
    #”Imported CSV”

    This automatically updated the Transform File Function from:
    let
    Source = (Parameter1 as binary) => let
    Source = Csv.Document(Parameter1,[Delimiter=”;”, Columns=23, Encoding=1252, QuoteStyle=QuoteStyle.None])
    in
    Source
    in
    Source

    To:
    let
    Source = (Parameter1 as binary) => let
    Source = Binary.Decompress(Parameter1, Compression.GZip),
    #”Imported CSV” = Csv.Document(Source,[Delimiter=”;”, Columns=23, Encoding=1252, QuoteStyle=QuoteStyle.None])
    in
    #”Imported CSV”
    in
    Source

    Note to use the Delimiter, Columns and Encoding that is appropriated for your own data.

    This helped me save 90% disk space when storing my csv files.

    I hope it helps.
    Thanks.

Leave a Reply to tachytelicCancel reply