Working With Compression In Power Query And Power BI Desktop

If you’re reading this post there’s one important question you’ll probably want to ask: is it possible to extract data from a zip file in Power Query/Power BI? The answer is, unfortunately, no (at least at the time of writing). As this answer from Tristan on the dev team explains, because there are so many flavours of zip file out there it’s an extremely difficult problem to solve – so it hasn’t been attempted yet. That said, there are two other mildly interesting things to learn about compression in Power Query/Power BI Desktop that I thought were worth blogging about…

The first is that Power Query/Power BI can work with gzip files. For example, given a gzip file that contains a single csv file, here’s an example M query showing how the Binary.Decompress() function can be used to extract the csv file from the gzip file and then treat the contents of the csv file as a table:

let
    Source = Binary.Decompress(
     File.Contents(
      "C:\Myfolder\CompressedData.csv.gz"),
       Compression.GZip),
    #"Imported CSV" = Csv.Document(Source,
     [Delimiter=",",Encoding=1252]),
    #"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV"),
    #"Changed Type" = Table.TransformColumnTypes(
     #"Promoted Headers",{
      {"Month", type text}, {" Sales", Int64.Type}
      })
in
    #"Changed Type"

The other is that you’ll see Binary.Decompress() used when you import an Excel workbook that contains a linked table into Power BI Desktop. For example, consider an Excel workbook that contains the following table:

image

If this table is imported into the Excel Data Model as a linked table, and you then save the workbook and try to import it into Power BI using File/Import/Excel Workbook Contents:

image

… you’ll see this message:

image

Click Start and you’ll get another message:

image

If you choose the Copy Data option, the data from the Excel table will be copied into Power BI. But where is it stored exactly? A look in the Query Editor at the query that returns the data shows that it’s embedded in the M code itself:

let
    Source = Table.FromRows(
     Json.Document(
      Binary.Decompress(
       Binary.FromText(
        "i45WMlTSUTJSitWJVjIGskzALFMgy0wpNhYA", 
        BinaryEncoding.Base64), 
       Compression.Deflate)), 
      {"A","B"}),
    #"Changed Type" = Table.TransformColumnTypes(
     Source,{{"A", Int64.Type}, {"B", Int64.Type}})
in
    #"Changed Type"

That big chunk of text in the middle of the Source step is the data from the Excel table stored as a compressed JSON document, and again Binary.Decompress() is used to extract this data.

21 thoughts on “Working With Compression In Power Query And Power BI Desktop

  1. Thanks for sharing Chris. The other file format I get asked about often is PDF. Do you know of any plans? I think there is a third party app that can do this, but native PQ support would be great. I realise it that data structure is an issue here too, but even if it just used the same type of “look for table” algorithm used on a web page, it would be great.

  2. That’s funny – not supporting ZIP because of the amount of flavours. Still, every office file (xlsx, docx, etc) IS a zip, and so are the “compressed folders” inside Windows. So one wonders: why not at least support the ZIP standard implemented in Win/Office..?

  3. It is a shame that Excel/PowerPivot Datamodel does not support this gzip method, I just tried it out. Unless I am doing something wrong. It would be super handy for me.

  4. I’ve had success today using Power Query to unzip a .zip file. In my case a compressed xml file.
    This is the code that I used:

    let
    Source = File.Contents(“sheet1.zip”),

    MyBinaryFormat = BinaryFormat.Record([MiscHeader=BinaryFormat.Binary(18),
    FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
    UnCompressedFileSize=BinaryFormat.Binary(4),
    FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
    TheRest=BinaryFormat.Binary()]),

    MyCompressedFileSize = MyBinaryFormat(Source)[FileSize]+1,
    MyFileNameLen = MyBinaryFormat(Source)[FileNameLen],

    MyBinaryFormat2 = BinaryFormat.Record([Header=BinaryFormat.Binary(30), Filename=BinaryFormat.Binary(MyFileNameLen), Data=BinaryFormat.Binary(MyCompressedFileSize), TheRest=BinaryFormat.Binary()]),

    GetDataToDecompress = MyBinaryFormat2(Source)[Data],
    DecompressData = Binary.Decompress(GetDataToDecompress, Compression.Deflate),
    #”Imported XML” = Xml.Tables(DecompressData)
    in
    #”Imported XML”

    Basically, I parse the file once to grab some metadata that is needed (filesize and filename length).

    Then I parse the file to extract the compressed data and discard anything after the compressed data.

    I’m not sure how broadly this would work – obviously deflated files only. It could be fairly easily adapted to extract files from .zip files with multiple files.

  5. Improved to deal with “extras”

    let
    Source = File.Contents(“C:\Users\User\Dropbox\Apps\Pythonista Sync99\parsing word\Hyperlinks.zip”),

    MyBinaryFormat = BinaryFormat.Record([MiscHeader=BinaryFormat.Binary(18),
    FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
    UnCompressedFileSize=BinaryFormat.Binary(4),
    FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
    ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
    TheRest=BinaryFormat.Binary()]),

    MyCompressedFileSize = MyBinaryFormat(Source)[FileSize]+1,
    MyFileNameLen = MyBinaryFormat(Source)[FileNameLen],
    MyExtrasLen = MyBinaryFormat(Source)[ExtrasLen],

    MyBinaryFormat2 = BinaryFormat.Record([Header=BinaryFormat.Binary(30), Filename=BinaryFormat.Binary(MyFileNameLen), Extras=BinaryFormat.Binary(MyExtrasLen), Data=BinaryFormat.Binary(MyCompressedFileSize), TheRest=BinaryFormat.Binary()]),

    GetDataToDecompress = MyBinaryFormat2(Source)[Data],
    DecompressData = Binary.Decompress(GetDataToDecompress, Compression.Deflate),
    #”Imported XML” = Xml.Tables(DecompressData),
    in
    #”Imported XML”

    • Nice work! I would be interested to know the amount of variation there is out there in zip file formats, and the likelihood that this code will work with any given zip file.

  6. Thanks Chris. I’ve been experimenting with this a bit more today. One obvious limitation is that it is only decompressing the first file in the zip file and not paying any attention to the Central Directory File Headers that are located at the end of a zip file. Apparently it is “incorrect” to ignore it as theoretically some files that you encounter in the zip file may have been “deleted” or “updated”. I’m willing to take that chance 🙂

    I’m currently working on some M code to parse/decompress all the files in a zip file; proving a little harder than I expected but I’m making headway.

    Re how widely it will work. I’m not sure, I might run a few tests tomorrow and see what works and doesn’t work. It has worked with every file I’ve looked at so far.

  7. Can I use this method to unzip both .zip & .zipx? Sorry, I’m a not familiar with the coding.. and I’m facing with a very routine job to unzip 20+ zip files with same file names.
    Trying to use PQ to unzip it all, so i don’t have to rename the files by one by one.

  8. I was struggling with some .zip files from AWS that wouldn’t KenR or Mark White’s solutions, but when I switched the compression to .gzip and used this solution here, Chris, it worked like a charm. Thank you!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s