Parquet Files In Power BI/Power Query And The “Streamed Binary Values” Error

If you’re using the new Parquet connector in Power BI there’s a chance you will run into the following error:

Parameter.Error: Parquet.Document cannot be used with streamed binary values.
Details:
[Binary]

This isn’t a bug or anything that can be fixed, so it’s important to understand why it occurs and what you can do about it.

One easy way to reproduce this problem is by trying to access a reasonably large (larger than a few MB) Parquet file stored in SharePoint, something like this:

let
  Source = SharePoint.Files(
    "https://microsoft-my.sharepoint.com/personal/abc",
    [ApiVersion = 15]
  ),
  GetFile = Source
    {
      [
        Name = "myfile.parquet",
        #"Folder Path"
          = "https://microsoft-my.sharepoint.com/personal/abc/Documents/"
      ]
    }
    [Content],
  #"Imported Parquet" = Parquet.Document(
    GetFile
  )
in
  #"Imported Parquet"

The problem is that reading data from Parquet files requires random file access, and this is something that isn’t possible in Power Query for certain data sources like SharePoint and Google Cloud Storage. This problem will never occur with locally-stored files or files stored in ADLSgen2.

There is one possible workaround but it comes with some serious limitations: buffer the Parquet file in memory using the Binary.Buffer() M function. Here’s an example of how the above query can be rewritten to do this:

let
  Source = SharePoint.Files(
    "https://microsoft-my.sharepoint.com/personal/abc",
    [ApiVersion = 15]
  ),
  GetFile = Source
    {
      [
        Name = "myfile.parquet",
        #"Folder Path"
          = "https://microsoft-my.sharepoint.com/personal/abc/Documents/"
      ]
    }
    [Content],
  #"Imported Parquet" = Parquet.Document(
    Binary.Buffer(GetFile)
  )
in
  #"Imported Parquet"

The problem with buffering files in memory like this is that it’s only feasible for fairly small files because of the limits on the amount of memory Power Query can use (see here for more information): you’re likely to get really bad performance or errors if you try to buffer files that are too large, and Parquet files are often fairly large. The best way of solving this problem is to switch to using a data source like ADLSgen2 where this problem will not happen.

[Thanks to Eric Gorelik for the information in this post]

12 responses

    • All I know is that it’s down to file random access. There’s a big difference between having a file system where you can access the contents of a file easily and a cloud service that can only give you a binary.

  1. Pingback: Power BI: Parquet Files and Streamed Binary Values Error – Curated SQL

  2. Actually cloud/http streams like Google Cloud Storage or AWS S3 do support random access if the client can make use of the Range header – see e.g. for GCP, https://cloud.google.com/storage/docs/xml-api/reference-headers#range
    The challenge here seems that Web.Content and Parquet.Document do not allow the use of Range on-the-fly.
    Ideally, Parquet.Document() should accept an url too and read the bytes as necessary though the Web/http interface.

  3. Thanks for posting this, it’s a problem we’re having using Parquet on AWS S3 and connecting to Power BI via an API Gateway. Our current workaround is to save as csv instead which is not ideal. Do you know if Microsoft is going to address this at some stage for sources other than ADLSgen2?

  4. Thanks for this article. It comes to the top of google results. Can you point to the exact logic/process which is the source of the error message? Is it the mashup engine?

    What the nature of the limitation in the context of an on-prem data gateway (OPDG)? When the mashup engine is running on the OPDG, the memory is mine, the local disk is mine. In this context, how do we avoid this error message? Will we be able to give the OPDG more memory, to make Binary.Buffer work reliably in that context?

    • The error is raised by Power Query but it’s not really a limitation of Power Query: it’s to do with how Power Query accesses the Parquet file. If the Parquet file is on a local file system you won’t have this problem, but if Power Query needs to access the Parquet file via an API then you probably will because most APIs don’t allow the kind of random file access that is necessary to read data from Parquet (ADLSgen2 being an exception). The workaround with Binary.Buffer is simply avoiding the API by downloading the entire file into local memory and accessing it from there, but then you will run into the limit on container size (see https://blog.crossjoin.co.uk/2019/04/21/power-bi-dataflow-container-size/); on a gateway the container size is calculated relative to the amount of memory on the machine it’s running on (see https://blog.crossjoin.co.uk/2022/02/13/speed-up-power-bi-refresh-by-increasing-the-amount-of-memory-on-your-on-premises-data-gateway-machine/) so your best bet is to increase the amount of RAM on the gateway PC.

      • Thanks Chris. Will play with these suggestions. I noticed I was able to consume massive amounts of memory for the sake of Json.Document and Binary.Decompress … so it seemed unfair to find this somewhat arbitrary restriction on Parquet. (I only need about a 10th of the memory for Parquet as what I was using for JSON … and I wasn’t allowed to have it …).

        Not only will I need to tinker with the gateway to come up with this extra RAM, but I will also need to tinker with my workstation where I run Power BI Desktop. I really think Microsoft may need to revisit “Parquet.Document”, and possibly give it a parameter that loads data into memory (implicit Binary.Buffer?). Perhaps it also needs an argument that specifies how much memory it is allowed to work with.

        I noticed the container is using the nuget for ParquetSharp. I have more experience with ParquetNet. But I plan to dive into ParquetSharp as well some day and see if it ever throws those “Streamed Binary Values” exceptions. I doubt it. I’m guessing this error originates within the Power Query side (a self-inflicted error based on how the amount of memory that seems acceptable for Parquet). What bothers me the most is the double standards. There are limitations in Parquet.Document that don’t affect the more commonly used Json data. I do understand the explanation about “random” access. Although the answer to that is the row groups, which are partitions in a parquet file that provide a minimal amount of data that is accessible if a client program wants to read the data in increments.

      • I would be surprised if there was a way for Json.Document or any other function to use more memory than Binary.Buffer. Where is your Parquet file stored?

      • The Parquet data is retrieved via HTTP REST.

        BTW, The Binary.Buffer seems to be doing the trick.
        I hadn’t actually tried it when I first started my comments. Step 1. is to complain. Step 2. is to test out your proposed workaround. 😉

        Thanks again.

      • Can I suggest changing the introduction (“This isn’t … anything that can be fixed)?

        Should just say “to make this work the same as Json.Document, you just need to nest Binary.Buffer() within Parquet.Document()”.

        As it is now, the reader is immediately discouraged that they won’t be able to get Parquet working. This won’t make the average person very happy (esp. after spending a day converting their entire Web API to generate parquet rather than json).

  5. Also, why is Parquet.Document being throttled on memory when I can use a *lot* more memory for other OPDG operations (Json.Document or Binary.Decompress)? I never run into errors with those, and the memory usage is significantly greater – by a factor of 10 or 100.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: