Query Folding On SQL Queries In Power Query Using Value.NativeQuery() and EnableFolding=true

Here’s something that will Blow Your Mind if you’re a Power Query/M fan. Did you know that there’s a way you can get query folding to work if you’re using a native SQL query on SQL Server or Postgres as your data source?

There’s a new option on the Value.NativeQuery() M function that allows you to do this: you need to set EnableFolding=true in the third parameter. It’s documented here for the Postgres connector but it also works for the SQL Server connector too. Here’s an example using the SQL Server AdventureWorksDW2017 sample database:

let
  Source = Sql.Databases("localhost"),
  AdventureWorksDW2017 = Source
    {[Name = "AdventureWorksDW2017"]}
    [Data],
  RunSQL = Value.NativeQuery(
    AdventureWorksDW2017,
    "SELECT EnglishDayNameOfWeek FROM DimDate",
    null,
    [EnableFolding = true]
  ),
  #"Filtered Rows" = Table.SelectRows(
    RunSQL,
    each (
      [EnglishDayNameOfWeek] = "Friday"
    )
  )
in
  #"Filtered Rows"

Notice that my data source is a SQL query that gets all rows for the EnglishDayNameOfWeek column from the DimDate table and I’m only filtering down to the day name Friday using the #”Filtered Rows” step using the Table.SelectRows() function. Normally the #”Filtered Rows” step wouldn’t fold because I’ve used a native SQL query as my source, but in this case it does because I’ve set EnableFolding=true in Value.NativeQuery.

Here’s the SQL query generated by this M query:

select [_].[EnglishDayNameOfWeek]
from 
(
    SELECT EnglishDayNameOfWeek FROM DimDate
) as [_]
where [_].[EnglishDayNameOfWeek] = 'Friday'

Of course this doesn’t mean that everything can be folded now, but it’s nice to see that some folding on native SQL queries is now possible.

As I said this only works for SQL Server and Postgres at the time of writing and there is one other limitation: folding won’t happen if you’re passing parameters back to your SQL query in the way I describe here.

[Thanks to Curt Hagenlocher for the information]

Implementing Data (As Well As Metadata) Translations In Power BI

Power BI Premium has supported metadata translations – translations for table, column and measure names etc – for a while now. Kasper has a great blog showing how to use this feature here; Tabular Editor makes it very easy to edit metadata translations too. However (unlike SSAS Multidimensional) Power BI doesn’t have native support for data translations, that’s to say translating the data inside your tables and not just the names of objects. There are some blog posts out there that describe ways of tackling this problem but none are really satisfying or reliable: techniques involving row-level security, for example, can force you to use relationships in a way that will impact performance; the undocumented UserCulture() DAX function isn’t ready and shouldn’t be used in production yet. In this blog post I’m going to describe a new approach to solving the problem of data translations that relies on the DirectQuery on Live connections functionality released in preview in December. It is far from perfect but I think it’s the best way of solving this problem available at the moment.

Describing the problem

The best way to understand the problem and what this new approach offers is to look at the end result. Here’s a Power BI report where everything is in English:

And here is exactly the same report where the same information is shown translated into German (apologies for the actual translations…):

There are several things to point out in this German version of the report:

  1. The text in the title at the top and in the sidebar is now in German
  2. The column headers for the matrix are now in German (this is what has been possible for a while with metadata translations)
  3. The dates are now formatted using the default for the German locale (again, as a result of metadata translations)
  4. The day names are now shown in German. This is the key thing here – the data in a column, as well as the metadata has been translated
  5. The decimal values in the Umsatz column are now formatted using the German locale so that a comma is used as the decimal separator and a full stop (ie a period for my American readers) is used as a thousands separator (again, as a result of metadata translations)
  6. The text indicating the total row is now in German (this is the normal behaviour of Power BI for users with a German browser locale)

So how do you get both data and metadata translations working in Power BI? Here’s a super-simple example showing how…

Step 1: The source dataset

First, let’s take a look at the dataset that contains all the data for both these reports. It contains two tables.

The Sales table looks like this:

Note that the Day Name column contains the names of the days of the week, and that there are two other columns containing the names of the days of the week translated to German and French. This is the table that holds the data shown in the matrix in the reports above.

There is also a table called Text that contains the text shown in the title and sidebar in the reports above. Again, there is a column containing the English text and two other columns containing German and French translations of that text:

There are no relationships between these tables:

Finally, this dataset also needs to be published to a workspace in the Power BI Service.

Step 2: Building the English version of the report

Building the original English version of the report is also quite straightforward. The matrix just contains data from the Sales table:

The only interesting thing is how the text in the title and sidebar is handled. In both cases I have used a card visual and dragged the Text column from the Text table into it, then filtered the data on the Visual column of the Text table so the appropriate text is shown:

In this case the sidebar shows the data from the Text column (aggregated to get the First value) where the Visual field contains the value “Textbox”.

Step 3: Creating the German translation dataset

This is where things get interesting. The next thing to do is to open a new .pbix file in Power BI Desktop, create a Live connection to the dataset created in step 1, and then hit the “Make changes to this model” button to create a new local dataset. This is where the new DirectQuery on Live connections functionality comes in; you should read the documentation on this feature before you go any further. The important thing to remember is that when you do this you are not duplicating any data or logic that is in the original dataset but you can make your own modifications to it.

There are two things that have to be done in this dataset. First, in Power BI Desktop, some renaming is necessary:

  1. The “Day Name” column on the Sales table has been renamed “English Day Name”
  2. The “German Day Name” column on the Sales table has been renamed “Day Name”
  3. The “Text” column on the Text table has been renamed “English Text”
  4. The “German Text” column on the Text table has been renamed “Text”

This results in the following columns in the local dataset:

Second, a translation object needs to be added to the dataset for the German (de-DE) locale for the metadata translations. I used Tabular Editor (instructions here) because it was the quickest and easiest way to add metada translations:

Note how the name of the Sales table has been translated to Umsatz, and how the names of the Date, Day Name (note: this is the column that has just been renamed as Day Name, which points to the German Day Name column in the original dataset) and Sales columns have been translated to Datum, Tagesname and Umsatz respectively.

This local dataset also needs to be published to the Power BI Service to proceed.

Step 4: building the German version of the report

The last thing to do is to go back to the original English version of the report built in step 2, open it in Power BI Desktop, then point it to the new German local dataset created in the previous step. You can do this by going to the Home tab in the ribbon, clicking on the Transform data button and then selecting Data source settings:

…and then selecting the German local dataset like so:

At this point you’ll see that the report is in a semi-translated state: the data has been translated but the metadata has not been.

Don’t panic though! Metadata translations can only be viewed in the browser or by changing the language settings inside Power BI Desktop so this is to be expected.

Notice that the middle column in the matrix points to a column called ‘Sales'[Day Name]. In the source dataset this contains the English day names; in the German translation dataset we have switched the names of the columns so ‘Sales'[Day Name] now contains the German day names. This is the key to solving the problem: all you need to do is ensure that each of the translation datasets you create exposes the set of table, column and measure names that your report expects; you just need to rename columns appropriately in each translation dataset so that they point to the columns in the source dataset that contain the correct translated names.

You should then save the .pbix file and publish again. You’ll either need to change the name of the report or publish it to a separate workspace; I recommend the latter, because it means you can tell all your German-speaking users to go to one workspace for their reports and your English-speaking users to go to another for their reports.

And that’s it – someone with a German browser locale viewing the version of the report connected to the German translation dataset will see this:

Summary

Here’s a diagram showing everything that has been built so far:

English-language users (who will have an English-language browser locale) use the original report that points to the source dataset. German-language users (who will have a German-language browser locale) use the German version of the report, which in turn connects to the German translation dataset; this gives them the German data and metadata translations.

The important things to remember are:

  • Even though you have multiple datasets there is no duplication of data or logic because of the way the new DirectQuery on Live connections functionality works.
  • Even though you have multiple copies of the same report for different languages, the report design in each case is identical and the only the dataset that each report points to is different.

As a result the effort needed to maintain multiple translated copies of the same report is kept to a minimum.

Not too far in the future there will be new and improved functionality in Power BI that makes solving this problem even easier, and at that point I’ll write a follow-up blog post.

Optimise The Performance Of Reading Data From ADLSgen2 In Power BI With The HierarchicalNavigation Option

Last year Marco Russo wrote a very useful blog post pointing out the performance problems you can run into when connecting to data stored in ADLSgen2 from Power BI when there are a large number of files elsewhere in the container. You can read that post here:

https://www.sqlbi.com/blog/marco/2020/05/29/optimizing-access-to-azure-data-lake-storage-adls-gen-2-in-power-query/

Marco’s advice – which is 100% correct – is that you should either pass the full path to the folder that you want to connect in the initial call to AzureStorage.DataLake() or, if you’re connecting to a single file, pass the path to the file itself. This avoids the performance overhead of reading metadata from files you’re not interested in reading from, which can be quite considerable.

There are some scenarios where this advice doesn’t work, though, and there is another way to avoid this overhead and make the performance of reading data much faster – and this is by using the HierarchicalNavigation option of the AzureStorage.DataLake() function. I blogged about what this option does some time ago but didn’t realise at the time the performance benefits of using it:

https://blog.crossjoin.co.uk/2019/09/29/hierarchical-navigation-adlsgen2-power-bi/

Consider the following scenario. Let’s say you want to connect to a CSV file in a folder which also contains a subfolder that contains many (in this example 20,000) other files that you’re not interested in:

[I’m only going to connect to a single file here to keep the example simple; I know I could just connect direct to the file rather than the folder and avoid the performance overhead that way]

Here’s the M code generated by the Power Query Editor using the default options to get the contents of the aSales.csv file:

let
  Source = AzureStorage.DataLake(
    "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder"
  ),
  Navigate = Source
    {
      [
        #"Folder Path"
          = "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder/",
        Name = "aSales.csv"
      ]
    }
    [Content],
  #"Imported CSV" = Csv.Document(
    Navigate,
    [
      Delimiter  = ",",
      Columns    = 2,
      Encoding   = 1252,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers"
    = Table.PromoteHeaders(
    #"Imported CSV",
    [PromoteAllScalars = true]
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"Product", type text},
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed Type"

In Power BI Desktop refreshing the table that this M query returns (even with the Allow Data Preview To Download In The Background option turned off) takes 23 seconds. I measured refresh time using a stopwatch, starting with the time that I clicked the refresh button and ending when the refresh dialog disappeared; this is a lot longer than the refresh time that you might see using the Profiler technique I blogged about here, but as a developer this is the refresh time that you’ll care about.

The problem here is the Source step which returns a list of all the files in the ParentFolder folder and the ManySmallFiles subfolder.

Now, here’s an M query that returns the same data but where the HierarchicalNavigation=true option is set:

let
  Source = AzureStorage.DataLake(
    "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder",
    [HierarchicalNavigation = true]
  ),
  Navigation = Source
    {
      [
        #"Folder Path"
          = "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder/",
        Name = "aSales.csv"
      ]
    }
    [Content],
  #"Imported CSV" = Csv.Document(
    Navigation,
    [
      Delimiter  = ",",
      Columns    = 2,
      Encoding   = 1252,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers"
    = Table.PromoteHeaders(
    #"Imported CSV",
    [PromoteAllScalars = true]
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"Product", type text},
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed Type"

This takes just 3 seconds to refresh in Power BI Desktop – a really big improvement.

Conclusion: always use the HierarchicalNavigation=true option in AzureStorage.DataLake() when connecting to data in ADLSgen2 storage from Power BI to get the best refresh performance and the best developer experience in Power BI Desktop.

Testing The Performance Of Importing Data From ADLSgen2 Common Data Model Folders In Power BI

Following on from my last two posts comparing the performance of importing data from ADLSgen2 into Power BI using the ADLSgen2 connector and going via Synapse Serverless (see here and here), in this post I’m going to look at a third option for connecting to CSV files stored in ADLSgen2: connecting via a Common Data Model folder. There are two ways to connect to a CDM folder in Power BI: you can attach it as a dataflow in the Power BI Service, or you can use the CDM Folder View option in the ADLSgen2 connector.

First of all, let’s look at connecting via a dataflow. Just to be clear, I’m not talking about creating a new entity in a dataflow and using the Power Query Editor to connect to the data. What I’m talking about is the option you see when you create a dataflow to attach a Common Data Model folder as described here:

This is something I blogged about back in 2019; if you have a folder of CSV files it’s pretty easy to add the model.json file that allows you to attach this folder as a dataflow. I created a new model.json file and added it to the same folder that contains the CSV files I’ve been using for my tests in this series of blog posts.

Here’s what the contents of my model.json file looked like:

Something to notice here is that I created one CDM partition for each CSV file in the folder; only the first CDM partition is visible in the screenshot. Also, I wasn’t able to expose the names of the CSV source files as a column in the way I did for the ADLSgen2 connector and Synapse Serverless connector, which means I couldn’t compare some of the refresh timings from my previous two posts with the refresh timings here and had to rerun a few of my earlier tests.

How did it perform? I attached this CDM folder as a dataflow, connected a new dataset to it and ran some of the same tests I ran in my previous two blog posts. Importing all the data with no transformations (as I did in the first post in this series) into a single dataset took on average 70 seconds in my PPU workspace, slower than the ADLSgen2 connector which took 56 seconds to import the same data minus the filename column. Adding a step in the Power Query Editor in my dataset to group by the TransDate column and add a column with the count of days (as I did in the second post in this series) took on average 29 seconds to refresh in my PPU workspace which is again slightly slower than the ADLSgen2 connector.

Conclusion #1: Importing data from a dataflow connected to a CDM folder is slower than importing data using the ADLSgen2 connector with the default File System View option.

What about the Enhanced Compute Engine for dataflows? Won’t it help here? Not in the scenarios I’m testing, where the dataflow just exposes the data in the CSV files as-is and any Power Query transformations are being done in the dataset. Matthew Roche’s blog post here and the documentation explains when the Enhanced Compute Engine can help performance; if I created a computed entity to do the group by in my second test above then that would benefit from it for example. However in this series I want to keep a narrow focus on testing the performance of loading data from ADLSgen2 direct to Power BI without staging it anywhere.

The second way to import data from a CDM folder is to use the CDM Folder View option (which, at the time of writing is in beta) in the ADLSgen2 connector:

I expected the performance of this method to be the same as the dataflow method, but interestingly it performed better when loading all the data with no transformations: on average it took 60 seconds to refresh the dataset. This was still a bit slower than the 56 seconds the ADLSgen2 connector took using the default File System View option to return the same data minus the filename column. I then ran the test to create a group by on the Transdate column and that resulted in an average dataset refresh time of 27 seconds, which is exactly the same as the ADLSgen2 connector with the default File System View option.

Conclusion #2: Importing data from a Common Data Model folder via the ADLSgen2 connector’s CDM Folder View option may perform very slightly slower, or about the same as, the default File System View option.

So no performance surprises again, which is a good thing. Personally, I think exposing your data via a CDM folder is much more user-friendly than giving people access to a folder full of files – it’s a shame it isn’t done more often.

%d bloggers like this: