Video: Advanced Analytics Features In Power BI

Following on from my last post, another SQLBits session of mine I wanted to highlight was “Advanced Analytics Features In Power BI”. The subject is a bit outside my normal area of expertise but it’s also one that I don’t think gets enough attention: it’s all the features available in Power BI reports that can help you explain why something happened rather than just what happened. Things I talk about include:

  • Adding forecasts to line charts
  • Symmetry shading, ratio lines and clustering on scatter charts
  • The “Explain the increase” and “Find where this distribution is different” features
  • The Key Influencers and Decomposition Tree visuals
  • Custom visuals such as Sanddance
  • Natural language querying with Q&A

Video: Performance Tuning Power BI Dataset Refresh

The team at SQLBits have been publishing all the session recordings from their last (online) conference on their YouTube channel. There’s a lot of great content there to check out and this post is to highlight one of my sessions, “Performance tuning Power BI dataset refresh”.

In this session I look at all of the factors that can influence how long it takes to import data into Power BI and what you can do to make it faster. Topics covered include:

  • Choosing a dataset storage mode
  • The importance of good data modelling
  • How the type of data source you use effects how quickly data can load
  • Ways to measure refresh performance, such as using SQL Server Profiler and Power Query Query Diagnostics
  • Power Query options that can influence refresh times such as disabling data previews
  • Query folding in the Power Query engine
  • Vertipaq engine features that affect refresh, such as calculated columns and calculated tables
  • How dataflows can help refresh performance

Power Query And Power BI Connectivity Announcements At The Microsoft Business Applications Summit

There were a lot of exciting announcements at the Microsoft Business Applications Summit this week but if you only watched the keynotes or read the recap on the Power BI blog you will have missed all the Power Query-related news in the “Data Prep in Power BI, Power Platform and Excel using Power Query” session:

https://mymbas.microsoft.com/sessions/1332f59f-a051-4a06-ae50-8f3185501a88

It covers all the new things that have happened in Power Query over the last few months such as Diagram View and, more importantly, talks about what’s going to happen in the next few months. It’s relatively short but for those of you with no time or patience, here’s a summary of the roadmap announcements:

[BTW “Power Query Online” is the browser-based version of Power Query that is used in Power BI dataflows]

My highlights are:

  • The ability to create a dataflow quickly by uploading a file to Power Query Online without needing to use a gateway to connect to a file on-premises, useful for one-time import scenarios.
  • Multi-value M parameter support – useful for dynamic M parameters and other things I can’t talk about yet 😉
  • The things that Miguel talks about regarding “easier design experiences” for Synapse are kept intentionally vague but it’s worth listening to carefully to what he says here!
  • Native SQL support for Snowflake, BigQuery and Redshift – this is really useful for anyone who wants to use DirectQuery with these databases because it will allow you to write your own SQL query and use it as the source of a table, rather than having to use a table or a view.
  • AAD based Single Sign-On support for Redshift and BigQuery (similar to what we have today for Snowflake) will also be very important for DirectQuery, because it means that the identity of the user running the report can be passed back to the database.
  • A dataflows connector for Excel Power Query – which means, at last, you’ll be able to get data from a dataflow direct into Excel. This will make a lot of Excel users very happy, I think: a lot of the time all users want is a table of data dumped to Excel and dataflows will be a great way to do provide them with that.

Last of all, the session showcases the great new home for all things Power Query – http://www.powerquery.com/ – which has great resources, newly-updated documentation and a blog. Make sure you check it out!

Power BI, Excel Organisation Data Types And Images

Excel Organisation data types were released last year (see here for details), but did you know that you can now use them to bring images as well as text and numbers into Excel? Here’s a super-simple example that shows you how to do this.

Here’s a table called ‘Fruit With Image’ in a dataset that I have published to the Power BI Service:

Notice that the Data Category property on the Image column, which contains the URL of a picture of each type of fruit listed, to “Image URL” (for more details on what this does see here). If I use this table in a Power BI report, I see the name of each fruit and a picture:

So far no surprises. I can also set this table up as a Featured Table (for more details see here) so it can be used as the source for an Organisation Data Type in Excel:

The cool thing is that when I type these fruit names into Excel and mark them as the “Fruit With Image” data type (see here for more details), I can then access the Image field and it will show the image that the URL points to inside a cell:

Measuring DirectQuery Performance In Power BI

If you have a slow DirectQuery report in Power BI one of the first questions you need to ask is how long the SQL queries that Power BI generates take to run. This is a more complicated question to answer than you might think, though, and in this post I’ll explain why.

I happen to have access to some of the famous New York taxi data in a Snowflake database, and in there is a table with trip data that has 173 million rows that I have a built a Power BI dataset from. The data and the database used are not really important here though – what is important is that it’s DirectQuery and a large-ish amount of data. Here’s a report page with a single table visual on it, showing passenger count aggregated by the hack license field:

It’s slow, but how slow? Here’s what Performance Analyzer shows when I refresh the table:

The DAX query takes 5.4 seconds but the Direct Query time is only 3.3 seconds – and the numbers don’t seem to add up. Here’s what Profiler captures for the same refresh shown in Performance Analyzer:

This shows there’s a gap of 2 seconds between the DirectQuery End event and the Query End event. What if I paste the DAX query into DAX Studio? Here’s what the Server Timings tab shows:

This is a different query execution to the two examples above, both of which show data for the same execution, which explains why the numbers are slightly different here – but again there seems to be an extra second of stuff happening and DAX Studio suggests that it’s in the Formula Engine.

So what is going on? The answer lies in understanding what the DirectQuery End Profiler event actually measures: it’s the amount of time between the Analysis Services engine handing a query over to the Power Query engine and the Analysis Services engine receiving the first row in the resultset back, including the time taken for the Power Query engine to fold the query.

Therefore if it takes a long time to get all the rows in the resultset then that could explain what’s going on here. Unfortunately there’s no way of knowing from Profiler events how long this takes – but there is another way. Going back to Performance Analyzer, if you export the data from it to JSON (by clicking the Export button) and load it into Power Query, you can see more detail about a DirectQuery query execution. Here’s the data from the first execution above:

[There’s a very good paper documenting what’s in the Performance Analyzer JSON file here]

Looking at the record in the metrics column for the Execute Direct Query event you can see the same 3.2 second duration shown above in Profiler. Notice that there are two other metrics here as well: RowsRead, which is the total number of rows returned by the resultset; and DataReadDuration, which is the amount of time to read these rows after the first row has been received plus some other Analysis Services Engine operations such as encoding of column values, joining with unpushed semijoins, projections of aggregations such as Average and saving the resultset to the in-memory cache. In this case the SQL query has returned 43191 rows and this takes 1.95 seconds – which explains the gap between the end of the Execute Direct Query event and the end of the query.

One last question: why this SQL query is returning so many rows when the DAX query is only asking for the top 502 rows?

The reason is that, at the time of writing at least, the Analysis Services engine can only push a top(n) operation down to a DirectQuery SQL query in very simple scenarios where there are no measures and no aggregation involved – and in this case we’re summing up values. As a result, if you’re using DirectQuery mode and have a visual like this that can potentially display a large number of rows and includes a measure or aggregated values, you may end up with slow performance.

[Thanks to Jeffrey Wang for providing the information in this post]

Can I Build A Power BI DirectQuery Dataset On Top Of A REST API?

One of the most common questions I get asked is “How can I use Power BI in DirectQuery mode on top of a REST API?”. This seems like a reasonable thing to do but almost everyone who tries it will fail, and in this post I will explain why.

To answer this question we first of all have to review the two main ways of working with data in Power BI: Import mode and DirectQuery mode. In Import mode data is cached in Power BI’s own internal database and all the DAX queries that are generated by your reports are answered from there. In DirectQuery mode no data is stored inside Power BI; instead, when a report is run and DAX queries are fired off against your dataset, Power BI in turn generates queries against the data source to get the data needed. Most of the data sources that can be used with DirectQuery mode in Power BI are relational databases and so that means Power BI will generate SQL queries to get data from them, but Power BI can also generate queries in other languages too.

Let’s look at a simple example of how DirectQuery works. Consider the following table called FruitSales in a SQL Server database:

If I build a DirectQuery dataset containing only this table I can create the following Power BI report with a matrix and a slicer:

The SQL query generated by Power BI to get the data for the matrix is this:

SELECT 
TOP (1000001) [t0].[Colour],SUM(
CAST([t0].[Sales] as BIGINT)
)
 AS [a0]
FROM 
(
(
select [$Table].[Fruit] as [Fruit],
    [$Table].[Colour] as [Colour],
    [$Table].[Sales] as [Sales]
from [dbo].[FruitSales] as [$Table]
)
)
 AS [t0]
WHERE 
(
([t0].[Colour] IN (N'Red',N'Green'))
)

GROUP BY [t0].[Colour] 

A few things are immediately obvious from this: the query has a WHERE clause to filter the data down to the two colours selected in the slicer and there’s a GROUP BY to get the sum of Sales by Colour. You can see how the results of the SQL query are going to match what is displayed in the report. In more more complex cases – especially when non-trivial DAX calculations are used – a single visual may generate multiple, more sophisticated SQL queries and some calculation work may also be done inside Power BI.

This example illustrates the kind of of operations that Power BI needs to push down to the data source in DirectQuery mode. The question is, then, does your REST API support filtering and aggregating data in the way Power BI wants? If not, then you won’t be able to build a DirectQuery dataset on top of it.

“But wait”, I hear some of you say, “why can’t Power BI get a table of data returned by my REST API and do the filtering and aggregation itself?”. I agree it would be lovely if it could but it can’t. And if it did, that would be something more like Import mode and not DirectQuery. “Can’t I just somehow refresh an Import mode dataset on demand then?”, you may then say, “I know it’s possible to refresh a dataset using Power Automate and we can now trigger a Power Automate flow from a report using the new Power Automate visual. What’s more, with Power BI Premium there’s no limit on the number of refreshes you can trigger via the API that Power Automate uses”. That’s certainly true, but there are a few problems with this approach:

  1. Refreshing an Import mode dataset can be relatively slow and expensive in terms of resources, and if you have hundreds of users refreshing a dataset every time they view a report you may end up with the dataset being permanently refreshing which again impacts performance.
  2. Commercial APIs often have limits on the number of times you can call them within a given time period and it’s likely you’d hit these rate limits if you let your users refresh a dataset any time they wanted.
  3. In Import mode there’s no easy way to take a selection made by an end user and pass it back to the API as a parameter (if you don’t need to pass parameters back to the API why not just create a regular Import dataset and refresh it on a schedule?). Dynamic M parameters only work in DirectQuery mode. I suppose you could capture a selection and, using the Power Automate visual again, pass it to a Flow that used a Power Automate custom connector to call the Power BI REST API and change an M parameter’s value in your dataset but that would be very tricky to set up. However…
  4. A dataset is something that is shared between all the users of your report. What happens if User A changes a parameter, refreshes the dataset, starts interacting with the report and then User B also starts viewing the report and changes the parameter to something different and refreshes again? In that case User A would see the numbers for User B’s selection in the report and it would be extremely confusing.

Let’s say your API does support all the rich operations Power BI needs in DirectQuery mode though. You can certainly build a Power BI custom connector that supports DirectQuery mode yourself. One way of doing this is to build your own ODBC provider on top of your API and then wrap it in a simple Power BI custom connector – which, as you can probably guess, is no easy task. There’s a sample of how to create a custom connector that supports DirectQuery on an ODBC provider here. Alternatively you can try not going down the ODBC route and putting all the logic in the custom connector but there’s no documentation on how to do this and it would be extremely difficult to do, so you’d need a lot of support from someone at Microsoft who is much better at coding in M than I am.

The good news is that some third-party vendors, for example CData and Progress (there may be others too) sell Power BI custom connectors that allow you to create DirectQuery datasets on top of REST APIs. These connectors are thin wrappers for ODBC providers and work by caching the data returned from the API within the provider and providing a SQL query interface on top of that. I haven’t tested these products so I can’t comment on how well they work or what their performance is like. You’d need to pay extra to use them, of course, and since they are custom connectors you’d need to have an on-premises data gateway to use them.

To conclude, the purpose of this post is not to explain why building a report on top of a REST API is a bad idea but to explain why it’s hard to do this in Power BI. This is, partly, the result of the way Power BI is designed: as a model-based tool rather than a report-based tool (Marco Russo wrote a great post on this subject) it needs access to all the data for a report via a richer interface than most APIs can provide. Indeed, my advice to anyone wanting to build a DirectQuery dataset on an API is that it is better to bypass the API and go to the data source that the API uses (which is often a relational database) instead, if you can, or to extract all the data that the API can provide and stage it locally. It might go against someone in your organisation’s grand plan for a service-oriented architecture but you’ll be a lot more successful with your reporting.

Power BI Dataflow Performance, Premium Per User And The Enhanced Compute Engine

Over the years I have written a lot about Power BI/Power Query performance but it has always been in the context of loading data direct into datasets, not dataflows. A lot of cool things have been happening in dataflows recently, though, and now that Premium Per User has made Premium features to a much wider audience I thought it would be interesting to look at an example of how PPU can help dataflow performance and specifically how and when the Enhanced Compute Engine can make dataflow refresh faster.

Using the same CSV file that I used in my posts from last year on optimising the performance of merges in Power Query, a file with one million rows and seven numeric columns named A, B, C, D, E, F and G, I created the following dataflow in a shared capacity (ie non-Premium) workspace:

The queries called First and Second are identical and just load all the data from the (same) source CSV file; they also have their load disabled. The query called Merge does an inner join between these two queries on the column called A:

The Merge query has its load enabled so it’s the only output of the dataflow; after it has joined the data it expands the nested columns returned and sets the data types on all the output columns.

Refreshing this dataflow in shared capacity took on average 150 seconds.

I then moved the workspace to Premium Per User capacity and without making any changes, I refreshed again.

Refreshing the same dataflow in PPU took on average 73 seconds.

So the first finding is that moving the dataflow to PPU more than halved the refresh time, which is a pretty good result.

However, at this point the Enhanced Compute Engine is not being used – so, to enable it, I enabled loading for the First and Second queries which in turn made the Merge query a Computed Table (what used to be a Computed Entity before the recent terminology changes, indicated by the lightning bolt icon):

For a full explanation of when the Enhanced Compute Engine can and can’t be used see Matthew Roche’s blog post here; basically it loads data into a SQL-based cache which Computed Tables can then leverage which means that data access is faster and the Power Query engine can push transformations back to it via query folding. The only other change I made was to set data types on the columns in the output of First and Second.

Refreshing this new version of the dataflow in PPU took on average 90 seconds

So performance was worse – but why? Enabling loading on First and Second means that more work is done at refresh time because their output needs to be ingested twice (once into ADLSgen2 and once into the SQL cache used by the Enhanced Compute Engine) before the Enhanced Compute Engine can access it. In this case the extra work needed to load First and Second outweighs the performance gains from using the Enhanced Compute Engine. The new metrics available from the dataflow’s Refresh History provide some insight into this (I strongly recommend you read the docs on these metrics here); here’s some of the data from one of the refresh history CSV files loaded into Excel:

In this particular case the overall refresh time of the dataflow was 88 seconds. First and Second refreshed in parallel – First taking 48 seconds and Second taking 51 seconds – and once they had both finished, Merge could refresh and only took 36 seconds to join the output of First and Second. So in this case Merge is indeed faster (36 seconds compared to 73 seconds before) as a result of using the Enhanced Compute Engine but that improvement isn’t enough to cancel out the additional time needed to load the data returned by First and Second into it.

What about a scenario where the Enhanced Compute Engine does make a positive difference? Take a look at the following dataflow, a slight variation on the dataflow above:

There are now three new tables: Ouput Table 1, Output Table 2 and Output Table 3. Each of these tables gets the maximum value from a different column in the table returned by Merge. Note that there are no Computed Tables in this dataflow so the Enhanced Compute Engine is not used, and that First, Second and Merge have load disabled.

Refreshing this dataflow on PPU took on average 95 seconds

Here are the refresh metrics for one of the refreshes:

As you can see, the three tables were refreshed in parallel and took between 84 and 93 seconds. It’s important to remember that for each of these tables the source data was loaded and the Merge query evaluated independently, which explains why they each take so long. The fact that Merge is evaluated three times when this dataflow refreshes is counter-intuitive but really important here – for more details see this post on how queries are evaluated in Power Query.

Now consider this version of the same dataflow where First, Second and Merge have their load enabled, making Merge, Output Table 1, Output Table 2 and Output Table 3 all Computed Tables.

Refreshing this dataflow on PPU took on average 88 seconds

Not a massive improvement, but an improvement. Now look at how different the refresh metrics are:

In this case Output Table 1, Output Table 2 and Output Table 3 only take 1 second to evaluate, but that’s because they are working from data cached in the Enhanced Compute Engine – the table returned by Merge – and the transformations in them fold. The Merge table also uses data cached in the Enhanced Compute Engine: the tables returned by First and Second. What’s more, because Merge is a Computed Table it is only evaluated once in this dataflow. Loading the data for First and Second takes 52 seconds and 50 seconds respectively and Merge takes 35 seconds. In this case the hit of loading the data into the Enhanced Compute Engine is worth taking.

In conclusion, there are two things that these tests have shown:

  • Moving your dataflow to PPU can make a big difference to refresh performance.
  • The Enhanced Compute Engine can make dataflow refresh faster but not in all cases: you need to understand how it works, and in some cases the overhead of loading the data into it outweighs the performance advantages it gives you for any transformations later on. Use the information in Refresh History to work out what’s happening for your dataflow.

Importing Data From ADLSgen2 Into Power BI: Summary Of Findings

Over the last few months I’ve written a series of posts looking at different aspects of one question: what is the best way to import data from ADLSgen2 storage into a Power BI dataset? For example, is Parquet really better than CSV? Should you use Azure Synapse Serverless? In this post I’m going to summarise my findings and offer some recommendations – although, as always, I need to stress that these are the conclusions I can draw from my test results and not the absolute, incontrovertible “Microsoft-says” truth so please do your own testing too.

Partitioning makes import faster

Whatever other choices you make about importing data into Power BI, creating partitioned tables in your dataset is the key to getting the best refresh performance (see here for more details, and here for more thoughts/findings on the subject). However, creating partitioned tables manually adds a lot of complexity since you need to use external tools like Tabular Editor to do so; it also adds cost since you need to have Premium or PPU to get access to XMLA Endpoints for Tabular Editor to connect to. The bigger the Premium capacity SKU you use, the more partitions you’ll be able to refresh in parallel and the faster your refresh will be.

Since incremental refresh also creates partitions in the background, and incremental refresh does not require Premium or PPU, you may want to consider using it instead of creating partitions manually but it’s nowhere near as flexible and if you’re connecting direct to ADLSgen2 then you’d have to use the technique that Miguel Escobar describes here to ensure that query folding takes place.

Do you need to filter?

The most important question you need to ask in this whole process is this:

  1. Are you loading only some of the data from one or more files (for example by filtering on the values in one or more columns), or
  2. Are you loading all the data (ie all the rows) from one or more files?

Scenario #1 is, I guess, a classic data lake scenario where you have multiple Power BI datasets created by different users, each of which is importing a different subset of the data from the files in the lake. Scenario #2, on the other hand, will be the case when you’re creating a set of files in ADLSgen2 that contain exactly the data you want to load into a single Power BI dataset for a specific project.

If your answer is scenario #1 and you’re filtering data before you load, then you’ll get the best import performance if store your data in Parquet files and query it through Azure Synapse Serverless (see here and here). Although Power BI can take advantage of Parquet format to a certain extent and will give you better performance if you are only importing some of the columns from a file (see here) it doesn’t do predicate pushdown. There is an additional cost associated with using Synapse Serverless, of course, but it’s very reasonably priced (see here for more details on how the pricing works).

If your answer is scenario #2 and you’re not filtering data, then you’ll get better performance (and lower costs) by connecting to your files in ADLSgen2 direct from Power BI. Using Azure Synapse Serverless isn’t necessarily a bad option but it does come with an overhead.

Combining data from multiple Parquet files can be optimised

If you are connecting direct to files in ADLSgen2 (and aren’t using Azure Synapse Serverless) and aren’t creating one partition per file then you’ll be combining data from multiple files in your Power Query M code. The code that Power Query generates automatically when you do this performs faster for CSV files than Parquet files (see here) but as I show here, with some simple changes you can create a much faster query to combine data from multiple Parquet files – although this technique does not work with CSV files.

Always use HierarchicalNavigation=true with AzureStorage.DataLake()

If you are not use Azure Synapse Serverless and reading the data direct from ADLSgen2 using the AzureStorage.DataLake() M function then you should always set the HierarchicalNavigation=true option. As I showed in this post you can get some significant performance benefits from using this option. There are other options that you can set on AzureStorage.DataLake() but I couldn’t find any benefits from using them (see here) – probably because requesting data from ADLSgen2 is relatively fast and the act of loading the data returned into a table in your dataset is much slower (see here).

Attaching Common Data Model folders as Dataflows can be a user-friendly option

While connecting to tables in Azure Synapse Serverless is reasonably straightforward, connecting direct to files and folders in ADLSgen2 can be quite intimidating for inexperienced Power BI users. As a result exposing ADLSgen2 data stored in Common Data Model format by attaching it as a Dataflow may be worth considering. There’s a performance difference between doing this and connecting direct to multiple CSV files (see here) but it’s certainly a lot more user-friendly. It’s also worth noting that support for the newer CDM manifest format in Dataflows has just been announced, although Dataflows don’t support CDM folders with data stored in Parquet format yet.

Partitioned Tables, Power BI And Parquet Files In ADLSgen2

Earlier in this series on importing data from ADLSgen2 into Power BI I showed how partitioning a table in your dataset can improve refresh performance. In that post I used CSV files in ADLSgen2 as my source and created one partition per CSV file, but after my recent discovery that importing data from multiple Parquet files can be tuned to be a lot faster than importing data from CSV files, I decided to try creating partitions linked to Parquet files instead.

As a reminder, here’s what Phil Seamark’s Power BI refresh report showed for the version of my dataset with one partition per CSV file:

As you can see, refresh in the Power BI Service took 40 seconds and throughput was just under 200,000 rows per second.

Here’s the same report for a partitioned dataset based on five Parquet files (with one partition per Parquet file) containing exactly the same data:

Using Parquet files as a source refresh only took 33 seconds and throughput was almost 250,000 rows per second.

But what does this actually tell me? It shows that it is possible to improve on the previous 40-second refresh time quite substantially. It certainly looks like reading data from individual Parquet files is faster than reading data from individual CSV files too; however I’m reluctant to draw too many other conclusions. This is because:

  • I only had five Parquet files in this case, compared to nine CSV files. I suspect that some extra parallelism might be possible if had six Parquet files instead of just five, because there are six refresh slots available in the PPU workspace I’m using.
  • This then leads onto the question of whether the number and size of the files affects refresh performance in the same way for CSV files as for Parquet files (any difference could be something to do with the format itself or the Power BI connectors used to read the data from the files).
  • This also leads onto the question of partitioning strategies – after all, you probably also want to only refresh partitions where data has changed, which will also have a bit impact on refresh times – and the distribution of data over partitions. That’s out of scope for this series, but if you have a copy of Marco Russo and Alberto Ferrari’s book “Tabular Modeling in Microsoft SQL Server Analysis Services” then check out chapter 11 and the section in chapter 12 on “Segmentation and partitioning”: what’s good for Analysis Services Tabular is mostly good for Power BI, although bear in mind that Power BI uses segment sizes of 1 million rows.

Optimising The Performance Of Combining Data From Multiple Parquet Files In Power Query/Power BI

In all the testing I’ve done recently with importing data from Parquet files into Power BI I noticed something strange: loading data from a folder containing multiple Parquet files seemed a lot slower than I would expect, based on the time taken to load data from a single file. So I wondered – is there something that can be optimised? It turns out there is and in this blog post I’ll show you what I did.

If you import data from a folder containing Parquet files – whether it’s a local folder or a folder in ADLSgen2 storage – you’ll see a series of queries created for you in the Power Query Editor window that looks like this:

The query called Query1 shown in the screenshot iterates over all the files in the folder you’ve chosen and calls a function that reads the data from each Parquet file. It returns a table that contains a column with the name of the original source file in (which isn’t all that interesting for Parquet files) and all the columns from the Parquet files you’re containing.

Using the Parquet files from my series of posts on importing data from ADLSgen2 as a source, here’s the M code Power Quey generates for this query which I have modified to remove the column with the source file name in:

let
  Source = Folder.Files("C:\MyFolder"),
  #"Filtered Hidden Files1"
    = Table.SelectRows(
    Source,
    each [Attributes]?[Hidden]? <> true
  ),
  #"Invoke Custom Function1"
    = Table.AddColumn(
    #"Filtered Hidden Files1",
    "Transform File (3)",
    each #"Transform File (3)"(
      [Content]
    )
  ),
  #"Renamed Columns1"
    = Table.RenameColumns(
    #"Invoke Custom Function1",
    {"Name", "Source.Name"}
  ),
  #"Removed Other Columns1"
    = Table.SelectColumns(
    #"Renamed Columns1",
    {"Transform File (3)"}
  ),
  #"Expanded Table Column1"
    = Table.ExpandTableColumn(
    #"Removed Other Columns1",
    "Transform File (3)",
    Table.ColumnNames(
      #"Transform File (3)"(
        #"Sample File (3)"
      )
    )
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Expanded Table Column1",
    {
      {"TransDate", type date},
      {"GuestId", type text},
      {"ProductId", type text},
      {"NetAmount", type number}
    }
  )
in
  #"Changed Type"

Here’s the output:

On my PC this query took an average of 102 seconds to refresh.

Apart from this query being slower than I expected, I also noticed that there is a “Changed Type” step at the end – which I thought was unnecessary because unlike CSV files, Parquet has typed columns. If you connect to a single Parquet file in Power Query then it recognises the column types, so why not here? Well, it’s because of the way it’s combining files by expanding table columns, and there is a way to work around this that I blogged about here:
https://blog.crossjoin.co.uk/2017/09/25/setting-data-types-on-nested-tables-in-m/

Setting a type on the table column before expanding it did indeed improve performance, but this led me to another optimisation.

I know that using the Table.Combine M function can perform differently to the Table.ExpandTableColumn function used in the original version of the query (although it does not always perform better). Therefore I made the following change to the query above: using Table.Combine to return a single table with all the data in (note that setting a type on the table column is not necessary for this optimisation). Here’s the new version:

let
  Source = Folder.Files("C:\Myfolder"),
  #"Filtered Hidden Files1"
    = Table.SelectRows(
    Source,
    each [Attributes]?[Hidden]? <> true
  ),
  #"Invoke Custom Function1"
    = Table.AddColumn(
    #"Filtered Hidden Files1",
    "Transform File",
    each #"Transform File"([Content])
  ),
  #"Renamed Columns1"
    = Table.RenameColumns(
    #"Invoke Custom Function1",
    {"Name", "Source.Name"}
  ),
  #"Removed Other Columns1"
    = Table.SelectColumns(
    #"Renamed Columns1",
    {"Source.Name", "Transform File"}
  ),
  Combine = Table.Combine(
    #"Removed Other Columns1"[
      Transform File
    ]
  )
in
  Combine

This version of the query took, on average 43 seconds to refresh – a massive improvement.

If you’ve been following my series on ADLSgen2 refresh you may remember that I blogged about importing from a folder of Parquet files there too: in this post I noted that it took on average 72 seconds to load the same data from an ADLSgen2 folder in the Power BI Service using the original code; that was with the Source File column in and removing that column made no different to performance. This new version of the query took on average 49 seconds.

The conclusion is obvious: if you need to load data from a folder of Parquet files then you should use this new approach because the performance benefits are substantial. I know what you’re thinking: does this technique work for other file types apart from Parquet like CSV? The answer is no, because these file types don’t have typed columns like Parquet so it won’t work unfortunately.

%d bloggers like this: