One of the most important properties you can set in a Power BI DirectQuery semantic model is the “Maximum connections per data source” property, which controls the number of connections that can be used to run queries against a data source. The good news is that the maximum value that you can set this property to has just been increased in Premium.
This property is important because the number of open connections to a data source acts as a limit on the number of concurrent queries that can be run by the semantic model against the source: each connection can only have one query running on it at any one time. If you have Power BI reports that have a large number of visuals on a page and/or a large number of users running reports at the same time then it is very easy for a DirectQuery semantic model to need to send lots of queries back to your data source at the same time. If some of the queries that your semantic model runs against your data source are slow – more than one or two seconds even – then the number of queries that need to be run at a given time will increase. The same is true if you have increased the Max Parallelism Per Query property to increase the number of parallel queries that can be generated by a single DAX query.
This property is documented in a number of places, including the DirectQuery guidance documentation and in data source-specific best practice documents such as this one for Snowflake. You can set the property in Power BI Desktop in the Current File/DirectQuery section of the Options dialog:
If you are not using Power BI Premium (ie you are using Power BI Shared capacity, also known as Power BI Pro) then the maximum value that you can set this property to is 10. If you are using Power BI Premium then the maximum value up to today was 30 but now that limit has been increased. The table on this page shows what the new limits per SKU are:
As you can see, for a P1/F64 the maximum limit is now 50 rather than 30 and this limit goes all the way up to 200 for a P4/F512 and higher.
I’ve seen plenty of cases where increasing the value of this property makes Power BI reports run a lot faster. However, this will only be true if your data source is able to handle the number of queries that Power BI is trying to run against it. As I showed in this post, if your data source can’t handle the number of queries you’re trying to run then performance will get worse and not better, so you should try different values to see which one works best.
In my last post but one I showed how you could create a DGML file that contains detailed information on what happens during a Power BI Import mode refresh – the “job graph” – and visualise that in Visual Studio or VS Code, to help with performance tuning. In my last post, I explained the concepts of blocking and waiting which are key to understanding this data. In this post I’ll share and describe a Dataflow template that extracts the data contained in the DGML file to a table for other types of analysis – which turns out to be quite simple, because DGML is an XML-based format which is easy to work with in Power Query and Dataflows.
You can download the .pqt template file here (for more information on how to use templates see here). To use it, create a new Dataflow Gen2 and import the template file. You’ll see the following queries:
Next, change the values of the three parameters to contain the IDs of the refresh whose DGML file you want to load (the notebook in this post that creates the DGML file uses the refresh ID as the file’s name) and the workspace and lakehouse where that file is stored. Finally, set output destinations on either the RefreshJobs and/or RefreshJobsPivoted queries. The RefreshJobs query returns a table with one row per refresh job:
This gives you all the data in the DGML file that was visible using a DGML viewer but in table form. For each job you get the CreatedAt, RunnableAt, StartedAt and FinishedAt datetimes, the Blocked, Wait and Running durations, the Slot ID, the CPU used and whether the job is on the critical path.
The RefreshJobsPivoted query gives you exactly the same data but instead of giving you one row per job, you get three: one row for the blocked, waiting and running phases of each job, with the start and end times for each phase and the duration. This may be a more convenient format for visualisation and analysis:
The Links query gives you the dependencies between the jobs:
Having loaded all this data into a Lakehouse you can now build a report on it. As I said, I’m sure there’s a way of visualising all of this data in a way that shows all the durations and dependencies, but even a basic table report on the RefreshJobs table like this is really useful because it shows you which jobs were blocked, which ones had to wait, and which ones took a long time to run:
In this example (the Dependencies example from my previous post, but where the refresh has maxParallelism set to 1, so there is blocking as well as waiting) you can see that refreshing table X took 60 seconds, that the refresh for table Y had to wait for 60 seconds before it could start and took 10 seconds, and that refresh for table XY was blocked for 70 seconds before it could start. So, if you wanted to make this refresh run faster, you would want to understand why table X was so slow and also look at increasing the amount of parallelism so Y did not have to wait.
Following on from my previous post showing how you can visualise the job graph for a Power BI Import mode semantic model refresh, I this post I will look at how you can interpret what the job graph tells you – specifically, explaining the concepts of blocking and waiting. As always, simple examples are the best way of doing this.
Blocking
Blocking occurs when one job can only start when one or more other jobs have completed because there is a dependency between them. Consider the following semantic model:
X and Y are tables that contain a single numeric column. X takes 1 minute to load while Y takes 10 seconds to load (I forced this delay using technique I blogged about here). XYUnion is a DAX calculated table that unions all the rows from X and Y with the following definition:
XYUnion = UNION(X,Y)
As you can imagine, the job that refreshes XYUnion can only start once the jobs that refresh both X and Y have finished; XYUnion will be blocked until both X and Y have refreshed. Here’s what the job graph for the refresh of this semantic model looks like:
At the centre is the job “Process Partition XYUnion[XYUnion]” which refreshes the calculated table XYUnion. The arrows going from this job to the jobs “Process Partition X[X]” and “Process Partition Y[Y]”, which refresh tables X and Y, show that this job depends on those two other jobs.
Hovering over the “Process Partition XYUnion[XYUnion” job shows the following popup:
There are four datetime values here: CreatedAt, RunnableAt, StartedAt and FinishedAt. There are also three durations:
Block is the elapsed time between CreatedAt and RunnableAt, and is the amount of time elapsed before all the jobs this job depends on were completed. Anything other than a zero here means that blocking has occurred.
Wait is the elapsed time between RunnableAt and StartedAt. A job becomes runnable when all the jobs it depends on have completed but even it still may not be able to start because Power BI only allows a certain number of jobs to refresh in parallel (see this post for more details and how to control the amount of parallelism). Waiting is described in the example below.
Run is the elapsed time between StartedAt and FinishedAt and is the amount of time the job itself took to run.
In this case you can see that the value for Block for XYUnion is 1 minute: X and Y have no preceding jobs so they kick off at the same time, X takes 1 minute to run and Y takes 10 seconds, so it is 1 minute before XYUnion can run. The popups for X and Y show 1 minute and 10 seconds respectively for Run, as you would expect:
One last thing to mention: in the full job graph diagram above you’ll see that certain nodes are highlighted in red. That’s because they are on the critical path, which is documented here; it’s the chain of jobs that dictates the overall length of the refresh, so if you want to make your refresh faster then you need to tune the jobs on the critical path. In this case the critical path goes through X to XYUnion: if you wanted the whole semantic model to refresh faster you would need to tune the refresh for either X or XYUnion; tuning the refresh for Y would make no difference to the overall semantic model refresh time.
Waiting
As I already mentioned, there is a limit on the number of jobs within a refresh that can run in parallel. The maximum number of jobs is represented by the number of “slots” – one slot can only run one job at a time – and in the screenshots of the popups above you can see each job has a slot number. If there are more jobs that could be run than there are slots available at a given time then some jobs have to wait for a slot.
Here’s another example: a semantic model with three tables, A, B and C, which each take 30 seconds to refresh.
There are no dependencies between the tables so in theory each of these three tables could refresh in parallel. However if you refresh with maxParallelism set to two then only two of the three can refresh at any one time. Here’s the job graph for a refresh that does that:
As you can see the critical path goes through the refresh jobs for table A, and hovering over the “Process Partition A[A]” job shows the following:
While this job was not blocked at all because there are no jobs it depends on, it had to wait 30 seconds for a slot to become available; it eventually ran in slot 0. Hovering over the nodes for “Process Partition B[B]” and “Process Partition C[C]” shows that they ran in slots 0 and slot 1 respectively and neither of them were blocked or had to wait.
The job graph isn’t always the best way of visualising this type of parallelism; Phil Seamark’s original Power BI report for visualising refreshes has a page which shows the slots and the jobs in them but I think there’s probably a better way of visualising all of this data that shows slots as well as dependencies. Maybe a Deneb visual is the answer? If anyone has ideas I’d be interested to hear them! In any case, the first step to doing this is to extract all of this data from the .DGML file into a table and that’s what I’ll demonstrate how to do in the next post in this series.
A few years ago a new pair of Profiler events was added for Power BI Import mode datasets (and indeed AAS models): the Job Graph events. I blogged about them here but they never got used by anyone because it was extremely difficult to extract useful data from them – you had to run a Profiler trace, save the trace file, run a Python script to generate a .dgml file, then open that file in Visual Studio – which was a shame because they contain a lot of really interesting, useful information. The good news is that with the release of Semantic Link in Fabric and the ability to run Profiler traces from a Fabric notebook it’s now much easier to access Job Graph data and in this blog post I’ll show you how.
Quick recap: what are the Job Graph events and why are they useful? Let’s say you have a Power BI Import mode semantic model and you want to optimise refresh performance. When you refresh a semantic model, that refresh is made up of multiple jobs which themselves are made up of multiple jobs: refreshing a semantic model involves refreshing all the tables in that model, refreshing a table involves refreshing all the partitions in that table, refreshing a partition involves loading the data and building attribute hierarchies, and so on. Some of these jobs can happen in parallel but in some cases there are dependencies between jobs, so one job can only start when another has completed. The Job Graph events give you information on these refresh jobs and the dependencies between them so you can work out which jobs you need to optimise. In order to capture information from them you need to run a trace while the semantic model is being refreshed; the data from some of these Job Graph events can be reconstituted into a Directed Graph Markup Language (DGML) file, which is an XML-based format, and once you’ve got that you can either visualise the DGML file using a suitable viewer or extract the data from it and analyse it further.
[Before I carry on I have to acknowledge that I’m extremely new at Python and a lot of the code in this post is adapted from the code in my colleague Phil Seamark’s excellent recent post on visualising Power BI refresh information with Semantic Link. Any feedback on ways to optimise the code is gratefully received.]
Here’s some Python code that you can use in a Fabric notebook to run a refresh and generate a DGML file. Each code snippet can be used in a separate code cell or combined into a single cell.
First of all you need to install Semantic Link:
%pip install semantic-link
Next you need to define the events you want in your trace, which in this case are just the Job Graph events:
import sempy.fabric as fabric import pandas as pd import time import warnings
# define events to trace and their corresponding columns
event_schema = { "JobGraph": base_cols }
warnings.filterwarnings("ignore")
You then need to start a trace using this definition, refresh the semantic model, stop the trace and filter the events captured so you only have those with the EventSubclass GraphFinished, remove the event which contains the metadata (which has a value of 0 in the IntegerData column) and then finally sort the rows in ascending order by the values in the IntegerData column:
WorkspaceName = "Insert workspace name here" SemanticModelName = "Insert semantic model name here"
with fabric.create_trace_connection(SemanticModelName,WorkspaceName) as trace_connection: # create trace on server with specified events with trace_connection.create_trace(event_schema, "Simple Refresh Trace") as trace:
trace.start()
# run the refresh request_status_id = fabric.refresh_dataset(SemanticModelName, WorkspaceName, refresh_type="full") print("Progress:", end="")
while True: status = fabric.get_refresh_execution_details(SemanticModelName, request_status_id, WorkspaceName).status if status == "Completed": break
print("░", end="") time.sleep(2)
print(": refresh complete") # allow ending events to collect time.sleep(5)
# stop Trace and collect logs final_trace_logs = trace.stop()
# only return GraphFinished events final_trace_logs = final_trace_logs[final_trace_logs['Event Subclass'].isin(["GraphFinished"])] # ignore metadata row final_trace_logs = final_trace_logs[final_trace_logs['Integer Data'].ne(0)] # sort in ascending order by Integer Data column final_trace_logs = final_trace_logs.sort_values(by=['Integer Data'], ascending=True)
Finally, you need to take all the text from the EventText column of the remaining events and concatenate it to get the contents of the DGML file and then save that file to the Files section of the Lakehouse attached to your notebook:
# concatenate all text in TextData column out = ''.join(final_trace_logs['Text Data']) # change background colour of critical path nodes so it's easier to see in VS Code out = out.replace("#263238", "#eba0a7")
#dispose of trace connection trace_connection.disconnect_and_dispose()
I found a nice Visual Studio Code extension called DGMLViewer which makes viewing DGML files easy. Rather than manually downloading the file, OneLake Explorer makes it easy to sync files in OneLake with your PC in a very similar way to OneDrive, which makes working with these DGML files in VS Code very straightforward because you can simply open the local copy when it syncs.
Here’s what one of thse DGML files, generated from the refresh of a very basic semantic model, looks like when viewed in DGML Viewer:
If you have Visual Studio you can also use it to view DGML files (you need to install the DGML Editor first); I found a VS extension called DgmlPowerTools 2022 which adds some advanced features. Here’s what a DGML file for a refresh looks like when visualised in Visual Studio 2022:
OK, so this looks cool but it also looks very complicated. What does it all mean? How can you interpret all this information and use it to optimise a refresh? That’s something for a future blog post!
[In my next post I look at how you can interpret this data and understand the concepts of blocking and waiting, and in the post after that show how you can extract the data in this DGML file to a table using a Dataflow]
Every year, on the anniversary of the first-ever post on this blog, I write a post reflecting on what has happened to me professionally in the past year and this year the subject is obvious: Fabric. Of course I knew about, and indeed had been playing with, Fabric for a long time before even the private previews but it was only with the public preview and then GA that I felt the real impact on my job. I now work on the Fabric Customer Advisory Team at Microsoft rather than the Power BI Customer Advisory Team, for example, and a lot more than my job title has changed. Instead of helping customers with, and collecting feedback on, just Power BI I now have to know about the whole of Fabric – which is a challenge. But how much can anyone know about the whole of Fabric? Is it all too much for one person?
This is already a problem with Power BI of course. Back when Power BI first appeared I felt reasonably confident that I knew a fair amount about everything to do with it. Soon, though, I gave up on trying to know much about data visualisation and now when I see the amazing things people do with Deneb I’m full of admiration but I accept this is something I’ll never have time to learn. I think I’m better than average at DAX but I’ll never be anywhere near as good as Marco and Alberto. Even with Power Query and M my knowledge of creating custom connectors is limited. And that’s all ok, I know what I need to know to do my job and luckily I’m not the kind of person who feels overwhelmed by the constant flow of updates although I know a lot of people do (see this great article by Kurt Buhler on ways of dealing with this). If I don’t know everything about Power BI today then I can say confidently that I’ll never know everything about Fabric in the future. No-one can. You need a team. The question then is, then, how will job roles evolve within the Fabric community? Who will end up doing what, and how will we organise Fabric development?
There are two ways you can approach Fabric development, and the two ways remind me of Isiah Berlin’s famous essay on “The Hedgehog and the Fox” (who says arts degrees are useless?). Unless you’re a fan of Tolstoy you don’t need to read it all; the essay turns on a line from the Greek poet Archilochus which says: “The fox knows many things, but the hedgehog knows one big thing” and explores how thinkers and writers, indeed all human beings, can be classified as either ‘foxes’ or ‘hedgehogs’. I think this classification can be useful when thinking about Fabric development too.
I’m definitely a hedgehog: I like to go deep rather than wide. Things were easier in the old days when the Microsoft BI stack was made of discrete products (Analysis Services, Integration Services, Reporting Services, SQL Server) and you could specialise in just one or two – the dividing lines were clear and pre-drawn and everyone working in BI had to be a hedgehog. It’s tempting to take the same approach with Fabric and encourage people to specialise in individual workloads, to become Fabric Spark or Fabric Warehouse professionals for example. It’s unsurprising that people from an IT background are natural hedgehogs but I also think this risks repeating the mistakes of the past: going back to the old days again, when I was an Analysis Services consultant I regularly saw teams who were focused on building their data warehouse in SQL Server but who never gave a second thought to what was meant to happen downstream, so the cubes, reports and spreadsheets suffered accordingly and so the solution as a whole was seen as a failure by the business. While it might seem as though a team of hedgehogs will give you a team of experts in all the areas you need, it can often result in a team of people focused on building specific parts of a solution rather than a functioning whole.
The alternative is to be a fox and I think most Power BI users who come from the business or are data analysts are foxes: they have to know a bit about everything in order to build an end-to-end solution because they work on their own and have to get stuff done. Self-service BI is inherently foxy. While a single Fabric fox can build a whole solution, though, and is more likely to deliver value to the business, the quality of the constituent parts may be variable depending on how well the developer knows each workload. A team of foxes won’t solve that problem either: two people who half-understand DAX aren’t equivalent to one person who is a DAX expert. Too much self-service development delivers short-term business value which is cancelled out in the long-term under the weight of the technical debt incurred.
I don’t think you choose to be a fox or a hedgehog; your personality, learning style and job circumstances push you in one direction or the other. In some organisations there will only be foxes or only be hedgehogs with all the attendent consequences. However, the fact that Fabric brings together everything you need for BI development in a single, tighly-integration platform and the fact that its functionality and price means it’s suitable for self-service and enterprise BI development, means that there’s an opportunity to combine hedgehogs and foxes in a new, more effective way.
The magic formula for Fabric must be similar to what the most successful Power BI organisations are already doing today: the hedgehogs in IT build the foundations, provide the core data, model it, and then the foxes in the business take what these components and turn them into solutions. This will require a big change in the way the hedgehogs work and think: no more BI projects, no more building end-to-end. It should be rare for a hedgehog to ever build a report for the business. Instead the aim is to build a platform. Equally the foxes will have to change the way they work too: no more direct access to data sources, no more building end-to-end. It should be rare for a fox to ever bring external data into Fabric. Instead the aim is to build on the platform that has been provided by the hedgehogs. If the hedgehogs have done their job properly then the foxes will be happy – although that’s a big “if”. Finding the right organisational structures to bring hedgehogs and foxes together will be a challenge; it will be even more difficult to overcome the mutual distrust that has always existed between IT and the business.
So, to go back to the original question, who will do what in Fabric? On the hedgehog side there will need to be a platform architect and various specialists: in data ingestion, Spark and/or in Fabric Warehouse, in real-time analytics, in designing core semantic models and DAX. On the fox side the required skills will be basic data transformation using Dataflows, basic semantic modelling/DAX using OneLake, and advanced report design and data visualisation. This seems to me like a division of labour that results in each team member having an area of Fabric they have a chance of learning and mastering and one that will deliver the best outcomes.
A year ago support for nested data types in Excel was announced on the Excel blog, but the announcement didn’t have much detail about what nested data types are and the docs are quite vague too. I was recently asked how to create a nested data type and while it turns out to be quite easy, I thought it would be good to write a short post showing how to do it.
Let’s say you have a Power Query query that returns details of the different types of fruit that you sell in your shop:
Let’s also say that the last three columns in this table (Grower, Address and Specialty) all relate to the farmer that supplies you with this fruit. Now you could create one Excel data type with all these columns in, but nested data types allow you to create data types inside data types, so in this case you can create a data type specifically for these three columns relating to the farmer and then nest it inside the main data type.
To do this, select these three columns in the Power Query Editor and click the Create Data Type button on the Transform tab in the ribbon:
Give the data type a name, in this case Farmer, in the Create Data Type dialog:
At this point you’ll have a query that returns a table where one of the columns, Farmer, contains a data type:
Finally, you then select all the columns in this table, including the Farmer column, and click the Create Data Type button again to create another new data type, this time called Product:
Here’s what you’ll see in the Power Query Editor once you’ve done this:
And here’s the result in the Power Query Editor:
Once you’ve loaded this query to the worksheet you can explore the nested data type via the card popup:
Or you can access the data in the nested data type using a formula. For example, in the animated gif above the cell A2 contains the data type for Apples, so the formula
=A2.Farmer.Address
…returns the address of the farmer who grows apples.
Alternatively, you can use the Excel FieldValue function to get the same result:
I don’t know how, but somehow I missed the announcement of the preview of Project Sophia at Ignite (it’s not built by the Fabric product group and I’ve been distracted by all the cool stuff we’re working on at the moment, so…). If you’re a fan of Power BI and data in general you are definitely going to want to check it out though! It’s not BI, it’s an “AI-Powered business research canvas” although that description doesn’t really do justice to it; even if you read the announcement blog post you probably won’t get what it does. You need to watch this demo video first, watch this Ignite session if you have more time, read the docs and try it for yourself here if you’re based in North America.
Here’s a quick summary of how it works. You upload some data to work with (you can use it with sample data too), ask it a question, it generates some visual and text-based answers, and then suggests next steps or allows you to click on some data and ask a further question. Each new step in the journey creates a new page of information; you can easily go back and see previous pages.
I fed a subset of the UK Land Registry price paid data into it and while it was far from perfect I was still extremely impressed with the results:
Before you ask the obvious question: at the time of writing it only connects to relatively small Excel, CSV and PDF files. And please don’t ask me any follow up questions about this because I won’t be able to answer them 🙂
One of the coolest features in Fabric is Direct Lake mode, which allows you to build Power BI reports directly on top of Delta tables in your data lake without having to wait for a semantic model to refresh. However not everyone is ready for Fabric yet so there’s also a lot of interest in the new DeltaLake.Table M function which allows Power Query (in semantic models or dataflows) to read data from Delta tables. If you currently have a serving layer – for example Synapse Serverless or Databricks SQL Warehouse – in between your existing lake house and your import mode Power BI semantic models then this new function could allow you to remove it, to reduce complexity and cut costs. This will only be a good idea, though, if refresh performance isn’t impacted and incremental refresh can be made to work well.
So is it possible to get good performance from DeltaLake.Table with incremental refresh? Query folding isn’t possible using this connector because there’s no database to query: a Delta table is just a folder with some files in. But query folding isn’t necessary for incremental refresh to work well: what’s important is that when Power Query filters a table by the datetime column required for incremental refresh, that query is significantly faster than reading all the data from that table. And, as far as I can see from the testing I’ve done, because of certain performance optimisations within DeltaLake.Table it should be possible to use incremental refresh on a Delta table successfully.
There are three factors that influence the performance of Power Query when querying a Delta table:
The internal structure of the Delta table, in particular whether it is partitioned or not
The implementation of the connector, ie the DeltaLake.Table function
The M code you write in the queries used to populate the tables in your semantic model
There’s not much you can do about #2 – performance is, I think, good enough right now although there are a lot of optimisations that will hopefully come in the future – but #1 and #3 are definitely within your control as a developer and making the right choices makes all the difference.
Here’s what I did to test incremental refresh performance. First, I used a Fabric pipeline to load the NYC Taxi sample data into a table in a Lakehouse (for the purposes of this exercise a Fabric Lakehouse will behave the same as ADLSgen2 storage – I used a Lakehouse because it was easier). Then, in Power BI Desktop, I created an import mode semantic model pointing to the NYC taxi data table in the Lakehouse and configured incremental refresh. Here’s the M code for that table:
let Source = AzureStorage.DataLake( "https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/unpartitionednyc/", [HierarchicalNavigation = true] ), ToDelta = DeltaLake.Table(Source), #"Filtered Rows" = Table.SelectRows( ToDelta, each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd ) in #"Filtered Rows"
Here’s the incremental refresh dialog:
I then published the semantic model and refreshed it via the Enhanced Refresh API from a notebook (Semantic Link makes this so much easier) using an effective date of 8th December 2013 to get a good spread of data. I used Phil Seamark’s new, notebook-based version of his refresh visualisation tool to see how long each partition took during an initial refresh:
The refresh took just over 30 minutes.
Next, using Spark SQL, I created a copy of the NYC taxi data table in my Lakehouse with a new datetime column added which removed everything apart from the date and I then partitioned the table by that new datetime column (called PickupDate here):
CREATE TABLE PartitionedByDateNYC
USING delta
PARTITIONED BY (PickupDate)
AS
SELECT *, date_trunc("Day", lpepPickupDateTime) as PickupDate
FROM NYCIncrementalRefreshTest.nyctaxi_raw
I created a copy of my semantic model, pointed it to the new table and reconfigured the incremental refresh to filter on the newly-created PickupDate column:
let Source = AzureStorage.DataLake( "https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionedbydatenyc/", [HierarchicalNavigation = true] ), ToDelta = DeltaLake.Table(Source), #"Filtered Rows" = Table.SelectRows( ToDelta, each [PickupDate] >= RangeStart and [PickupDate] < RangeEnd ) in #"Filtered Rows"
…and refreshed again. This time the refresh took about 26 seconds.
Half an hour to 26 seconds is a big improvement and it’s because the DeltaLake.Table function is able to perform partition elimination: the partitions in the semantic model align to one or more partitions in the Delta table, so when each partition in the semantic model is refreshed Power Query only needs to read data from the partitions in the Delta table that contain the relevant data. This only happens because the filter in the Power Query query using the RangeStart and RangeEnd parameters is on the same column that is used to partition the Delta table.
In my final test I partitioned my Delta table by month, like so:
CREATE TABLE PartitionedNYC
USING delta
PARTITIONED BY (PickupYearMonth)
AS
SELECT *, (100*date_part('YEAR', lpepPickupDateTime)) + date_part('Months', lpepPickupDateTime) as PickupYearMonth
FROM NYCIncrementalRefreshTest.nyctaxi_raw
The challenge here is that:
The new PickupYearMonth column is an integer column, not a datetime column, so it can’t be used for an incremental refresh filter in Power Query
Power BI incremental refresh creates partitions at the year, quarter, month and date granularities, so filtering by month can’t be used for date partitions
I solved this problem in my Power Query query by calculating the month from the RangeStart and RangeEnd parameters, filtering the table by the PickupYearMonth column (to get partition elimination), stopping any further folding using the Table.StopFolding function and then finally filtering on the same datetime column I used in my first test:
let Source = AzureStorage.DataLake( "https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionednyc/", [HierarchicalNavigation = true] ), ToDelta = DeltaLake.Table(Source), YearMonthRangeStart = (Date.Year(RangeStart) * 100) + Date.Month(RangeStart), YearMonthRangeEnd = (Date.Year(RangeEnd) * 100) + Date.Month(RangeEnd), FilterByPartition = Table.StopFolding( Table.SelectRows( ToDelta, each [PickupYearMonth] >= YearMonthRangeStart and [PickupYearMonth] <= YearMonthRangeEnd ) ), #"Filtered Rows" = Table.SelectRows( FilterByPartition, each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd ) in #"Filtered Rows"
Interestingly this table refreshed even faster: it took only 18 seconds.
This might just be luck, or it could be because the larger partitions resulted in fewer calls back to the storage layer. The AzureStorage.DataLake M function requests data 4MB at a time by default and this could result in more efficient data retrieval for the data volumes used in this test. I didn’t get round to testing if using non-default options on AzureStorage.DataLake improved performance even more (see here for more details on earlier testing I did with them).
To sum up, based on these tests it looks like incremental refresh can be used effectively in import mode semantic models with Delta tables and the DeltaLake.Table function so long as you partition your Delta table and configure your Power Query queries to filter on the partition column. I would love to hear what results you get if you test this in the real world so please let me know by leaving a comment.
Something I mentioned in my recent post about the new DeltaLake.Tables M function on the Fabric blog recently was the fact that you can get different versions of the data held in a Delta table using the Value.Versions M function. In fact, the Value.Versions is the way to access different versions of data in any source that has this concept – so long as Power Query has added support for doing so. The bad news is that, at least at the the time of writing, apart from the DeltaLake connector there’s only one other source where Value.Versions can be used in this way: the connector for Fabric Lakehouses.
Here’s how you can access the data in a table using the Lakehouse.Contents M function:
Version 0 is the earliest version; the latest version of the data is the version with the highest number and this version can also be accessed from the row with the version number null. The nested values in the Data column are tables which give you the data for that particular version number. So, for example, if I wanted to get the data for version 2 I could click through on the nested value in the Data column in the row where the Version column contained the value 2. Here’s the M code for this:
let Source = Lakehouse.Contents(), SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data], SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data], SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data], ShowVersions = Value.Versions(SelectTable), Data = ShowVersions{2}[Data] in Data
The Lakehouse connector uses the TDS Endpoint of the Lakehouse to get data by default, as in the first code snippet above, but if you use Value.Versions to get specific versions then this isn’t (as yet) possible so it will use a slower method to get data and performance may suffer.
Last of all, you can get the version number of the data you’re looking at using the Value.VersionIdentity function. If you’re looking at the latest version of the data then Value.VersionIdentity will return null:
Power Query allows you to merge (“join” in database terms) two tables together in a variety of different ways, including left and right anti joins. Unfortunately, as I found recently, anti joins don’t fold on SQL Server-related data sources, which can result in performance problems. Luckily there is a different way of doing anti joins that does fold.
Say you have two Power Query queries called Fruit1 and Fruit2 that get data from SQL Server tables containing the names of different varieties of fruit:
Now, let’s say you want to get a list of all the fruit varieties that are in Fruit1 and not in Fruit2. The obvious thing to do is to do a Merge and use the Left Anti option like so:
Here’s the M code, including an extra step to remove the join column that this creates:
let
Source = Table.NestedJoin(Fruit1, {"Fruit"}, Fruit2, {"Fruit"}, "Fruit2", JoinKind.LeftAnti),
#"Removed Other Columns" = Table.SelectColumns(Source,{"Fruit"})
in
#"Removed Other Columns"
This gives you the correct result:
…but it doesn’t fold, so performance may be bad.
However, if you do a Merge and use the Left Outer option instead:
Then expand the join column (called Fruit2.Fruit here):
And then filter on that column so you only keep the rows where it contains the value null, and then remove that column, you get the same result:
Here’s the M:
let
Source = Table.NestedJoin(Fruit1, {"Fruit"}, Fruit2, {"Fruit"}, "Fruit2", JoinKind.LeftOuter),
#"Expanded Fruit2" = Table.ExpandTableColumn(Source, "Fruit2", {"Fruit"}, {"Fruit2.Fruit"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded Fruit2", each ([Fruit2.Fruit] = null)),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Fruit"})
in
#"Removed Other Columns"
This now does fold (meaning performance should be better) and gives you the following SQL:
select [_].[Fruit]
from
(
select [$Outer].[Fruit],
[$Inner].[Fruit2]
from [dbo].[Fruit1] as [$Outer]
left outer join
(
select [_].[Fruit] as [Fruit2]
from [dbo].[Fruit2] as [_]
) as [$Inner] on ([$Outer].[Fruit] = [$Inner].[Fruit2] or [$Outer].[Fruit] is null and [$Inner].[Fruit2] is null)
) as [_]
where [_].[Fruit2] is null