How Much Data Can You Load Into Power BI?

Power BI can handle large data volumes, but just how much data can you load into Power BI? Can Power BI handle big data? How big is “big” anyway? These are questions you may have when you’re starting out on a new Power BI project and the answers can be hard to find. Indeed the answer in most cases is “it depends”, which isn’t very helpful. In this post I will try to explain the various limits on Power BI dataset size and how you can know if you’re likely to hit them.

Power BI storage modes

First of all, what does it even mean to “load data into” Power BI? There are two fundamental ways that Power BI can work with data: Import mode and DirectQuery mode. Different tables in your Power BI dataset can have different storage modes; there’s also an option called Dual mode where a table can switch between Import mode and DirectQuery mode depending on the circumstances.

In DirectQuery mode no data is stored in your Power BI dataset and when your report renders and queries are sent to the dataset, the dataset sends queries back to the data source to get data on demand. This means the only limits on the amount of data you can work with are the limits set by your datasource.

In Import mode Power BI stores a copy of your data inside the dataset in its own internal database engine, known as either the Vertipaq engine or the Analysis Services engine. Import mode is the default storage mode and for good reason – it will almost always give you the best performance for your reports and it allows you to use all Power BI functionality. The “how much data?” question is only relevant to Import mode because when you use it Power BI imposes various limits on the amount of data that can be stored inside a dataset.

How big is your dataset?

In order to answer the question of “how much data can you load into Power BI?” you need to know how to measure the size of your dataset. There are various different ways of doing this but the best way is to install a free, community-developed tool called DAX Studio: its Model Metrics feature shows you the total size of your dataset on the Summary tab:

This figure is good enough for most purposes. It’s not quite that simple though, and this video by Marco Russo and Patrick Leblanc has a great explanation of all the different ways of measuring dataset size and what the figures really represent.

Can you predict how large your dataset is if you know how much data you have in your data source?

You have to import all your data into your dataset to find out how large your dataset is. But can you work out how large your dataset will be before you do this? The answer is no. You can be pretty sure that the size of your dataset will be smaller than the size your data is in your data source but it could only be a bit smaller or it could be a lot smaller, maybe even just 10-20% of the original size. This is because Power BI compresses your data very efficiently when you import it. What’s more, how you model your data can have a big impact on how well the compression works and making a few changes can result in a much smaller dataset as, for example, this post by Nikola Ilic shows.

How much data can you load into Power BI Desktop?

The only practical limit on the amount of data you can load into Power BI Desktop is the amount of memory you have on your PC; you’ll need at least 16GB of RAM, ideally 32GB, to get the best experience. However Power BI Desktop is just a development tool – you’ll need to publish to the Power BI Service for other people to view your report and that’s where the limits take effect. What’s more there’s a 10GB limit on the size of a dataset you can publish to the Power BI Service from Power BI Desktop, although you can have datasets much larger than that in Power BI Premium.

To be honest you should never be working with anything like 10GB of data in Power BI Desktop anyway: the file size will be huge, saving will be slow and you’ll spend a long time waiting for data to import while you’re development. What you should do is work with a small subset of your data in Desktop and only load the full volume after you’ve published. You can do this in a few different ways, for example using incremental refresh or by using M parameters to limit the amount of data you load in Desktop and then changing the parameter after publishing, as I showed here.

How much data can you load in the Power BI Service if you’re using Shared capacity (aka “Power BI Pro”)?

If you are not using Power BI Premium or Premium Per User, when you publish to the Power BI Service you are using Shared capacity (often called “Power BI Pro” by Power BI users because you only need a Power BI Pro licence to use it). The maximum size of a dataset in Shared capacity is 1GB; if you hit that limit you’ll get the “Unable to save the changes” error I blogged about here. There is one exception though: as mentioned here, if you use the rather obscure option in the Power BI Service to upload an Excel workbook with data in the Excel Data Model/Power Pivot, then the resulting dataset is limited to 250MB.

How much data can you load in the Power BI Service if you’re using Power BI Premium or Premium Per User (PPU)?

The default maximum dataset size in a Power BI Premium capacity or PPU is 10GB, but if you turn on the Large Dataset storage format you can have datasets larger than that and the maximum size depends on the amount of memory available in your Premium capacity. The “Max memory per dataset” column in the table here shows the maximum amount of memory available in each Premium or Power BI Embedded capacity SKU to an individual dataset; the maximum amount of memory available per dataset in PPU is 100GB. However the maximum amount of memory available to a dataset is not the maximum size of a dataset: apart from the memory used to store the data for a dataset, more memory will be needed when the dataset is queried or refreshed. If you do a full refresh of a dataset it may require almost double the amount of memory needed to store the dataset itself; incremental refresh may require less memory. You can use the Premium Capacity Metrics App to see how your dataset’s memory usage changes over time.

Other Import mode limits

There are a few other limitations of Import mode that should be mentioned. As documented here you can only have 1,999,999,997 distinct values in a single column and there is a limit of 16,000 on the combined total number of tables and columns in those tables – but if you have that many tables and columns you have definitely made a mistake with how you model your data. Also, some DAX functions such as Median() only work on tables with less than 2 billion rows in them, as I blogged here.

Power BI can load all my data, but will my reports be fast enough to use?

Loading all your data into Power BI is one thing but what really matters is whether the reports you build are fast enough to use. In my experience performance problems relating to data volume are not as common as performance problems caused by inefficient DAX in measures or poor modelling decisions, although they do occur – for example distinct counts on columns with a very large number of unique values can be slow. If you’re following best practices you’re unlikely to encounter them unless you’re working with the kind of data volumes that require Premium. DAX Studio’s Server Timings feature can help you understand why your queries are slow and whether data volumes are an issue.

This is a lot to take in – what does it all mean?

Let me finish up by making a few broad generalisations about how much data you can load into Power BI Import mode. Assuming you have followed all the best practices around modelling data you should be able to work with tables with up to a few million rows, probably tens of millions of rows, in Shared capacity and tables with up to a few billion rows in Premium. If your data source is Excel then Power BI can definitely handle the amount of data you have; if you are working with a relational database like SQL Server then it’s still very likely Import mode will work; even if you’re working with hundreds of gigabytes of data or more in a source like Azure Synapse, Snowflake or BigQuery then Import mode may still be ok. You will need to know Power BI very well to get good performance in Import mode with the largest data volumes but it is certainly possible and I know of several customers that are doing it.

Calling The Power BI REST API From TSQL In Azure SQL DB

Something cool has just been announced for Azure SQL DB: the ability to call a limited number of REST APIs direct from TSQL. The announcement is here:

https://devblogs.microsoft.com/azure-sql/azure-sql-database-external-rest-endpoints-integration-public-preview/

There are code samples here:

https://github.com/Azure-Samples/azure-sql-db-invoke-external-rest-endpoints

And guess what, the Power BI REST API is one of the APIs you can call! This means, for example, you can run DAX queries via the Execute Queries endpoint (bearing in mind all the limitations) from TSQL:

https://github.com/Azure-Samples/azure-sql-db-invoke-external-rest-endpoints/blob/main/power-bi.ipynb

I think this opens up a lot of possibilities. What do you think? Leave a comment!

Why You Should Defragment Your Fact Tables If You’re Using Incremental Refresh In Power BI Premium

A lot of people are building large datasets in Power BI Premium nowadays, and if you’re building a large dataset you’re very likely to be using Power BI’s incremental refresh functionality with your fact tables or manually creating and refreshing partitions in them. If so, you should occasionally do a special kind of refresh to shrink the dictionaries on your fact tables.

To illustrate why you need to do this I created a Power BI dataset with one table containing one column, where each value was a GUID (ie a unique text value). Here’s the M I used to generate it:

let
    Source = {1..10000000},
    Output = #table(type table [MyGUID=text], 
        List.Transform(
            Source, 
            each {Text.NewGuid()})
    )
in
    Output

After publishing the dataset I copied the original partition so there were two partitions in the table, each of which contained ten million rows. When you use incremental refresh in a table Power BI automatically generates and manages partitions for you in the background – and there isn’t much difference between doing this and creating partitions manually like I’m doing here.

Here’s what DAX Studio’s Model Metrics functionality showed when I first refreshed the dataset:

The Dictionary column is the column to focus on in this table. Dictionaries are the data structures created by Power BI’s storage engine, the Vertipaq engine, to hold all the distinct values in a column – the more distinct values there are, the larger the dictionary. As Marco points out here the DMVs that DAX Studio uses to measure dictionary size don’t always return consistent or accurate results but the values it returns are good enough for the purposes of this test.

I then did several Full refreshes on just one of the two partitions. Here’s what DAX Studio showed at this point:

As you can see, the size of the dictionary for the MyGUID column (and therefore the overall size of the dataset) has grown a lot – about 1GB – even though the cardinality of the column has stayed the same. This is because dictionaries are held at the table level, not at the partition level, so by refreshing just one partition I was adding 10 million new text values to the dictionary without deleting any of the old values that has been added to the dictionary by previous refreshes and which were no longer there in the table.

Finally, I ran the following Refresh command against the workspace’s XMLA Endpoint from SQL Server Management Studio:

{
  "refresh": {
    "type": "defragment",
    "objects": [
      {
        "database": "TableWithManyDistinctValues",
        "table": "MyTable"
      }
    ]
  }
}

[Note that XMLA Endpoints are only available if you’re using Premium or PPU]

This refresh is of type defragment and it is documented here. This is the explanation of what it does from the docs:

Defragment the data in the specified table. As data is added to or removed from a table, the dictionaries of each column can become polluted with values that no longer exist in the actual column values. The defragment option will clean up the values in the dictionaries that are no longer used.

After that you can see the dictionary for the table, and therefore the size of the dataset, was back to around its original size:

This is an extreme example of course, but I can imagine something nearly this bad happening in the real world if you use incremental refresh on a table with a text column containing addresses or user comments and if you don’t keep much history. This isn’t new behaviour or functionality: this happens in Analysis Services Tabular models too and is discussed in books like Marco and Alberto’s “Tabular Modelling in Microsoft SQL Server Analysis Services” but not many people are aware it still applies to Power BI datasets. This type of refresh can be quite slow but it is likely to be faster than the only other option (which is the only option if you’re not using Premium) which is to run a full refresh on the dataset.

So why should you run a refresh of type defragment on your dataset? There are two reasons:

  1. If you’re getting close to the size limits imposed by the Premium capacity SKU you’re using, and you want to make sure that you don’t hit them.
  2. Large datasets generally result in slower query performance, so reducing the size of your dataset may result in faster queries.

How often should you do this? In most cases not very often; it depends on how much your dataset shrinks when you do defragment. Once a year may be sufficient and I doubt anyone would need to do it more than once a month. One way to automate this would be using Power Automate to call the Enhanced Refresh API, as I showed in this series of posts.

Performance Tuning Table Visuals With Filters Applied In Power BI

Do you have a Power BI report with a table or a matrix on it, where there is a filter on the rows of the table? It’s a very common scenario indeed. Is the table or matrix slow to render? If so, this post is for you!

Consider the following table visual:

There are four columns: Date, Town and two measures. One measure called [Fast Measure] is, as the name suggests, very quick to execute; the other measure, called [Slow Measure], is very complex and slow. The definitions are irrelevant here. Notice that there is a filter on this table visual so only the rows where [Fast Measure] is greater than 1 are shown.

If I measure the amount of time to render this table in Performance Analyzer, it takes around 17.5 seconds to run. However, if I remove the filter on [Fast Measure], the table only takes 8 seconds to run. Why? The filter is on the fast measure and surely more rows are returned without the filter, so wouldn’t the slow measure be evaluated more?

It turns out that the DAX generated for tables with filters applied isn’t as well-optimised as it could be. Yes, I work on the Power BI team and yes this is something that should be addressed, but before that happens I thought it would be good to make as many people aware of this as possible so they can tune their reports. I have worked with a lot of customers recently who are running into this problem without realising it.

Here’s the DAX query for the table without the filter:

DEFINE
  VAR __DS0Core = 
    SUMMARIZECOLUMNS(
      'Date'[Date],
      'Property Transactions'[Town],
      "Fast_Measure", 'Property Transactions'[Fast Measure],
      "Slow_Measure", 'Property Transactions'[Slow Measure]
    )

  VAR __DS0PrimaryWindowed = 
    TOPN(501, __DS0Core, 'Date'[Date], 1, 'Property Transactions'[Town], 1)

EVALUATE
  __DS0PrimaryWindowed

ORDER BY
  'Date'[Date], 'Property Transactions'[Town]

Here’s the DAX query for the table with the filter:

DEFINE
  VAR __ValueFilterDM0 = 
    FILTER(
      KEEPFILTERS(
        SUMMARIZECOLUMNS(
          'Date'[Date],
          'Property Transactions'[Town],
          "Fast_Measure", 'Property Transactions'[Fast Measure],
          "Slow_Measure", 'Property Transactions'[Slow Measure]
        )
      ),
      [Fast_Measure] > 1
    )

  VAR __DS0Core = 
    SUMMARIZECOLUMNS(
      'Date'[Date],
      'Property Transactions'[Town],
      __ValueFilterDM0,
      "Fast_Measure", 'Property Transactions'[Fast Measure],
      "Slow_Measure", 'Property Transactions'[Slow Measure]
    )

  VAR __DS0PrimaryWindowed = 
    TOPN(501, __DS0Core, 'Date'[Date], 1, 'Property Transactions'[Town], 1)

EVALUATE
  __DS0PrimaryWindowed

ORDER BY
  'Date'[Date], 'Property Transactions'[Town]

Notice how, in this second query, the filter is applied in an additional variable called __ValueFilterDM0. The side effect of implementing the filter in this way is that all the measures in the table, including [Slow Measure], are evaluated twice. This explains why, in this example, the filtered table is twice as slow as the unfiltered table. The performance of your table may be different: it depends on the number of measures, their definition, the nature of the filter and many other factors. Measuring the performance of your table with and without the filter applied will tell you how much of an impact this issue is having in your report.

What can you do about this? One solution is to apply the filter inside the measures rather than on the visual. In this case, creating two new measures with the following definitions:

Fast Measure Filtered =
VAR f = [Fast Measure] 
RETURN IF ( f > 1, f )

Slow Measure Filtered =
IF ( [Fast Measure] > 1, [Slow Measure] )

..and then using these measures on columns instead of the original measures, removing the visual-level filter, like so:

In my example this results in the table rendering in 8 seconds again, but again your mileage may vary. What’s more I’ve turned off totals in the table and therefore avoided the problem of making these measures return the same totals as the totals in the original filtered table. There may be other solutions (I suspect calculation groups would be worth investigating) that work better in your dataset and report.

Are you affected by this problem? How much impact does it have? Do you have a different solution? Let me know in the comments…

Understanding The “Unable to save the changes since the new dataset of size” Error In Power BI

If you’re refreshing a Power BI dataset that contains a large amount of data you may encounter the following error:

Data source error Unable to save the changes since the new dataset of size 1120 MB exceeds the limit of 1024 MB. If the dataset is hosted on a PowerBI Premium capacity, you may also reach out to your capacity administrator to see if the limit can be adjusted.

You’ll get this error at the end of a refresh if the size of your dataset is larger than the size allowed by Power BI. The maximum size of a dataset varies depending on whether you are using Shared capacity (commonly known as “Pro”) or Premium capacity:

  • In Shared/Pro the maximum size of a dataset is 1GB (ie the 1024MB from the error message shown above), as documented here.
  • In Premium:
    • The default maximum size of a dataset is 10GB if the “Large dataset storage format” option (documented here and here) is turned off for your dataset.
    • If the “Large dataset storage format” option is turned on, the maximum size of your dataset depends on the amount of memory available in the SKU of the Power BI Premium capacity you’re using. There’s a table in the documentation that tells you how much memory is available for each SKU: look in the “RAM per dataset” column and also note what it says in the footnotes about the maximum size of a dataset being slightly less than the amount of memory listed.

What can you do if you run into this error? You have two choices:

  1. If you’re using a Power BI Pro licence, buy Premium Per User licences or buy Premium capacity and move your dataset into a workspace on PPU or Premium. If you’re already on Premium, upgrade your capacity and make sure that the “Large dataset storage format” option is turned on for your dataset. This will increase the dataset size limits.
  2. Reduce the size of your dataset so it fits within the limits.

Option #2 is likely to be the preferred option in most cases. There are a lot of detailed blog posts and articles out there on how to reduce the size of your dataset (including this one in the docs) but basically you should use DAX Studio’s Model Metrics feature to see what’s taking up the most space in your dataset then delete unnecessary tables and columns, try reduce the number of distinct values in each column, and reduce the number of rows you’re loading into each table (perhaps by reducing the amount of historic data). Quite often a few small changes can significantly shrink the size of your dataset and also help refresh times and query performance.

Speed Up Power BI Dataset Refresh Performance In Premium Or PPU By Changing The “Parallel Loading Of Tables” Setting

Do you have a a large dataset in Power BI Premium or Premium Per User? Do you have more than six tables that take a significant amount of time to refresh? If so, you may be able to speed up the performance of your dataset’s refresh by increasing the number of tables that are refreshed in parallel, using a feature that was released in August 2022 but which you may have missed.

Some of you may be thinking: haven’t you blogged about refresh parallelism before? Yes: last year I wrote a post about setting the maxParallelism property when refreshing through TMSL scripts, a few months ago I also showed how you could change the same property when refreshing using the Enhanced Refresh API. That was all fairly complex though and what I’m going to show you here is, in contrast, very easy to implement.

To illustrate this I created an Import dataset containing nine tables, each of which loaded data from different CSV files. Each table contained a couple of million rows and took 20-30 seconds to refresh. I then published to a PPU workspace in the Power BI Service and used the technique described in Phil Seamark’s “Visualise your Power BI refresh” blog post to capture what happened during refresh. The refresh took 44 seconds as as you can see from this screenshot of Phil’s report, only six of the tables were refreshed in parallel at any one time:

[Ignore the y axis on this graph – I have no idea what’s going on with it]

This is as you would expect: the default number of objects that can be refreshed in parallel in Premium or Premium Per User is 6. This is also the maximum number of objects that can be refreshed in parallel in Shared capacity.

I then went to my original .pbix file, opened the Options dialog, went to the Current File/Data Load pane and changed the new “Parallel loading of tables” setting on my dataset from Default to Custom and entered the value 9:

This setting sets the maxParallelism property in the refresh commands generated by Power BI in the Power BI Service; it is fully documented here. Rerunning the refresh again in the Power BI Service I found that the overall time taken was down to 29 seconds and that all the tables were refreshed in parallel:

Here’s the Refresh command from the Profiler trace showing maxParallelism set to 9:

In summary: refresh performance got1/3 faster by making this simple change, so if you’re using Power BI Premium and not already doing advanced things with TMSL scripts or the Enhanced Refresh API, this is definitely something you should experiment with. As I said last year you can try setting this setting up to a value of 30 but there’s no guarantee you’ll get the amount of parallelism you ask for (it depends on the Premium capacity SKU and a few other factors). What’s more, setting this property too high could result in slower refresh because it might result in Power BI overloading your data source with queries.

[Thanks to Akshai Mirchandani for answering my questions about this subject]

Diagnosing Calculation Group-Related Performance Problems In Power BI Using EvaluateAndLog

A few weeks ago I wrote a post showing how you can use the new EvaluateAndLog DAX function to diagnose performance problems relating to the use of the Switch function. Did you know that calculation groups can experience similar performance problems though? In some scenarios limited evaluation takes place for all calculation items, not just the one you expect; luckily you can use EvaluateAndLog to diagnose this too. In this post I’ll show you a simple example.

Let’s say you have a dataset containing the following table, called Sales:

There’s also a calculation group with three calculation items:

The Measure Value calculation item returns the value of SelectedMeasure but also uses EvaluateAndLog:

EVALUATEANDLOG( SELECTEDMEASURE() )

The Measure Value * 2 and Measure Value * 3 calculation items do the same thing but multiply the value returned by SelectedMeasure by two and three respectively, for example:

EVALUATEANDLOG( SELECTEDMEASURE() * 2 )

There’s another table in the dataset, with no relationship to the Sales table, called Calc Item Names containing a single column listing the three calculation items’ names:

Last of all, there is a measure called Sales Amount:

Sales Amount = SUM(Sales[Sales])

…and a measure called Dynamic Sales Amount Unoptimised that takes the selection made on the Calc Item Names column of the Calc Item Names table and uses it to dynamically select the calculation item with that name and apply it to the Sales Amount measure:

Dynamic Sales Amount Not Optimised =
VAR SelectedCalcItem =
    SELECTEDVALUE ( 'Calc Item Names'[Calc Item Names] )
RETURN
    CALCULATE ( [Sales Amount], 'My Calc Group'[Name] = SelectedCalcItem )

All this allows you to build the following report where a slicer controls the calculation item applied in the Dynamic Sales Amount Not Optimised measure:

As the name suggests, though, there’s a problem with the Dynamic Sales Amount Not Optimised measure that EvaluateAndLog can help uncover. Using Jeffrey Wang’s DAXDebugOutput tool to capture what happens when Measure Value is selected in the slicer, you can see three DAXEvaluationLog events. The first is the one you’d expect: since Measure Value is selected in the slicer, the Measure Value calculation item is evaluated:

The other two events are for the other two, unselected calculation items though – and similar to the Switch problem, some evaluation taking place for all calculation items can cause performance problems:

How do you avoid this problem? It turns out that if you use TreatAs to apply the calculation item inside Calculate, like so:

Dynamic Sales Amount Optimised =
CALCULATE (
    [Sales Amount],
    TREATAS ( VALUES ( 'Calc Item Names'[Calc Item Names] ), 'My Calc Group'[Name] )
)

You only get one DaxEvaluationLog event when the query runs and only the calculation item you selected is evaluated:

[Thanks to Krystian Sakowski for originally diagnosing the problem in this post and coming up with the fix]

Interview With Me On ‘The Knee-Deep In Tech’ Podcast

Last week, while I was at the always-excellent dataMinds Connect conference in Belgium, I was interviewed by Heini Ilmarinen and Alexander Arvidsson for the ‘Knee-Deep In Tech’ podcast. You can listen to it here:

https://www.kneedeepintech.com/posts/podcast-episode-208/

We had a very interesting chat about my career, the Power BI CAT team and what it does, why prioritising features is a lot harder than you think, technical training and the genius of Miss Excel, and more.

Finding The Tables, Columns And Measures Used By A DAX Query In Power BI

If you’re monitoring activity in Power BI, for example using Log Analytics, you’ll know that you can capture the DAX queries generated by your published Power BI reports. How do you make sense of them though? You may want to know which tables, columns or measures are being used by a query – maybe so you can work out which ones aren’t being used and can be deleted. I always thought the only way to do this would be to parse the DAX query, which would be incredibly difficult to do. Yesterday, though, Igor Cotruta pointed out on Twitter that there’s a really easy way to do this that I didn’t know about, using the DISCOVER_CALC_DEPENDENCY DMV. I’ve used this DMV before and blogged about it here, but what was new to me is that you can pass a DAX query into the QUERY restriction (it is all documented here). To see how it works let’s look at a simple example.

Consider the following dataset consisting of a single table with a calculated column and a few measures:

Now consider a table visual built on this dataset with a slicer linked to it:

You can get the DAX query generated by this visual using Performance Analyzer’s Copy Query feature (described in this article). You can then pass this DAX query into the WHERE clause of a query against the DISCOVER_CALC_DEPENDENCY DMV like so:

SELECT 
REFERENCED_OBJECT_TYPE, 
REFERENCED_TABLE,
REFERENCED_OBJECT,
REFERENCED_EXPRESSION
FROM 
$SYSTEM.DISCOVER_CALC_DEPENDENCY 
WHERE 
QUERY = 
'
DEFINE
  VAR __DS0FilterTable = 
    TREATAS({"Apples"}, ''Sales''[Product])

  VAR __DS0Core = 
    SUMMARIZECOLUMNS(
      ROLLUPADDISSUBTOTAL(''Sales''[Country], "IsGrandTotalRowTotal"),
      __DS0FilterTable,
      "Sales_Amount", ''Sales''[Sales Amount],
      "Effective_Tax_Rate", ''Sales''[Effective Tax Rate]
    )

  VAR __DS0PrimaryWindowed = 
    TOPN(502, __DS0Core, [IsGrandTotalRowTotal], 0, ''Sales''[Country], 1)

EVALUATE
  __DS0PrimaryWindowed

ORDER BY
  [IsGrandTotalRowTotal] DESC, ''Sales''[Country]

'

[Note that the DAX query I’m passing in contains single quotes that have to be escaped, which explains why all the table names are surrounded by two single quotes]

The query can be run from DAX Studio connected to either Power BI Desktop or a dataset published to Power BI Premium via the XMLA Endpoint. Here’s the output:

As you can see, this gives you the table, columns (including the hidden RowNumber column), measures and calculated column directly referenced by the query as well as the DAX definitions of the measures and calculated column. It does not mention the [Tax Amount] measure, which the [Effective Tax Rate] measure references; you can get the full list of measures and calculated columns in the dataset and the dependencies between them by querying the DMV without the WHERE clause though. Neither does the output tell you that the Product column is being sliced by the value “Apples”, which would also be useful. Overall, though, I think this is a really powerful feature.

What Is The Difference Between A Live Connection And DirectQuery Mode In Power BI?

It’s very easy to get confused between a Live connections and DirectQuery mode in Power BI: the customers I work with frequently get them mixed up, partly because they are similar concepts, partly because other BI tools use the term “live connection” to refer to what Power BI calls DirectQuery mode. In this post I will explain the differences between these two terms.

Power BI Datasets, Azure Analysis Services and SQL Server Analysis Services

The first thing to understand is that when a Power BI report runs it can only get data from one of three places: a Power BI dataset; an Azure Analysis Services (AAS) database; or an on-premises instance of SQL Server Analysis Services(SSAS). There are some exceptions to this rule but we can ignore them for the purposes of this post. Each of these three sources run variations of the same engine, ie the Analysis Services engine, and can be queried in DAX. These sources may in turn get data from other sources when they are queried but as far as the Power BI report is concerned the only sources it can talk to directly are Power BI datasets, AAS databases and SSAS databases.

Live connections

Normally, when you open up Power BI Desktop and start developing, your .pbix file contains the definition of a Power BI dataset and a report that connects to it. When you publish this .pbix file they are split up and you see separate dataset and report objects in your Power BI workspace.

However, in Power BI Desktop you may instead choose to connect to a Power BI dataset that has already been published to the Power BI Service, or an instance of AAS or SSAS, and use that as the source for the report. The main advantage of doing this is that different developers can work on the dataset/AAS database/SSAS database definition and the report at the same time. When you connect to an already-published dataset or an AAS or SSAS database from Power BI Desktop you are creating a Live connection.

To connect to a published Power BI dataset and create a Live connection choose the “Power BI datasets” option under the Get Data menu:

If you want to create a Live connection to AAS or SSAS choose the Analysis Services source and then make sure the “Connect live” option is selected:

After you’ve created the Live connection you’ll see a message telling you this in the bottom right-hand corner of the screen in Power BI Desktop:

The key thing to understand is that when you create a Live connection your .pbix file no longer contains a dataset definition – it only contains a report, and that report connects to a dataset, AAS database or SSAS database that already exists elsewhere.

The concept of a Live connection only really makes sense in Power BI Desktop: as I said, in the Power BI Service reports and datasets are separate objects anyway. A report can only have one Live connection because a report can only ever connect to a single dataset, AAS database or SSAS database.

If you publish a report with a Live connection to an already-published Power BI dataset, then the only new thing that will appear in the Power BI Service is the report . If you publish a report with a Live connection to AAS or SSAS then you will see a new dataset appear in the Power BI Service but this isn’t really a dataset, it’s really just a pointer to the AAS or SSAS database where the data is really being held.

DirectQuery mode

There are two fundamental ways a Power BI dataset, AAS database or SSAS Tabular database can work with data: Import mode and DirectQuery mode. In Import mode the data needed for your dataset is copied into the dataset and stored inside it. In DirectQuery mode, when a Power BI report requests data from the dataset by running a DAX query against it, the dataset in turn generates queries that are run on your data source to get the data it needs on demand. If your data source is a SQL Server database then Power BI will generate SQL queries to get the data it needs; for other sources it may generate queries in other query languages. A Power BI dataset is always a semantic layer, an easily consumable interface for your BI data: in Import mode it is a semantic layer and it also stores the data it needs; in DirectQuery mode it is just a semantic layer.

The only data sources that Power BI can use in DirectQuery mode are those that have some kind of query language that Power BI can generate. You can’t connect to a CSV file in DirectQuery mode, for example, because there’s no way of running queries against it unless you load the data in it into some other form of database. Other data sources may have rich query languages but Power BI will only be able to connect to them in DirectQuery mode if a suitable connector exists for that source. Most commonly-used relational databases can be used in DirectQuery mode. If your data source supports DirectQuery mode you’ll see that as an option on the dialog when you first connect:

The key thing to understand here is that the term “DirectQuery mode” describes where the data for a dataset is kept, ie outside the dataset as opposed to inside it in Import mode.

Composite models

Different tables in a single Power BI dataset can have different storage modes: some can be in Import mode and some can be in DirectQuery mode. Furthermore, different DirectQuery tables in the same Power BI dataset can connect to different data sources. When you have different tables using different storage modes or different tables using different DirectQuery datasets then you have what is called a “composite model“.

DirectQuery on Power BI datasets and Analysis Services

One final twist is that you can create Power BI datasets in DirectQuery mode and use other Power BI datasets, AAS databases or SSAS databases, as a data source. This means you can have DirectQuery datasets that connect to other datasets that can potentially connect to other datasets. It’s useful because, for example, you can have a single, centrally-developed dataset containing all of your company’s sales data and then other Power BI developers can create their own datasets that connect to this central dataset and extend it by adding extra tables of data and new measures without ever duplicating the data in the central dataset.

Although the way things behave behind the scenes is somewhat different, a Power BI dataset in DirectQuery mode that connects to another Power BI dataset as its source is conceptually very similar to a Power BI dataset in DirectQuery mode that connects to a relational database like SQL Server: it just generates DAX queries to get the data it needs when it is queried rather than SQL queries.

Conclusion

Live connections and DirectQuery mode are two distinct concepts. Live connections occur when you create reports in Power BI Desktop that connect to already-published Power BI datasets, AAS databases or SSAS databases. DirectQuery mode is one of the two fundamental ways that a Power BI dataset can work with data (the other being Import mode) and is the mode where a dataset does not contain any data but instead queries other data sources to get the data it needs when it is itself queried.

[Thanks are due to my friend Peter Myers – this blog post expands on some explanations we developed last year in a session we co-presented at various Power BI user groups called “Demystifying Power BI Datasets”. Several recordings of this session exist on YouTube, for example here.]