Filtering Data Loaded Into A Workspace Database In Analysis Services Tabular 2017 And Azure Analysis Services

The first mistake that all new Analysis Services Tabular developers make is this one: they create a new project in SSDT, they connect to their source database, they select the tables they want to work with, they click Import, and they then realise that trying to load a fact table with several million rows of data into their Workspace Database (whether that’s a separate Workspace Database instance or an Integrated Workspace) is not a good idea when they either end up waiting for several hours or SSDT crashes because it has run out of memory. You of course need to filter your data down to a manageable size before you start developing in SSDT. Traditionally, this has been done at the database level, for example using views, but modern data sources in SSAS 2017 and Azure Analysis Services allow for a new approach using M.

Here’s a simple example of how to do this using the Adventure Works DW database. Imagine you are developing a Tabular model and you have just connected to the relational database, clicked on the FactInternetSales table and clicked Edit to open the Query Editor window before importing. You’ll see something like this:

image

…that’s to say there’ll be a single query visible in the Query Editor with the same name as your source table. The M code visible in the Advanced Editor will be something like this:

let
    Source = 
	#"SQL/localhost;Adventure Works DW",
    dbo_FactInternetSales = 
	Source{[Schema="dbo",Item="FactInternetSales"]}[Data]
in
    dbo_FactInternetSales

At this point the query is importing all of the data from this table, but the aim here is to:

  1. Filter the data down to a much smaller number of rows for the Workspace Database
  2. Load all the data in the table after the database has been deployed to the development server

To do this, stay in the Query Editor and create a new Parameter by going to the menu at the top of the Query Editor and clicking Query/Parameters/New Parameter, and creating a new parameter called FilterRows of type Decimal Number with a Current Value of 10:

image

The parameter will now show up as a new query in the Queries pane on the left of the screen:

image

Note that at the time of writing there is a bug in the Query Editor in SSDT that means that when you create a parameter, close the Query Editor, then reopen it, the parameter is no longer recognised as a parameter – it is shown as a regular query that returns a single value with some metadata attached. Hopefully this will be fixed soon but it it’s not a massive problem for this approach.

Anyway, with the parameter created you can now use the number that it returns to filter the rows in your table. You could, for example, decide to implement the following logic:

  • If the parameter returns 0, load all the data in the table
  • If the parameter returns a value larger than 0, interpret that as the number of rows to import from the table

Here’s the updated M code from the FactInternetSales query above to show how to do this:

let
    Source = 
	#"SQL/localhost;Adventure Works DW",
    dbo_FactInternetSales = 
	Source{[Schema="dbo",Item="FactInternetSales"]}[Data],
    FilterLogic = 
	if 
		FilterRows<=0 
	then 
		dbo_FactInternetSales 
	else 
		Table.FirstN(dbo_FactInternetSales, FilterRows)
in
    FilterLogic 

The FactInternetSales query will now return just 10 rows because the FilterRows parameter returns the value of 10:

image

And yes, query folding does take place for this query.

You now have a filtered subset of rows for development purposes, so you can click the Import button and carry on with your development as usual. Only 10 rows of data will be imported into the Workspace Database:

image

What happens when you need to deploy to development though?

First, edit the FilterRows parameter so that it returns the value 0. To do this, in the Tabular Model Explorer window, right-click on the Expressions folder (parameters are classed as Expressions, ie queries whose output is not loaded into Analysis Services) and select Edit Expressions:

image

Once the bug I mentioned above has been fixed it should be easy to edit the value that the parameter returns in the Manage Parameters pane; for now you need to open the Advanced Editor window by clicking the button shown below on the toolbar, and then edit the value in the M code directly:

image

Then close the Advanced Editor and click Import. Nothing will happen now – the data for FactInternetSales stays filtered until you manually trigger a refresh in SSDT – and you can deploy to your development server as usual. When you do this, all of the data will be loaded from the source table into your development database:

image

At this point you should go back to the Query Editor and edit the FilterRows parameter so that it returns its original value, so that you don’t accidentally load the full dataset next time you process the data in your Workspace Database.

It would be a pain to have to change the parameter value every time you wanted to deploy, however, and luckily you don’t have to do this if you use BISM Normalizer – a free tool that all serious SSAS Tabular developers should have installed. One of its many features is the ability to do partial deployments, and if you create a new Tabular Model Comparison (see here for detailed instructions on how to do this) it will show the differences between the project and the version of the database on your development server. One of the differences it will pick up is the difference between the value of the parameter in the project and on in the development database, and you can opt to Skip updating the parameter value when you do a deployment from BISM Normalizer:

image

Analysis Services Query Analyzer

Last week a new, free tool for analysing the performance of MDX queries on SSAS Multidimensional was released: Analysis Services Query Analyzer. You can get all the details and download it here:

https://ssasqueryanalyzer.github.io/

…and here’s a post on LinkedIn by one of the authors, Francesco De Chirico, explaining why he decided to build it:

https://www.linkedin.com/pulse/asqa-10-released-francesco-de-chirico/

I’ve played around with it a bit and I’m very impressed – it’s a really sophisticated and powerful tool, and one I’m going to spend some time learning because I’m sure it will be very useful to me. It’s an AddIn for SQL Server Management Studio and captures Profiler and Perfmon data for an MDX query when you run it, does so for a cold and warm cache, and then displays the results in a variety of graphs and charts. And it even has an MDX formatter built in!

image

image

image

You can find a full list of features here, and when you download the tool it comes with very detailed documentation. Definitely a must-have for all serious SSAS Multidimensional developers.

Setting SQL Server Connection String Properties In Power BI and SSAS Tabular Modern Data Sources

It may not be immediately obvious, but you cannot set your own connection string properties when connecting to SQL Server using the built-in SQL Server connector from either Power BI or a modern data source in Azure SSAS/SSAS Tabular 2017:

image

All you can do is configure the options that are available in the UI, which in the current version of SSDT looks like this:

image

…and which are documented in the Sql.Databases() M function here.

It turns out that the restriction on using your own connection string properties in the built-in SQL Server connector is a deliberate design decision on the part of the Power Query team because, behind the scenes, they use different providers in different circumstances to optimise performance, and because allowing arbitrary connection string properties might make maintaining backwards compatibility difficult in the future.

While your average Power BI user is unlikely to even notice this, for SSAS Tabular developers it could be a big problem: complete control over the connection string is often necessary in enterprise BI scenarios. What are the alternatives then? Well you can use the OLE DB and ODBC connectors instead:

image

Both of these connectors do allow you to set your own connection string properties. For example here’s the UI for a new ODBC connection in SSDT:

image 

The documentation for the Odbc.DataSource and OleDb.DataSource M functions has more detail on how these connectors can be used and how connection string properties can be set. Remember also that the OLE DB Provider for SQL Server was un-deprecated in October 2017.

However, apart from possible performance differences between the two (which you should test yourself – Henk van der Valk wrote a good post on this for SSAS MD and most of what he said is relevant for Tabular) there’s one less-than-obvious difference between these two options: the OLE DB connector does not appear to support query folding right now whereas the ODBC connector does. Of course this isn’t an issue if you’re writing your own SQL queries to import data, but if you do want to use M functions for partitioning (as I show here) you’re likely to get very poor performance with the OLE DB connector.

Using Your Own SQL Queries For Tables With Modern Data Sources In SSAS 2016+ And Azure Analysis Services

While the integration of the Power Query engine into Analysis Services Tabular 2017 and Azure Analysis Services with modern data sources will certainly bring a lot of benefits, I think it’s fair to say that the implementation has not been entirely painless. One problem is that it is no longer obvious how to specify your own SQL query to populate a table or partition in your Tabular model – and while the Query Editor is great, there are a lot of cases where this is necessary. In this post I’ll show you how to do this.

If you’re used to using the Power Query UI in Excel or Power BI Desktop, you’ll notice that when you connect to a SQL Server database using the SQL Server connector in SSDT:

image

…there is no option to enter your own SQL query when you do so:

image

This is deliberate. In Analysis Services, unlike Power BI and Excel, there is a distinction made between data sources and other M queries that return data from those data sources, one that makes a lot of sense in my opinion. While it is possible to enter your own SQL for other data source types, such as OLE DB connections, a data source object is really intended just to define a connection to a data source and not to define what data you want from that data source.

[You may also notice that there’s a “SQL statement” property on a SQL Server data source visible in the Visual Studio properties pane, but I don’t recommend you use it – it doesn’t seem to work well with the rest of the SSDT/Power Query UI]

To import a table or view in your database all you have to do is right-click on your data source and choose Import New Tables; my blog post from September last year describes how to do this, and how to use M functions for creating partitions.

To use your own SQL queries though you need to write some M code. First, import a table – any table, but preferably a small one – and get to the Query Editor UI. In this case I’ve imported the DimDate table from the Adventure Works DW database:

image

Next, select your query in the Queries pane on the left-hand side of the screen and open the Advanced Editor either by clicking on the relevant button in the toolbar (shown above) or by right-clicking on the query name in the Queries pane. You’ll see the following dialog:

image

The M code will be something like this:

let
    Source = #"SQL/localhost;Adventure Works DW",
    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]
in
    dbo_DimDate

In this example the Source step creates a reference to the data source you have already created, and the dbo_DimDate step gets the contents of the DimDate table from this data source.

You can modify this code to use your own SQL by using the Value.NativeQuery() function (which I have blogged about here):

let
    Source = #"SQL/localhost;Adventure Works DW",
    MyQuery = 
	Value.NativeQuery(
		#"SQL/localhost;Adventure Works DW",
		"SELECT DISTINCT FiscalYear FROM DimDate"
	)
in
    MyQuery

Here what I’ve done is replaced the dbo_DimDate step in the previous query with a step called MyQuery that uses Value.NativeQuery() to run my own SQL.

image

Now all you need to do is click Import and you have the output of the query loaded into SSAS. It would be nice if there was UI support for using your own SQL queries when importing data in the future. Note that, as soon as you use this method, any other steps or queries further downstream will not be able to perform query folding, so you should make sure that you do as much of your filtering and transformation in the SQL as possible otherwise you may encounter performance problems.

The documentation describes a similar – but not identical – workflow for achieving the same result here. Personally I think it’s counter-intuitive that you should click on Expressions to create a Table object! Expressions are used for functions and other M code that is shared by the M queries used by Tables.

An alternative to doing all this is to go back to the old way of doing things and use a legacy data source rather a modern data source in SSDT. You lose the ability to use the Query Editor and M if you do this, but in a lot of cases you probably won’t care. The 17.4 release of SSDT for Visual Studio 2015, released in December 2017, has exposed a property that allows you to create legacy data sources again easily. In Visual Studio, go to the Tools menu and select Options and in the Options dialog go Analysis Services Tabular/Data Import and check “Enable legacy data sources”:

image

When you do this, you’ll notice two new options when you right-click on Data Sources in the Tabular Model Explorer pane: Import From Data Source (Legacy) and Existing Connections (Legacy).

image

This gives you access to the Table Import wizard that was available in previous versions of Analysis Services Tabular, which not only allows you to enter your own SQL but also creates a legacy data source that in turn makes it easy to use your own SQL when creating partitions.

Using the Resource Usage Profiler Event For SSAS Multidimensional Query Performance Tuning, Part 3: Aggregations And Indexes

Building aggregations in your SSAS Multidimensional will make your queries faster, right? While that’s true, they will only make a noticeable difference to performance if your query has Storage Engine-related problems rather than Formula Engine-related problems. What’s more, even when do you have Storage Engine-related problems there are some cases where you may find that aggregations don’t give you the kind of performance boost that you expect. In this post I’ll explain why this can happen, how you can use the Resource Usage Profiler event (as described in parts 1 and 2 of this series) to find out when this is happening, and how you can deal with the problem.

Aggregations make queries go faster for two reasons:

  1. Most importantly they contain pre-aggregated data. For example your fact table might contain data at the Day granularity but an aggregation might contain fact data aggregated up to the Year granularity. This means that when a query needs to get data at the Year granularity, SSAS does not need to read the fact table-level data stored in the partition, it can read the data it needs direct from the aggregation without needing to aggregate any data at query time.
  2. Secondly, because the size of an aggregation is usually a lot smaller than the size of the fact table-level data stored in a partition, it is much faster to read data from an aggregation.

However, regarding this second point, there’s a catch: you’ll know if you’ve read the previous posts in this series that SSAS builds indexes on fact data so it can scan that data very quickly, but most builds of SSAS do not build indexes on aggregations. I say ‘most’ because there were a few builds of SSAS that did build indexes on aggregations, but as this article explains this feature was turned off soon after it was introduced because it was found that the time spent building those aggregations was not usually worth any gain in query performance that resulted.

The Resource Usage Profiler event can be used to monitor the number of rows read and rows scanned when SSAS reads data from an aggregation.Taking the original test query and cube from the previous posts in this series:

SELECT 
{[Measures].[My Measure]}
ON 0
FROM
[MyCube]
WHERE([ID].[ID].&[1])

…the Resource Usage event shows that when run on a cold cache and no aggregations are built on the cube, SSAS scans 256 out of 5000 fact rows of data – a single page – and returns one row:

image

However if this cube has an aggregation built on the ID attribute of the ID dimension, the same granularity as the fact table:

image

…when the query runs and SSAS reads data from this aggregation rather than the fact data, the Resource Usage event returns the following:

image

Notice that the ROWS_SCANNED value is now 5000. This is because the aggregation has no indexes built on it so SSAS has to scan all the rows in the aggregation. Reading data from an aggregation at the same granularity as the fact data is therefore a lot less efficient than reading data from the original fact data.

Of course, since most aggregations are usually much smaller than the original fact data, the lack of indexes is not so important because scanning all the data is going to be very quick anyway. However, on very large cubes you may need to build some very large aggregations and find that even when your queries hit these aggregations, performance is still bad because of this lack of indexes. If you see this happening, and can see the ROWS_SCANNED value in the Resource Usage event reporting very high values, then it might be a good idea to enable the building of indexes on aggregations.

You can do this by changing the values of the AggIndexBuildEnabled and AggIndexBuildThreshold server properties in the msmdsrv.ini file. Setting AggIndexBuildEnabled to 1 allows SSAS to build indexes on aggregations. It’s not necessary to build indexes on all aggregations though: you can specify that only aggregations larger than a certain number of rows have indexes built using the AggIndexBuildThreshold property. The only public documentation for these properties is given in two articles here and here, and I strongly recommend you read these articles so that you understand the implications of doing this on your processing times. You should only consider changing these properties if you are a very experienced SSAS developer and you monitor the effects carefully – I’m not even sure if changing these properties is supported by Microsoft.

Using the Resource Usage Profiler Event For SSAS Multidimensional Query Performance Tuning, Part 2: Many-To-Many Relationships And Non-Materialised Referenced Relationships

You probably know that using many-to-many relationships or non-materialised referenced relationships can be bad for Analysis Services Multidimensional query performance. How can you measure their impact, though? In the first post in this series I showed how the Resource Usage Profiler event could be used to to monitor Storage Engine activity; in this post I’ll use it to show the effect of using these features on the amount of Storage Engine activity that takes place during query execution.

Many-to-many relationships

In my previous post I built a very simple SSAS cube with one measure group called Fact from a table with 5000 rows, then built a dimension called ID from that same table with 5000 members on its only hierarchy. For this post I added a second measure group called Fact Bridge based on the same table, added a new role-playing copy of the existing dimension called M2M ID, and then created a many-to-many relationship from this new dimension to the original measure group via the new measure group.

image

Because all the dimensions and measure groups are built from the same table, one member on the M2M ID dimension is linked to just one member on the ID dimension, so selecting a member on the M2M ID dimension will give the same result as selecting the member with the same key on the ID dimension. Even though SSAS see a many-to-many relationship, in the data it’s a one-to-one relationship.

As I showed in my last post, selecting one member from the ID dimension in a query like the following:

select 
{[Measures].[My Measure]} 
on 0,
{[ID].[ID].&[1]}
on 1
from
[M2M Test]

image

…results in a single page of 256 rows being read and a single row being returned from the cube, as shown by the Resource Usage event in a Profiler trace:

image

If the query is changed to use the M2M ID dimension instead:

select 
{[Measures].[My Measure]} 
on 0,
{[M2M ID].[ID].&[1]}
on 1
from
[M2M Test]

image

…the same result is returned, because the member [M2M ID].[ID].&[1] is associated with one member from the [ID].[ID] hierarchy, the member [ID].[ID].&[1], via the intermediate measure group. However the Resource Usage event shows something very different:

image

The ROWS_SCANNED value has gone from 256 to 5256, and ROWS_RETURNED has gone from 1 to 5003! Why? Part of the explanation is that we now have two measure groups that must be scanned, and the Resource Usage statistics are totals for all Storage Engine activity across all measure groups. In this query the Fact Bridge measure group is scanned first to resolve the many-to-many relationship between the M2M ID and ID dimensions, and then the Fact measure group is scanned to get the value for the measure My Measure. The Fact Bridge measure group only accounts for 256 rows scanned and 1 row returned though, the remaining 5000 rows scanned are from the main Fact measure group. The problem here is that SSAS does not translate the filter on the M2M ID dimension into a filter on the ID dimension (this is a limitation of the way SSAS handles many-to-many) so all the rows on the main Fact measure group get scanned in this query.

This explains something that I have blogged about before here, namely that if you partition your measure group by a dimension that is used in a many-to-many relationship you’ll see that all partitions are scanned and not just the partitions you expect to be scanned. The Resource Usage event shows that even when you don’t see unexpected partition scans happening, using a many-to-many relationship in a query can result in a lot of extra Storage Engine activity and therefore potentially worse query performance.

Non-materialised referenced relationships

Something similar happens when you use non-materialised referenced relationships (although materialised referenced relationships are OK). To test this I created another variation on my original cube, with just one measure group and the ID dimension as before but now with a new, role-playing instance of the ID dimension joining to the measure group through the ID dimension using a non-materialised referenced relationship.

image

image

The following query returns the same result as the two queries above:

select {[Measures].[My Measure]} on 0,
{[Ref ID].[ID].&[1]}
on 1
from
[RefDimTest]

image

…but again, the Resource Usage event shows the entire measure group is being scanned when this query runs:

image

I’m not much of a fan of referenced relationships anyway – you can usually get rid of them by redesigning your SSAS dimensions or your underlying dimensional model – so this one more reason not to use them.

Using the Resource Usage Profiler Event For SSAS Multidimensional Query Performance Tuning, Part 1: Rows Scanned And Rows Returned

If you’re performance tuning SSAS Multidimensional Storage Engine issues, the Resource Usage Profiler event can provide a lot of useful information about what’s going on behind the scenes when you run a query. This is something I have blogged about in the past (and it will be useful to read this post before carrying on) but recently I’ve done some more research into this area and found out a lot more things about what this event tells you.

For my testing I created a very simple cube from a single fact table. The table contained 5000 rows and two columns: a dimension key column containing the values 1 to 5000, and a measure column that always contained the value 1. From this I built a single measure group with a single measure called My Measure, and a single dimension built from the dimension key column with 5000 members on it called ID.

Consider the following MDX query:

SELECT 
{[Measures].[My Measure]}
ON 0
FROM
[MyCube]
WHERE([ID].[ID].&[1])

Here’s the output:

image

In this query the WHERE clause filters the output to one member from the ID dimension, which in turn returns data from one row in the underlying fact table, and so My Measure returns 1. If the query is run on cold cache, the Resource Usage Profiler event returns the following:

image

The ROWS_RETURNED value here returns 1, which is what you might expect – the query results show data from one row in the underlying fact table. The ROWS_SCANNED value is 256 though. Why? Chapter 20 of the book “Microsoft SQL Server 2008 Analysis Services Unleashed” has a lot of detail about how SSAS MD stores data on disk, but the important point here is that the data in a partition is stored on disk in segments, with each segment made up of pages, and when a query is run SSAS will scan all the pages that it thinks contain data. The ever-reliable Akshai Mirchandani of the SSAS dev team helped me with the remaining information I needed:

  • There are 65536 rows of data per segment
  • There are 256 pages per segment
  • There are therefore, at most, 256 rows per page

So in this case ROWS_SCANNED shows 256 because one complete page was scanned.

Modifying the query to slice on two members from ID like so:

SELECT 
{[Measures].[My Measure]}
ON 0
FROM
[MyCube]
WHERE([ID].[ID].&[1]:[ID].[ID].&[2])

image

…results in a ROWS_RETURNED of 2 and a ROWS_SCANNED that is still 256, because the two rows must be stored in the same page:

image

…while asking for 257 members from ID in the WHERE clause like so:

SELECT 
{[Measures].[My Measure]}
ON 0
FROM
[MyCube]
WHERE([ID].[ID].&[1]:[ID].[ID].&[257])

Results in a ROWS_RETURNED of 257 and a ROWS_SCANNED of 257 – obviously 2 pages are now being scanned to get the data needed for the query.

image

Finally, the query:

SELECT 
{[Measures].[My Measure]}
ON 0
FROM
[MyCube]
WHERE([ID].[ID].&[5000])

…returns a rows scanned of 136:

image

This must be because the final page, which doesn’t contain the full 256 rows, is scanned. 5000-136=4864, and 4864/256=19, so there must be 20 pages of data: 19 pages of 256 rows and one final page of 136 rows.

I don’t think it’s worth getting too hung up on the exact values that ROWS_SCANNED and ROWS_RETURNED, especially given that they return totals for all Storage Engine activity for all measure groups for the whole query, but knowing that they tell you roughly how much work is being done my the Storage Engine means that you can use them to watch for warning signs that something isn’t working properly when you’re performance tuning queries. In subsequent parts of this series I’ll show some practical examples of this.

%d bloggers like this: