Microsoft Technical Computing Initiative

An interesting announcement here from Microsoft about its new Technical Computing Initiative:

http://blogs.technet.com/microsoft_blog/archive/2010/05/17/modeling-the-world.aspx

Lots of the usual PR-speak and vagueness, but from the post above here are the main points:

In terms of technology, the initiative will focus on three key areas:

  1. Technical computing to the cloud: Microsoft will help lead the way in giving scientists, engineers and analysts the computing power of the cloud.  We’re also working to give existing high-performance computing users the ability to augment their on-premises systems with cloud resources that enable ‘just-in-time’ processing. This platform will help ensure processing resources are available whenever they are needed—reliably, consistently and quickly. 
  2. Simplify parallel development: Today, computers are shipping with more processing power than ever, including multiple cores. But most modern software only uses a small amount of the available processing power. Parallel programs are extremely difficult to write, test, and troubleshoot.  We know that a consistent model for parallel programming can help more developers unlock the tremendous power in today’s computers and enable a new generation of technical computing. We’re focused on delivering new tools to automate and simplify writing software through parallel processing from the desktop… to the cluster… to the cloud.    
  3. Develop powerful new technical computing tools and applications: Scientists, engineers and analysts are pushing common tools (i.e., spreadsheets and databases) to the limits with complex, data-intensive models. They need easy access to more computing power using simpler tools to increase the speed of their work, and we’re building a platform with this objective in mind. We expect that these efforts will yield new, easy-to-use tools and applications that automate data acquisition, modeling, simulation, visualization, workflow and collaboration.

And from this article on the Wall Street Journal, here’s a practical example of what will be delivered:

Muglia offers an example of how Microsoft plans to make high-performance computing more accessible: Today many financial services firms use the company’s Excel spreadsheet application to develop financial models, but if the firms need the power of a supercomputer to crunch numbers, they often have to write specialized applications in programming languages like Fortran that a much smaller group of users are fluent in.

Microsoft’s Technical Computing group is working on software that will allow a program like Excel to run in parallel on thousands of machines so the application can be used to tackle monster financial computing chores on its own, Muglia says.

It’s been a while since there was any wild speculation on this blog but I can’t resist it – all this talk of running Excel in parallel on multiple machines and the cloud makes me wonder if this is going to work with PowerPivot too? Or rather, will this work with whatever PowerPivot/Vertipaq becomes when it grows up into a corporate BI tool?

PowerPivot Samples Available

There are some new PowerPivot samples available for download – sample data in workbooks, plus DAX calculation examples. You can get them here:

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=eac83429-c6e5-48a6-87cf-00a4141e5441

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=1ae63bfb-c303-44e3-ae44-7413d499495d

Microsoft BI Indexing Connector

Just seen this on the Sharepoint BI blog, the Microsoft BI Indexing Connector:

http://blogs.msdn.com/sharepointbi/archive/2010/05/14/announcing-the-microsoft-business-intelligence-indexing-connector.aspx

From the blog post:

With this new Indexing connector, users have a dedicated Report tab where they can find the reports they are looking for, use refiners to further narrow their searches, and even get a preview of the report before opening it in the browser or client…

…In addition to discovering the documents, the MSBIIC will also crawl the data sources revealing items that are not necessary in the report itself but critical to the user’s discovery and as part of the refiners.

What happens when you do a Process Update on a dimension?

Over the last few days I’ve been involved in an interesting thread on the SSAS forum regarding what happens when you do a Process Update on a dimension. It’s a topic that is not widely understood, and indeed I’ve not known all the details until today, but it’s nonetheless very important: one of the commonest performance-related problems I see in my consultancy work is partitions that have aggregations designed for them, but where those aggregations aren’t in a processed state because a Process Update has been run on one or more dimensions. Anyway, just now Akshai Mirchandani from the dev team posted a really good overview of what actually happens when you run a Process Update on that thread, so I thought I’d copy here to ensure it gets a wider audience:

Here is a quick summary of what happens when you do ProcessUpdate:

1. After the dimension has been updated, the server analyzes the changes that occurred to the dimension. In 2005, this analysis was pretty simple and would often incorrectly detect that major changes had occurred that required clearing of indexes and aggregations. In 2008, this code was improved such that it more often would realize that nothing significant has occurred. It’s a fairly small (but useful) optimization — I guess nobody thought it was worth documenting!

2. Based on this analysis, the server will decide whether or not indexes and aggregations need to be cleared. If no (e.g. because records were only added and not deleted/updated), then the partitions won’t be affected.

3. If indexes/aggregations need to be cleared, then the server will check if ProcessAffectedObjects was enabled — if yes, then instead of clearing the indexes/aggregations it will rebuild the indexes/aggregations.

4. The act of clearing the indexes/aggregations also shows up as "partition processing operations" in Profiler — that’s one of the things that has been confusing some of you.

5. When aggregations are cleared, only the flexible aggregations need to be cleared because we’re guaranteed by the rigid relationships that the members cannot have moved and therefore the rollups cannot have changed. However, indexes can still have changed and therefore you may still see the partition processing jobs kick off to clear the indexes.

6. ProcessIndexes and ProcessClearIndexes take care of building both bitmap indexes (aka map) and aggregations — the context is that both aggregations and bitmap indexes are generically considered "indexes".

Really the main takeaway here is that if you ProcessUpdate a dimension, you should strongly consider either doing ProcessAffectedObjects or an explicit ProcessIndexes on the affected partitions so that bitmap indexes and flexible aggregations get rebuilt. The advantage of explicitly doing ProcessIndexes is that you can bring your cube online earlier and have the indexes/aggregations get processed more lazily in the background — a number of customers prefer to do that because their processing windows are too small to wait for the indexes to get processed.

Also related to this topic, I thought I’d also highlight a great post by Darren Gosbell where he shows how to find out if your aggregations are processed or not:

http://geekswithblogs.net/darrengosbell/archive/2008/12/02/ssas-are-my-aggregations-processed.aspx

Tuning SSRS-Generated MDX Parameter Queries

Sometimes you’ll find yourself in the position of building SSRS reports where you have parameters with a large number of available values. Using the Adventure Works cube as an example, if you were to drag the Customer attribute from the Customer dimension onto the filter area of the Query Designer for a simple query and check the Parameters box like so:

image 

…you’d end up with a parameter where you can choose any customer to filter on – and there are 18485 customers on that hierarchy.

If you right-click on your data source in the Report Data pane (in BIDS in SSAS 2008) and check the Show Hidden Datasets option, you can see the MDX query that BIDS generates to return the list of available values for the parameter query:

image 

Here’s what the query will look like for the Customers hierarchy for the Customer dimension:

WITH
MEMBER [Measures].[ParameterCaption]
AS [Customer].[Customer].CURRENTMEMBER.MEMBER_CAPTION
MEMBER [Measures].[ParameterValue]
AS [Customer].[Customer].CURRENTMEMBER.UNIQUENAME
MEMBER [Measures].[ParameterLevel]
AS [Customer].[Customer].CURRENTMEMBER.LEVEL.ORDINAL
SELECT
{[Measures].[ParameterCaption]
, [Measures].[ParameterValue]
, [Measures].[ParameterLevel]}
ON COLUMNS
, [Customer].[Customer].ALLMEMBERS
ON ROWS
FROM [Adventure Works]

On my laptop this query executes in just over 1.5 seconds. Not bad, you might think, for a query that returns a fairly large number of rows. But we can do better!

This query returns all customers on rows and three columns: the caption of each member, the unique name, and the ordinal of the level (which is used for indenting the caption of each member in the dropdown list for the parameter, so you can easily distinguish between members on different levels). These values are returned as calculated members, but they can also be obtained as member properties and this is the key to tuning the query. So, if you create a new OLEDB connection to the cube (ie you don’t use the built-in Analysis Services connection type but you create a data source that connects to the cube using the OLEDB connection type)…

image

…and then create a new dataset with the following query:

WITH
MEMBER MEASURES.DUMMY AS NULL
SELECT
{MEASURES.DUMMY}
ON COLUMNS
, [Customer].[Customer].ALLMEMBERS
DIMENSION PROPERTIES UNIQUE_NAME, MEMBER_CAPTION, LEVEL_NUMBER
ON ROWS 
FROM [Adventure Works]
CELL PROPERTIES VALUE

You’ll find you get almost exactly the same data back, although this time the query returns in around 0.2 seconds.

A few things need to be noted here. First of all, although MDX allows you to put an empty set on columns, in the query above I had to create a dummy calculated measure that returned null because otherwise the query didn’t return any rows from the OLEDB data source. Secondly, in the new query the All Member unique name and caption come out as nulls – that’s normal behaviour for flattened rowsets (which is what you get when you run queries through an OLEDB connection), unfortunately, and again something we’re going to have to work around ourselves. Thirdly, we also need to create a column with indented member names – the original parameter dataset did this using a SSRS expression in a calculated field – although in this case, where there’s only one level underneath the all member, we could probably skip this and not hurt usability.

To trap the nulls and make sure the All Customers member appears as a parameter option, you can use a calculated field on the new dataset with an expression like this:

=iif(
Fields!Customer_Customer_Customer_UNIQUE_NAME.Value is Nothing
, "[Customer].[Customer].[All Customers]"
, Fields!Customer_Customer_Customer_UNIQUE_NAME.Value)

And to generate the indented captions you can use an expression like this:

=iif(
Fields!Customer_Customer_Customer_UNIQUE_NAME.Value is Nothing
, "All Customers"
, " " + Fields!Customer_Customer_Customer.Value)

You then need to delete the original parameter dataset, point the report parameter to the new dataset and bind these two columns to it value and label fields. And lo and behold, you have a report that runs just over a second faster than it did before. This might seem like a lot of hassle to go through for such a small gain, but if you have more than one large parameter the time savings will add up and your users will notice the difference.

BI User Group, London, May 20th

Just a quick post to say I’ll be hosting and speaking at a BI user group event in London on May 20th. My session will be on ‘Implementing Common Business Calculations in DAX” and there’ll be one other session from someone still to be confirmed. More details and registration here:

http://sqlserverfaq.com/events/223/Business-Intelligence-Implementing-common-business-calcs-using-DAX-in-PowerPivot-Chris-Webb.aspx

Counting Returning Customers in DAX

As promised, I’m going to be putting up a few more DAX examples here – although they may not demonstrate any new and interesting concepts, I’m still learning the language and blogging is a good way for me to practise implementing common calculations.

So, today’s problem is: how do we find the number of distinct customers who bought something in the current time period and who have also bought something in the past? That’s to say we want to count the number of returning customers, as opposed to completely new customers who have never bought anything from us before. This is, basically, a variation on the problem of how to get a distinct count in DAX that Marco has already dealt with comprehensively here, but with aspects of a time intelligence calculation (a topic which is well covered here). I’ll be using two tables from Adventure Works to illustrate this: FactInternetSales and DimDate.

What we need to start off with is find the set of distinct customers who bought something in the current time period. The following simple DAX expression finds this set and gives me the distinct count:

=COUNTROWS(DISTINCT(FactInternetSales[CustomerKey]))

We now need to filter these customers so that we only return the ones who bought something in the range of dates from the very first date we have data for, up to the day before the first date in the current time range. This can be accomplished with the DatesBetween, FirstDate and DateAdd functions. With the DatesBetween function, if you pass a Blank value to the first parameter it will give you the first date you have data for as your start date (see here for another example of this); for the end date in the range, we find the first date in the current date range with FirstDate, then get the day before using DateAdd:

DATESBETWEEN(
DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)
)

We then need to use this date range inside the Filter function as follows:

=CALCULATE(COUNTROWS(DISTINCT(FactInternetSales[CustomerKey]))
, FILTER(DISTINCT(FactInternetSales[CustomerKey])
, CALCULATE(SUM(FactInternetSales[SalesAmount]),
DATESBETWEEN(DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY))
, ALL(DimDate))>0))

However, there’s one last problem to solve: on the first date in the time dimension using the DateAdd function in this way will return a Blank value, and using Blank as both a start and an end date will result in us getting all the dates in the column back from the DatesBetween function. This means that for this first date we get the same value as the distinct count because we are finding the set of all customers who bought something on that first date and applying a filter to see whether these customers bought something on any date, which of course they did:

image

So we need to use the IF function to check if using DateAdd to get the previous date in this way returns a Blank, and if it does to return a Blank. Here’s the final expression:

=IF(
DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)=BLANK()
, BLANK()
,CALCULATE(COUNTROWS(DISTINCT(FactInternetSales[CustomerKey]))
, FILTER(DISTINCT(FactInternetSales[CustomerKey])
, CALCULATE(SUM(FactInternetSales[SalesAmount]),
DATESBETWEEN(DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY))
, ALL(DimDate))>0)))

image

One last point: even though I’ve written this expression on the RC version of PowerPivot, I still need to use ALL(DimDate) on the last line I am joining FactInternetSales and DimDate using the OrderDateKey column, which is an integer surrogate key. This wouldn’t be necessary if I was joining the two tables using a datetime column. Personally I think this is going to confuse no end of people and is a bit rubbish – especially because, in many data warehouses, the only way to join a fact table and a time dimension table is using an integer surrogate key; even worse, if you try to bring a datetime column from a time dimension table down onto the fact table using a calculated column, you’ll get a circular reference error. Hopefully this will get fixed in the next release…

UPDATE: I’ve found a better way of doing this in DAX. Instead of using the Filter function, which can be a bit slow, you can just use the filter arguments of the Calculate function itself. Here’s the new version:

=IF(
CALCULATE(COUNTROWS(), DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY))=0
, BLANK()
, CALCULATE(
COUNTROWS(DISTINCT(vwFactInternetSalesWithDate[CustomerKey]))
,DISTINCT(vwFactInternetSalesWithDate[CustomerKey])
, DATESBETWEEN(DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)
)))

Here, the two filter arguments I’m using are:

  1. DISTINCT(vwFactInternetSalesWithDate[CustomerKey]), to filter by the customers who bought something in the current time period, and
  2. DATESBETWEEN(DimDate[FullDateAlternateKey], BLANK(),  DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)), to filter by the set of all dates up to the day before the first date in the current period

ANDing these two arguments together gives us a context that contains only rows that have a customer that bought in the current time period and dates up to the current time period. It’s a little harder to understand what’s going on here but a lot more elegant, I think.

 

SQL Server Metadata Toolkit 2008

One of the few sessions I was able to see last week at the PASS European Conference (I had to leave very early, unfortunately) was Markus Raatz’s excellent “New Treasures for Microsoft BI Found on Codeplex”. Among other things, he showed off a new project on Codeplex that I hadn’t seen before: SQL Server Metadata Toolkit 2008. It’s an updated version of the SQL Server Metadata Toolkit that was released for 2005 a few years ago, and allows you to trace metadata dependencies between the relational database, SSIS packages and SSAS objects. So if, for example, you change a column name in a dimension table you should be able to work out which SSIS packages and which SSAS dimensions will be broken. You can download it here:
http://sqlmetadata.codeplex.com/

I’ve not had a chance to download it yet, but it looks like it will be very useful.

Analysis Services and Solid State Disks

There’s a lot of discussion going on in the SQL Server relational world about solid state disks and their benefits, but for some reason very little has been said about how useful they could be for Analysis Services. And yet, with their impressive performance for random reads, SSAS and SSDs would seem to be the perfect combination. I think the reason for the silence is that there’s less general understanding of SSAS performance issues out there in the blogosphere, and probably also since there are fewer really large installations of SSAS not many people have had the chance to test SSDs with SSAS properly.

Anyway, the other week I was lucky enough to spend some time testing SSDs with SSAS for a customer of mine. For reasons of privacy I can’t share any significant details about what I found, and before I go on I need to point out that what I was doing was comparing the performance of a particular SSAS cube on a particular existing hard disk configuration with the same SSAS cube on a particular SSD configuration. So while it’s certainly possible to say that, in lab tests, SSDs can perform extremely well for certain tasks, in the real world you have to ask the following question: what benefit will an SSD give me for my cube, over my existing disk configuration? So in fact even if I could share the details of my tests they wouldn’t be all that useful.

That said, here are a few high-level findings:

  • Backup on the SSD was approximately 50% faster than on their existing, very respectable, hard disk configuration; Restore was 65% faster. These were the clearest examples of the SSD showing massive performance improvements.
  • For various reasons I couldn’t do full tests on processing performance, but Process Index operations were about 10-15% faster, as was Synchronisation.
  • Performance of individual queries was not significantly faster, maybe 1-2%; there were some cases where, with large numbers of concurrent users, I saw a bigger improvement of around maybe 10%. This didn’t surprise me at all because I knew that any query performance issues the customer has are related to the formula engine and not the storage engine, so IO wasn’t much of a problem in the first place.

These tests also led to some interesting behind-the-scenes discussions about how you can actually measure the impact of IO on SSAS storage engine operations, which Teo has already blogged about here and which are well worth reading:
http://prologika.com/CS/blogs/blog/archive/2010/04/13/analysis-services-i-o-time.aspx
http://prologika.com/CS/blogs/blog/archive/2010/04/16/using-xperf-to-test-analysis-services.aspx

For this particular customer, my recommendation was that it wasn’t worth the time, money and effort to move to SSDs, at least from a query performance point of view. In fact, based on several years experience of tuning SSAS implementations I find that in most cases IO is not one of the most important factors in determining query performance – it’s more often the case that the problem lies in the formula engine, and even when it doesn’t, an appropriate partitioning strategy and aggregation design can work wonders.

However, let me be clear: I am not making some general statement that SSDs are not useful for SSAS. What I am doing is giving the usual ‘your mileage may vary’ and ‘it depends’ answer. There are definitely going to be SSAS implementations where SSDs will make a massive difference to query performance, and where no other tuning technique can have the same impact. For example, I’ve heard that with distinct count measures (especially when there are a large number of concurrent users) SSDs can lead to very impressive improvements in query performance; I would assume that where there are large many-to-many relationships you would also benefit accordingly. What you need to do before you spend money on SSDs is to understand the causes of any performance issues you currently have and make sure you tune your cube to the best of you abilities.

Has anyone else had any experience with SSAS and SSDs they’d like to share?

UPDATE: the SQLCat team have published this technical note on the subject: http://sqlcat.com/technicalnotes/archive/2010/09/20/analysis-services-distinct-count-optimization-using-solid-state-devices.aspx

SQLBits VI in the bag

Wow, another SQLBits is over, and in my opinion it was the best yet. I know I always say that after we’ve done one, but this time we broke our attendance records (I think the official figure was 425 people, more than 50 more than the next highest attendance we’ve achieved) and everything went very smoothly indeed. We even escaped relatively unscathed by the volcanic ash problem: only one speaker couldn’t make it because of the flight cancellations, amazingly, although several European attendees were grounded and a lot of people will be having an unplanned weekend break in London right now. After his flight got cancelled, Bob Duffy got his pregnant wife to drive him overnight from Dublin so he could speak – that’s what I call dedication to the cause!

A big thank you goes to all my fellow members of the SQLBits committee, to all the volunteers who worked so hard on the day and the night before, to all our sponsors and to everyone who turned out and helped make the day such a success. Now, where shall we go for the next one? And how am I going to get to Germany next week for PASS Europe?

UPDATE: and if you did come to SQLBits, please fill out our feedback surveys online –
Conference Feedback
Session Feedback