Oracle 11g materialised views

You may (or may not) have noticed that Oracle 11g got released recently. One of the new features that caught my eye is discussed in these two posts by Seth Grimes and Doug Henschen on the Intelligent Enterprise blog:
Seems like Oracle have used a new OLAP engine (ie not Express or Essbase) as the basis for managing large numbers of materialised views. When I first read this I thought the kind of feature that only a die-hard relational-database-lovin’ OLAP-denier could ever get excited about, but then I realised that that’s probably the point. It would be cool if SQL Server could use Analysis Services in the same way, and if it could it would open the eyes of a lot of new people to the power of Analysis Services.

Resurrect the XMLA Council!

A few weeks ago I had an interesting email conversation with Nick Goodman and Julian Hyde after Nick noticed that the XMLA Council’s web site which used to be at www.xmla.org now redirects to a site ‘brought to you by Simba Technologies’. Now I have nothing against Simba – in fact they have a number of interesting products – but the lack of an independent web site highlights the fact that the XMLA Council is in effect dead, having not met (so I understand) for several years now. A few days later I saw Andrew Wiles had blogged on this topic too:
http://andrewwiles.spaces.live.com/blog/cns!43141EE7B38A8A7A!199.entry

As Andrew points out, at present XMLA interoperability is something of a myth. There are a few tools that do manage it such as Rex which, unlike Mosha, I was able to use successfully against AS2005 as well as Mondrian although it’s probably not worth the bother (it’s nowhere near as good as SQL Management Studio for running MDX queries). JPivot is another open source tool that claims to work against AS and Mondrian but although I know other people have got it working against AS2K (see for example here: http://forums.pentaho.org/showthread.php?t=49954) and AS2005 I’ve never been able to do so against AS2005 despite several hours of effort. The only commercial product that works against multiple platforms that I’ve had experience of is Panorama (though I know there are others out there), and as this blog entry suggests it’s been a hard slog for them to work with SAP’s bizarre implementation of MDX.

I guess what happened with the XMLA Council is that like a lot of initiatives like this there was an initial burst of enthusiasm that dissipated once its ideals came into conflict with the demands of real-world product development. I know it’s pie-in-the-sky to expect perfect cross-platform interoperability but I think the present situation could be made a lot better and I think the time has come to resurrect the XMLA Council – and I think it’s up to Microsoft to take the lead on this. I don’t want to suggest that Microsoft have some kind of moral obligation to do this as de facto owners of the spec, rather that while everyone would benefit from increased interoperability Microsoft would benefit most. The first reason why is that the XMLA-compatible client tool market is dominated by tools that work against AS and which are sold by Microsoft partners, and they would be able to expand their potential customer base to support other servers like SAP BW and Essbase. More importantly though, the client tool that everyone really wants to use is Excel and if it were possible to hook Excel 2007 up to other OLAP engines then it would cement its position as the BI client tool of choice. Reporting Services’ support for SAP BW and Essbase shows that Microsoft are interested in supporting competing OLAP tools so is it unrealistic to expect Excel could support the same platforms?

When are named sets in the WITH clause evaluated?

Following on from my postings of last week about dynamically-generated sets, I had a really interesting email yesterday from my colleague at Solid Quality Mentors Francesco De Chirico. He sent me a variant of one of my queries which worked when he wasn’t expecting it to work and to be honest when I looked at it, I couldn’t work out why it worked either. However after a bit of testing I realised he’d discovered something quite important that I at least didn’t know; as with my last post on sets in the MDX Script it all makes sense when you think about it though (see also Mosha’s comment on my last post for some explanation).

Consider the following query:

with
set mycheapset as {[Department].[Departments].&[2],[Department].[Departments].&[3]}
select {} on 0,
mycheapset on 1
from [Adventure Works]

It returns in a second on a cold cache on my laptop, as you’d expect. Now consider the same query with an extra, expensive set declaration in it that isn’t referenced in the query:

with
set myexpensiveset as bottomcount(
{[Department].[Departments].&[2],[Department].[Departments].&[3]}
*
{[Scenario].[Scenario].&[1],[Scenario].[Scenario].&[2] }
*
[Account].[Account].[Account].members
*
[Date].[Date].[Date].members
, 10,[Measures].[Amount])
set mycheapset as {[Department].[Departments].&[2],[Department].[Departments].&[3]}
select {} on 0,
mycheapset on 1
from [Adventure Works]

This now executes on a cold cache in 13 seconds on my laptop despite returning exactly the same results. Even a query which doesn’t reference any of these sets executes in a minimum of 13 seconds:

with
set myexpensiveset as bottomcount(
{[Department].[Departments].&[2],[Department].[Departments].&[3]}
*
{[Scenario].[Scenario].&[1],[Scenario].[Scenario].&[2] }
*
[Account].[Account].[Account].members
*
[Date].[Date].[Date].members
, 10,[Measures].[Amount])
set mycheapset as {[Department].[Departments].&[2],[Department].[Departments].&[3]}
select {[Measures].[Internet Sales Amount]} on 0
from [Adventure Works] 

As with named sets in the MDX Script, named sets in the WITH clause are being evaluated when the query runs regardless of whether they’re referenced or not.

This has obvious implications for performance tuning: if you’ve got an MDX query which has a lot of expensive set definitions in the WITH clause that aren’t used, then you really need to delete them! However you might have queries in custom-built applications or something like SSRS where sets are present which may or may not be necessary, depending on certain parameters. Here’s an example of the type of thing I’m talking about:

with
set myexpensiveset as extract(bottomcount(
{[Department].[Departments].&[2],[Department].[Departments].&[3]}
*
{[Scenario].[Scenario].&[1],[Scenario].[Scenario].&[2] }
*
[Account].[Account].[Account].members
*
[Date].[Date].[Date].members
, 10,[Measures].[Amount]), [Department].[Departments])

set mycheapset as {[Department].[Departments].&[2],[Department].[Departments].&[3]}

member measures.test as
iif([Department].[Departments].currentmember is [Department].[Departments].&[2],
count(myexpensiveset), count(mycheapset))

select {[Measures].test} on 0,
[Date].[Calendar Year].members on 1
from [Adventure Works]
where([Department].[Departments].&[3])

In this case the WHERE clause might be parameterised and could contain either Department 2 (which would return the count of the expensive set) or Department 3 (which would return the count of the cheap set). You can rewrite this to be much more efficient by taking advantage of the fact that sets are evaluated after the WHERE clause so you only evaluate the expensive set when you need to:

with
set myiifset as
iif([Department].[Departments].currentmember is [Department].[Departments].&[2],
extract(bottomcount(
{[Department].[Departments].&[2],[Department].[Departments].&[3]}
*
{[Scenario].[Scenario].&[1],[Scenario].[Scenario].&[2] }
*
[Account].[Account].[Account].members
*
[Date].[Date].[Date].members
, 10,[Measures].[Amount]), [Department].[Departments]),
{[Department].[Departments].&[2],[Department].[Departments].&[3]})

member measures.test as count(myiifset)

select {[Measures].test} on 0,
[Date].[Calendar Year].members on 1
from [Adventure Works]
where([Department].[Departments].&[3])

When are named sets in the MDX Script evaluated?

Interesting discovery I made last week: I was tuning a cube and noticed that all my queries, when they were run on a cold cache, were much slower than I was expecting – even the most basic query seemed to take at least 40 seconds. After a lot of head-scratching I looked in Profiler and found the answer to what was going on, and it turned out to be the fact that there were two named sets in the MDX Script that used very complex expressions and which together took 40 seconds to evaluate. Commenting them out reduced all my query times by 40 seconds. The problem was that I wasn’t referencing these sets in any of my queries…!

I had thought that named sets in the MDX Script were evaluated the first time they were actually used in a query but this is demonstrably not the case. Consider the following query:

with set mytest as bottomcount(
{[Department].[Departments].&[2],[Department].[Departments].&[3]}
*
{[Scenario].[Scenario].&[1],[Scenario].[Scenario].&[2] }
*
[Account].[Account].[Account].members
*
[Date].[Date].[Date].members
, 10,[Measures].[Amount])
select {} on 0,
mytest on 1
from [Adventure Works]

Executed on a cold cache on the Adventure Works cube it returns in 13 seconds on my laptop. Consider also the following very basic query:

select {[Measures].[Internet Sales Amount]} on 0,
[Date].[Calendar Year].members on 1
from [Adventure Works]

This returns in under a second on a cold cache on the Adventure Works cube. Now go into Visual Studio to edit the Adventure Works cube and add the set from the first query as a named set at the end of the MDX Script so:

create set myexpensiveset as
bottomcount(
{[Department].[Departments].&[2],[Department].[Departments].&[3]}
*
{[Scenario].[Scenario].&[1],[Scenario].[Scenario].&[2] }
*
[Account].[Account].[Account].members
*
[Date].[Date].[Date].members
, 10,[Measures].[Amount]);

If you then clear the cache and rexecute the second query, which used to complete in only 1 second, it will now take 13 seconds despite the fact it doesn’t reference this set in any way. If you take a look in Profiler you can see that the Execute MDX Script event, which is fired the first time you run a query on a cold cache, is now taking 13 seconds and that’s what’s causing the query as a whole to take so long. So named sets in the MDX Script are evaluated when the MDX Script is executed, and that takes place the first time you run a query after the cache has been cleared – either by running a ClearCache command or by processing your cube.

olap4j

I’ve just come across olap4j (http://www.olap4j.org/) which might be of interest for anyone out there working with Java to access OLAP cubes – as the website says, it’s like JDBC for OLAP (and so I guess you can also say it’s like ADOMD.Net for Java). It’s designed to work with a range of OLAP servers including AS2005. The interesting thing about it for me isn’t the cross-platform support, though, but that it in addition to support for MDX it will have an API for programmatically generating queries – something that I think would be useful to have in Microsoft’s own data access layer. Implemented correctly it would take away the need for programmers to have to learn about how to write efficient MDX and hopefully bring about a kind of standardisation of queries which would make writing calculations on the cube easier.

Using Dynamically-Generated Sets in the MDX Script

Even more fun: I’ve just found that you can use the technique for dynamically generating sets within a cube’s MDX Script. For example, add the following set declarations to the end of the Adventure Works cube’s MDX Script:

create hidden set myyears as [Date].[Calendar].[Calendar Year].members;

create hidden set mytest as
generate(myyears,
strtoset("
intersect({},
{topcount(descendants([Date].[Calendar].currentmember,[Date].[Calendar].[Date]), 10,[Measures].[Internet Sales Amount])} as
[Top days for "
+ myyears.current.item(0).item(0).name + "])"
));

Then connect to the cube with your favourite client tool and, bingo! you see all the sets available to use in your queries:

namedsets

The only thing to watch out for is that, for some reason, you need to declare the set you pass in to the first parameter of the Generate function separately as I have above, rather than inline.

I can see this as being quite useful – if you need to create a large number of very similar sets on your cube it’s much more manageable than declaring each set individually since you only need to write the set expression once.

Cumulative Update Package 2 for SP2

For those of you who like to live life on the bleeding edge, the second cumulative update package for SP2 (build 3175) is now available for download here:
Some relationally-minded friends of mine have commented that these hotfix rollups always seem to contain more than their fair share of AS fixes, and I have to admit that they are probably right. I also heard about one welcome change that I believe is now in there: the AS dev team have got round to making the query-cancellation functionality a lot more responsive, so no more hours wasted waiting for the query to end after you clicked cancel. Hurray!

Advanced Ranking and Dynamically-Generated Named Sets in MDX

Calculating ranks is a classic MDX problem, and as is usually the case Mosha has an excellent blog entry on the subject which I would encourage you to read if you haven’t done so already:
http://sqljunkies.com/WebLog/mosha/archive/2006/03/14/mdx_ranking.aspx

His advice can be summarised fairly simply – where possible declare a named set which is ordered to use in the Rank function rather than try to do this ordering every time you want to calculate a rank because it’ll give you much better performance. This is all you need to know for 99% of the rank calculations you’ll ever write; however, I’ve been doing a bit of thinking around the remaining 1% of scenarios and here’s what I’ve come up with.

First of all it’s worth pointing out that the Order function isn’t always the most efficient way of ordering a set. As Mosha points out in this blog entry:
http://sqljunkies.com/WebLog/mosha/archive/2007/04/19/stored_procs_best_practices.aspx
… there’s a performance bug in the Order function which makes it very slow. Mosha’s own example query:

with member y as
count(
Order( [Customer].[Customer].[Customer].MEMBERS
*[Product].[Category].[Category].MEMBERS
*[Product].[Style].[Style].MEMBERS
, Measures.[Internet Sales Amount], BDESC))
select y on 0 from [Adventure Works]

…runs in 40 seconds on my laptop, and while he shows how a sproc can improve performance he doesn’t mention that if you rewrite to use the TopCount function you get even better performance. The following query runs in 4 seconds on a cold cache on my machine:

with
member y as count(topcount(
   {[Customer].[Customer].[Customer].MEMBERS
  *[Product].[Category].[Category].MEMBERS
  *[Product].[Style].[Style].MEMBERS} as myset
, count(myset), Measures.[Internet Sales Amount]))
select y on 0
from [Adventure Works]

The TopCount function uses a different algorithm to the Order function and it’s optimised to return the top n members of an ordered set; for a relatively small set like the one in this example it performs better than the existing buggy implementation of Order but that may not always be the case. And of course it’s highly likely that the Order function will be fixed in a future service pack so at some point it will start performing better than TopCount. As a result, use this approach at your own risk and test thoroughly!

But let’s get back onto the subject of ranking proper. The obvious problem that Mosha doesn’t deal with in his article is what happens when you have to calculate a rank on more than one criteria? To take Mosha’s example, what about calculating the rank of Employees by Reseller Sales Amount for several Months, where those Months are to appear on Columns? If you know what those Months are going to be in advance it’s fairly straightforward because you can create multiple named sets to use; the worst problem you’re going to have is that your query is going to be pretty long. But what if you don’t know what those Months are going to be or how many of them there are? For example, you might be filtering months on another criteria or using the TopPercent function. There’s no way you can create the named sets you need to get good performance if you don’t know how many sets you’re going to need, is there? If you had complete control over the code of your client tool then you could dynamically generate your MDX query string to give you all the named sets you needed, but that would be a pain even if it was possible at all (and it wouldn’t be with many off-the-shelf tools like Proclarity Desktop). Well, one solution to this problem simply uses one named set for all your Months:

WITH
SET MyMonths as TopPercent([Date].[Calendar].[Month].Members, 20, [Measures].[Reseller Sales Amount])
SET MyEmployees as [Employee].[Employee].[Employee].MEMBERS
SET OrderedEmployees AS
GENERATE(MyMonths,
ORDER(
[Employee].[Employee].[Employee].members
, ([Measures].[Reseller Sales Amount],[Date].[Calendar].CurrentMember), BDESC)
, ALL)
MEMBER [Measures].[Employee Rank] AS
RANK([Employee].[Employee].CurrentMember,
SubSet(
OrderedEmployees
,(RANK([Date].[Calendar].CurrentMember, MyMonths)-1) * Count(MyEmployees)
, Count(MyEmployees)
))

SELECT
MyMonths
*
[Measures].[Employee Rank]
ON 0
,MyEmployees ON 1
from [Adventure Works]

The set MyMonths contains the months that I’m interested in, and as I said because it uses the TopPercent function I don’t know in advance how many Months it will contain. However I do know that there’s a static number of Employees that I want to rank so in my OrderedEmployees set I use the Generate function to create a concatenated list of ordered Employees for each Month (note the use of the ALL flag here to make sure Generate doesn’t do a distinct on the set it returns). In my Employee Rank calculation I can then use the Subset function to pick out the section of this set which returns the ordered list of Employees for the current month: it’s the subset that starts at index  (RANK([Date].[Calendar].CurrentMember, MyMonths)-1) * Count(MyEmployees) and is Count(MyEmployees) members long.

BUT… of course this only works because we know there are the same amount of Employees each month. What happens if we change the calculation and ask if the current Employee is in the top 75% of Employees by Reseller Sales Amount for each month? In this case, 8 Employees make up the top 75% for August 2003 but there are 10 for September 2003 and so on so this approach isn’t any use. 

The solution to this problem came to me while I was driving home down the motorway on the way back from my in-laws’ house on Saturday afternoon, and when it did my wife asked me why I had suddenly started smiling so broadly – this is something that will get all you MDX fetishists out there (all three of you) equally excited. Basically it’s a way of dynamically generating named sets within a query. Let’s take a look at the whole solution first:

WITH
SET MyMonths as TopPercent([Date].[Calendar].[Month].Members, 20, [Measures].[Reseller Sales Amount])
SET MyEmployees as [Employee].[Employee].[Employee].MEMBERS
SET MyMonthsWithEmployeeSets as
Generate(
MyMonths
, Union(
{[Date].[Calendar].CurrentMember}
,
StrToSet("
Intersect({},
{TopPercent(MyEmployees, 75, ([Measures].[Reseller Sales Amount],[Date].[Calendar].CurrentMember))
as EmployeeSet" + Cstr(MyMonths.CurrentOrdinal) + "})")
))

MEMBER [Measures].[TopEmployee] AS
iif(
RANK([Employee].[Employee].CurrentMember,
StrToSet("EmployeeSet" + Cstr(Rank([Date].[Calendar].CurrentMember, MyMonths)))
)<>0, "Yes", "No")

SELECT
MyMonthsWithEmployeeSets
*
[Measures].[TopEmployee]
ON 0
,MyEmployees ON 1
from [Adventure Works]

This executes in 2 seconds on a cold cache on my laptop, compared to 52 seconds for the equivalent query which evaluates the TopPercent for every single cell, so it’s definitely a big improvement. What I’m doing is in the set declaration for MyMonthsWithEmployeeSets using a Generate function to iterate over the set of Months I’m going to display on columns and return exactly the same set, but while doing so find the set of the top 75% Employees for each Month and store it in a named set declared inline. The way I do this is to return a set containing the current Month from the iteration and union it with an expression which returns an empty set; the top 75% set is evaluated and stored inside the expression which returns the empty set. The fun bit is that the expression which returns the empty set is inside a string which is passed into StrToSet, and as a result I can dynamically generate the names of my named sets using a static string plus the result of the currentordinal function. In the example above I end up creating five named sets called EmployeeSet1 to EmployeeSet5, one for each Month. Thanks to the fact that I can reference these sets in another calculated member so long as it’s evaluated further down the execution tree (see http://www.sqlserveranalysisservices.com/OLAPPapers/ReverseRunningSum.htm), assuming I construct my SELECT statement appropriately I can then get at the contents of these sets within my TopEmployee calculated member using another call to StrToSet and some string manipulation to determine the name of the set that I’m after. How cool is that?

New release of the Analysis Services Stored Procedure Project

After months of procrastination, there’s finally a new release of the Analysis Services Stored Procedure Project available here:
Thanks to Darren Gosbell for doing the build and tidying up the wiki. Several of the new features have already been blogged about here and on Darren’s blog so I won’t go into too many details, but there’s a lot of cool new stuff there to enjoy. My favourite is Greg Galloway’s CreatePartitions function that allows you to automatically create partitions with slices based on the members of an MDX set – great for creating proof of concept cubes; his functions to dump AS memory and disk usage information are also impressive. Darren’s XMLADiscover class contains a load of useful stuff, particularly the ClearCache function which clears the cache on a database or a particular cube. There’s also some stuff that hasn’t even been included in the project yet but hopefully will be soon, so stay tuned…

PASS Germany Bootcamp

Just heard from Markus Fischer that PASS Germany are planning another (German-language) bootcamp. Details here:
There are two tracks, a DBA one and a Reporting Services one.