Amazon SimpleDB… for BI?

I’ve just seen that Amazon has released a new web service database called SimpleDB:

http://www.amazon.com/b/ref=sc_fe_c_1_3435361_1?ie=UTF8&node=342335011&no=3435361&me=A36L942TSJ2AJA

Note that it’s not a relational database, but has a very flexible attribute-based model; it also claims to need no tuning, data modelling or any of the boring stuff that us database people do for a living. While claims like that automatically make me somewhat sceptical, I also think about Google Bigtable and the new generation of COP databases and think, well maybe… Slap an MDX-like interface over this, make it accessible from Excel, and you’ve got something interesting and disruptive from a BI point of view.

Comparing MDX Query Results With Tablediff

Recently I was working on a project where the customer wanted to upgrade from SP1 to SP2 but found that when they did so the results of some of the calculations on their cube changed. Unfortunately these changes weren’t easy to spot: their MDX Script was >1500 lines long and the bugs they were running into were dependent on the query being run and whether the cache was cold or warm (in fact one of them was the bug I blogged about here). When testing various workarounds and hotfix builds, the problem I ran into was a problem I’ve run into before on numerous occasions: when I changed something on the MDX Script, how did I know it didn’t break something else?

In the past what I’ve done was take a few MDX queries, run them, copy and paste the results into Excel, make my changes and then rerun the same queries and paste the new results into Excel again and use Excel formulas to compare the old and new resultsets. However in this case this approach just wasn’t feasible and I started looking for a better way of comparing large numbers of MDX resultsets. Now there is definitely a tool to be written here, probably one which runs batches of queries, serialises the cellsets and saves them to disk, and then compares the cellsets and highlights any cells with differences, but I didn’t have time to write it. I was talking the problem over with Didier Simon of PSS, though, and he made a great suggestion – use tablediff.exe, the command line utility you get with SQL Server to compare relational tables. You can read more about it here:
http://msdn2.microsoft.com/en-us/library/ms162843.aspx

This turned out to be the key. I cannibalised bits of my Analysis Services cache warmer and created two SSIS packages, the first of which ran a batch of MDX queries retrieved from a relational table (originally saved from a Profiler trace) through SQL Server using OpenRowset and saving the results in a table using a SELECT INTO. Here’s what the control flow looked like:

And here’s an example of the kind of SQL query I was generating:

select identity(int, 1,1) as idcolumn, * into Query1
from openrowset(‘MSOLAP’, ‘Datasource=localhost; Initial Catalog=Adventure Works DW’,
‘ with cell calculation test for ”(*)” as ”cstr(round(measures.currentmember,5))”
select Measures.[Internet Sales Amount] on 0,
[Date].[Calendar Year].Members on 1
from [Adventure Works]’)

A couple of things to note – Tablediff needs a column which can uniquely identify each row in the tables its comparing, hence the identity column. I also found that I was running into the precision problems described here and as a result, tablediff was flagging up cells as having different values when in fact the values were differing only by a miniscule amount, so I created a calculated cell assignment in the WITH clause which rounded up every value to five decimal places. I could do this because I knew none of my queries had WITH clauses, but another solution would have been to create a session calculated cell assignment which did the same thing.

After running this package I had a SQL Server database with as many tables containing results as I had queries. I could then make my changes to the cube and rerun the same package to dump a new set of results to a different database, and then run my second package to compare the results. This was, again, a fairly simple package to put together:

All I did was use a ForEach loop to loop over the tables in the database containing the first set of results, and then use an Execute Process task to call tablediff to compare each table with the equivalent table in the second database; if it found any, I used the -et argument to save the output to another table. I also used the -c argument to get column-level differences and the -b argument to specify the number of bytes to compare for large objects, necessary because the SELECT INTO creates an ntext column for all the member names on the Rows axis.

Overall, it’s not an ideal solution (coding a proper app is definitely the way to go) but it did the job, and hopefully this will be useful to someone else…

Partitions and Aggregations

Something that’s easy to forget (at least I’m always forgetting it) when creating an aggregation design is the impact that partitioning can have on it. Aggregations are only created on a per-partition basis – you can’t aggregate across partitions – so that means, for example, when you’re partitioning by Month then there’s little point building aggregations at granularities higher than Month; after all, if you only have one Month of data in your partition and you aggregate that data to the Month, Quarter or Year levels then the figures for each aggregation will be identical! In fact building aggregations above the granularity of Month can be a bit of a waste: an aggregation at Month granularity can be used by queries at Quarter and Year, but an aggregation for the same partition at Year granularity that is in all other respects identical can only be used by queries at the Year granularity.

The best thing to do to ensure this doesn’t happen is to set the AggregationUsage property to None for all attributes above the granularity that you’re partitioning by, and also to make sure your partition row counts are set correctly, before you run either of the Aggregation Design wizards. This should ensure that your aggregation design properly reflects your partitioning strategy.

One of the questions I’ve always meant to research further on this topic is whether partitioning could hurt query performance, given that it effectively prevents certain aggregations being built. Taking the example we’ve been using so far, what happens if you’re partitioning by Month and you’ve got a slow query at Year granularity that you’d like to build an aggregation for – would partitioning by Year rather than Month, so you could build a true Year-level aggregation, make sense? Well, I’ve just had a chance to test this out on a fairly large cube (with approx 100 million rows in the main fact table, and Product and Customer dimensions with 1 million+ members on their key attributes) and interestingly partitioning seems to make no difference at all. I created two measure groups, one partitioned by Month and one partitioned by Year, and then created one aggregation on the former at Month/Product and one aggregation on the latter at Year/Product. I then ran a query that returned the top 10 Products at Year on each measure group and the performance was identical. Clearly this is not something to be worried about, thank goodness…

[Thanks are due to Eric Jacobsen of the SQLCat team for talking through some of these issues after the BI Conference earlier this year. He accurately predicted that partitioning wouldn’t hurt performance because the AS query engine is very efficient at merging the data caches of partitions.]

SQLBits II: The SQL

Since the first SQLBits was such a success, we’re doing it again – this time in Birmingham (that’s the UK one, obviously, not the one in America) on the 1st of March 2008 at the Lakeside Conference Centre. Put the date in your diary! For more information see: http://www.sqlbits.com/

If you came to the last one then you can expect the same thing again only bigger and better, and of course it remains free to attend. If you didn’t come to the last one and you’re wondering whether to believe the hype, then here are some comments plucked from the blogosphere that will hopefully sway you:
http://blog.benhall.me.uk/2007/10/sqlbits-post-conference.html
http://www.simple-talk.com/community/blogs/andras/archive/2007/10/08/37975.aspx
http://geekswithblogs.net/twickers/archive/2007/10/07/115914.aspx

Registrations aren’t open yet but at this point we’re looking for sponsors (http://www.sqlbits.com/information/Sponsorship.aspx) and speakers (http://www.sqlbits.com/information/SessionSubmission.aspx). If you’ve got a SQL Server-related product or service to promote (and that includes you, Microsoft BI vendors and consultancies) then there will be no better opportunity to meet potential customers. And if you’re dying to impart your hard-won technical knowledge to your peers and want to take your first steps on the conference circuit, again SQLBits is the place to be; in particular for the BI track I’d be happy to hear from anyone new willing to speak about SSAS, SSRS, SSIS, PerformancePoint or MDM.

Multiple Assignments Bug

I don’t usually blog about the bugs I find in AS because, well, bugs usually get fixed pretty quickly and are usually only seen in very specific scenarios. However I’m going to make an exception here because this bug is very easy to reproduce, it can lead to inconsistent values being returned and it doesn’t look like it’s going to be fixed any time soon.

Here’s the repro on Adventure Works (I used the Simple version, but I’ll assume it works on the regular version too):

Comment out the MDX Script apart from the Calculate statement
Add the two following assignments:
([Measures].[Reseller Sales Amount])=1;
([Measures].[Reseller Sales Amount], [Employee].[Employees].&[291])=2;
Run the following query on a cold cache and note that the value returned for the employee Stephen Y. Jiang is 11 (all other non-leaf members show values aggregated up from their leaf members too):
select [Measures].[Reseller Sales Amount] on 0,
[Employee].[Employees].members
on 1
from [Adventure Works]
Now run the following query which only returns data for Stephen Y. Jiang without clearing the cache and note that the value returned is now 1:
select [Measures].[Reseller Sales Amount] on 0,
[Employee].[Employees].&[272]
on 1
from [Adventure Works]
Now rerun the query from step 3 and note that the value for Stephen Y. Jiang is still 1.

I have a case with PSS open about this, and from what I can gather the problem is to do with making multiple assignments to the same cell; I’ve only reproed it with a parent/child hierarchy, but I’m told it could appear on a regular hierarchy as well. In my opinion it’s the values returned in Step 3 that are incorrect since an assignment on the member Jae B. Pak should only trigger an aggregation from leaf level for that member’s siblings; the values from Step 4 for Stephen Y. Jiang and the other non-leaf members are right.

The job I’m working on at the moment is a financial application, and the only workarounds seem to be liberal use of FREEZE or making sure that you don’t try to assign values to the same cell more than once, neither of which are exactly feasible in my situation: I’m already dealing with a vast amount of code in the MDX Script and getting either approach to work is next to impossible. I would assume that most financial applications will have parent/child hierarchies (eg on an Accounts dimension) and multiple overlapping assignments in them, so be warned! I wonder if this is going to be a problem for PerformancePoint applications?

AS2008 Block Computation

A lot has been said about block computation as it’s the biggest feature coming in AS2008. Basically it means better performing MDX, but as you’d expect it doesn’t mean that all your MDX queries and calculations are going to run faster, just some. But how do you know whether you’re going to benefit? Well, digging through the latest update of AS2008 BOL today I found the following, very detailed description of when you will benefit from these performance improvements:

http://msdn2.microsoft.com/en-us/library/bb934106(SQL.100).aspx

Very interesting… the list of set functions is a bit limited (where is Filter?) but in general it looks like all common calculations are going to get faster.

SQL2008 November CTP released

I’m a bit late on this (ie about a day, but that’s late in the blog world) but the November CTP of SQL2008 has been released and you can download it here:

http://www.microsoft.com/downloads/details.aspx?FamilyId=3BF4C5CA-B905-4EBC-8901-1D4C1D1DA884&displaylang=en

I won’t go into the details of what it contains because various other people have already done so; the best round up I’ve seen is from Vidas Matelis:

http://www.ssas-info.com/VidasMatelisBlog/38_ms-sql-server-analysis-services-2008-november-ctp-quick-review#more-38

Apparently the only feature missing from this CTP for AS is the ability to share a single AS database between multiple instances; there don’t seem to be any new features coming that I didn’t know about apart from some new Perfmon counters and some new Profiler trace events, and the rumoured query plans have not made the cut. Am I alone in feeling a little… underwhelmed by the new features in AS2008? Maybe that’s because I haven’t had a chance to test out the block computation changes yet (I might do this week though) – they’re not very obvious or demoable but they should hopefully make a big, big improvement to the performance of many calculations.

Enterprise Cube

Picked up from Dan Meyers and Peter Koller, it looks like Microsoft are working on some packaged BI solutions aimed at specific verticals which will be sold under the brand name ‘Enterprise Cube’. Apart from Mark Kromer’s blog, which Dan points to, the only other information I can find about it is this session abstract from the Australian BI summit, which describes the first of these solutions aimed at the telco industry:

Microsoft Enterprise Cube – BPM Solutions for Today’s Business Needs
Haresh Khatwani, Group Product Manager MEC, Microsoft & Daniel Yu, Product Manager, Industry Solutions Group, Microsoft

Microsoft Enterprise Cube is the only packaged end-to-end BI offering at Microsoft and provides the framework for pain-point-specific BI solutions for industry verticals. Using case studies form leading telecommunications companies in the USA and Korea this session will show you how the MEC framework enables modularity, quick installation, quick time-to-market and how your organisation can leverage the Microsoft BI product stack to report, analyze, trend and forecast business data without effecting live business support systems.

I’m always sceptical about the technical value of generalised industry solutions, except when the data involved varies so little in format that you can genuinely stick a pre-defined cube on top of anyone’s data with the minimum of customisation. For example, solutions on the Microsoft BI platform that I know about that can do this include Data Intelligence’s PharmaAnalytics solution for pharma market research data, and PrecisionPoint’s solution for Dynamics, and I’m sure there are hundreds of other similar ones out there. Perhaps they could be brought into the ‘Enterprise Cube’ fold too under some kind of partner programme?

From a marketing point of view, though, I think it makes a lot of sense – it will bring on board a lot of customers and partners who would otherwise not look at the Microsoft BI platform, and as Peter comments this is something the other BI vendors have been doing for a long time and makes Microsoft look more credible in this space.

IBM buys Cognos

As the pundits have been predicting, IBM is to buy Cognos:

http://news.yahoo.com/s/nm/20071112/bs_nm/cognos_ibm_dc_4

Viewing Partition Slice Information

The other week I discovered that it was possible to view the partition slice information discussed here using the DISCOVER_PARTITION_DIMENSION_STAT schema rowset. I’ve always thought it would be cool to be able to visualise this information so you could easily check to see whether your dataid ranges overlapped or not; to do this, though, you would need to see the slice information for all your partitions in one resultset and DISCOVER_PARTITION_DIMENSION_STAT only returns data for one partition. So this week I dusted off my coding skills and wrote a sproc that loops through every partition in a measure group, calls DISCOVER_PARTITION_DIMENSION_STAT and concatenates the results as well as checking to see which partitions have ranges which overlap each other. I’ve added it to the source code for the Analysis Services Stored Procedure Project, which you can get hold of here:

http://www.codeplex.com/ASStoredProcedures

Note that the function isn’t part of the latest release, you will have to download the code and compile it yourself. As an example of what it does, in SQLMS if you connect to the Adventure Works database and then run the following statement:

call assp.DiscoverPartitionSlices("Adventure Works", "Internet Sales")

…you’ll get a resultset showing all the partition slice information for every attribute on every partition in the Internet Sales measure group.

Having done this I was able to put together a simple SSRS report to display this information. I say simple, it probably took me longer to create the report than to write the sproc because I couldn’t work out how to visualise the ranges in a useful way; in the end I used the RangeColumn chart style from Dundas, which seemed to be the best fit. Here’s an example of what the report returned for the Date attribute on the Date dimension and the Internet Sales measure group:

I cheated a bit by setting the IndexBuildTheshold very low so that indexes would be created for the 2001 and 2002 partitions, but it shows what a healthy set of slices looks like (the Internet Sales measure group is partitioned by Calendar Year). Another attribute worth looking at is the Product attribute on the Product dimension:

What we see here is that in 2001 and 2002 only a small number of products were actually sold compared to 2003 and 2004. In this situation it might be worth trying to reorder the members or set a slice on 2001 and 2002 so that AS knows about this distribution: it would mean that if you were running a query for the sales of a Product that was only sold in 2003/4 for all time periods, AS would not need to scan the 2001 and 2002 partitions because it would know in advance that there would be no data for that product in those partitions. One of the improvements I would like to make to the sproc is to display member names and unique names as well as dataids so that the slice ranges are easier to understand; Mosha also had the idea that it would be good to be able to take these unique names and use them to set a slice on a partition for exactly the scenario I’ve just described.

Lastly, I found what is almost certainly a bug in AS when I was looking at the Calendar Semester attribute on the Date dimension:

Calendar Semester has, as you would expect, a one-to-many relationship with the Year attribute and we would expect to see no overlaps as a result -but as the report shows, that isn’t the case. When you run a query on this attribute, for example:

select measures.[internet sales amount] on 0,
{[Date].[Calendar Semester].&[2004]&[1]} on 1
from [adventure works]

You see partitions being hit when they shouldn’t be hit – in this case the 2003 and 2004 partitions – and this even sometimes seems to happen when you have a slice explicitly set on the partition using the Year attribute, which is pretty bad. Greg Galloway did a bit of digging on this and discovered that the SQL generated returned rows in the same mixed-up order as the dataids of the members, and fixing the SQL to return members in the order you want the dataids stops this happening to a certain extent. Mosha says that it’s a problem with the decode tables created within the dimensions, but whatever the cause you may want to check your dimensions to see if it’s happening to you. This kind of problem has been noticed before and I knew about the Disable Prefetch Facts connection string property that was introduced to deal with it, so I wonder if this is related in some way? Anyway, I logged the bug on Connect and I’ll blog about any new information that I get on it:

https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=308793

Chris Webb's BI Blog

Amazon SimpleDB… for BI?

Like this:

Comparing MDX Query Results With Tablediff

Like this:

Partitions and Aggregations

Like this:

SQLBits II: The SQL

Like this:

Multiple Assignments Bug

Like this:

AS2008 Block Computation

Like this:

SQL2008 November CTP released

Like this:

Enterprise Cube

Like this:

IBM buys Cognos

Like this:

Viewing Partition Slice Information

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: