Handling Time Zones

As a general rule, whenever several different people ask me about the same problem I begin to think that it’s time to blog about it. In this particular case the problem is how to display data by different time zones; I’ve been asked about this a few times recently, possibly because AS2005’s ability to handle multiple languages and currency conversion means that people now feel like they can build ‘international’ cubes and this requirement arises as a result. For example imagine you were collecting usage stats for your web application and were interested in seeing how many visitors you had at different times of the day, but users in your UK and US offices wanted to analyse this data by their local time. In your cube you’d need to have a Time Zone dimension and/or attribute which, when you chose the GMT member, would mean that the selection 3:00 on 1st November 2005 on your Time dimension would show the same values as if you’d chosen the EST Time Zone and selected 22:00 on 31st October 2005. Note that this isn’t the same problem that Ralph Kimball talks about here, which is doesn’t allow you to analyse data in anything other than the local time zone of the event or the standardised time zone.
The obvious first step towards solving this problem is to hold the data in your fact table at one common time zone, probably GMT. However doing the time zone translation isn’t as simple as finding how many hours + or – each time zone is with regard to GMT, of course. The big complicating factor is Daylight Saving Time, and as Wikipedia shows it’s a pretty big complicating factor – in my opinion you wouldn’t want to write code to model these rules, you’d be much better off having a table which for every hour of every day in every time zone you were interested in would give the equivalent hour and day in GMT. So if you were interested in 20 time zones and had 5 years of data in your fact table, you would have approximately 5*365*24*20=876000 rows in this table. Populating this table would be a pain though… I wonder if there’s a web service out there which can do this conversion for you, operated by people who keep on top of all the changes surrounding Daylight Saving Time such as the recent ones Wikipedia mentions in Indiana?
How you can then use this table in your AS2005 cube design is the really interesting question. My first thought was that even though this isn’t a many-to-many relationship (unless you wanted the duplicated hours that occur when DST moves backwards to be modelled as one member rather than two, which doesn’t seem sensible) you could model it as such – the table descibed in the previous paragraph would become your intermediate measure group, which joins to the fact table via a GMT Time dimension; this time dimension could then be reused (as a role-playing dimension) as your Time in Time Zone dimension, joined to just the intermediate measure group along with a separate Time Zone dimension. You could then hide your GMT Time dimension and your users would just have to select which ever time zone they wanted to analyse by on the Time Zone dimension and the Time in Time Zone dimension would behave exactly as described in my first paragraph.
However there are some drawbacks to this approach. First and foremost is that you couldn’t make use of any aggregations above the hour attribute, because the many-to-many relationship is resolved at this level. This would mean that if you wanted to run queries at the Month or Year level they might not perform as fast as you’d like. Also, the larger the table you’ve built your intermediate measure group from, the slower it will be to resolve the many-to-many relationship, although partitioning this measure group by Time Zone and possibly Year to will mitigate this effect.
The alternative is to model all this as a single regular dimension. You’d have a GMT Hour attribute, an Hour in Time Zone attribute (with a one-to-many relationship between the two) and a Time Zone attribute; the key thing to note here is that every hour in a time zone becomes a single member, so the overall dimension size is much larger than in the many-to-many approach. Since AS2005 can (so I’m told) handle dimensions of many million members even in 32-bit the size issue wouldn’t be the showstopper it would have been in AS2000, although it would presumably have some impact on overall performance. The fact that you could now use aggregations above the Hour level would, on the other hand, be a big factor in improving performance which I would guess would more than make up for the negative impact of dimension size. Judicious hiding of attributes and the use of user hierarchies (for example Time Zone->Year->Month->Day->Hour) would mean that the complexity of the dimension could be hidden from the user; the only real drawback I can think of is that because every Hour, Month, Year etc would be a separate member on the dimension, it would be difficult for users to create reports in an ad-hoc tool which selected a couple of months on a visible axis and then allowed them to slice by time zone – something that wouldn’t be an issue with the many-to-many approach.
 
One other approach I thought about was using a stored procedure called from an MDX Script assignment to perform the time zone translation – but as I said earlier, trying to handle all the wrinkles of DST changes in code would be a real nightmare, and calling a stored procedure multiple times to get values for each Hour probably wouldn’t perform well either.
I’d be interested to hear other people’s thoughts on how to solve this problem; even better, it would be good to hear from anyone who’s had to actually implement a solution. So post some comments! It would be useful to know whether the regular dimension approach does in fact perform better than the many-to-many approach, for instance…
UPDATE (March 2015)  – almost ten years after writing this post, I’ve been asked to revise it. With rather more experience of SSAS than I had back then, I can say that my proposed solution for a single dimension is just wrong and wouldn’t work, and using a stored proc wouldn’t be feasible either. The only workable solution would be the many-to-many approach.

SQL2005 RTM

I’m a few days late onto this news (I dunno, you take a day off on holiday and this happens) but in case you didn’t know, SQL2005 was released to manufacturing last Thursday. It’s available to download from MSDN if you’re a subscriber:
 
 

BI Certification from Microsoft

Euan Garden notes that the new SQL 2005 certifications have been announced, one of which is the MCITP: Business Intelligence Developer. See here for more details:
 
Looks like I’ll have to get one of these.

A new form of MDX

I guess those of you in North America have already seen this on TV:
(Thanks to George for sending the link)
 
"Coming to a store near you in November" – to coincide with the launch of SQL2005, no doubt.

Even more Excel BI stuff

After my last post earlier today, I came across a whole heap of new links I thought I should share. First of all, here’s the press release on new Office BI functionality, which includes a few screenshots:
Unfortunately I couldn’t get into the LiveMeeting recording that it links to. Here’s more reaction:
 
Most important of all, though, the video of the PDC presentation on Excel BI is now available to download at
Look for the following presentation:
OFF323: Building Business Intelligence Solutions Using “Excel 12” and SQL Server 2005 Analysis Services
 
I’ve just watched it and wow, it really looks like Excel 12 lives up to the hype. In fact there’s nothing that it’s doing that plenty of other tools haven’t done for ages, but this is EXCEL. And there are many pleasing touches such as the date filtering and all of the new formatting functionality that Excel 12 gives you. Now all Microsoft need to do is buy Tableau and incorporate their visualisation capabilities into Excel and you will have the killer BI app.
 
And continuing the train of thought about Excel Services (as I should be calling Excel server) and AS, it also struck me that there’s a certain overlap between it and Reporting Services too, at least as far as intranet reporting goes. Why go to the trouble of creating an RS report when you can simply push your spreadsheet up to Sharepoint and see it in the Excel Services thin client? Obviously you can’t do this for all the scenarios you need to use RS for, but hmm, I wonder if you can build a custom data extension so you can point RS at an instance of Excel Services and grab data from a spreadsheet for your RS reports…
 

Scorecards Pricing and Excel Server

An interesting article here on the pricing scheme for "Office Business Scorecard Manager 2005" which lets out a few more details on the new Excel server functionality:
 
Despite the fact that Excel 12 will have much better AS integration (the Excel blog promises more details ‘in a few weeks’ in this recent posting, so I’m keeping my eyes peeled), as I and several other people have remarked, an Excel server will end up treading on the toes of AS in a few scenarios. Smaller apps, yes, when all the data can be put in the spreadsheet (and you can now put more on a sheet, as this post on the Excel blog notes) and also financial apps where you are most likely to have a group of power users who prefer to build things themselves without IT involvement. Take, for instance, this comment about the Excel server from the Information Week article:
"The main goal here is to ensure workgroups are seeing and working with the same numbers"
Which is, let’s face it, another variation on the theme of ‘one version of the truth’ that BI vendors have been playing for years.
 
In the short/medium term having an Excel server is a development to be welcomed but in the long-term is there going to be pressure to improve the scaleability/performance of the Excel server so it can handle larger and larger apps? (Isn’t this how Essbase started out as a product?) Microsoft are going to have to be careful to position AS and Excel server appropriately. And also it strikes me that another undesireable effect that this will have is that users will want to do more and more of their calculations in Excel, where they can define them themselves, and only take the raw figures from AS – so that formulae end up being replicated all over the place with many subtle variations creeping in, taking us away from that ‘one version of the truth’ goal.

Reviewing BI Suites

It looks like the normally-reliable Business Intelligence Pipeline site has bitten off more than it can chew with its recent attempt to review the leading BI platforms. Not only have readers got upset about certain products being missing from the review – the result of a ‘no-review’ policy on the part of the vendors – but a quick look at the review of Analysis Services/Reporting Services reveals a number of very basic mistakes/misunderstandings. For example: "Chart generation, prevalent among other participants in our evaluation, was not an option with Reporting Services". Eh? Is this meant to mean you can’t have charts in RS reports? Of course you can. And even more worryingly: "Microsoft SQL Analysis Services is an extension to SQL Server 2000, with client access provided through its Enterprise Manager"; several other comments made me think that the literal interpretation of this sentence is in fact what the reviewer meant. Enterprise Manager as a client tool? What does this mean for the quality of the other reviews? Why, while we’re at it, did they review AS+RS and not AS plus Proclarity/Panorama, which is surely a more sensible combination?
 
To be fair, any attempt to try to review such a vast area such as BI platforms with all the different architectures, functionality etc on offer, is doomed to failure. Certainly all the Gartner reports on this subject – which people pay substantial amounts of money for – seem to be pretty useless and provide less information than would be revealed by spending a few hours Googling. The only worthwhile BI reviews I’ve ever seen are on the OLAP Report website, and even there Nigel Pendse focuses on OLAP engines and the odd front-end, and not BI platforms as a whole.
 

Who reads my blog?

Sorry, this is completely off-topic, but I’ve got a little tired of the poor information that MSN Spaces provide on blog visitors and I’m curious to find out who is reading. So I’m going to run a little experiment with Site Meter – just click on the link below to see some detailed stats about visits to this blog:
 

Site Meter

 
Pity there’s no way of putting this permanently on my front page that I can see.

Usage-Based Partitioning

I was reading Dave Wickert’s excellent white paper "Project REAL: Analysis Services Technical Drilldown" the other day (you can get it here), specifically the section on P39 about partitioning. In it he discusses the new functionality in AS2005 which automatically determines which members from your dimensions have data in a given partition, and goes on to talk about the new possibilities this opens up in terms of partitioning strategy. Here’s an excerpt:
 

The partitions in the Project REAL database seem to violate one of the basic best practices of SQL Server 2000. There is no data slice set for the partitions. In SQL Server 2000, partitions must have the data slice set so that the run-time engine knows which partition to access. This is similar to specifying a hint to a relational query optimizer. In SQL Server 2005, this is no longer necessary. Processing the partition now automatically builds a histogram-like structure in the MOLAP storage. This structure identifies which members from all dimensions are included in the partition. Thus, so long as the storage method is MOLAP, the data slice is an optional (and unused) property. However, the data slice is used with ROLAP storage or when proactive caching involves a ROLAP access phase. In both of these circumstances, the actual fact data is never moved so the system does not have a chance to identify a member. In this case, setting the data slice for the partition remains a necessary and critical step if you expect the system to perform well.

Because the MOLAP structures dynamically determine the data slice, a new type of partitioning technique is possible with SQL Server 2005. The best way to describe this technique is via a simple example.

Suppose a system that you are designing has a product dimension of 1,000 products. Of these, the top 5 products account for 80% of the sales (roughly evenly distributed). The remaining 995 products account for the other 20% of the sales. An analysis of the end-user query patterns show that analysis based on product is a common and effective partitioning scheme. For example, most of the reports include a breakdown by product. Based on this analysis, you create six partitions. You create one partition each for the top 5 products and then one “catchall” partition for the remainder. It is easy to create a catchall partition. In the query binding, add a WHERE clause to the SQL statement as in the following code.

In the top five partitions (1 through 5) use the following code.

      SELECT * FROM <fact table>
      WHERE SK_Product_ID = <SK_TopNthProduct#>

In the catchall partition use the following code.

      SELECT * FROM <fact table>
      WHERE SK_Product_ID NOT IN (<SK_TopProduct#>,
                                  <SK_2ndTopProduct#>

                                  <SK_3rdTopProduct#>

                                  <SK_4thTopProduct#>

                                  <SK_5thTopProduct#>)

This technique requires a lot of administrative overhead in SQL Server 2000 Analysis Services. In SQL Server 2000, the data slice must identify each and every member in the partition—even if there are thousands and thousands of members. To implement the example, you would need to create the catchall partition data slice with 995 members in it. This is in addition to the administrative challenge of updating that list as new members are added to the dimension. In SQL Server 2005 Analysis Services, the automatic building of the data slice in the partition eliminates the administrative overhead.

 
 This got me thinking… if we’ve got a Usage-Based Optimisation wizard for helping design the right aggregations for a cube, surely it’s possible to do something similar so that we can design partitions on the basis of the queries that users actually run? Here’s an idea on how it might work (nb this would be a strategy to use in addition to partitioning by Time, Store or other ‘obvious’ slices rather than a replacement):
  • First, get a log of all the queries that users are actually running. Unfortunately the Query Log in AS2005, like AS2000, doesn’t actually record the actual MDX of all the queries run. The only way to do this is to use Profiler; I was a bit worried about whether doing this would have an adverse impact on query performance but when I put the question to MS they indicated it shouldn’t be much (Richard Tkachuk also mentioned, as an aside, that turning off Flight Recorder should result in an increase of a few % in terms of query performance – a tip to remember for production boxes, I think). Once you’ve run your trace you can then export all of the MDX statements from it to a text file very easily.
  • You’d then need a bit of code to extract the unique names of all the members mentioned explicitly in these queries. It should be a fairly simple task if you get the right regular expression, I think. Note that this ignores queries which use any kind of set expression – my thinking was that individually named members are going to be the most interesting because they’re going to be the ones which slice the queries the most: if users are querying on all the countries in the cube that’s not going to be any use for partitioning, but if they have a particular product in the WHERE clause that is much more useful to know about.
  • Then you could do some data mining to cluster these members by their propensity to appear in a query together. The idea is that each of the resulting clusters would translate into a partition; those members which didn’t fall nicely into a cluster and those members that didn’t get extracted in step #2 would have their data fall into one of Dave’s ‘catch-all’ partitions. Imagine this scenario: the UK branch of the Adventure Works corporation suddenly finds there is massive demand for bikes after petrol (‘gas’, for you Americans) prices rise massively. As a result, analysts in the UK run lots of queries which are sliced by the Product Category [Bikes] and the Country [UK]. You’d hope that this pattern would emerge in the clustering and result in a single partition containing all the data for ([Bikes], [UK]), so in the future similar queries run much faster.

What does everyone think? There seems to be a lot of activity these days in the comments section of my blog, so I thought I’d invite feedback. Can anyone see a fatal flaw in this approach?

 

Tableau v1.5 released

Version 1.5 of Tableau, in my opinion probably the best looking, easiest to use and most innovative (but unfortunately also rather expensive and fat-client only) AS client tool has just been released. You can see a list of all the new features here, chief of which is support for AS2005. If you’re looking for an AS client tool I strongly recommend you download a trial and take a look even if you don’t think it can meet all your requirements – it really shows up how poor the other client tools out there are in user interface terms.
 
I did a tiny bit of beta testing on this release and remained as impressed as I was when I first saw it. However the discovery that you can’t use Time Utility dimensions with the tool – a modelling technique which is going to be very common with AS2005 since that’s what the Time Intelligence Wizard builds to hang all your time calculations such as YTD and Previous Period Growth off – was a bit of a disappointment. I found the dev team very intelligent and responsive to feedback, though, and they’ve promised to look at this problem for the next release…