Ah, October 6th at last – the date when I was promised All Would Be Revealed. I’d been hearing rumours of something very new and exciting in the world of Microsoft BI for a while but never had any details (they probably reasoned that telling an inveterate blogger like me something top secret would be asking for trouble, but honestly I can keep my mouth shut when I need to); Mosha and Marco both mentioned it recently but didn’t give anything away either.
Kilimanjaro is the code name for the next release of SQL Server, due 2010
Project Madison is the code name for what’s being done with DATAllegro
Project Gemini is the new, exciting thing: an in-memory storage mode for Analysis Services. To quote Tom Casey in the Intelligent Enterprise article: "It’s essentially another storage mode for Microsoft SQL Server Analysis Services with access via MDX, so existing applications will be able to take advantage of the performance enhancements." But it’s clearly more than that – from the Forrester blog entry above: "Its Gemini tool (to be available for beta testing sometime in 2009 and general availability in 2010) will not only enable power users to build their own models and BI applications, but easily make them available to power users, almost completely taking IT out of the loop. In Gemini, the in-memory, on the fly modeling will be done via a familiar Excel interface. Once a new model and an application is built in Excel, a power user can then publish the application to Sharepoint, making it instantly available to casual users. Not only that, but the act of publishing the model to Sharepoint also creates a SQLServer Analysis Services cube, which can be instantaneously accessed by any other BI, even non Microsoft, tool"
So, self-service cube design and in-memory capabilities. Sounds very, very reminiscent of Qlikview and other similar tools; and given that Qlikview is by all accounts growing rapidly, it’s an obvious market for MS to get into. I guess what will happen is that end users will get a kind of turbo-charged version of the cube wizard where they choose some tables containing the data they want to work with, and it builds a cube that works in ROLAP-ish mode on top of this new in-memory data store. We’ll also get even better query performance too (from COP? pointer-based? data structures).
All in all, super-exciting and despite all the hype about end-user empowerment I’m sure there’ll be even more opportunity for the likes of me to earn consultancy fees off this doing MDX work, tuning etc. But the point about end-user empowerment brings me back to Qlikview: I’ve never seen it, but it’s interesting because I’ve heard some very positive reports about it and some very negative ones too. From what I can make out it is very fast and easy-to-use, and has some great visualisation capabilities, but I’ve also heard it’s very limited in terms of the calculations you can do (at least compared to MDX); I’ve also heard that it’s marketed on the basis that you don’t need a data warehouse to use it – which perhaps explains some of its popularity, but also explains more of the negative comments that it’s had, because of course if you don’t build a data warehouse you’re going to run into all kinds of data quality and data integration issues. Perhaps this last point explains why Qlikview does so appallingly in the BI Survey’s rankings of how well products perform in a competitive evaluation. So something to be wary of if you’re giving tools to end users…
Anyway, if you’re at the BI Conference and have any more details or thoughts on this, please leave a comment!
Windows High Performance Computing (HPC) server is due to launch at the beginning of November (see here for a news report on it, here and here for some details). Hmm, I see SQL Server is listed on the ‘supported applications’ page… surely there’s got to be some kind of tie-in here with the whole MatrixDB/DATAllegro MPP stuff?
It’s ridiculous, really, that OLAP and data mining functionality have co-existed inside Analysis Services for years now yet they never seem to be used together. I only ever work with the OLAP side of things and just play around with data mining – I’ve never done a serious project with it, unfortunately – and from what I can see the people out there working with Analysis Services data mining don’t tend to use it with cubes; it’s two separate worlds. Anyway, to get to the point, while I was preparing for SQLBits a few weeks ago I came up with a simple idea of how you could use AS data mining to generate forecast data which could then be loaded into a cube, an approach that I think could be easily adapted for use in most real production systems. Now, having spent an evening in a Munich hotel room testing this out, I present my findings.
The data set I’m using here was compiled by Tony Rogerson and put into a dimensional model by Allan Mitchell, and it contains the details of postings to SQL Server newsgroups over the past 8 years. For the purposes of this example I aggregated the data so it contained a single fact table with one measure, the number of postings to all newsgroups per day, which joined to just one Time dimension. It’s interesting data in fact. Looking at the high level at the number of postings per year from 2002 to 2007, you can see that newsgroup postings hit a peak in 2004 and then went into decline:
I guess this is probably a result of the introduction of the MSDN Forums and their growth at the expense of newsgroups. Looking at the day level, you can see an obvious pattern where the number of postings is much higher during the week than at weekends:
The first thing I did was build a simple cube from this data, with a single sum measure representing the number of postings made and a simple Time dimension with Years, Quarters, Months and Dates. This allowed me to explore the data and create the above graphs in Excel. However I only had data up to April 29th 2008, and what I wanted to see was a forecast in my cube of postings for the rest of 2008 and 2009. Here’s what I did to get that:
I created a new dimension in my cube called Scenario. It was built from a table containing the following two rows:
This dimension then had a single attribute hierarchy with two members on it, Actual and Forecast.
I then added a new foreign key column to my existing fact table so I could join the new Scenario dimension to it. This column always contained the value 1 because all the data in my fact table only represented Actual values.
Next I created a new data mining model using the Microsoft Time Series algorithm based off my fact table. The Time dimension foreign key column was used as my Key Time column and the measure representing the number of postings was set to be an Input as well as Predictable. Now I’m certainly not a data mining expert and I didn’t spend any time tweaking the model, but it seemed to give reasonably good results at the day level picking up the weekday/weekend variation already noted above: Obviously if you were doing this for real you’d want to spend a bit more time making sure your mining model was giving you good results.
With my mining model processed, I was then able to use the following DMX query to give me a flattened resultset showing the predicted number of postings per day from April 30th 2008 until the end of 2009: SELECT FLATTENED PredictTimeSeries([Forecasts].[Postings], 611) From [Forecasts] One thing I found in order to get this to work nicely was that I had to change the ‘meaningful’ surrogate keys on the Time dimension that Allan had provided me with (eg the key for April 30th 2008 was 20080430, May 1st 2008 was 20080501) to meaningless integer surrogate keys where each new day had a key value one greater than the previous day (eg April 30th 2008 was key 3043, May 1st was key 3044 etc). This meant that when I ran the above DMX query it returned surrogate key values for each predicted value that I could use with my time dimension – before I did this I found the DMX query just added 1 to the date key for each predicted date, giving me incorrect key values like 20080430, 20080431, 20080432 and so on for April 30th, May 1st, May 2nd and so on.
I then took the results of this query and used SSIS to load them into a second fact table in my relational data source. The SSIS data flow task looked like this: Here’s what I did at each step:
Used an OLEDB data source, connected up to Analysis Services, to run the initial DMX prediction query
Added a derived column with the value 2, as the foreign key for my Scenario dimension, to show that this was all Forecast data.
Used a data conversion transform to convert the two columns returned by the DMX query from bigints to ints.
Used an OLEDB destination to load the data into a new Forecast fact table (with a structure identical to the existing fact table) in SQL Server.
Finally I went back to my cube and added a new partition to my existing measure group, pointing at the new Forecasts fact table, and reprocessed. This meant I now had data for the Actual member on my Scenario dimension up to April 29th 2008, and data for the Forecast member from April 30th 2008 until the end of 2009. And voila – forecast data loaded into my cube at Day level, which is then of course easy to explore, see aggregated up to Month, Quarter and Year level, and so on. Here’s what the data looked like in an Excel pivot chart: And of course if I didn’t include the Scenario dimension in my query I could look at Actual and Forecast values aggregated up to say the Month level:
Some thoughts on how you’d implement this in a real production system:
You’re not going to want to (or probably be able to) use the Time Series algorithm at the granularity of your fact table. It might make sense to use it at Day granularity, but you’d probably want to aggregate your other dimensions up to a much higher level (for example aggregate from Product to Product Category, Customer to Country and so on) and then generate the forecasts. This would mean in turn you couldn’t just create a new partition in your existing measure group to hold the forecast data, you’d need to create a new measure group. However with some simple MDX you’d be able to integrate the two sets of values and get a similar result to the one I’ve achieved here.
You’d want to automate the training of the data mining model, and this can be done pretty easily in SSIS using the Analysis Services processing task.
You might want to store the set of forecasts you generate each day, so you could compare different forecasts made on different days or compare an old forecast with the actual data that subsequently came in. To do this you’d need to add a new dimension to your Forecast measure group which would be Forecast Date (ie the date the forecast data was generated), and maybe make your forecast measures semi-additive (eg last nonempty) with regard to this new Forecast Date dimension.
As I said, you’d want to spend a lot more time than I did making sure your data mining model gave you good predictions; the aim of this blog entry is to describe how you’d load the forecast data into AS, not make sure the forecasts were accurate! Trying to forecast 1.5 years into the future as I did is probably a bit optimistic; it might be better to limit your predictions to a few days or a month. I used AS2005 here but AS2008 has got a much improved Time Series algorithm you can find out about here. And if you want to learn more about AS data mining your starting point should be www.sqlserverdatamining.com; I see also that there’s a 2008 version of the DM team’s book out soon too – I liked the 2005 version and it was certainly a lot more helpful than BOL for understanding all this.
So overall a fun exercise and one I might work up into a presentation for a user group meeting or something similar. I’d be very interested to hear from anyone who is doing this for real though, to find out how well it works in practice; if I get the chance to implement it with one of my customers I’ll blog about it.
And wouldn’t it be cool if, in the next version of Analysis Services, you could automatically add a Forecast partition to your cube in BIDS and have all this work done for you through a wizard?