It’s ridiculous, really, that OLAP and data mining functionality have co-existed inside Analysis Services for years now yet they never seem to be used together. I only ever work with the OLAP side of things and just play around with data mining – I’ve never done a serious project with it, unfortunately – and from what I can see the people out there working with Analysis Services data mining don’t tend to use it with cubes; it’s two separate worlds. Anyway, to get to the point, while I was preparing for SQLBits a few weeks ago I came up with a simple idea of how you could use AS data mining to generate forecast data which could then be loaded into a cube, an approach that I think could be easily adapted for use in most real production systems. Now, having spent an evening in a Munich hotel room testing this out, I present my findings.
The data set I’m using here was compiled by Tony Rogerson and put into a dimensional model by Allan Mitchell, and it contains the details of postings to SQL Server newsgroups over the past 8 years. For the purposes of this example I aggregated the data so it contained a single fact table with one measure, the number of postings to all newsgroups per day, which joined to just one Time dimension. It’s interesting data in fact. Looking at the high level at the number of postings per year from 2002 to 2007, you can see that newsgroup postings hit a peak in 2004 and then went into decline:
I guess this is probably a result of the introduction of the MSDN Forums and their growth at the expense of newsgroups. Looking at the day level, you can see an obvious pattern where the number of postings is much higher during the week than at weekends:
The first thing I did was build a simple cube from this data, with a single sum measure representing the number of postings made and a simple Time dimension with Years, Quarters, Months and Dates. This allowed me to explore the data and create the above graphs in Excel. However I only had data up to April 29th 2008, and what I wanted to see was a forecast in my cube of postings for the rest of 2008 and 2009. Here’s what I did to get that:
- I created a new dimension in my cube called Scenario. It was built from a table containing the following two rows:
1 Actual 2 Forecast
This dimension then had a single attribute hierarchy with two members on it, Actual and Forecast.
- I then added a new foreign key column to my existing fact table so I could join the new Scenario dimension to it. This column always contained the value 1 because all the data in my fact table only represented Actual values.
- Next I created a new data mining model using the Microsoft Time Series algorithm based off my fact table. The Time dimension foreign key column was used as my Key Time column and the measure representing the number of postings was set to be an Input as well as Predictable.
Now I’m certainly not a data mining expert and I didn’t spend any time tweaking the model, but it seemed to give reasonably good results at the day level picking up the weekday/weekend variation already noted above:
Obviously if you were doing this for real you’d want to spend a bit more time making sure your mining model was giving you good results.
- With my mining model processed, I was then able to use the following DMX query to give me a flattened resultset showing the predicted number of postings per day from April 30th 2008 until the end of 2009:
One thing I found in order to get this to work nicely was that I had to change the ‘meaningful’ surrogate keys on the Time dimension that Allan had provided me with (eg the key for April 30th 2008 was 20080430, May 1st 2008 was 20080501) to meaningless integer surrogate keys where each new day had a key value one greater than the previous day (eg April 30th 2008 was key 3043, May 1st was key 3044 etc). This meant that when I ran the above DMX query it returned surrogate key values for each predicted value that I could use with my time dimension – before I did this I found the DMX query just added 1 to the date key for each predicted date, giving me incorrect key values like 20080430, 20080431, 20080432 and so on for April 30th, May 1st, May 2nd and so on.
- I then took the results of this query and used SSIS to load them into a second fact table in my relational data source. The SSIS data flow task looked like this:
Here’s what I did at each step:
- Used an OLEDB data source, connected up to Analysis Services, to run the initial DMX prediction query
- Added a derived column with the value 2, as the foreign key for my Scenario dimension, to show that this was all Forecast data.
- Used a data conversion transform to convert the two columns returned by the DMX query from bigints to ints.
- Used an OLEDB destination to load the data into a new Forecast fact table (with a structure identical to the existing fact table) in SQL Server.
And of course if I didn’t include the Scenario dimension in my query I could look at Actual and Forecast values aggregated up to say the Month level:
Some thoughts on how you’d implement this in a real production system:
- You’re not going to want to (or probably be able to) use the Time Series algorithm at the granularity of your fact table. It might make sense to use it at Day granularity, but you’d probably want to aggregate your other dimensions up to a much higher level (for example aggregate from Product to Product Category, Customer to Country and so on) and then generate the forecasts. This would mean in turn you couldn’t just create a new partition in your existing measure group to hold the forecast data, you’d need to create a new measure group. However with some simple MDX you’d be able to integrate the two sets of values and get a similar result to the one I’ve achieved here.
- You’d want to automate the training of the data mining model, and this can be done pretty easily in SSIS using the Analysis Services processing task.
- You might want to store the set of forecasts you generate each day, so you could compare different forecasts made on different days or compare an old forecast with the actual data that subsequently came in. To do this you’d need to add a new dimension to your Forecast measure group which would be Forecast Date (ie the date the forecast data was generated), and maybe make your forecast measures semi-additive (eg last nonempty) with regard to this new Forecast Date dimension.
- As I said, you’d want to spend a lot more time than I did making sure your data mining model gave you good predictions; the aim of this blog entry is to describe how you’d load the forecast data into AS, not make sure the forecasts were accurate! Trying to forecast 1.5 years into the future as I did is probably a bit optimistic; it might be better to limit your predictions to a few days or a month. I used AS2005 here but AS2008 has got a much improved Time Series algorithm you can find out about here. And if you want to learn more about AS data mining your starting point should be www.sqlserverdatamining.com; I see also that there’s a 2008 version of the DM team’s book out soon too – I liked the 2005 version and it was certainly a lot more helpful than BOL for understanding all this.
So overall a fun exercise and one I might work up into a presentation for a user group meeting or something similar. I’d be very interested to hear from anyone who is doing this for real though, to find out how well it works in practice; if I get the chance to implement it with one of my customers I’ll blog about it.
And wouldn’t it be cool if, in the next version of Analysis Services, you could automatically add a Forecast partition to your cube in BIDS and have all this work done for you through a wizard?