Using AS Data Mining to Add Forecast Values to a Cube

It’s ridiculous, really, that OLAP and data mining functionality have co-existed inside Analysis Services for years now yet they never seem to be used together. I only ever work with the OLAP side of things and just play around with data mining – I’ve never done a serious project with it, unfortunately – and from what I can see the people out there working with Analysis Services data mining don’t tend to use it with cubes; it’s two separate worlds. Anyway, to get to the point, while I was preparing for SQLBits a few weeks ago I came up with a simple idea of how you could use AS data mining to generate forecast data which could then be loaded into a cube, an approach that I think could be easily adapted for use in most real production systems. Now, having spent an evening in a Munich hotel room testing this out, I present my findings.

The data set I’m using here was compiled by Tony Rogerson and put into a dimensional model by Allan Mitchell, and it contains the details of postings to SQL Server newsgroups over the past 8 years. For the purposes of this example I aggregated the data so it contained a single fact table with one measure, the number of postings to all newsgroups per day, which joined to just one Time dimension. It’s interesting data in fact. Looking at the high level at the number of postings per year from 2002 to 2007, you can see that newsgroup postings hit a peak in 2004 and then went into decline:

ngpostingsperyear

I guess this is probably a result of the introduction of the MSDN Forums and their growth at the expense of newsgroups. Looking at the day level, you can see an obvious pattern where the number of postings is much higher during the week than at weekends:

ngpostingsperday

The first thing I did was build a simple cube from this data, with a single sum measure representing the number of postings made and a simple Time dimension with Years, Quarters, Months and Dates. This allowed me to explore the data and create the above graphs in Excel. However I only had data up to April 29th 2008, and what I wanted to see was a forecast in my cube of postings for the rest of 2008 and 2009. Here’s what I did to get that:

  1. I created a new dimension in my cube called Scenario. It was built from a table containing the following two rows:

    1 Actual
    2 Forecast

    This dimension then had a single attribute hierarchy with two members on it, Actual and Forecast.

  2. I then added a new foreign key column to my existing fact table so I could join the new Scenario dimension to it. This column always contained the value 1 because all the data in my fact table only represented Actual values.
  3. Next I created a new data mining model using the Microsoft Time Series algorithm based off my fact table. The Time dimension foreign key column was used as my Key Time column and the measure representing the number of postings was set to be an Input as well as Predictable.
    miningmodel
    Now I’m certainly not a data mining expert and I didn’t spend any time tweaking the model, but it seemed to give reasonably good results at the day level picking up the weekday/weekend variation already noted above:
    predictionchart
    Obviously if you were doing this for real you’d want to spend a bit more time making sure your mining model was giving you good results.
  4. With my mining model processed, I was then able to use the following DMX query to give me a flattened resultset showing the predicted number of postings per day from April 30th 2008 until the end of 2009:
    SELECT FLATTENED
      PredictTimeSeries([Forecasts].[Postings], 611)
    From
      [Forecasts]
    One thing I found in order to get this to work nicely was that I had to change the ‘meaningful’ surrogate keys on the Time dimension that Allan had provided me with (eg the key for April 30th 2008 was 20080430, May 1st 2008 was 20080501) to meaningless integer surrogate keys where each new day had a key value one greater than the previous day (eg April 30th 2008 was key 3043, May 1st was key 3044 etc). This meant that when I ran the above DMX query it returned surrogate key values for each predicted value that I could use with my time dimension – before I did this I found the DMX query just added 1 to the date key for each predicted date, giving me incorrect key values like 20080430, 20080431, 20080432 and so on for April 30th, May 1st, May 2nd and so on.
  5. I then took the results of this query and used SSIS to load them into a second fact table in my relational data source. The SSIS data flow task looked like this:
    predictdataflow
    Here’s what I did at each step:
    1. Used an OLEDB data source, connected up to Analysis Services, to run the initial DMX prediction query
    2. Added a derived column with the value 2, as the foreign key for my Scenario dimension, to show that this was all Forecast data.
    3. Used a data conversion transform to convert the two columns returned by the DMX query from bigints to ints.
    4. Used an OLEDB destination to load the data into a new Forecast fact table (with a structure identical to the existing fact table) in SQL Server.
  6. Finally I went back to my cube and added a new partition to my existing measure group, pointing at the new Forecasts fact table, and reprocessed. This meant I now had data for the Actual member on my Scenario dimension up to April 29th 2008, and data for the Forecast member from April 30th 2008 until the end of 2009. And voila – forecast data loaded into my cube at Day level, which is then of course easy to explore, see aggregated up to Month, Quarter and Year level, and so on. Here’s what the data looked like in an Excel pivot chart:
    actualforecastpivotchart
    And of course if I didn’t include the Scenario dimension in my query I could look at Actual and Forecast values aggregated up to say the Month level:
    monthview

Some thoughts on how you’d implement this in a real production system:

  • You’re not going to want to (or probably be able to) use the Time Series algorithm at the granularity of your fact table. It might make sense to use it at Day granularity, but you’d probably want to aggregate your other dimensions up to a much higher level (for example aggregate from Product to Product Category, Customer to Country and so on) and then generate the forecasts. This would mean in turn you couldn’t just create a new partition in your existing measure group to hold the forecast data, you’d need to create a new measure group. However with some simple MDX you’d be able to integrate the two sets of values and get a similar result to the one I’ve achieved here.
  • You’d want to automate the training of the data mining model, and this can be done pretty easily in SSIS using the Analysis Services processing task.
  • You might want to store the set of forecasts you generate each day, so you could compare different forecasts made on different days or compare an old forecast with the actual data that subsequently came in. To do this you’d need to add a new dimension to your Forecast measure group which would be Forecast Date (ie the date the forecast data was generated), and maybe make your forecast measures semi-additive (eg last nonempty) with regard to this new Forecast Date dimension.
  • As I said, you’d want to spend a lot more time than I did making sure your data mining model gave you good predictions; the aim of this blog entry is to describe how you’d load the forecast data into AS, not make sure the forecasts were accurate! Trying to forecast 1.5 years into the future as I did is probably a bit optimistic; it might be better to limit your predictions to a few days or a month. I used AS2005 here but AS2008 has got a much improved Time Series algorithm you can find out about here. And if you want to learn more about AS data mining your starting point should be www.sqlserverdatamining.com; I see also that there’s a 2008 version of the DM team’s book out soon too – I liked the 2005 version and it was certainly a lot more helpful than BOL for understanding all this.

So overall a fun exercise and one I might work up into a presentation for a user group meeting or something similar. I’d be very interested to hear from anyone who is doing this for real though, to find out how well it works in practice; if I get the chance to implement it with one of my customers I’ll blog about it.

And wouldn’t it be cool if, in the next version of Analysis Services, you could automatically add a Forecast partition to your cube in BIDS and have all this work done for you through a wizard?

11 thoughts on “Using AS Data Mining to Add Forecast Values to a Cube

  1. Great post Chris.  I like the functionality with sending the forecast records to the fact table by using a DMX query. Doing DM on SSAS cubes is slower than running it on a view or table at least in SSAS 2005. The DM structure is also a little harder to understand than the relational version. Market basket analysis in SSAS cubes with MDX is more difficult and do not achive the same visual result as running association rules on the same cube. It is important to know more about this technlogy and the methods.

  2. Great article – we\’re always getting questions about OLAP/Data Mining integration – now I can send them your way! 🙂
    I have another recommendation for getting more granular forecasts using SQL Server 2008.  Using the Time Series enhancements in SQL Server 2008 you can create a Time Series model using one set of data and then use it by applying PREDICTION JOIN to forecast from another set of data.This allows you to create a model, say, at the Category or Subcategory level where the data patterns are more stable, and then use that model against the individual Product level data to get forecasted amounts for each product.

  3. Hi ChrisThanks for this usefull article !I totally agree that DM world is still not as tightly integrated as it should be. I mean that actually it is integrated at an application level but not at a data model level. That\’s why I\’d like to mention a little case we had to face in integrating DM into OLTP and OLAP world: a customer needed to show their customer profile in their OLTP application, by showing the cluster node that it belongs to according to a clustering DM model and the probability of buying other products using a decision tree model. They also wanted to see DM model metadata such as model creation date, due to the fact that models were frequently retrained.Besides they wanted to see both values as a Customer dimension attribute when doing BI reports.So we decided to integrate it at the relational level in two different ways: first by creating a CLR function within the SQL Server engine that retrieved DM prediction data and metadata from the DM model and second by firing prediction queries through a linked server to the DM processed model.Doing it at the relational level in the architecture resulted in a more unified solution as we had not to worry about it at the application or OLAP and BI level, but I think that due to performance issues it only applied to databases with small amount of records.Knowing the fact that relational, OLAP and DM answer different problems sometimes user needs to see them as a unique informational unit.I think that integration should be native from lower data levels so as to achieve a more consistent and less redundant architecture. Integrating them at the top level of the architecture could lead to finding more integration issues if components do not provide data as a service.

  4. Really a good one Chris… I see you haven’t ventured in this enough, but I have a work of predicting sales (revenue, cost, margin & units) for different sales Channels, Segments, countries and a whole bunch of product types at weekly level. Would SSAS Time Series DM model help me to do this complexity level forecasting? Or should I turn to my developers in SAS (who would take a lot of time going back & forth with BRDs) to do this? Also, I’m unable to find anything on building multiple X’s dependent Y’s for forecasting in SSAS (2008/2012). I hope you may have found something since the time this blog was posted. Please do let me know.

  5. Chris: When you created your mining model, what dimension did you use as your source? When creating a model, you are prompted for a dimension. Which did you use? And did you base your model on an existing database or and existing cube?

Leave a Reply to ansoneesanCancel reply