Despite the worldwide financial meltdown, all talk on the MS BI blogosphere has been about Project Gemini this week. Even though no-one knows much about it – and certainly no-one knows everything about it either, as we’ve been told there are more announcements to come – the reaction has been pretty clear: the back-end technology looks cool, but the strategy of putting Gemini in Excel on the desktop is an invitation to bypass the data warehouse and create spreadmart hell whatever MS says about Sharepoint-based management of Gemini models. I’ve already linked to just about everyone who’s expressed this opinion apart from Mick Horne, whose three posts here, here and here are well worth reading as they put the case against Gemini in its currently proposed form very clearly. Choice quote: "Where is the single version of the truth in this architecture? I’ve just spent 4 years of my life trying to convince users to stop using Excel as a data store and here are Microsoft positively encouraging it. Hell will freeze over before this capability is used responsibly in most organisations".
So what would I like to see MS actually do with Gemini? From the comments on Marco’s blog, at the moment it sounds like the priority is to get the Gemini storage mode working for local cubes and Excel rather than put everything in place for the full server version of Analysis Services. I would prefer the emphasis to be reversed and have Gemini storage mode ready as soon as possible on the server side. It’s clearly going to provide a massive performance boost when it does arrive (and let’s not forget it’s at least two years away), and with other COP databases and data warehouse appliances improving every day there’s going to be significantly more competitive pressure on Analysis Services and SQL Server-based data warehouse projects in the future. I’d hate to see current MS BI shops start abandoning their Analysis Services implementations because they can get better query performance and scalability elsewhere.
But whatever us bloggers say I’m sure we’ll get Gemini in the form we’ve already been shown. The reason is that whatever the rights and wrongs of it from a BI consultant’s point of view, the people who use Excel will definitely want this and so there will be an overwhelming commercial case for it. This kind of desktop, DIY BI is in a way similar to illegal drugs: there are always some people that want it, a certain number of them are always going to do it even though they know they shouldn’t, so you’ve got two choices – either legalise it and then hope to control it, as with Gemini, or throw all your efforts into outlawing it. With the first option you run the risk of encouraging what you wanted to discourage and ending up with a worse problem, with the second option you run the risk of people resenting your rules so much they end up being widely ignored. What’s the best option?
UPDATE: one last link for you – Nigel Pendse of the OLAP Report gives an unreservedly positive review here:
I’ve now had a chance to watch the demos and read all the first-hand accounts of what was announced yesterday (see Marco, Mosha, Tim Kent, Jeremy Kashel, Richard Tkachuk). Here are some unstructured thoughts and questions:
- As per my comment yesterday about Qlikview, self-service BI is undoubtedly what many end users want – but as BI professionals we know only too well that it can be dangerous; in fact just about every blog entry I’ve read on Gemini has made this point. The question of whether it’s a good idea to let your power users do what Gemini lets them do is likely to cause all kinds of heated religious dispute: I was involved in an argument about this at SQLBits recently, and over on Brent Ozar’s blog (see here and here) you can see the same discussion being had. Although I completely understand the scenario that MS describes in its demos of users needing to work with data that isn’t and will never be in a data warehouse, I lean slightly to the side of those who see self-service BI vendors as selling "snake oil". But being a BI consultant I would, wouldn’t I? All this talk of Gemini representing the ‘constellation of twins’, power users and the IT department working together happily, is something of a fairy tale…
- In a comment on my blog yesterday, Mosha stated that there was no cube wizard needed for Gemini. But looking at the demo there’s certainly a step needed where you connect to data sources and choose the tables and fields you want to work with, so whether you call it a cube wizard in the strictest sense you need to have some understanding of your data before you can do anything with it. And whatever the demo says, the application you’re using can only take you part of the way, there’s no way a model can be 100% inferred. What happens if fields that mean the same thing have two different names in two different data sources, or if there are two fields which mean different things which have the same name? And, even for many power users, the question of what a table or a join or even a database actually will still need some explanation.
- While we’re at it – and I know this is a bit of a tangent – expecting power users to understand basic technical concepts is one thing but in many cases (as this excellent blog entry points out) "people have no way of knowing which questions are meaningful ones to ask, and which are meaningless". Not that I’m saying your average BI consultant/IT guy has a better idea either, far from it.
- I was pleased to see mention of data cleaning functionality in the Gemini addin. Is this coming from Zoomix?
- Certainly the Gemini pivot table demo was very impressive. Is this what pivot tables will look like in Office.Next? If so, are we going to see Excel finally grow up to being a full-featured AS client tool for power users in the same way Proclarity Desktop was?
- Moving on, on one hand we’ve got Project Madison, which gives us in SQL Server the ability to query vast amounts of data very quickly. Since this is in SQL Server, I would expect to be able to use AS in ROLAP mode on top. On the other hand we have Project Gemini which will give us a super-fast in-memory storage mode for AS but for slightly smaller data volumes. Where do the two meet? Will we be able to create a HOLAP like solution where your raw data stays in SQL Server/Madison and you can create Gemini-mode aggregations? And can you persist the data in Gemini to disk easily, in case of hardware failure? How long does it take to load data into Gemini?
- Apart from Qlikview, the other product being mentioned in the same breathe as Gemini is TM1, which is of course primarily used for financial apps. So what will the benefits of Gemini be for PerformancePoint and home-grown AS financial cubes? Not only faster storage engine queries, but also faster calculations (although I know only too well that sometimes you can have poor query performance due to calculations even on a warm storage engine cache, even in AS2008). And will you be able to do writeback on a Gemini partition? Now that would be a major performance benefit.
- Having said that the need to be able to write MDX will keep people like me in a job, it’s worth noting that it should indeed be possible to make it easy to write many MDX calculations in Excel. Indeed, one of the cool features of the Intelligencia Query MDX generator is precisely this: the ability to turn spreadsheet style formulas into MDX calculations. And yes, Andrew is in the process of getting this functionality patented.
- I love the idea of Gemini being AS, but I can imagine that some more relationally orientated people would want the ability to query this new data store with SQL. Of course AS actually can be queried with SQL but it’s a very limited subset; it would be great to see tighter integration between AS and the relational engine (along the lines of Oracle’s new cube-based materialised views) so the performance gains that AS gives you can be made available to the relational engine.
- Which thought in turn leads onto whether Madison style MPP can be applied to the Analysis Services engine itself (as I wondered here), either directly or if AS was more tightly integrated with the relational engine. So many permutations of these technologies are possible…
- As with PerformancePoint and Excel Services, there seems to be yet another dependency on Sharepoint here for the management of Gemini models. Of course some central repository is necessary and it makes sense to use Sharepoint rather than reinvent the wheel, but as Microsoft Watch points out this cross-dependency helps MS sell more licenses. And as anyone who has tried to sell a MS BI solution will tell you, selling more server products can be a problem – it’s not necessarily the licence cost but the perfectly valid "we don’t use Sharepoint here, we use X and we don’t want to have to support Sharepoint just for this" response that has to be overcome. I think this issue part-explains why I’ve seen so little use of Excel Services with Analysis Services in my work when it seems such a compelling proposition for almost all companies.
- Lastly, given the current financial crisis, something tells me that when the first CTPs of all this appear next year consultants like me will have plenty of free time to test it out. I know pundits out there are saying that the BI industry will weather any recession because companies will want to compete on information, but I’m sceptical – in my experience most companies don’t make rational decisions in circumstances like these (is that heresy coming from a BI consultant?), they just cut budgets and fire staff without thinking much. And IT consultants, perceived as a cost and of lesser importance to the health of the business than things like, say, the CEO’s bonus, always feels the pain first. Hohum.
Ah, October 6th at last – the date when I was promised All Would Be Revealed. I’d been hearing rumours of something very new and exciting in the world of Microsoft BI for a while but never had any details (they probably reasoned that telling an inveterate blogger like me something top secret would be asking for trouble, but honestly I can keep my mouth shut when I need to); Mosha and Marco both mentioned it recently but didn’t give anything away either.
Anyway, to coincide with the keynote at the BI Conference, more details have shown up on the web:
Here’s what I gather:
- Kilimanjaro is the code name for the next release of SQL Server, due 2010
- Project Madison is the code name for what’s being done with DATAllegro
- Project Gemini is the new, exciting thing: an in-memory storage mode for Analysis Services. To quote Tom Casey in the Intelligent Enterprise article:
"It’s essentially another storage mode for Microsoft SQL Server Analysis Services with access via MDX, so existing applications will be able to take advantage of the performance enhancements."
But it’s clearly more than that – from the Forrester blog entry above:
"Its Gemini tool (to be available for beta testing sometime in 2009 and general availability in 2010) will not only enable power users to build their own models and BI applications, but easily make them available to power users, almost completely taking IT out of the loop. In Gemini, the in-memory, on the fly modeling will be done via a familiar Excel interface. Once a new model and an application is built in Excel, a power user can then publish the application to Sharepoint, making it instantly available to casual users. Not only that, but the act of publishing the model to Sharepoint also creates a SQLServer Analysis Services cube, which can be instantaneously accessed by any other BI, even non Microsoft, tool"
So, self-service cube design and in-memory capabilities. Sounds very, very reminiscent of Qlikview and other similar tools; and given that Qlikview is by all accounts growing rapidly, it’s an obvious market for MS to get into. I guess what will happen is that end users will get a kind of turbo-charged version of the cube wizard where they choose some tables containing the data they want to work with, and it builds a cube that works in ROLAP-ish mode on top of this new in-memory data store. We’ll also get even better query performance too (from COP? pointer-based? data structures).
All in all, super-exciting and despite all the hype about end-user empowerment I’m sure there’ll be even more opportunity for the likes of me to earn consultancy fees off this doing MDX work, tuning etc. But the point about end-user empowerment brings me back to Qlikview: I’ve never seen it, but it’s interesting because I’ve heard some very positive reports about it and some very negative ones too. From what I can make out it is very fast and easy-to-use, and has some great visualisation capabilities, but I’ve also heard it’s very limited in terms of the calculations you can do (at least compared to MDX); I’ve also heard that it’s marketed on the basis that you don’t need a data warehouse to use it – which perhaps explains some of its popularity, but also explains more of the negative comments that it’s had, because of course if you don’t build a data warehouse you’re going to run into all kinds of data quality and data integration issues. Perhaps this last point explains why Qlikview does so appallingly in the BI Survey’s rankings of how well products perform in a competitive evaluation. So something to be wary of if you’re giving tools to end users…
Anyway, if you’re at the BI Conference and have any more details or thoughts on this, please leave a comment!
A couple of interesting (and possibly BI-related) technologies are coming soon from Microsoft:
- I see today via Nick Carr that soon we’ll be able to run Windows and SQL Server on Amazon Elastic Compute Cloud (EC2). I wonder if that includes Analysis Services too? If so, that would be handy.
- Also announced today, the new MS cloud operating system, coming within the month
- Windows High Performance Computing (HPC) server is due to launch at the beginning of November (see here for a news report on it, here and here for some details). Hmm, I see SQL Server is listed on the ‘supported applications’ page… surely there’s got to be some kind of tie-in here with the whole MatrixDB/DATAllegro MPP stuff?
It’s ridiculous, really, that OLAP and data mining functionality have co-existed inside Analysis Services for years now yet they never seem to be used together. I only ever work with the OLAP side of things and just play around with data mining – I’ve never done a serious project with it, unfortunately – and from what I can see the people out there working with Analysis Services data mining don’t tend to use it with cubes; it’s two separate worlds. Anyway, to get to the point, while I was preparing for SQLBits a few weeks ago I came up with a simple idea of how you could use AS data mining to generate forecast data which could then be loaded into a cube, an approach that I think could be easily adapted for use in most real production systems. Now, having spent an evening in a Munich hotel room testing this out, I present my findings.
The data set I’m using here was compiled by Tony Rogerson and put into a dimensional model by Allan Mitchell, and it contains the details of postings to SQL Server newsgroups over the past 8 years. For the purposes of this example I aggregated the data so it contained a single fact table with one measure, the number of postings to all newsgroups per day, which joined to just one Time dimension. It’s interesting data in fact. Looking at the high level at the number of postings per year from 2002 to 2007, you can see that newsgroup postings hit a peak in 2004 and then went into decline:
I guess this is probably a result of the introduction of the MSDN Forums and their growth at the expense of newsgroups. Looking at the day level, you can see an obvious pattern where the number of postings is much higher during the week than at weekends:
The first thing I did was build a simple cube from this data, with a single sum measure representing the number of postings made and a simple Time dimension with Years, Quarters, Months and Dates. This allowed me to explore the data and create the above graphs in Excel. However I only had data up to April 29th 2008, and what I wanted to see was a forecast in my cube of postings for the rest of 2008 and 2009. Here’s what I did to get that:
- I created a new dimension in my cube called Scenario. It was built from a table containing the following two rows:
1 Actual 2 Forecast
This dimension then had a single attribute hierarchy with two members on it, Actual and Forecast.
- I then added a new foreign key column to my existing fact table so I could join the new Scenario dimension to it. This column always contained the value 1 because all the data in my fact table only represented Actual values.
- Next I created a new data mining model using the Microsoft Time Series algorithm based off my fact table. The Time dimension foreign key column was used as my Key Time column and the measure representing the number of postings was set to be an Input as well as Predictable.
Now I’m certainly not a data mining expert and I didn’t spend any time tweaking the model, but it seemed to give reasonably good results at the day level picking up the weekday/weekend variation already noted above:
Obviously if you were doing this for real you’d want to spend a bit more time making sure your mining model was giving you good results.
- With my mining model processed, I was then able to use the following DMX query to give me a flattened resultset showing the predicted number of postings per day from April 30th 2008 until the end of 2009:
One thing I found in order to get this to work nicely was that I had to change the ‘meaningful’ surrogate keys on the Time dimension that Allan had provided me with (eg the key for April 30th 2008 was 20080430, May 1st 2008 was 20080501) to meaningless integer surrogate keys where each new day had a key value one greater than the previous day (eg April 30th 2008 was key 3043, May 1st was key 3044 etc). This meant that when I ran the above DMX query it returned surrogate key values for each predicted value that I could use with my time dimension – before I did this I found the DMX query just added 1 to the date key for each predicted date, giving me incorrect key values like 20080430, 20080431, 20080432 and so on for April 30th, May 1st, May 2nd and so on.
- I then took the results of this query and used SSIS to load them into a second fact table in my relational data source. The SSIS data flow task looked like this:
Here’s what I did at each step:
- Used an OLEDB data source, connected up to Analysis Services, to run the initial DMX prediction query
- Added a derived column with the value 2, as the foreign key for my Scenario dimension, to show that this was all Forecast data.
- Used a data conversion transform to convert the two columns returned by the DMX query from bigints to ints.
- Used an OLEDB destination to load the data into a new Forecast fact table (with a structure identical to the existing fact table) in SQL Server.
And of course if I didn’t include the Scenario dimension in my query I could look at Actual and Forecast values aggregated up to say the Month level:
Some thoughts on how you’d implement this in a real production system:
- You’re not going to want to (or probably be able to) use the Time Series algorithm at the granularity of your fact table. It might make sense to use it at Day granularity, but you’d probably want to aggregate your other dimensions up to a much higher level (for example aggregate from Product to Product Category, Customer to Country and so on) and then generate the forecasts. This would mean in turn you couldn’t just create a new partition in your existing measure group to hold the forecast data, you’d need to create a new measure group. However with some simple MDX you’d be able to integrate the two sets of values and get a similar result to the one I’ve achieved here.
- You’d want to automate the training of the data mining model, and this can be done pretty easily in SSIS using the Analysis Services processing task.
- You might want to store the set of forecasts you generate each day, so you could compare different forecasts made on different days or compare an old forecast with the actual data that subsequently came in. To do this you’d need to add a new dimension to your Forecast measure group which would be Forecast Date (ie the date the forecast data was generated), and maybe make your forecast measures semi-additive (eg last nonempty) with regard to this new Forecast Date dimension.
- As I said, you’d want to spend a lot more time than I did making sure your data mining model gave you good predictions; the aim of this blog entry is to describe how you’d load the forecast data into AS, not make sure the forecasts were accurate! Trying to forecast 1.5 years into the future as I did is probably a bit optimistic; it might be better to limit your predictions to a few days or a month. I used AS2005 here but AS2008 has got a much improved Time Series algorithm you can find out about here. And if you want to learn more about AS data mining your starting point should be www.sqlserverdatamining.com; I see also that there’s a 2008 version of the DM team’s book out soon too – I liked the 2005 version and it was certainly a lot more helpful than BOL for understanding all this.
So overall a fun exercise and one I might work up into a presentation for a user group meeting or something similar. I’d be very interested to hear from anyone who is doing this for real though, to find out how well it works in practice; if I get the chance to implement it with one of my customers I’ll blog about it.
And wouldn’t it be cool if, in the next version of Analysis Services, you could automatically add a Forecast partition to your cube in BIDS and have all this work done for you through a wizard?