COP Databases

COP (Column Oriented Processing) databases seem to have been making something of a comeback recently. You may be wondering what a COP database is; the best short summary is from the OLAP Report’s glossary:

This is a class of database that is quite different to, but is nevertheless sometimes confused with, OLAP. COP databases perform high speed row counting and filtering at a very detailed level, using column-based structures, rather than rows, tables or cubes. COPs analyze detail and then occasionally aggregate the results, whereas OLAPs are largely used to report and analyze aggregated results with occasional drilling down to detail.

Although less well known and recognized than OLAP, COP databases have also been in use for more than 30 years (for example, TAXIR), which means that, just like OLAP, they predate relational databases. COP products are used for very different applications to OLAP and examples include Alterian, Sand Nucleus, smartFOCUS, Sybase IQ, Synera, etc.

As Nigel says, they have been around for ages but have never seemed to be as popular as OLAP. I have no idea why this is because the technology is fundamentally good – my very first project using OLAP Services back in the SQL 7 days was an attempt to replace a proprietary COP database, still desktop-based and not even 32-bit code, and we still struggled to match its performance on some queries. Maybe it just needed a big software company to buy into the sector in the way that Oracle and Microsoft did with OLAP for it to take off. There are certainly some big companies using it though – for example Tesco, the biggest supermarket chain in the UK:

Anyway, the prompt for this blog entry was several people asking me about COP databases over the last few months and then, this morning, reading this entry on Shawn Rogers’ blog about a new COP product called Paraccel:

It got me thinking… I wonder if Microsoft should think about buying or developing a COP database? I have no idea whether it makes sense or not, but you could integrate it as another engine within Analysis Services and perhaps even create a hybrid of OLAP and COP. People often complain about how bad AS is at querying transaction-level data, and drillthrough at the moment is very difficult to get good performance from; I wonder if a COP would help here. Certainly Paraccel’s AMIGO feature where it can synch with an existing relational database sounds very much like processing a cube (only faster); give it the ability to be queried in MDX and think of the fun… and in the long run, maybe all this stuff should have closer integration with the relational database, as Oracle are doing.

SQLBits Podcast

While at SQLBits, Dave McMahon of the Next Generation User Group recorded a podcast with various people there, including me. It’s now available to download here:

Can I state for the record that I did do a presentation for them earlier this year, in Oxford!

Craig Murphy has a lot of other SQLBits-related podcasts and videos on his blog too, including a picture of the top of my head taken while I was recording the above podcast:

Why does AS query my ROLAP dimension? Or, another good reason to use Unique Names

One of my particular hobby horses is always using full unique names when writing MDX, rather than just referring to the name of the member. Consider the following query in Adventure Works:

select {[Measures].[Internet Sales Amount]} on 0,
{[Source Currency].[Source Currency].&[United Kingdom Pound]}
on 1
from [adventure works]

OK it doesn’t make much sense in itself, but we know what it is doing. Now take a look at this query:

select {[Measures].[Internet Sales Amount]} on 0,
{[United Kingdom Pound]}
on 1
from [adventure works]

It returns the same results, but it’s bad for two reasons. First the reason I always knew about: if there’s another member with the same name (as there is in Adventure Works, there is more than one Currency dimension) then you can’t be sure what member the query will return. For example, you might have a Region on your Geography dimension called ‘West’ and a Customer with the surname ‘West’ – only unique name is going to be able to uniquely identify a member. The other reason I just found out about from Mosha, and it explains some weird behaviour I had noticed several times in the past but never understood. When you have a query (or worse, a calculation) that doesn’t use unique names then AS will search through the members on all dimensions to look for until it finds a member which has the right name. This is not too bad when all of your dimensions are MOLAP, but if you have a ROLAP dimension and AS decides to search through that then you can end up with a big performance problem. If you rerun the second query above and run a Profiler trace then you will see AS query the Internet Sales Order Details ROLAP dimension, and this makes the query substantially slower. Note that you might be lucky and that AS might find a member with a matching name before it comes to search the ROLAP dimension, but you probably don’t want to take that risk!

Project Gatineau

Interesting post from Jamie Thomson’s non-SSIS blog on Microsoft’s web analytics offering Gatineau, and how it looks very much like an Analysis Services solution:!550F681DAD532637!2339.entry

I’ve blogged about adCenter’s use of Analysis Services before and I’ve heard the same stories as Jamie about a mutant version of AS. I wonder whether someone might consider telling us more about this, if this is (as I suspect) what he’s working on at the moment?

Will MDX Go Mainstream?

I always look out for Mark Whitehorn’s articles in the Register, if only because I’m tickled to see any mention of MDX on the front page of the UK’s foremost IT news site. Here’s his latest:

While I agree with his judgement that "MDX will become a highly saleable skill" (it already is for me!), and while I agree with with all his arguments that MDX is a good thing and better than SQL for BI queries, I can’t agree with his central argument that application developers are going to start learning it on a massive scale. I would really, really, really like to believe it but I can’t. A small minority of developers who develop specialist analytical applications will need to learn a bit but everyone else will rely on third-party products like Dundas OLAP Services (what’s going to happen about that with the MS acquisition of the Dundas products? Anyone know?) or the Intelligencia OLAP Controls. It’s not that MDX is difficult per se, but that people who are used to thinking in SQL – and developers are always going to need to know some SQL – find it very difficult to start thinking in MDX, and that’s a big hurdle to overcome. And as the existence of not just the tools I mentioned but the entire AS client tool market proves, it’s also relatively easy to write a generic MDX query generator that will work well on just about any cube whereas you can’t just write a generic SQL query generator that will work on any set of tables without building a metadata layer over the top (eg in the way Report Builder needs its Report Models); and once you’ve built that metadata layer you might as well have built a cube anyway.

Fun and Games with Schema Rowsets

One of the (few) new features in AS2008 are Data Management Views, which give you a table-like way of looking at all kinds of useful metadata and management style information. However, as Darren points out here:
…this isn’t really anything you can’t get from schema rowsets at the moment, and the great set of functions that Darren wrote for the Analysis Services Stored Procedure Project ( make it very easy to get at this information in MDX.

I’ve been playing around with schema rowsets a bit over the last week or so, mostly because I’m working on an updated version of my SSIS cache-warmer (more of which soon), but it’s given me a number of interesting ideas for other things that could be possible. For instance, one of the big missing features from AS at the moment in my opinion in a query governor: it’s all too easy for one user to bring your server to its knees by running the query from hell. In the article above, Darren talks about the information in the DISCOVER_SESSIONS and DISCOVER_CONNECTIONS rowsets, and how you can use the ids they return to kill sessions and connections; both of them contain other interesting information (which I think probably need some time spent researching to make sure I know exactly what they mean) on the time since the last command, the CPU time taken by the last command, what the last command was, how long the last command took to run etc. With this information I guess it would be possible to build a rudimentary query governor in SSIS: basically a package that was scheduled to run every ten seconds or something, checked the overall CPU usage (or other perfmon counters) on the machine and if it was greater than a certain threshold looked at the sessions that were open, tried to work out which were the guilty parties and killed them.

There’s other useful stuff available from schema rowsets too that doesn’t seem to be documented anywhere. If you look at the DISCOVER_SCHEMA_ROWSETS rowset (try running call assp.Discover("DISCOVER_SCHEMA_ROWSETS") if you have ASSP installed) you can see what schema rowsets are available. One interesting one I found was DISCOVER_PARTITION_DIMENSION_STAT; for example run


…on Adventure Works. What it’s showing is the partitioning and indexing information for an individual partition, information discussed in this SQLCat blog entry from earlier this year:

This article sends you off to the info*.xml files created when you process a partition to get this information, but here it is ready to use. The last three columns the rowset returns are the important ones: they tell you whether each attribute is indexed in the partition and if so what the maximum and minimum members in the partition are. These maximums and minimums are expressed as DataIDs, AS’s own internal surrogate keys, and you can find out what a member’s DataID is by using the (wait for it) DataID function as follows:

with member measures.showdataid as
select measures.showdataid on 0,
[Date].[Date].members on 1
from [adventure works]

When I first found out about this I thought it would be really cool to be able to visualise this information somehow so you could check to see whether your partition slicing was in a healthy state or not. For instance, if you look at the 2001 partition from the Internet Sales measure group:


…you can see there are no indexes at all. This is because there is so little data in it, it falls below the Index Build Threshold server property, and this in turn means that every time you run a query then that partition is going to be scanned. Another thing you might need to watch out for, if you’re not using the Slice property on each partition (and you really should be, even on MOLAP partitions, despite what some people say) is the scenario the SQLCat team describe where your partitions end up with overlapping DataID ranges, meaning that AS ends up scanning partitions unnecessarily when you run queries. I’m sure I could put together some kind of SSRS report showing all the partitions in a measure group side by side, and perhaps using a stacked bar chart or something try to represent the DataID ranges for a given attribute.

%d bloggers like this: