Data Mining Amazon Wish Lists

I’ve seen this article linked to by several people today and thought it was worth posting up myself:
Not true data mining as BI folks understand it, but once you’ve got the data just think what you could do with it using AS2005 data mining (something similar to what Amazon are already doing with it, I guess). By the way, if anyone wants to buy me a present

NON EMPTY limitations

I was putting together a ridiculously large query for my previous post to try to test performance of the HAVING clause, and tried to run the following on Adventure Works:
 
SELECT [Measures].[Internet Order Quantity] ON 0,
NON EMPTY
[Customer].[Customer].MEMBERS
*
[Date].[Date].
MEMBERS
*
[Product].[Product].
MEMBERS
*
[Geography].[City].
MEMBERS
ON
1
FROM [Adventure Works]
 
To my surprise, I got the following error message:
 

Executing the query …

The expression contains a function that cannot operate on a set with more than 4,294,967,296 tuples.

Execution complete

 
Obviously some kind of internal limit here to do with NON EMPTY: the query runs if you remove it. Now 4,294,967,296 tuples is a lot of tuples but imagine that you had a million customers and ten thousand products, and you wanted to find out which customers had bought which products – you’d want to find the non empty combinations in a set of tuples which is still larger than that limit. OK in any normal cube this would probably bring back far too many rows to be a useful query, but what if you had a very specific slice in the WHERE clause (eg adding WHERE([Ship Date].[Date].&[1])  onto the end of the above query still gives the same message) so that you’d only expect to return a very small number of rows in your results? There are certainly going to be ways to work around this if you are writing your own MDX, but I wonder if any of the commercially available client tools which use NON EMPTY know to do this yet? Hmm…
 
 

The HAVING clause

The fact that there’s now a HAVING clause at all in AS2005 MDX doesn’t seem to be publicly documented anywhere, as far as I know; it’s one of those hidden features we found out about while researching ‘MDX Solutions’. While George has a full write-up on how it behaves in the book (coming soon to a store near you) here’s a quick overview of what it does…
 
Basically, it allows you to filter the contents of an axis without having to use the FILTER function. So, for example, the following Adventure Works query only returns the tuples on the ROWS axis which have Internet Order Quantity greater than 1000:
 

SELECT [Measures].[Internet Order Quantity] ON 0,
NON EMPTY
[Date].[Date].MEMBERS
*
[Product].[Subcategory].
MEMBERS
*
[Geography].[Country].
MEMBERS
*
[Customer].[Gender].
MEMBERS
HAVING
[Measures].[Internet Order Quantity]>1000
ON 1
FROM
[Adventure Works]

 
Of course this isn’t anything you couldn’t do before in AS2K, so what’s the point of it? Well, first of all, if you know how you’d do this using a FILTER then you’ve probably already realised that using the HAVING clause makes your MDX much easier to read; as someone who in the past wrote the MDX generation component for a client tool, I can also say that it will also make generating MDX in code easier.
 
There also seem to be some performance benefits to using it too. Since I was told the HAVING condition is applied after the NON EMPTY, I rewrote the above query to use a combination of NONEMPTY and FILTER as follows:
 

SELECT

[Measures].[Internet Order Quantity] ON 0,
FILTER(
NONEMPTY(
[Date].[Date].
MEMBERS
*
[Product].[Subcategory].
MEMBERS
*
[Geography].[Country].
MEMBERS
*
[Customer].[Gender].
MEMBERS
,[Measures].[Internet Order Quantity])
,[Measures].[Internet Order Quantity]>1000)
ON 1
FROM
[Adventure Works]

 
I was expecting this query to perform exactly the same as the first one; however the version which uses HAVING seems to consistently perform better on my machine, although only slightly so. I’d be interested to know whether this is reproducible and why this is. 

Netscan

Netscan is a site I have a look at every few months, and since I’ve just noticed it’s had an upgrade (although this could have happened a while ago) I thought it would be worth a blog entry:
 
As you can see, it’s a very good example of a web-based dashboard providing information/stats on newsgroup postings. The one new feature I was particularly impressed by was the graphical representation of the life of a newsgroup thread – expand one of the nodes in the ‘Thread Tracker’ section to see this; in my opinion this is exactly the kind of creative use of graphics which helps people understand data much more efficiently than a dry text-based approach. I wonder what technology they’ve used for it all? I would guess the UI is custom-coded, but I’m sure you can do something very similar with RS2005 and the rest of the SQL BI suite.
 
It’s also interesting from another point of view in that for a change the data is something that has relevance to me, as opposed to being some customer’s sales figures or accounts data. I’m there listed at #5 on the ‘Author Tracker’ rankings (using my onlyforpostingtonewsgroups hotmail account) for the whole of 2005, although if you were to rank on number of replies instead of days active I wouldn’t do quite as well; Deepak comes in at #1 on any measure though and you can understand why he’s the MVP given the amount of work he must put in. Also compare the data for Quarter 3 and Quarter 4 2005: in the ‘Report Card’ section for Q3 you can see that most measures were down lots compared to the same period 2004, whereas in Q4 the measures were way up. I guess that’s because AS2K is a mature product so the people working with it have less need to ask questions, or perhaps all Microsoft BI projects were put on hold for SQL2005; certainly I’d say that the increase in the Q4 figures must be due to the release of SQL2005 and people having to come to grips with the new functionality.