SDS: the new relational features announced

After all the rumours, here’s the official announcement of the new relational features that are coming to SQL Data Services:

Given that the team have already made noises about adding BI features to SDS soon, I can’t wait to see what form they’ll take. Of course there are already lots of ways of doing BI with data stored online as my last blog entry showed; there are also couple of startups like Birst and GoodData who do very sophisticated BI things in the cloud already. But I hope Microsoft has something up its sleeve, and that I can run an MDX query against it…

Guardian Data Store – free data, and some ideas on how to play with it

I was reading the Guardian (a UK newspaper) online today and saw that they have just launched something called Open Platform, basically a set of tools that allow you to access and build applications on top of their data and content. The thing that really caught my eye was the Data Store, which makes available all of the numeric data they would usually publish in tables and graphs in the paper in Google Spreadsheet format. Being a data guy I find free, interesting data irresistible: I work with data all day long, and building systems to help other people analyse data is what I do for a living, but usually I’m not that interested in analysing the data I work with myself because it’s just a company’s sales figures or something equally dull. However give me information on the best-selling singles of 2008 or crime stats for example, I start thinking of the fun stuff I could do with it. If you saw Donald Farmer’s fascinating presentation at PASS 2008 where he used data mining to analyse the Titantic passenger list to see if he could work out the rules governing who survived and who didn’t, you’ll know what I mean.

Given that all the data’s in Google Spreadsheets anyway, the first thing I thought of doing was using Panorama’s free pivot table gadget to analyse the data OLAP-style (incidentally, if you saw it when it first came out and thought it was a bit slow, like I did, take another look – it’s got a lot better in the last few months). Using the data I mentioned above on best-selling singles, here’s what I did to get the gadget working:

  1. Opened the link to the spreadsheet:
  2. Followed the link at the very bottom of the page to edit the page.
  3. On the new window, clicked File/Create a Copy on the menu to open yet another window, this time with a version of the data that can be edited (the previous window contained only read-only data)
  4. Right-clicked on column J and selected Insert 1 Right, to create a new column on the right-hand side.
  5. Added a column header, typed Count in the header row, and then filled the entire column with the value 1 by typing 1 into the first row and then dragging it down. I needed this column to create a new measure for the pivot table.
  6. Edited the ‘Artist(s)’ column to be named ‘Artist’ because apparently Panorama doesn’t like brackets
  7. Selected the whole data set (the range I used was Sheet1!B2:K102) and then went to Insert/Gadget and chose Analytics for Google Spreadsheets. It took me a moment to work out I had to scroll to the top of the sheet to see the Panorama dialog that appeared.
  8. Clicked Apply and Close, waited a few seconds while the cube was built, ignored the tutorial that started, spent a few minutes learning how to use the tool the hard way having ignored the tutorial, and bingo! I had my pivot table open. Here’s a screenshot showing the count of singles broken down by gender and country of origin.


Of course, this isn’t the only way you can analyse data in Google spreadsheets. Sisense Prism, which I reviewed here a few months ago, has a free version which can connect to Google spreadsheets and work with limited amounts of data. I still have it installed on my laptop, so I had a go connecting – it was pretty easy so I won’t go through the steps, although I didn’t work out how to get it to recognise the column headers as column headers and that polluted the data a bit. Here’s a screenshot of a dashboard I put together very quickly:


Lastly, having mentioned Donald Farmer’s Titanic demo I thought it would be good to do some data mining. The easiest way for me was obviously to use the Microsoft Excel data mining addin: there are two flavours of this: the version (available here) that needs to be able to connect to an instance of Analysis Services, and the version that can connect to an instance of Analysis Services in the cloud (available here; Jamie MacLennan and Brent Ozar’s blog entries on this are worth reading, and there’s even a limited web-based interface for it too). Here’s what I did:

  1. Installed the data mining addin, obviously
  2. In the copy of the spreadsheet, I clicked File/Export/.xls to export to Excel, then clicked Open
  3. In Excel, selected the data and on the Home tab on the ribbon clicked the Format as a Table button
  4. The Table Tools tab having appeared on the ribbon automatically, I then pressed the Analyze Key Influencers button
  5. In the dialog that appeared, I chose Genre from the dropdown to try to work out which of the other columns influenced the genre of the music
  6. Clicked I Agree and Do Not Remind Me Again on the Connecting to the Internet dialog
  7. Added a report comparing Pop to Rock

Here’s what I got out:


From this we can see very clearly that if you’re from the UK or under 25 you’re much more likely to be producing Pop, Groups are more likely to produce Rock, and various other interesting facts.

So, lots of fun certainly (at least for a data geek like me), but everything I’ve shown here is intended as a serious business tool. It’s not hard to imagine that, in a few years time when more and more data is available online through spreadsheets or cloud-based databases, we’ll be doing exactly what I’ve demonstrated here with that boring business data you and I have to deal with in our day jobs.

Analysis Services and the System File Cache

Earlier this week Greg Galloway sent me an email about some new code he’d added to the Analysis Services Stored Procedure Project to clear the Windows system file cache:

I thought this was quite interesting: several times when I’ve been doing performance tuning I’ve noticed that the same query, running on a cold Analysis Services cache, runs much slower when the cube has just been processed. This I put down to some form of caching happening at the OS level or below that; I also vaguely knew that it was a good idea to limit the system file cache, having seen the following on Jesse Orosz’s blog:!E322FD91218E57CF!295.entry

Anyway, doing some more research on this subject I came across the following blog entry that discusses the problem of excessive caching in more detail:
…and announces a new tool called the Microsoft Windows Dynamic Cache Service that aims to provide a better way of managing the system file cache:

Has anyone got any experience with this? From what I can see, installing the Dynamic Cache Service on a 64-bit SSAS box with a big cube on looks like a good idea – has anyone tried it? If you have, or are willing to, can you let me know how you get on? Comments are welcome…

%d bloggers like this: