Using PowerPivot to Analyse Windows Search Results

My whole life is on my laptop hard drive, and as a consequence my laptop hard drive has a lot of stuff on it. Luckily, when I need to find something, there’s always Windows Search to help me out – but even so a single search query can return a lot of data. How can we make sense of all the data in the Windows Search Index? Well, we can use PowerPivot of course!

I first got the idea for doing this when I saw ‘Microsoft OLE DB Provider for Search’ in my list of OLE DB Providers; a quick look around on the net revealed the following useful sources of information about it:
http://msdn.microsoft.com/en-us/library/bb231256(v=VS.85).aspx
http://www.thejoyofcode.com/Using_Windows_Search_in_your_applications.aspx

So if you can query Windows Search using SQL, I thought, then I should be able to take the data that is returned from running one of these SQL queries and load it into PowerPivot. And after a lot of trial and error, I managed it – and it works rather well. Here’s how to do it…

First of all, you need to make sure you have the OLE DB Provider for Search installed. If you don’t, you need to download and install the Windows SDK. Then you can open up a new Excel workbook and open the PowerPivot window. Next you need to create a new connection by going to the Home tab, clicking the Get External Data/From Other Sources button, and then clicking the Others(OLEDB/ODBC) option.

image

Next, type in the following connection string:

Provider=Search.CollatorDSO;Extended Properties=Application=Windows

Do not try to click the Build button and select the provider from the list – when I did this, I got the error “Provider is no longer available. Ensure that the provider is installed properly”. Next, choose the “Write a query to specify the data to import” option (again, if you choose the “Select from a list of tables…” you’ll get an error) and enter your SQL query.

Here’s where the fun begins. From the two links above, you can see that there are loads of possibilities as to what you can query. Here’s a sample query that returns a list of all the files in the Documents folder on my c:\ drive and below, along with their file types, the folder path and the file size in bytes:

SELECT  
System.ItemTypeText,
System.Size, System.ItemFolderNameDisplay,
CAST(System.ItemFolderPathDisplay as DBTYPE_BSTR),
System.DateCreated,
System.DateAccessed,
System.FileExtension,
CAST(System.ItemName AS DBTYPE_BSTR) 
FROM SYSTEMINDEX
WHERE SCOPE=’file:C:\Users\Chris Webb\Documents’

Notice that I’ve had to cast some columns to DBTYPE_BSTR – I found that if I didn’t do this, the columns simply didn’t appear in the query results in PowerPivot. Other things you can do here in the query include searching for all items that include particular words or phrases, or which are above a certain size, or have a particular file extension.

With that done, you’re good to go. In a pivot table you can slice and dice all the data returned to your heart’s content. Here, for example, are the top five files with the .ABF extension (ie SSAS backup files) from my c:\ drive:

image

With a separate time dimension table joining to System.DateCreated you can do even more. Here’s the total size of files on my c:\ drive in bytes broken down by the year they were created:

image

You can also use the DAX time intelligence functionality. I added a running total calculation that shows the growth in the total size in MB of all files, over time, based on the creation date of each file. Here’s the formula:

=CALCULATE(SUM(Files[SYSTEMSIZE]), DATESBETWEEN(Time[Date], BLANK(), LASTDATE(Time[Date])))/1024/1024

This chart shows that running sum from November 2008, when I bought the laptop, to today:

image

There are plenty of tools out there that help you analyse this type of data but I doubt any of them can do what the PowerPivot/Excel combo can do. And it’s this kind of personal BI that PowerPivot makes easy. The only thing missing is an API which would allow you to build the SQL query used here dynamically: imagine having an interface where users could type their own search terms and then be able to analyse the results in PowerPivot at the click of a button. Hopefully PowerPivot will get an API soon. And as I’ve said before in the past, wouldn’t it be cool if Bing could do this kind of thing with web search results and data found on the web?

New Stuff in the TechEd BI Keynote

I wasn’t able to go to the MS BI Conference/TechEd in New Orleans this year, unfortunately, but I’ve just watched Ted Kummert’s keynote from today and it’s got lots of interesting new BI-related stuff in it. You can view it here:
http://www.msteched.com/2010/NorthAmerica/Keynote02

If you don’t have any patience with the normal keynote backslapping, I suggest you forward to around 01:20:00 when Amir Netz comes on stage to do some demos. Here’s what he shows:

  • 01:25:20 Pivotviewer Extensions for Reporting Services. As far as I can see, this is a new bit of SSRS functionality that does something like the following: execute a data-driven subscription to generate multiple SSRS reports, then load them into the new Silverlight control-based version of Live Labs Pivot, so the output of these reports can be analysed. This will be available in the next thirty days, so I guess early July. It makes for a very visually appealing demo but if I’m right about what’s happening here then it seems a bit of a hack. Live Labs Pivot needs to be properly integrated into the BI stack!
  • 01:32:30 the ability to define KPIs in PowerPivot. The funny thing is, ever since KPIs were introduced in SSAS 2005 I have worked with hundreds of companies using SSAS and I have never, ever seen anyone using KPIs in production! But this is slightly different and it’s got a nice visual designer, so I can see how KPIs in PowerPivot would be very useful.
  • 01:34:45 new record view for viewing data in the PowerPivot designer, a better way of working with wide tables and editing the calculations on them. Handy.
  • 01:36:30 importing a PowerPivot solution, from Excel, into BI Development Studio. You get the same PowerPivot designer UI in BIDS as you get in Excel, but now you are able to use source control, develop offline and so on.
  • 01:37:55 lineage and impact analysis for DAX calculations – a diagram showing the dependencies between DAX calculations. Good, but lineage and impact analysis is needed for much more than just DAX calculations – it needs to cover everything from the structure of the relational data warehouse to SSIS, SSAS and SSRS reports.
  • 01:39:30 2 billion rows of data loaded into a server-based instance of SSAS running in in-memory mode, from a project developed in BIDS. And of course, it’s fast!

So only a few hints at how Vertipaq will be used in corporate BI scenarios, but what’s here is encouraging…

Thoughts on Analysis Services in the Cloud

So far this year I’ve indulged myself a few times in a bit of futurology (here and here, for example) regarding directions the Microsoft BI stack might take. The one area I haven’t touched on recently, though, is what Analysis Services in the cloud might look like; I did speculate a bit here but that was a while ago now and before several relevant technologies had been announced. It’s certainly coming, and presumably somebody somewhere is working on it right now in some top-secret bunker in Redmond, so maybe a few public comments on what we the user community would want from it would be helpful…? Anyway, welcome or not, here are some thoughts…

So why would you want or need Analysis Services in the cloud rather than regular Analysis Services? I can think of a few things it should be able to do to justify its existence:

  • It should be cheaper than hosting all the infrastructure in-house and it should be scalable to the Nth degree. OK, so these are the standard reasons dragged out for cloud-based anything, but with Analysis Services there are two obvious times when resource usage peaks – processing and when a big/complex query runs – and equally there are times when the server can be completely quiet; so the idea of being able to make use of near infinite resources when you need them to make the processing/querying super-quick, but only pay for what you use, is very appealing. From this point it follows that when you need to use extra resources, you need a platform that can scale to be able to make use of those resources.
  • As well as being able to work with ‘traditional’ data sources such as your corporate data warehouse, it should be able to work with cloud-based data sources be they relational or non-relational (like Amazon SimpleDB, Google’s recently-announced BigQuery, Azure Table Storage and all the rest), feeds (like OData or GData), linked data (RDF), web-based spreadsheets like Google Docs or the Excel web app, or completely unstructured data from anywhere on the web (maybe something like how Google Squared works). Supporting the integration of data modelled for all of these different types of database would be a challenge but I think it should be possible.
  • It should be available as a data source for anywhere else on the web – your own apps, reports, web-based spreadsheets and so on – as well as desktop apps like Excel. The really important thing, for me at least, would be for it to expose an XMLA interface to allow ad-hoc querying (note that Excel can’t talk direct to an XMLA provider, it only does OLEDB for OLAP, but it’s possible to bridge the two and Simba already sell an OLEDB provider that does this); grudgingly, I’ll admit a SQL query interface would be useful too. The ability to expose data via an OData feed would be a must as well. 
  • I’d also like to see it support some basic ETL functionality too, because if it is as scalable and fast as I’d like to be then it would have an obvious secondary use for large-scale number crunching – aggregating data, doing lookups, sorting, many of the things that you might do today in the SSIS data flow or which you might look at Hadoop to do. Derived columns and lookups could all be done with DAX or MDX calculations; pivoting, sorting and filtering could all be done (and configured very easily with a good client tool) through the right MDX query. I can imagine it acting as a datasource for itself: you’d load data into a cube or a table or whatever, create a query on top of it which is made available as a feed, then take the data from that feed and load it into another cube/table, and so on.
  • Following on from the last two points, it’s not enough to be able to act as a data source, Microsoft would need to come up with a decent web-based client tool specifically for use with it. And no, vanilla pivot tables on the web wouldn’t cut it, nor would SSRS in the cloud (not that that I wouldn’t want that) – you’d need to have the wow factor that something like Live Labs Pivot has as well as serious, power-user functionality like the Proclarity desktop client; it would probably need to be built using Silverlight or HTML5. I still think there’s an opportunity to rethink what a client tool could be here, blur the line between BI client tool, spreadsheet and database and come up with something really new.

I wouldn’t want, and don’t expect to get, a recognisable version of Analysis Services 2008 in the cloud in the way that SQL Azure is recognisable as server-based SQL Server. While I still see an important role for Analysis Services as we have it today as in corporate BI scenarios I don’t think there’s any point transferring it to the cloud with exactly the same functionality. Some things, like dimension security, would still be needed, but some things, like cell security, we could probably live without. PowerPivot in the cloud would make more sense as a starting point, so long as it was not just a straight copy of PowerPivot on the desktop: the ability to scale to really, really large data volumes, as in the first bullet above, would be the key feature and the only real reason why customers would want BI in the cloud. And it’s not just the scalability of simple queries either – queries that use complex calculations would need to scale too. You know what, though? When I look at DAX I can’t help but think it was designed with this requirement in mind; I can see how the evaluation of DAX expressions could be easily parallelised.

So all this sounds pretty ambitious, even a bit pie-in-the-sky. The way I see it, though, as far as BI-in-the-cloud goes there’s everything still to play for and if Microsoft doesn’t deliver then someone else will. Could it be Google, or Amazon, or some startup we’ve never heard of? Now that Google BigQuery has a SQL interface, it’s only going to be a matter of time before someone builds a BI app on top of it (I wonder if Mondrian can be made to work with it?) – and it certainly seems to be fast. Microsoft needs to think beyond using BI to defend the Excel/Sharepoint franchise and start thinking about the future! With ten years of BI experience behind it, Microsoft should be in a strong position to move forward into the cloud but it’s only going to succeed if it’s innovative. I understand there will be some new PowerPivot-related product announcements at the BI Conference this week; I’m keeping my fingers crossed.

As always, your comments and ideas are welcome…

The ‘Function XXXX only works with contiguous date selections’ Error

In my last post on DAX, I mentioned I’d come across the following error when running a query in PowerPivot:

ERROR – CALCULATION ABORTED: MdxScript(Sandbox) (22, 4) Function ‘DATEADD’ only works with contiguous date selections.

The DAX expression I was using was as follows:

=CALCULATE(SUM(FactInternetSales[SalesAmount])
    , DATEADD(DimDate[FullDateAlternateKey], -1, YEAR)
    , ALL(DimDate))

And the error was happening when I ran queries with Dates on either rows or columns and was using a slicer that had multiple years selected but only a few months, for example like this:

image

Clearly I had a non-contiguous date selection: I was expecting to see all the dates in January and February 2003 on rows, followed by all the dates in January and February 2004. But I couldn’t understand why I was getting an error on what, to me, seemed like a perfectly reasonable query!

So I asked my friends at Microsoft and Marius Dumitru explained that this wasn’t really an error, it was actually a deliberate feature designed to stop users running queries where calculations would return misleading results, or where the semantics of what was actually happening would be hard to understand. I can understand the motivation for doing this – for example, consider what would happen in the scenario above if there was no error, and I didn’t have dates on rows or columns – no time intelligence calculation would return any meaningful values here. But I still maintain that my scenario above should return values, because in this case the slicers are only serving to filter which dates are being displayed on a visible axis, and when the dates themselves are visible on an axis then it makes sense to see the result of the calculation.

Time to open a Connect, then, in the hope that this scenario will be allowed in a future version. Please vote here:
https://connect.microsoft.com/SQLServer/feedback/details/565032/loosen-restrictions-on-contiguous-date-selections-error

Hadoop on Azure?

Here’s something interesting I’ve just seen on James Dixon’s blog: apparently Microsoft is preparing to provide Hadoop on Windows Azure. Here’s the article James links to:

http://www.sdtimes.com/link/34319

I wonder if this is just a stopgap, in response to customer demand, as the article suggests it might be? How does this fit with the recently-announced Technical Computing Initiative? Was Project Dryad a dead end?

Time Intelligence Without Date Columns in DAX

One of the things that annoys me a little bit in DAX is the way all the built-in time intelligence functionality relies so much on having columns of type Date. What happens if you don’t have a Date column to use? Many Time dimension tables that I see in data warehouses don’t: they use integers to represent dates instead, often in the form YYYYMMDD (eg so 20100525 would represent the 25th of May 2010). And what happens if you need to something time-intelligence-like on a non time dimension? For example you might have a dimension representing different budgets, and you might want to show the difference between the current budget and the previous budget. There are no equivalents of the .PrevMember or .NextMember, or any hierarchical functions, because there are no hierarchies in PowerPivot.

Anyway, after some thought (and a lot of help from Tomislav and Jeffrey Wang from the dev team) I’ve found out it is possible to create time-intelligence calculations without dates. Let’s take a simple PowerPivot model based on Adventure Works DW, with the DimDate table joined to FactInternetSales on the OrderDateKey column. Now since there’s a column of type Date on DimDate, FullDateAlternateKey, we can use the time intelligence functionality to create a calculation that returns the previous year’s Sales Amount as follows:

=CALCULATE(SUM(FactInternetSales[SalesAmount])
    , DATEADD(DimDate[FullDateAlternateKey], -1, YEAR)
    , ALL(DimDate))

Here’s what a query using this calculation looks like:

image

The above calculation works by returning the sum of SalesAmount for all the dates that are one year before the dates in the current context. So if you have the year 2003 on rows in your query, the for that cell the row context would contain all of the dates in the year 2003; these then would be shifted back one year and you’d get the sum of SalesAmount for all the dates in the year 2002. Incidentally, this explains why in Adventure Works you get strange values for the year 2004: it’s because in DimDate you only have dates from January 1st 2004 to August 31st 2004, so the calculation only returns the sum of SalesAmount for January 1st 2003 to August 31st 2003.

Now, what would happen if we didn’t have the FullDateAlternateKey column and we wanted to do the same calculation? What we need to do is recreate this algorithm and shift our dates back one year without using the DateAdd function. The way to do it relies on the format of the keys used in the DimDate[DateKey] column, the YYYYMMDD format I mentioned before. If we have a date key like 20100525 and we want to find the equivalent date in the previous year, all we need to do is subtract 10000 from that key value. In our expression, then, what we need to do is to get the set of dates in the current row context, and for each of those dates return the date that has the key value that is 10000 less than the current date, then sum up SalesAmount for those dates. Here’s the expression:

=CALCULATE(SUM(FactInternetSales[SalesAmount])
,FILTER(ALL(DimDate[DateKey])
, FILTER(VALUES(DimDate[DateKey]), DimDate[DateKey]-10000=EARLIER(DimDate[DateKey])))
, ALL(DimDate))

Notice the use of two nested Filter statements here, and the way that the Earlier function is used to retrieve the value of DimDate[DateKey] from the outer row context. The outer Filter gets the set of all values from DimDate[DateKey], regardless of year or any other selection made, using the All function; the inner Filter filters this set so it only contains the keys that are 10000 less than the dates that have currently been selected.

It works in almost the same way as the original calculation, although I’ve noticed a few differences. First of all, the grand total for the new calculation displays the total of the values for each year, in a way that the original version does not:

image

Secondly, on February 29th in a leap year, the DateAdd function returns February 28th in the previous year and our new approach (unsurprisingly) does not:

image

It’s probably a matter of taste what should happen here; either a null or the value from February 28th in the previous year make sense to me. Note, though, that as shown in the last but one screenshot the year totals for 2004 for both calculations are identical, so the value for February 28th is not counted twice.

Thirdly, keeping dates on rows and measures on columns, when you add English Month Name to a slicer and choose one month, the new expression works ok but you get the following error from the original expression:

ERROR – CALCULATION ABORTED: MdxScript(Sandbox) (22, 4) Function ‘DATEADD’ only works with contiguous date selections.

image

This particular error will be the subject of my next blog post, but in this case I think the DAX time intelligence functions are behaving a bit rubbishly when they should be returning results.

Google Prediction API

Here’s something interesting for anyone into cloud-based data mining: Google have announced a new prediction API. More details here:

http://code.google.com/apis/predict/
http://code.google.com/apis/predict/docs/getting-started.html

Here’s the summary from the site:

The Prediction API enables access to Google’s machine learning algorithms to analyze your historic data and predict likely future outcomes. Upload your data to Google Storage for Developers, then use the Prediction API to make real-time decisions in your applications. The Prediction API implements supervised learning algorithms as a RESTful web service to let you leverage patterns in your data, providing more relevant information to your users. Run your predictions on Google’s infrastructure and scale effortlessly as your data grows in size and complexity.

Looks exciting! Now, why didn’t Microsoft do this? Clearly the ideas were there (remember the cloud version of the Excel DM addin?), but it seems like all the key members of the SQL data mining team had to leave MS to pursue their dreams: http://predixionsoftware.com/ 

Microsoft Technical Computing Initiative

An interesting announcement here from Microsoft about its new Technical Computing Initiative:

http://blogs.technet.com/microsoft_blog/archive/2010/05/17/modeling-the-world.aspx

Lots of the usual PR-speak and vagueness, but from the post above here are the main points:

In terms of technology, the initiative will focus on three key areas:

  1. Technical computing to the cloud: Microsoft will help lead the way in giving scientists, engineers and analysts the computing power of the cloud.  We’re also working to give existing high-performance computing users the ability to augment their on-premises systems with cloud resources that enable ‘just-in-time’ processing. This platform will help ensure processing resources are available whenever they are needed—reliably, consistently and quickly. 
  2. Simplify parallel development: Today, computers are shipping with more processing power than ever, including multiple cores. But most modern software only uses a small amount of the available processing power. Parallel programs are extremely difficult to write, test, and troubleshoot.  We know that a consistent model for parallel programming can help more developers unlock the tremendous power in today’s computers and enable a new generation of technical computing. We’re focused on delivering new tools to automate and simplify writing software through parallel processing from the desktop… to the cluster… to the cloud.    
  3. Develop powerful new technical computing tools and applications: Scientists, engineers and analysts are pushing common tools (i.e., spreadsheets and databases) to the limits with complex, data-intensive models. They need easy access to more computing power using simpler tools to increase the speed of their work, and we’re building a platform with this objective in mind. We expect that these efforts will yield new, easy-to-use tools and applications that automate data acquisition, modeling, simulation, visualization, workflow and collaboration.

And from this article on the Wall Street Journal, here’s a practical example of what will be delivered:

Muglia offers an example of how Microsoft plans to make high-performance computing more accessible: Today many financial services firms use the company’s Excel spreadsheet application to develop financial models, but if the firms need the power of a supercomputer to crunch numbers, they often have to write specialized applications in programming languages like Fortran that a much smaller group of users are fluent in.

Microsoft’s Technical Computing group is working on software that will allow a program like Excel to run in parallel on thousands of machines so the application can be used to tackle monster financial computing chores on its own, Muglia says.

It’s been a while since there was any wild speculation on this blog but I can’t resist it – all this talk of running Excel in parallel on multiple machines and the cloud makes me wonder if this is going to work with PowerPivot too? Or rather, will this work with whatever PowerPivot/Vertipaq becomes when it grows up into a corporate BI tool?

PowerPivot Samples Available

There are some new PowerPivot samples available for download – sample data in workbooks, plus DAX calculation examples. You can get them here:

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=eac83429-c6e5-48a6-87cf-00a4141e5441

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=1ae63bfb-c303-44e3-ae44-7413d499495d

Microsoft BI Indexing Connector

Just seen this on the Sharepoint BI blog, the Microsoft BI Indexing Connector:

http://blogs.msdn.com/sharepointbi/archive/2010/05/14/announcing-the-microsoft-business-intelligence-indexing-connector.aspx

From the blog post:

With this new Indexing connector, users have a dedicated Report tab where they can find the reports they are looking for, use refiners to further narrow their searches, and even get a preview of the report before opening it in the browser or client…

…In addition to discovering the documents, the MSBIIC will also crawl the data sources revealing items that are not necessary in the report itself but critical to the user’s discovery and as part of the refiners.