A Look At Google BigQuery

Over the years I’ve written quite a few posts about Google’s BI capabilities. Google never seems to get mentioned much as a BI tools vendor but to me it’s clear that it’s doing a lot in this area and is consciously building up its capabilities; you only need to look at things like Fusion Tables (check out these recently-added features), Google Refine and of course Google Docs to see that it’s pursuing a self-service, information-worker-led vision of BI that’s very similar to the one that Microsoft is pursuing with PowerPivot and Data Explorer.

Earlier this month Google announced the launch of BigQuery and I decided to take a look. Why would a Microsoft BI loyalist like me want to do this, you ask? Well, there are a number of reasons:

  • It looked like fun.
  • It’s actually described by Google themselves as an OLAP system here. I don’t agree with this classification – I think it’s better to describe it as an analytical database – but it was enough to make an old OLAP-fan like me curious. It’s also a column-store database. While it’s hardly a complete BI solution in itself it could easily be used as the back-end for one and I believe the French BI vendor BIME already support it as a data source.
  • I’ve argued in the past the only reason that anyone would want to do cloud-based BI would be to scale way beyond whatever on premises hardware they could afford, and Google does make some impressive claims for its scalability: it should be able to provide quick response times for queries on terabytes of data. After all, if I can handle hundreds of millions of rows of data in PowerPivot on my laptop then I’m only going to be tempted to use a cloud-based solution if I need to work with much larger data volumes.
  • It’s based on Dremel, and I’ve heard that one of the developers that works on Dremel is someone we used to know well in the world of Microsoft BI.

To test it out I thought I’d see how it handled the largest csv file I happened to have handy: a 1.2 GB dataset consisting of two integer columns and 86,220,856 rows. I wasn’t able to load the file direct into BigQuery because it was too large (small files you can load in direct), so I first had to upload it into Google Cloud Storage and then loaded it into BigQuery. The upload into Google Cloud Storage took about an hour, and once that was complete it took about 15 minutes to load it into BigQuery. Here’s the dialog for loading data into BigQuery – it’s pretty straightforward, as you can see:

image

The first query I ran was a simple count(*) to find the number of rows in the table, and that took a respectable 2.9 seconds:

Finding the number of distinct values in one of the columns was a little slower, at 3.9 seconds:

As was getting the sum of the values in a column, at 4.0 seconds:

While not astounding fast it, the fact that all these queries executed in under 5 seconds means that performance is good enough for ad hoc data analysis.

Next, I thought I’d have a crack at the problem that this dataset was actually intended for: the scenario described in this post on the Vertica blog from last year:

http://www.vertica.com/2011/09/21/counting-triangles/

Ignoring the my-technology-is-better-than-yours tone of this post, I thought this was an interesting problem: easy to understand but difficult to calculate quickly, and a good test for any product with pretensions to being a ‘big data’ platform. The two columns in the csv file are meant to represent the ids of members of a social network, and with each row representing a relationship between two people. The problem to solve is to find how many ‘triangles’ there are in the data, ie situations where person A has a relationship with person B, person B has a relationship with person C and person C also has a relationship with person A.

It was fairly easy to adapt the SQL in the blog post to the BigQuery syntax, the only slight problem being that it only supports joins between two tables at a time and so you have to use subselects:

select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [Edges.EdgesFull]) as e2
join
(select * from [Edges.EdgesFull]) as e3
on e2.destination = e3.source
where e2.source < e3.source) as e4
join
(select * from [Edges.EdgesFull]) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source
where e1.source < e4.e2source

 

On a small test dataset everything worked OK, but on the full dataset I ran the query and… got an error:

Reading the small print in the docs I realised I’d run into the following limitation:

BigQuery provides real-time query performance for JOIN statements where one of the two sides is small. Here small means less than 8MB of compressed data; compression rates vary, but are usually in the range of 2-10X as compared to the corresponding CSV. We expect this 8MB limit to continue to increase over time. For simplicity, we always require the table on the right side of any JOIN clause to be small. Outer joins with a small table on the "outer" side are unsupported, which means that we only support left outer joins.

Clearly my entire table was going to be much more than this limit. I had a go at filtering the data so that the first column was less than 2000 (the max value in each column is 4,847,570) and that worked, returning in 23.1 seconds:

select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [Edges.EdgesFull] where source<2000) as e2
join
(select * from [Edges.EdgesFull] where source<2000) as e3
on e2.destination = e3.source 
where e2.source < e3.source) as e4
join
(select * from [Edges.EdgesFull] where source<2000) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source 
where e1.source < e4.e2source
   

A bit disappointing, but maybe this is a problem more suited to Pregel than Dremel? Certainly in more traditional OLAP scenarios when you need to join a fact table to a dimension table, many dimension tables will be smaller than 8MB when compressed so this limitation wouldn’t be such an issue.

Overall I was impressed with the performance and ease-of-use of BigQuery, and I’ll be interested to see how it develops in the future and integrates with the rest of the Google stack (it’s already possible to hook it up to Google docs with a bit of coding). I will, of course, be equally interested to see what Microsoft’s cloud BI and Office strategy comes up with to counter this.