About a year ago I wrote a post looking at Google BigQuery which finished on a bum note when I ran into a limitation with the size of tables that could be used in a join. Recently I found out that that particular limitation had been lifted with the introduction of the new JOIN EACH operator, and since my account was still active and the data was still uploaded, I thought I’d see if my query could be made to run.
Just as a bit of background (if you can’t be bothered to read my previous post): the query runs on a sample data set originally created by Vertica (see here) that consists of a 1.2GB csv file with two integer columns and 86,220,856 rows. The rows represent the edges in a social network, and the aim of the query is to find the number of triangles – the number of cases where person A is connected to person B, B is connected to C and A is also connected to C. The query joins this large table to itself not once, but twice.
Anyway, to cut a long story short the query did work this time. Here’s the new version, rewritten to use JOIN EACH:
select count(*) from (select e2.Source as e2Source, e2.Destination as e2Destination, e3.Source as e3Source, e3.Destination as e3Destination from (select * from [Edges.EdgesFull]) as e2 join each (select * from [Edges.EdgesFull]) as e3 on e2.destination = e3.source where e2.source < e3.source) as e4 join each (select * from [Edges.EdgesFull]) as e1 on e1.destination = e4.e2source and e4.e3destination = e1.source where e1.source < e4.e2source
Here’s the evidence:
It took 98.7 seconds to run, which I think is pretty good. Some remarks:
I’m not 100% sure this is the correct result, but I’m fairly confident.
I’m a long way from being 100% sure this is the most efficient way to write the query.
When the original Vertica blog post came out it sparked a p*ssing contest between vendors who wanted to to show they could do better than Vertica. You can see an example of other people’s results here; clearly other platforms could do a lot better, and remember these results are two years old.
Obviously you can’t judge the performance of a database on one query.
That’s all. My mind is now at rest, and I’ll go back to being a Microsoft fanboy tomorrow.
Over the years I’ve written quite a few posts about Google’s BI capabilities. Google never seems to get mentioned much as a BI tools vendor but to me it’s clear that it’s doing a lot in this area and is consciously building up its capabilities; you only need to look at things like Fusion Tables (check out these recently-added features), Google Refine and of course Google Docs to see that it’s pursuing a self-service, information-worker-led vision of BI that’s very similar to the one that Microsoft is pursuing with PowerPivot and Data Explorer.
Earlier this month Google announced the launch of BigQuery and I decided to take a look. Why would a Microsoft BI loyalist like me want to do this, you ask? Well, there are a number of reasons:
It looked like fun.
It’s actually described by Google themselves as an OLAP system here. I don’t agree with this classification – I think it’s better to describe it as an analytical database – but it was enough to make an old OLAP-fan like me curious. It’s also a column-store database. While it’s hardly a complete BI solution in itself it could easily be used as the back-end for one and I believe the French BI vendor BIME already support it as a data source.
I’ve argued in the past the only reason that anyone would want to do cloud-based BI would be to scale way beyond whatever on premises hardware they could afford, and Google does make some impressive claims for its scalability: it should be able to provide quick response times for queries on terabytes of data. After all, if I can handle hundreds of millions of rows of data in PowerPivot on my laptop then I’m only going to be tempted to use a cloud-based solution if I need to work with much larger data volumes.
It’s based on Dremel, and I’ve heard that one of the developers that works on Dremel is someone we used to know well in the world of Microsoft BI.
To test it out I thought I’d see how it handled the largest csv file I happened to have handy: a 1.2 GB dataset consisting of two integer columns and 86,220,856 rows. I wasn’t able to load the file direct into BigQuery because it was too large (small files you can load in direct), so I first had to upload it into Google Cloud Storage and then loaded it into BigQuery. The upload into Google Cloud Storage took about an hour, and once that was complete it took about 15 minutes to load it into BigQuery. Here’s the dialog for loading data into BigQuery – it’s pretty straightforward, as you can see:
The first query I ran was a simple count(*) to find the number of rows in the table, and that took a respectable 2.9 seconds:
Finding the number of distinct values in one of the columns was a little slower, at 3.9 seconds:
As was getting the sum of the values in a column, at 4.0 seconds:
While not astounding fast it, the fact that all these queries executed in under 5 seconds means that performance is good enough for ad hoc data analysis.
Next, I thought I’d have a crack at the problem that this dataset was actually intended for: the scenario described in this post on the Vertica blog from last year:
Ignoring the my-technology-is-better-than-yours tone of this post, I thought this was an interesting problem: easy to understand but difficult to calculate quickly, and a good test for any product with pretensions to being a ‘big data’ platform. The two columns in the csv file are meant to represent the ids of members of a social network, and with each row representing a relationship between two people. The problem to solve is to find how many ‘triangles’ there are in the data, ie situations where person A has a relationship with person B, person B has a relationship with person C and person C also has a relationship with person A.
It was fairly easy to adapt the SQL in the blog post to the BigQuery syntax, the only slight problem being that it only supports joins between two tables at a time and so you have to use subselects:
select count(*) from (select e2.Source as e2Source, e2.Destination as e2Destination, e3.Source as e3Source, e3.Destination as e3Destination from (select * from [Edges.EdgesFull]) as e2 join (select * from [Edges.EdgesFull]) as e3 on e2.destination = e3.source where e2.source < e3.source) as e4 join (select * from [Edges.EdgesFull]) as e1 on e1.destination = e4.e2source and e4.e3destination = e1.source where e1.source < e4.e2source
On a small test dataset everything worked OK, but on the full dataset I ran the query and… got an error:
Reading the small print in the docs I realised I’d run into the following limitation:
BigQuery provides real-time query performance for JOIN statements where one of the two sides is small. Here small means less than 8MB of compressed data; compression rates vary, but are usually in the range of 2-10X as compared to the corresponding CSV. We expect this 8MB limit to continue to increase over time. For simplicity, we always require the table on the right side of any JOIN clause to be small. Outer joins with a small table on the "outer" side are unsupported, which means that we only support left outer joins.
Clearly my entire table was going to be much more than this limit. I had a go at filtering the data so that the first column was less than 2000 (the max value in each column is 4,847,570) and that worked, returning in 23.1 seconds:
select count(*) from (select e2.Source as e2Source, e2.Destination as e2Destination, e3.Source as e3Source, e3.Destination as e3Destination from (select * from [Edges.EdgesFull] where source<2000) as e2 join (select * from [Edges.EdgesFull] where source<2000) as e3 on e2.destination = e3.source where e2.source < e3.source) as e4 join (select * from [Edges.EdgesFull] where source<2000) as e1 on e1.destination = e4.e2source and e4.e3destination = e1.source where e1.source < e4.e2source
A bit disappointing, but maybe this is a problem more suited to Pregel than Dremel? Certainly in more traditional OLAP scenarios when you need to join a fact table to a dimension table, many dimension tables will be smaller than 8MB when compressed so this limitation wouldn’t be such an issue.
Overall I was impressed with the performance and ease-of-use of BigQuery, and I’ll be interested to see how it develops in the future and integrates with the rest of the Google stack (it’s already possible to hook it up to Google docs with a bit of coding). I will, of course, be equally interested to see what Microsoft’s cloud BI and Office strategy comes up with to counter this.
I’ve been following BigQuery and Google’s BI efforts for a while, and it looks like it’s only a matter of time before they have something resembling a pivot table built into a Google Docs spreadsheet – and that would be a very significant development in the world of web-based BI. Think about it: vast amounts of data stored in BigQuery, accessible through a SQL-like interface and able to be queried very quickly; a spreadsheet on the front; all that’s needed is some kind of OLAP-like-layer in between to make it easy for end users to build their own queries. And Google have done similar things to what’s necessary here with Fusion Tables and their collaboration with Panorama.
Microsoft really need an answer to this. We know SSAS in the cloud is coming, and we have the Excel web app, but again we need pivot table support in the web app to complete the picture (see here and here for more discussion on this topic). I hope it comes soon!