About a year ago I wrote a post looking at Google BigQuery which finished on a bum note when I ran into a limitation with the size of tables that could be used in a join. Recently I found out that that particular limitation had been lifted with the introduction of the new JOIN EACH operator, and since my account was still active and the data was still uploaded, I thought I’d see if my query could be made to run.
Just as a bit of background (if you can’t be bothered to read my previous post): the query runs on a sample data set originally created by Vertica (see here) that consists of a 1.2GB csv file with two integer columns and 86,220,856 rows. The rows represent the edges in a social network, and the aim of the query is to find the number of triangles – the number of cases where person A is connected to person B, B is connected to C and A is also connected to C. The query joins this large table to itself not once, but twice.
Anyway, to cut a long story short the query did work this time. Here’s the new version, rewritten to use JOIN EACH:
select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [Edges.EdgesFull]) as e2
join each
(select * from [Edges.EdgesFull]) as e3
on e2.destination = e3.source
where e2.source < e3.source) as e4
join each
(select * from [Edges.EdgesFull]) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source
where e1.source < e4.e2source
Here’s the evidence:
It took 98.7 seconds to run, which I think is pretty good. Some remarks:
- I’m not 100% sure this is the correct result, but I’m fairly confident.
- I’m a long way from being 100% sure this is the most efficient way to write the query.
- When the original Vertica blog post came out it sparked a p*ssing contest between vendors who wanted to to show they could do better than Vertica. You can see an example of other people’s results here; clearly other platforms could do a lot better, and remember these results are two years old.
- Obviously you can’t judge the performance of a database on one query.
That’s all. My mind is now at rest, and I’ll go back to being a Microsoft fanboy tomorrow.
Feels like good old times to read in Chris blog about features I implemented. Chris – please venture more outside of Microsoft walled BI garden, there are many interesting things there!
What would really make it like old times would be if you rewrote the query so it executed in 0.00000001 seconds!
Cool! I posted a link to this post in http://www.reddit.com/r/bigquery/comments/1iggfa/counting_triangles_in_a_social_graph_the_vertica/.
Can you make the dataset public? Then everyone can play with this query :).
(sharing is free in BQ)
Hi Felipe,
You can get the source data from here: https://github.com/vertica/Graph-Analytics—-Triangle-Counting
Regards,
Chris
Oh yes, raw data is cool. But a cooler feature of BigQuery (not too well known – yet), is that you can share your already ingested datasets. So if you are having the Edges.EdgesFull table still stored, it takes just 3 clicks to share it with the world: http://i.imgur.com/sSQIF50.png
Then anyone can run the queries you wrote, without the need to download and re-upload the full dataset :).
Thanks for doing this!
OK, I’ve done that now. Is there some kind of link to the dataset that I should share?
Cool! Now I just need to know your project id, to know in which namespace I can find this table.
So instead of writing:
select * from [Edges.EdgesFull]) as e2
I need to write:
select * from [your-project-id:Edges.EdgesFull]) as e2
Thanks!
OK, here’s the ID:
978003128907
Thanks! The equivalent query that now anyone can run is:
select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e2
join each
(select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e3
on e2.destination = e3.source
where e2.source < e3.source) as e4
join each
(select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source
where e1.source < e4.e2source
A cleaner one would be:
select count(*) from
(select
e2.Source as e2Source, e3.Destination as e3Destination
from [christopherpaulwebbuk:Edges.EdgesFull] as e2
join each [christopherpaulwebbuk:Edges.EdgesFull] as e3
on e2.destination = e3.source
where e2.source < e3.source) as e4
join each [christopherpaulwebbuk:Edges.EdgesFull] as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source
where e1.source < e4.e2source
(but it doesn't get better performance, talking to the engineering team about it)
Thanks for doing this? What t-shirt size would you like? 🙂
Oops:
Thanks for doing this? –> Thanks for doing this!
No problem! If you can rewrite the query to make it faster, I would be very interested.
Waiting a year is one way to make it faster: Just took 40 seconds here. 🙂
select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e2
join each
(select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e3
on e2.destination = e3.source
where e2.source < e3.source) as e4
join each
(select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source
where e1.source < e4.e2source