Another Look At Google BigQuery

Chris Webb Google July 5, 2013July 4, 2013 2 Minutes

About a year ago I wrote a post looking at Google BigQuery which finished on a bum note when I ran into a limitation with the size of tables that could be used in a join. Recently I found out that that particular limitation had been lifted with the introduction of the new JOIN EACH operator, and since my account was still active and the data was still uploaded, I thought I’d see if my query could be made to run.

Just as a bit of background (if you can’t be bothered to read my previous post): the query runs on a sample data set originally created by Vertica (see here) that consists of a 1.2GB csv file with two integer columns and 86,220,856 rows. The rows represent the edges in a social network, and the aim of the query is to find the number of triangles – the number of cases where person A is connected to person B, B is connected to C and A is also connected to C. The query joins this large table to itself not once, but twice.

Anyway, to cut a long story short the query did work this time. Here’s the new version, rewritten to use JOIN EACH:

select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [Edges.EdgesFull]) as e2
join each
(select * from [Edges.EdgesFull]) as e3
on e2.destination = e3.source
where e2.source < e3.source) as e4
join each
(select * from [Edges.EdgesFull]) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source
where e1.source < e4.e2source

Here’s the evidence:

It took 98.7 seconds to run, which I think is pretty good. Some remarks:

I’m not 100% sure this is the correct result, but I’m fairly confident.
I’m a long way from being 100% sure this is the most efficient way to write the query.
When the original Vertica blog post came out it sparked a p*ssing contest between vendors who wanted to to show they could do better than Vertica. You can see an example of other people’s results here; clearly other platforms could do a lot better, and remember these results are two years old.
Obviously you can’t judge the performance of a database on one query.

That’s all. My mind is now at rest, and I’ll go back to being a Microsoft fanboy tomorrow.

Tagged
BigQuery

Published by Chris Webb

My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel. View all posts by Chris Webb

Published July 5, 2013July 4, 2013

12 thoughts on “Another Look At Google BigQuery”

mosha says:

July 5, 2013 at 9:29 pm

Feels like good old times to read in Chris blog about features I implemented. Chris – please venture more outside of Microsoft walled BI garden, there are many interesting things there!

Loading...

Reply
1. Chris Webb says:
  
  July 5, 2013 at 9:42 pm
  
  What would really make it like old times would be if you rewrote the query so it executed in 0.00000001 seconds!
  
  Loading...
  
  Reply
Felipe Hoffa (@felipehoffa) says:

July 17, 2013 at 3:02 am

Cool! I posted a link to this post in http://www.reddit.com/r/bigquery/comments/1iggfa/counting_triangles_in_a_social_graph_the_vertica/.

Can you make the dataset public? Then everyone can play with this query :).

(sharing is free in BQ)

Loading...

Reply
1. Chris Webb says:
  
  July 17, 2013 at 10:31 am
  
  Hi Felipe,
  
  You can get the source data from here: https://github.com/vertica/Graph-Analytics—-Triangle-Counting
  
  Regards,
  
  Chris
  
  Loading...
  
  Reply
  1. Felipe Hoffa (@felipehoffa) says:
    
    July 17, 2013 at 6:42 pm
    
    Oh yes, raw data is cool. But a cooler feature of BigQuery (not too well known – yet), is that you can share your already ingested datasets. So if you are having the Edges.EdgesFull table still stored, it takes just 3 clicks to share it with the world: http://i.imgur.com/sSQIF50.png
    
    Then anyone can run the queries you wrote, without the need to download and re-upload the full dataset :).
    
    Thanks for doing this!
    
    Loading...
  2. Chris Webb says:
    
    July 17, 2013 at 11:02 pm
    
    OK, I’ve done that now. Is there some kind of link to the dataset that I should share?
    
    Loading...
  3. Felipe Hoffa (@felipehoffa) says:
    
    July 18, 2013 at 4:32 am
    
    Cool! Now I just need to know your project id, to know in which namespace I can find this table.
    
    So instead of writing:
    
    select * from [Edges.EdgesFull]) as e2
    
    I need to write:
    
    select * from [your-project-id:Edges.EdgesFull]) as e2
    
    Thanks!
    
    Loading...
  4. Chris Webb says:
    
    July 19, 2013 at 12:05 am
    
    OK, here’s the ID:
    978003128907
    
    Loading...
  5. Felipe Hoffa (@felipehoffa) says:
    
    July 19, 2013 at 10:27 pm
    
    Thanks! The equivalent query that now anyone can run is:
    
    select count(*)
    from
    (select
    e2.Source as e2Source, e2.Destination as e2Destination,
    e3.Source as e3Source, e3.Destination as e3Destination
    from
    (select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e2
    join each
    (select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e3
    on e2.destination = e3.source
    where e2.source < e3.source) as e4
    join each
    (select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e1
    on e1.destination = e4.e2source
    and e4.e3destination = e1.source
    where e1.source < e4.e2source
    
    A cleaner one would be:
    
    select count(*) from
    (select
    e2.Source as e2Source, e3.Destination as e3Destination
    from [christopherpaulwebbuk:Edges.EdgesFull] as e2
    join each [christopherpaulwebbuk:Edges.EdgesFull] as e3
    on e2.destination = e3.source
    where e2.source < e3.source) as e4
    join each [christopherpaulwebbuk:Edges.EdgesFull] as e1
    on e1.destination = e4.e2source
    and e4.e3destination = e1.source
    where e1.source < e4.e2source
    
    (but it doesn't get better performance, talking to the engineering team about it)
    
    Thanks for doing this? What t-shirt size would you like? 🙂
    
    Loading...
2. Felipe Hoffa (@felipehoffa) says:
  
  July 19, 2013 at 10:28 pm
  
  Oops:
  Thanks for doing this? –> Thanks for doing this!
  
  Loading...
  
  Reply
  1. Chris Webb says:
    
    July 19, 2013 at 11:52 pm
    
    No problem! If you can rewrite the query to make it faster, I would be very interested.
    
    Loading...
Felipe Hoffa (@felipehoffa) says:

August 20, 2014 at 3:59 am

Waiting a year is one way to make it faster: Just took 40 seconds here. 🙂

select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e2
join each
(select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e3
on e2.destination = e3.source
where e2.source < e3.source) as e4
join each
(select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source
where e1.source < e4.e2source

Loading...

Reply

Another Look At Google BigQuery

Like this:

Published by Chris Webb

12 thoughts on “Another Look At Google BigQuery”

Leave a ReplyCancel reply

Share this:

Like this:

Published by Chris Webb

12 thoughts on “Another Look At Google BigQuery”

Leave a ReplyCancel reply

Discover more from Chris Webb's BI Blog