Chris Webb's BI Blog: Databricks

I recently took part in a webinar with Denny Lee, Liping Huang and Marius Panga from Databricks on the subject of best practices for using Power BI on Databricks. You can view the recording on LinkedIn here:

https://www.linkedin.com/video/live/urn:li:ugcPost:7174102151407779841

…or on YouTube here:

My section at the beginning covering Power BI best practices for Import and DirectQuery doesn’t contain any new information – if you’ve been following the DirectQuery posts on this blog or read the DirectQuery guidance docs here and here then there won’t be any surprises. What I thought was really useful, though, was hearing the folks from Databricks talk about best practices on the Databricks side and this took up the majority of the webinar. Definitely worth checking out.

Some good news for those of you using DirectQuery mode in Power BI: the ApproximateDistinctCount DAX function, which returns an estimate of the number of the distinct values in a column and which can be a lot faster than a true distinct count as returned by the DistinctCount function, is now available to use with BigQuery, Databricks and Snowflake sources. It only worked with Azure SQL DB and Synapse before; RedShift is coming soon. You can use it in exactly the same way that you would with the DistinctCount function except that it only works in DirectQuery mode.

For example, I have a Power BI DirectQuery dataset that uses the New York Taxi data in Snowflake as its source. With the following two DAX measures defined on the Trip table:

Approximate Distinct Medallions = APPROXIMATEDISTINCTCOUNT('TRIP'[MEDALLIONID])
Distinct Medallions = DISTINCTCOUNT('TRIP'[MEDALLIONID])

…I can build a table visual that compares the output of the two measures (as you can see, the difference isn’t that big):

…and see that the ApproximateDistinctCount DAX function is translated to the APPROX_COUNT_DISTINCT function in Snowflake SQL:

select { fn convert(count(distinct("MEDALLIONID")), SQL_DOUBLE) } + { fn convert(max("C1"), SQL_DOUBLE) } as "C1",
    approx_count_distinct("MEDALLIONID") as "C2"
from 
(
    select "DATEID",
        "MEDALLIONID",
        "HACKNEYLICENSEID",
        "PICKUPTIMEID",
        "DROPOFFTIMEID",
        "PICKUPGEOGRAPHYID",
        "DROPOFFGEOGRAPHYID",
        "PICKUPLATITUDE",
        "PICKUPLONGITUDE",
        "PICKUPLATLONG",
        "DROPOFFLATITUDE",
        "DROPOFFLONGITUDE",
        "DROPOFFLATLONG",
        "PASSENGERCOUNT",
        "TRIPDURATIONSECONDS",
        "TRIPDISTANCEMILES",
        "PAYMENTTYPE",
        "FAREAMOUNT",
        "SURCHARGEAMOUNT",
        "TAXAMOUNT",
        "TIPAMOUNT",
        "TOLLSAMOUNT",
        "TOTALAMOUNT",
        case
            when "MEDALLIONID" is null
            then CAST(1 as INTEGER)
            else CAST(0 as INTEGER)
        end as "C1"
    from "NYCDATA_DB"."NYCTAXIDATA"."TRIP"
) as "ITBL"

Distinct counts are often the slowest type of measure but in my experience report consumers are very unwilling to accept seeing “near enough” numbers in their reports rather than numbers that are 100% accurate, even if the approximate distinct counts are much faster. I heard someone suggest using field parameters to allow report consumers to switch between showing fast approximate distinct counts for exploration and accurate distinct counts when they really need them, and I think this is a great compromise.

Bonus links: if you need to do an approximate distinct count in Import mode, Phil Seamark shows how to do this here; Phil also wrote a great post on building aggregations for distinct counts (read it carefully – this is a really powerful technique!) here.

Category: Databricks

Best Practices For Power BI On Databricks Webinar

Like this:

ApproximateDistinctCount DAX Function Now Works On More DirectQuery Sources

Like this:

Share this:

Like this:

Share this:

Like this: