Benford’s Law And Power Query

Probably my favourite session at SQLBits the other week was Professor Mark Whitehorn on exploiting exotic patterns in data. One of the things he talked about was Benford’s Law, something I first heard about several years ago (in fact I’m sure I wrote a blog post on implementing Benford’s Law in MDX but I can’t find it), about the frequency distribution of digits in data. I won’t try to explain it myself but there are plenty of places you can read up on it, for example: . I promise, it’s a lot more interesting that it sounds!

Anyway, it struck me that it would be quite useful to have a Power Query function that could be used to find the distribution of the first digits in any list of numbers, for example for fraud detection purposes. The first thing I did was write a simple query that returned the expected distributions for the digits 1 to 9 according to Benford’s Law:

    //function to find the expected distribution of any given digit
    Benford = (digit as number) as number => Number.Log10(1 + (1/digit)),
    //get a list of values between 1 and 9
    Digits = {1..9},
    // get a list containing these digits and their expected distribution
    DigitsAndDist = List.Transform(Digits, each {_, Benford(_)}),
    //turn that into a table
    Output = #table({"Digit", "Distribution"}, DigitsAndDist)



Next I wrote the function itself:

//take a single list of numbers as a parameter
(NumbersToCheck as list) as table=>
    //remove any non-numeric values
    RemoveNonNumeric = List.Select(NumbersToCheck, 
                        each Value.Is(_, type number)),
    //remove any values that are less than or equal to 0
    GreaterThanZero = List.Select(RemoveNonNumeric, each _>0),
    //turn that list into a table
    ToTable = Table.FromList(GreaterThanZero, 
                        Splitter.SplitByNothing(), null, null, 
    RenameColumn = Table.RenameColumns(ToTable,{{"Column1", "Number"}}),
    //function to get the first digit of a number
    FirstDigit = (InputNumber as number) as 
                    number => 
    //get the distributions of each digit
    GetDistributions = Table.Partition(RenameColumn, 
                    "Number", 9, FirstDigit),
    //turn that into a table
    DistributionTable = Table.FromList(GetDistributions, 
                    Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    //add column giving the digit
    AddIndex = Table.AddIndexColumn(DistributionTable, "Digit", 1, 1),
    //show how many times each first digit occurred
    CountOfDigits = Table.AddColumn(AddIndex, 
                    "Count", each Table.RowCount([Column1])),
    RemoveColumn = Table.RemoveColumns(CountOfDigits ,{"Column1"}),
    //merge with table showing expected distributions
    Merge = Table.NestedJoin(RemoveColumn,{"Digit"},
    ExpandNewColumn = Table.ExpandTableColumn(Merge, "NewColumn", 
                            {"Distribution"}, {"Distribution"}),
    RenamedDistColumn = Table.RenameColumns(ExpandNewColumn,
                            {{"Distribution", "Expected Distribution"}}),
    //calculate actual % distribution of first digits
    SumOfCounts = List.Sum(Table.Column(RenamedDistColumn, "Count")),
    AddActualDistribution = Table.AddColumn(RenamedDistColumn, 
                            "Actual Distribution", each [Count]/SumOfCounts)

There’s not much to say about this code, apart from the fact that it’s a nice practical use case for the Table.Partition() function I blogged about here. It also references the first query shown above, called Benford, so that the expected and actual distributions can be compared.

Since this is a function that takes a list as a parameter, it’s very easy to pass it any column from any other Power Query query that’s in the same worksheet (as I showed here) for analysis. For example, I created a Power Query query on this dataset in the Azure Marketplace showing the number of minutes that each flight in the US was delayed in January 2012. I then invoked the function above, and pointed it at the column containing the delay values like so:


The output is a table (to which I added a column chart) which shows that this data follows the expected distribution very closely:


You can download my sample workbook containing all the code from here.

2 thoughts on “Benford’s Law And Power Query

  1. Fascinating example! I wonder if it also might be useful for working with generated sample data. For example, you could validate existing actual sales data to see what the adherence is to Benford’s law, then use that distribution as part of creating sample/random sales data. There are other ways to create randomized sample data off of existing data, but if you wanted to create samples for something that didn’t already exist (like a new market or new product), you’d have a better starting place if you new the distribution for existing numbers.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s