Probably my favourite session at SQLBits the other week was Professor Mark Whitehorn on exploiting exotic patterns in data. One of the things he talked about was Benford’s Law, something I first heard about several years ago (in fact I’m sure I wrote a blog post on implementing Benford’s Law in MDX but I can’t find it), about the frequency distribution of digits in data. I won’t try to explain it myself but there are plenty of places you can read up on it, for example: http://en.wikipedia.org/wiki/Benford%27s_law . I promise, it’s a lot more interesting that it sounds!

Anyway, it struck me that it would be quite useful to have a Power Query function that could be used to find the distribution of the first digits in any list of numbers, for example for fraud detection purposes. The first thing I did was write a simple query that returned the expected distributions for the digits 1 to 9 according to Benford’s Law:

[sourcecode language=”text” padlinenumbers=”true”]

let

//function to find the expected distribution of any given digit

Benford = (digit as number) as number => Number.Log10(1 + (1/digit)),

//get a list of values between 1 and 9

Digits = {1..9},

// get a list containing these digits and their expected distribution

DigitsAndDist = List.Transform(Digits, each {_, Benford(_)}),

//turn that into a table

Output = #table({"Digit", "Distribution"}, DigitsAndDist)

in

Output

[/sourcecode]

Next I wrote the function itself:

[sourcecode language=”text”]

//take a single list of numbers as a parameter

(NumbersToCheck as list) as table=>

let

//remove any non-numeric values

RemoveNonNumeric = List.Select(NumbersToCheck,

each Value.Is(_, type number)),

//remove any values that are less than or equal to 0

GreaterThanZero = List.Select(RemoveNonNumeric, each _>0),

//turn that list into a table

ToTable = Table.FromList(GreaterThanZero,

Splitter.SplitByNothing(), null, null,

ExtraValues.Error),

RenameColumn = Table.RenameColumns(ToTable,{{"Column1", "Number"}}),

//function to get the first digit of a number

FirstDigit = (InputNumber as number) as

number =>

Number.FromText(Text.Start(Number.ToText(InputNumber),1))-1,

//get the distributions of each digit

GetDistributions = Table.Partition(RenameColumn,

"Number", 9, FirstDigit),

//turn that into a table

DistributionTable = Table.FromList(GetDistributions,

Splitter.SplitByNothing(), null, null, ExtraValues.Error),

//add column giving the digit

AddIndex = Table.AddIndexColumn(DistributionTable, "Digit", 1, 1),

//show how many times each first digit occurred

CountOfDigits = Table.AddColumn(AddIndex,

"Count", each Table.RowCount([Column1])),

RemoveColumn = Table.RemoveColumns(CountOfDigits ,{"Column1"}),

//merge with table showing expected distributions

Merge = Table.NestedJoin(RemoveColumn,{"Digit"},

Benford,{"Digit"},"NewColumn",JoinKind.Inner),

ExpandNewColumn = Table.ExpandTableColumn(Merge, "NewColumn",

{"Distribution"}, {"Distribution"}),

RenamedDistColumn = Table.RenameColumns(ExpandNewColumn,

{{"Distribution", "Expected Distribution"}}),

//calculate actual % distribution of first digits

SumOfCounts = List.Sum(Table.Column(RenamedDistColumn, "Count")),

AddActualDistribution = Table.AddColumn(RenamedDistColumn,

"Actual Distribution", each [Count]/SumOfCounts)

in

AddActualDistribution

[/sourcecode]

There’s not much to say about this code, apart from the fact that it’s a nice practical use case for the Table.Partition() function I blogged about here. It also references the first query shown above, called Benford, so that the expected and actual distributions can be compared.

Since this is a function that takes a list as a parameter, it’s very easy to pass it any column from any other Power Query query that’s in the same worksheet (as I showed here) for analysis. For example, I created a Power Query query on this dataset in the Azure Marketplace showing the number of minutes that each flight in the US was delayed in January 2012. I then invoked the function above, and pointed it at the column containing the delay values like so:

The output is a table (to which I added a column chart) which shows that this data follows the expected distribution very closely:

You can download my sample workbook containing all the code from here.

Fascinating example! I wonder if it also might be useful for working with generated sample data. For example, you could validate existing actual sales data to see what the adherence is to Benford’s law, then use that distribution as part of creating sample/random sales data. There are other ways to create randomized sample data off of existing data, but if you wanted to create samples for something that didn’t already exist (like a new market or new product), you’d have a better starting place if you new the distribution for existing numbers.

Yes, that’s a good idea. In fact using Power Query to generate random data is on my list of things to blog about, so I’ll bear that in mind.

This is great way to plot the information! My next question would be how to drill down on this information to see what exactly the outliers are? Do we need to link relationships to the function?