Attaching Manually-Created Common Data Model Folders In Power BI Dataflows

Most of the focus on Power BI dataflows so far has been on the use of Power Query Online to load data from external data sources. However, when you create a dataflow you also have the option to attach an existing Common Data Model (CDM) folder stored in Azure Data Lake Storage Gen2:

image

The documentation has some very detailed information on this subject: you can read how to configure Power BI to use your own Azure Data Lake Storage Gen2 account here, and you can read about how to attach an existing CDM folder here.

The interesting thing for me is that this means that any external service – whether it’s a Microsoft service, a third party service or something you build yourself – can output data into a folder in Azure Data Lake Storage Gen2, and so long as it’s in the right format you can expose this data as a dataflow in Power BI. So, I thought, why not create some data manually, put it in the cloud and see if I can get Power BI to attach it as a dataflow?

It turns out it isn’t too difficult to do, and what’s more when you look at the documentation for the Common Data Model and specifically the documentation for the format of the model.json file there are all kinds of interesting features to check out.

Consider the following folder containing three files, all created manually and uploaded to my Azure Data Lake Storage Gen2 account using Azure Storage Explorer:

image

The two csv files are pretty simple:

image

image

The model.json file, which contains the metadata for the dataflow, is also not that complicated:

image

This model.json file describes a single entity called Sales with three columns: Month, Product and Sales. It also specifies that the data for this entity is stored in two partitions and these partitions are linked to the two csv files above (I’m pretty sure this is how incremental refresh is handled in dataflows too). I like the idea of another application being able to save new csv files to a folder and, after updating the model.json file to add a new partition, this data being appended onto the table of data returned by the entity.

It took a few goes to attach the folder to a dataflow, mostly because when it fails the error messages (at least at the time of writing) are completely unhelpful. Two things I ran into are:

  • Any spaces in folder or file paths need to be url encoded, ie replaced with %20
  • You really do need to grant permissions on the folder to the Power BI users who are going to access the dataflow. RTFM.

After that, the dataflow is visible in the Power BI service:

image

…and the data from the Sales entity can be loaded into your dataset in Power BI Desktop:

image

image

As I said, there’s more for me to research here. For example, looking in the documentation, you can create two types of entities – Local and Referenced (are the latter the same as Linked Entities?) – and you can also define relationships between entities (do these become relationships between tables in Power BI?). I think this opens up a lot of new possibilities for close integration between Power BI and other services!

25 responses

  1. Pingback: Attaching Manually-Created Common Data Model Folders In Power BI Dataflows — Chris Webb’s BI Blog | MS Excel | Power Pivot | DAX | SSIS |SQL

  2. I’m curious to know how the model.json file will work in conjunction with parquet files considering they also store metadata, e.g., field types, inside them.

      • Fingers crossed. However, I’d be surprised if it’s soon because one big limitation of Power Query for the past few years is that it cannot natively read parquet files. I assume they’ll either: (a) refactor the existing Mashup engine; or (b) fold M to a Spark proxy, e.g., Databricks. Either of this options ain’t easy.

      • I don’t think has anything to do with the Power Query engine – with an attached CDM folder there’s no need to use Power Query Online, and in Power BI Desktop my guess is that you’ll be going through a standard API to get data from the dataflow, however it’s created, rather than reading the parquet files directly.

  3. Pingback: Attaching Manually-Created Common Data Model Folders In Power BI Dataflows — Chris Webb’s BI Blog – SutoCom Solutions

  4. If Microsoft separates the SQL Serve storage from the engine and the storage can use Azure Data Lake with CDM metadata then you could go hog wild with SQL over CDM. What a dream. Chris, have you tried to define the relationship metadata in JSON and seen if Power Query can understand it? I doubt it at this point. I think only PowerApps CDS at this point fully utilizes the entire specification.

    • Suhail – With Azure SQL Data Warehouse, we do have PolyBase which can access remote data in ADLS Gen2 via an external table definition. Remote queries via data virtualization like that don’t perform optimally because the SQL engine has to pull full data files over in order to perform a where statement which retrieves only partial data from a file (i.e., there’s no compute engine natively in ADLS Gen2 currently to perform data pruning or predicate pushdown before the data goes over to the SQL database). Having said all that, the functionality is available so theoretically you can already start going a teensy bit wild with SQL over CDM. =) (For the sake of completeness I feel the need to say…PolyBase is strongly encouraged as the way to *load* data to Azure SQL DW, but use PolyBase with caution for remote queries for the reasons I mentioned above. There’s a lot more to it than that, but that’s the gist.)

      • I forgot about Polybase. That does allow SQL over CDM. I had a project using PowerApps and I was shuttling data between the CDM and SQL Server via Microsoft Flow because I had to do some complex calculations that could easily be done with the full breadth of T-SQL. Yes, there’s also a huge comfort level and proficiency I’ve built up with SQL where as using Power Query, Flow and the PowerApps business rules engine could have solved but would’ve taken longer to puzzle it out. Doing reporting off of CDM via Polybase, Power Query or staging the data using Azure Data Factory I can handle and fully agree staging the data in SQL is the way to go for data warehousing. It’s the complex read/write business logic where I have trouble but then there’s such a big ecosystem around CDM now that I want to be leveraging it for reporting and application development.

  5. Pingback: Defining Relationships Between Entities In The Common Data Model To Automatically Create Relationships In Power BI « Chris Webb's BI Blog

  6. Pingback: Attaching Manually-Created Common Data Model Folders In Power BI Dataflows | Pardaan.com

  7. Hi Chris,

    Nice posts 🙂

    Maybe you have some knowledge of an issue which I am facing with Data Flows using a on-prem source on a shared capacity.

    I have stumpled upon a (to me) strange error. I have a PowerBI Dataflow storing data on an underlying ADL Gen2 Storage. The Workspace is not on a Premium Capacity and the source data resides on a on-premise Oracle. So I have setup an on-prem gateway in order to fetch the data.

    Running a refresh (manual or scheduled) produces the following error:

    Refresh can’t run: This dataflow uses Azure Data Lake Storage Gen2 and an associated gateway with shared capacity. You can enable refresh by removing the gateway or by upgrading this workspace to Power BI Premium capacity.

    Of course removing the gateway makes no sense as I need this for the on-prem data. And the documentation says nothing about premium capacity required for refresh from on-prem sources.

    As of now I have had to create the data flow as an external data flow. Just like you have demonstrated in this post. Instead of loading files via PBI Dataflow I use ADF and manually maintain the model.json which is pretty irritating as the Oracle source is an old Ax a lot of columns in each entity.

    Any insights from you on the above?

  8. Any idea whether or not the ADLS gen2 storage account has to be in the same tenant as Power BI?

    I’m testing via data from a different tenant but am getting a 500 error when trying to create it. I’ve set up the read/execute right via ASE for the Power BI user, the Json URL is in the right format and the file is simple like yours…

      • Both. I’ve tried creating the storage account in the same Azure Region but it still fails. I’ll have to try getting a storage account created in the same Azure Directory/Tenant to see if that works. So far no luck across different tenants.

      • Are you using dfs or blob as your endpoint? Should be dfs since its ADLS gen 2 but it seems to vary in the documentation…and ive noticed the self-hosted PBI storage uses the blog endpoint.

      • Chris – I got ADLS deployed in the same tenant/directory as PBI and it works!

        Also I tested creating a CDM folder with a reference entity from another CDM folder and it worked! This seems to be how MSFT is deploying “Linked Entities”…but this works with only Pro! The key is in pointing to the other CDM json file and then defining your own modelID which gets referenced in the entity section.

        Amazing potential here…

  9. Hi, I’ve tried the same steps but keep receiving the message below when adding the CDM folder to dataflow:

    CDM folder could not be added.
    You cannot add the CDM folder, because you do not have permission to it. Please contact the CDM folder creator to grant you access and try again.

    Already gave access to my user and to power bi groups on the folder, files under it and the file system

    Any help would really be appreciated

    Thanks

    • …in time, with one difference, I’m not using “powerbi” file system, I’m using another file system I’ve already had on the lake

      Regards

  10. Made it!

    But only using “powerbi” file system, anyone made it with file systems already existing, just by changing access rights?

    Another problem was that I had to use csv. Semicolon separated columns didn’t work either.

    And still doesn’t work with the parquet 😦

    Waiting for the day it connects to Gen 2 as easy as to Gen 1 nowadays

    Sheers

  11. Actually it worked, all i had to do was give some access rights, the same that are given when you use Power BI to create a CDM

    It’s a good shortcut to make some discovery levereaging data you already have on the lake, also delivering good viz already

    Could automate model.json creation on ingestion, and data would be ready to use right away

    Regards

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: