A few weeks ago I wrote an article introducing dataflows in Power BI and discussed why you should care about them. There are a few questions that have been spinning around in my head since then, significantly:
- What implication is there to the Power BI Golden Dataset as a result of the new dataflows?
- Should I extract my queries from my Power BI workbooks and create dataflows instead?
I will come back to question 1 another time. Today I am going to explain how and why you may want to move your queries to dataflows.
Why Would You Want to Move Queries to Dataflows?
Power Query is a user-friendly ETL tool (Extract, Transform and Load). Traditionally ETL has been done using more complicated tools (such as SQL Server Integration Services – SSIS) and the resulting data is stored in a data mart or data warehouse for consumption by anyone that needs a standard view of the data.
Power BI Desktop can consume tables directly from a data warehouse and simply load the table into Power BI – dead easy.
But Power Query is also a powerful ETL tool in its own right, and it can be used to transform and reshape the source data directly inside Power BI Desktop (and then PowerBI.com). This is very useful if :
- You don’t have a data warehouse and/or
- You need some variation of what is in your data warehouse.
- You have other data sources that are not in a data warehouse but are still important to you.
Taking this approach (manipulate in Power Query) is perfectly fine if you have a single workbook, but what if you have 10 similar workbooks all needing the same transformation?
Worse still, what if you are one of many people in a company all doing the same thing with multiple workbooks?
It is possible that you will end up with hundreds of Power Query queries embedded in workbooks that are either identical or what is possibly worse, maybe even slightly different.
One of the things that dataflows allows you to do is remove the Power Query code from the Power BI workbook and make it available for consumption many times into multiple workbooks from the same source. See image below.
How to Move your Queries to Dataflows
I set out to move my queries from a sample workbook into dataflows. First I created a new workspace to store my Dataflows (I called it myData). I noticed that when you have an otherwise blank workspace (no workbooks loaded), you will see the welcome screen as shown below. It wasn’t clear to me how to create a dataflow, and it is not intuitive. All you have to do is click “skip” to be able to create a dataflow.
From there I clicked “Dataflows” from the menu at the top of the page and, then the create button.
I copied the queries from my current PBIX file one at a time. To do this, I just went into the query editor in Power BI Desktop, right clicked on a query (1 below) and selected copy (2).
I then went back to PowerBI.com and added a new blank query data source inside dataflows
I then pasted the code from the clipboard directly over the starting power query code in the dataflow as shown below.
In my case, I am using a very simple query. Given this case, it is probably easier to just create the new queries connecting directly to my SQL database directly inside dataflow rather than using copy and paste.
If you had a more complex query, then copy and paste is definitely the right way to go.
When I pasted the code, I was prompted to select a gateway. I already had an enterprise gateway configured to my PC/Server so I selected it from the list.
Note: dataflows will not work with a personal gateway. If you have a personal gateway, you may like to consider uninstalling it and installing the enterprise version instead.
Don’t be scared, it is not hard at all to do and it is not too different from the Personal Gateway.
I then had to enter my credentials for the PC/Server to make sure the dataflow could access the gateway.
From there it was just a matter of repeating the process for all other tables in my PBIX file. In my case I was just connecting to my SQL Server DB, so it was faster to create a brand new Entity for each of my SQL Tables directly as a new SQL Database Source, so that is what I did.
Create a new SQL connection
To do this, I simply went to a new data source, SQL Server.
I then entered the details for the database AND the credentials for the gateway.
Follow the normal Power Query steps to complete the import for the other tables.
New Dataflows Are Ready
At this point, I had my 5 tables configured as dataflows as shown below. Technically each of these “tables” is called an “entity” inside dataflows.
This is a clue that dataflows can do a lot more than just serve up a table to a user (like a data warehouse).
Refresh the Tables
I then saved the dataflow (called it AdventureWorks) and then ran the refresh when prompted. Just a reminder for those of you not in the USA, you will need to change the Locale of your connections so that things such as dates are treated correctly. The default is the English US and I need English AUS.
This change only has to be (should be) done once for each data source, but there is currently a bug that is preventing this change from being saved.
I know MS is working on this. Given this isn’t currently working correctly, I actually had to change the date columns to data type “text”, otherwise the queries always failed when it found a date like 28/1/2004.
Once the data was loaded as text in Power BI Desktop I was able to change the data type back to date again, picking up the correct Locale information from my PC. Hardly ideal, but it is a workaround until this bug is fixed.
Re-point the Tables in the PBIX
There are a few ways you can do this, but the easiest way is to get under the hood into the code in Power Query. The process is to:
- Create a new connection to one of the entities in the new dataflow from Power BI desktop (get data\Power BI Dataflows). Repeat for all other tables.
- Then go into Power Query (edit queries) and select the Advanced Editor for the first new query, and copy the M code.
- Then go into the query for the original matching table, go into the Advanced Editor and replace the current code that exists with the new code.
- Repeat the cut and paste process for the other tables
- delete the duplicate queries (the new ones, not the original ones).
The advantages of using this approach include:
- Any calculated columns you have in the table will keep working
- There is no need to move any measures from the tables.
Note: You cannot simply delete the original tables and rename the new tables in Power Query and expect it to work. You must cut and paste the new code into the original table that you want to keep.
Publish the Workbook to PowerBI.com
One REALLY cool feature of dataflows is that I can now publish my PBIX file back to PowerBI.com, but it doesn’t need to be published into the same workspace that contains my dataflows.
For this demo, I published my updated version of my workbook to “My Workspace”, not to the myData workspace I created earlier.
Refresh Dataflows and Workbooks
The last thing I did was to set up scheduled refreshes for both the dataflows and the workbooks. Of course I need to set the scheduled refresh for the dataflow to run and complete first, followed by a refresh of the workbook.
Dataflows Refresh
I switched back to the dataflows workspace and selected schedule refresh.
I set the schedule to refresh each day at 8:00 am
I then switched back to My Workspace and did the same thing with the workbook dataset
Before moving on, I had to set up my credentials.
I clicked “edit credentials”. The following window popped up. First I was confused about this as I was trying to edit the data in the first 2 text boxes, but that is not what is needed. Just click “sign in” and then log into your account.
After logging in, I was able to schedule the dataset refresh. I set it to 8:30 am giving the dataflow enough time to refresh first.
Benefits
The benefits of doing it this way include.
- You have 1 master query for each table.
- You can load these tables into multiple workbooks reusing a common source.
- If you need to edit the query, you edit it in one place and it will flow through to every workbook.
- Other people can use the same queries (as long as they are given access to the workspace containing the dataflows.
- There is one version of the truth.
- Dataflows can use traditional data warehouses as the source, but can also use Excel spreadsheets and CSV files too. In this sense, dataflows can extend the concept of an enterprise data warehouse to a broader set of data including data not managed by IT.
- There are performance benefits against the gateway, as the gateway is only used once per query instead of potentially many times.
What do you think?
I would love to hear what you think about setting up your queries this way. Do you agree it is a good approach? What other benefits do you see?