Azure Data Factory (ADF) is a great tool as part of your cloud based ETL tool set. However not all your data is necessarily accessible from the public internet. These instruction go through the steps required to allow ADF access to your internal or VNet data-sets.
For ADF, we need to set up and configure an Integration Runtime service (formally called the Data Management Gateway) behind the firewall. This will then provide the secure communication and transfer of data between your ADF and your internal data sources.
1. Setting up the Azure Data Factory Integration Runtime
From your Azure Data Factory in the Edit
- Select Connections on the left hand menu at the bottom
- On the right hand side select the ‘Integration Runtimes’ tab
- Click the ‘+ New’
- Select ‘Perform data movement and dispatch activities to external computes.’ option.
- Then select ‘Private Network’
- Then give it a name and description
- Once the new Runtime is created, then you will be shown the Authentication Keys. You will need one of these for the next steps so take a copy of at least one of them.
- Now download the ‘Azure Data Factory Integration Runtime’ onto the server it will be installed on from the link on the screen or https://www.microsoft.com/en-us/download/details.aspx?id=39717
- You will now need to install this on a server inside your network or VNet. The server you install it on must be able to connect to the desired data sources.
You can install this service on more than one server to create a resilient high availability cluster, but I won’t go through that now.
For this demo, I have decided to install it onto the SQL Server itself.
- Once installed you’ll be asked to enter one of the Authentication Keys you copied earlier. This enables the Integration Runtime instance to register itself with your Azure Data Factory service.
- You can now test a connection to your database using either Basic or Azure Key Vault
2. Create a Linked Service in Azure Data Factory
- Create a new Linked Service by clicking on the ‘+ New’ under on the ‘Connections’ -> ‘Linked Services’ tab.
- Now select the type of service you want to connect to behind your firewall. In my case it’s a self hosted MS SQL Server.
- Give it a name and make sure you select the new Integration Runtime you created earlier.
- Enter the credentials and test the connection to ensure all is working well.
- You can now create your Datasets and Pipelines using this Linked Service in the normal way. Schema discovery and data previews should be available too. You just need to select the new Linked service in the ‘Connection’ tab.