What (t*****k) happened to our Azure Data Factory costs

Daniel Kloyber
4 min readAug 17, 2020

First Azure joke I ever heard was: “Azure Pricing — you need a PHD to understand that”.

Sad… but true.

There are not many people out there I have not told this yet — but we are building a new data warehouse.

A NEW data warehouse, using Azure Data Factory, Azure SQL, Azure Data Lake and a couple more systems that we managed to integrate.

And yes — I know, we are Microsoft-centric, and yes, I can imagine, there are many opinions about this — but I don’t care right now.

I wanna tell you the story of our first beginner trap — how we fell for it and how we plan to stay out of them in the future.

In general we are uber-happy with pricing of the Azure services we are using.

In a good month we are probably below 100 Euros, for our current (beginner) business needs, but we still get all the coolness. To be honest, costs are a huge factor for us, especially these days.

We are actually so happy with the pricing, that after keeping a keen eye on the costs for the first couple of months, we now only look at the end-of-the-month invoice.

Not little was my surprise to see that we paid considerably more (still not that much) for July than what we paid for the last couple of months. So… I had to investigate…

This is the story of how we needlessly spent ~90 Euros of Azure consumption:

  • A key part of our DWH is an exchange rates dimension, that allows us to integrate different locations that operate in national currencies.
  • We use the ECB rates for now, and at the time, we used to fetch them from here: https://api.exchangeratesapi.io/
  • The response is a JSON file, that we flatten using data flow and then we write to SQL table. We flatten it because we need to add exchange rates from other sources as well. The end result is a SQL table similar to this one:
The actual currency rate data we put in our Data Warehouse
  • Historical data was fetched (all at once) using a very simple/basic/stupid Python script similar like this one:

import requests

import datetime

baseurl = ‘https://api.exchangeratesapi.io/'

currency_name = “EUR”

for x in range(0,20):

today = datetime.date.today() — datetime.timedelta(x)

currency = ‘?base=’ + currency_name

currency_url = baseurl + str(today) + currency

r = requests.get(url = currency_url)

filename = str(today) + “_” + currency_name + “.json”

f = open(filename, “w+”)

f.write(str(r.json()))

f.close()

  • We then copy the files to an Azure Data Lake folder, convert them to CSV, un-pivot them, add some columns and push them to SQL using a Data Flow.

When I noticed the excessive costs for last month, somehow I guessed this could be the only cause, so we confirmed this in the ADF monitoring reports:

We have some 4 similar jobs, one for each currency we were interested in.

The actual costs are caused by the “Data movement activities”, the 133 DIUs costing around 30 Euros.

30 EUROS for one freaking pipeline run — THAT CANNOT BE RIGHT.

We moved around 4000 files through this pipeline, and as you can see, the costs for vCore-hour are very decent.

Turns out the actual problem was the number of files being fed to the ADF Data Flow.

It’s still a mystery for me how the DIU-hour are calculated, but it seems that when you parallelize a lot of file ingestion operations using “Data flows” they all count up in your DIU balance. It’s not like running one Pipeline with parallel actions it’s more like many (thousands) of pipelines in parallel.

This seems to be what exploded our DIU’s counter.

I did the following test: merged our files into larger CSVs (with 6000 rows) instead of 6000 CSV’s with one row and the costs went back to less than 30 cents / pipeline run.

To summarize this lesson: when using ‘Data flows’ prepare the data upfront so that instead of many small files you can use one large file instead. The costs savings can be huge.

--

--