What a real ETL pipeline costs (and why)
Asking what an ETL pipeline costs is a bit like asking what a house costs. Here's what actually moves the number, so you can tell a fair quote from a cheap one that bites you later.
"How much does an ETL pipeline cost?" is one of the most common questions we get, and it is a bit like asking what a house costs. The honest answer is that it depends, but the things it depends on are not a mystery. Once you know what actually moves the number, you can look at any quote, ours or anyone else's, and tell whether it is fair.
So here is what you are really paying for.
There are two costs, not one
Almost every confusing conversation about pipeline pricing comes from mixing up two very different things: the one time cost to build it, and the ongoing cost to run it. A pipeline is not a website you launch and forget. It is a small machine that runs on a schedule, week after week, and that machine has running costs. Get clear on which number you are talking about and half the confusion disappears.
What drives the build cost
How many sources, and how messy each one is
One clean source with a tidy structure is a small job. Ten sources, each with its own quirks, layout and edge cases, is not ten times harder, it is worse, because now you also have to make them all agree on one shape. The mess inside each source matters as much as the count. A site that rearranges its layout every month costs more to support than one that has looked the same for years.
How hard the transformation is
Pulling raw data is rarely the hard part. The work is in what happens next: cleaning it, validating it, standardising dates and currencies and units, removing duplicates, and reshaping everything into the schema you actually want. "Just give me the data" usually means "give me the data after all the annoying work is done," and that annoying work is most of the build.
Where it has to go, and who gets told
Dropping a CSV in one place is simple. Loading into a database, pushing to S3, syncing to a client's own system, and firing off email and Slack notifications on every run is far more moving parts, and every part is something that can break, which means something that has to be built properly.
Whether you are scraping or calling APIs
If the data comes from clean APIs, collection is the easy bit. If it comes from scraping sites that would rather you didn't, you are also paying for the machinery that keeps it alive at scale: proxies, browser automation, throttling, retries and monitoring. We wrote a whole post on that, because it is its own craft.
What drives the ongoing cost
This is the part people forget, and it is the part that bites. A running pipeline has real monthly costs:
- Proxies and API fees. If you scrape at volume, residential proxies are a genuine line item. If you lean on paid APIs, their bill scales with your usage.
- Compute and storage. Something has to run the jobs and hold the data. Usually modest, but never zero.
- Maintenance. Sources change. When a site redesigns or an API shifts, someone has to fix the pipeline before the bad data piles up. This is the cost most cheap quotes quietly leave out.
Why the cheapest pipeline is usually the most expensive
You can get a pipeline built cheaply. The problem shows up later. A pipeline with no validation and no monitoring does not fail loudly, it fails quietly, feeding you subtly wrong data for weeks while you make decisions on it. By the time anyone notices, you pay twice: once to find the damage, and once to rebuild the thing properly. The money you save up front tends to come back with interest.
The boring, slightly more expensive version, the one with checks, alerts and a clean handover, is cheaper over any timeframe that actually matters.
How to think about a number
Rather than quote a figure that would be meaningless without your details, here is the framing we use. A small single source pipeline with light transformation is a modest one off build with small running costs. A multi source pipeline with heavy cleaning, several destinations, notifications and anti blocking is a larger build with a real monthly bill behind it. Most projects sit somewhere on that line, and where yours lands depends entirely on the answers to the questions above.
If you can tell us your sources, roughly how much data, how often you need it, and where it has to end up, we can give you a straight answer quickly. No vague "it depends," just a real number and the reasoning behind it.
Have a project like this?
If you need a scraper, a data pipeline, or a full product built and maintained properly, we would love to hear about it.
Start a project