[ad_1]
A gentle introduction to unit testing, mocking and patching for beginners
In this story, I would like to raise a discussion about unit testing in data engineering. Although there are plenty of articles on Python unit testing on the internet, the topic looks a bit vague and uncovered. We will speak about data pipelines, the parts they consist of and how we can test them to ensure continuous delivery. Each step of the data pipeline can be considered as a function or process and ideally, it should be tested not only as a unit but all together, integrated into one single data flow process. I’ll try to summarize the techniques that I use often to mock, patch and test data pipelines including integration and automated tests.
What is unit testing in the data world?
Testing is a crucial part of any software development lifecycle and helps developers make sure the code is reliable and can be easily maintained in the future. Consider our data pipeline as a set of processing steps or functions. In this case, unit testing can be considered as a technique of writing tests to ensure that each unit of our code, or each step of our data pipeline doesn’t produce unintended results and is fit for purpose.
In a nutshell, each step of a data pipeline is a method or function which needs to be tested.
Data pipelines might be different. In fact, they often vary greatly in terms of data sources, processing steps and final destinations for our data. Whenever we transform the data from point A to point B, there is a data pipeline. There are different design patterns [1] and techniques to build these data processing graphs and I wrote about it in one of my previous articles.
Take a look at this simple data pipeline example below. It demonstrates a common use case scenario when data is being processed in the multi-cloud. Our data pipeline starts from the…
Source link