Python for Data Engineers. Advanced ETL techniques for beginners | by 💡Mike Shakhomirov

[ad_1]

Advanced ETL techniques for beginners

In this story I will speak about advanced data engineering techniques in Python. No doubt, Python is the most popular programming language for data. During my almost twelve-year career in data engineering, I encountered various situations when code had issues. This story is a brief summary of how I resolved them and learned to write better code. I will show a few techniques that make our ETL faster and help to improve the performance of our code.

List comprehensions

Imagine you are looping through a list of tables. Typically, we would do this:

data_pipelines = ['p1','p2','p3']
processed_tables = []
for table in data_pipelines:
processed_tables.append(table)

But instead, we could use list comprehensions. Not only they are faster, they also reduce the code making it more concise:

processed_tables = [table for table in data_pipelines]

For example, looping through a super large file with data to transform (ETL) each row has never been easier:

def etl(item):
# Do some data transformation here
return json.dumps(item)data = u"\n".join(etl(item) for item in json_data)

List comprehensions are extremely useful for ETL processing of large data files. Imagine a data file we need to transform into a newline delimited format. Try running this example in your Python environment:


import io
import jsondef etl(item):
return json.dumps(item)
# Text file loaded as a blob
blob = """
[
{"id":"1","first_name":"John"},
{"id":"2","first_name":"Mary"}
]
"""
json_data = json.loads(blob)
data_str = u"\n".join(etl(item) for item in json_data)
print(data_str)
data_file = io.BytesIO(data_str.encode())
# This data file is ready for BigQuery as Newline delimited JSON
print(data_file)

Output will be a newline delimited JSON. This is a standard format for data in BigQuery data warehouse and it is ready for loading into the table:

{"id": "1", "first_name": "John"}
{"id": "2", "first_name": "Mary"}
<_io.BytesIO object at 0x10c732430>

Generators

[ad_2]
Source link

Python for Data Engineers. Advanced ETL techniques for beginners | by 💡Mike Shakhomirov | Oct, 2023

Advanced ETL techniques for beginners

List comprehensions

Generators

Comments

Leave a Reply Cancel reply