[ad_1]
Advanced ETL techniques for beginners
In this story I will speak about advanced data engineering techniques in Python. No doubt, Python is the most popular programming language for data. During my almost twelve-year career in data engineering, I encountered various situations when code had issues. This story is a brief summary of how I resolved them and learned to write better code. I will show a few techniques that make our ETL faster and help to improve the performance of our code.
List comprehensions
Imagine you are looping through a list of tables. Typically, we would do this:
data_pipelines = ['p1','p2','p3']
processed_tables = []
for table in data_pipelines:
processed_tables.append(table)
But instead, we could use list comprehensions. Not only they are faster, they also reduce the code making it more concise:
processed_tables = [table for table in data_pipelines]
For example, looping through a super large file with data to transform (ETL) each row has never been easier:
def etl(item):
# Do some data transformation here
return json.dumps(item)data = u"\n".join(etl(item) for item in json_data)
List comprehensions are extremely useful for ETL processing of large data files. Imagine a data file we need to transform into a newline delimited format. Try running this example in your Python environment:
import io
import jsondef etl(item):
return json.dumps(item)
# Text file loaded as a blob
blob = """
[
{"id":"1","first_name":"John"},
{"id":"2","first_name":"Mary"}
]
"""
json_data = json.loads(blob)
data_str = u"\n".join(etl(item) for item in json_data)
print(data_str)
data_file = io.BytesIO(data_str.encode())
# This data file is ready for BigQuery as Newline delimited JSON
print(data_file)
Output will be a newline delimited JSON. This is a standard format for data in BigQuery data warehouse and it is ready for loading into the table:
{"id": "1", "first_name": "John"}
{"id": "2", "first_name": "Mary"}
<_io.BytesIO object at 0x10c732430>
Generators
Source link