perfect. You’re going to encounter a lot of data inconsistencies. Nulls, negative values, string inconsistencies, etc. If these aren’t handled early in your data analysis workflow, querying and analysing your data would be a pain later on.
Now, I’ve done data cleaning before using SQL and Excel, not really with Python. So, to learn about Pandas (one of Python’s data analysis libraries), I’ll be dabbling in some data cleaning.
In this article, I’ll be sharing with you a repeatable, beginner-friendly data cleaning workflow. By the end of this article, you should be pretty confident in using Python for data cleaning and analysis.
The Dataset we’ll be working with
I’ll be working with a synthetic, messy HR dataset containing typical real-world errors (inconsistent dates, mixed data types, compound columns). This dataset is from Kaggle, and it is designed for practising data cleaning, transformation, exploratory analysis, and preprocessing for data visualisation and machine learning.
The dataset contains over 1,000 rows and 13 columns, including employee information such as names, department-region combinations, contact details, status, salaries, and performance scores. It includes examples of:
- Duplicates
- Missing values
- Inconsistent date formats
- Erroneous entries (e.g., non-numeric salary values)
- Compound columns (e.g., “Department_Region” like “Cloud Tech-Texas” that can be split)
It contains columns like:
- Employee_ID: Unique synthetic ID (e.g., EMP1001)
- First_Name, Last_Name: Randomly generated personal names
- Name: Full name (may be redundant with first/last)
- Age: Includes missing values
- Department_Region: Compound column (e.g., “HR-Florida”)
- Status: Employee status (Active, Inactive, Pending)
- Join_Date: Inconsistent format (YYYY/MM/DD)
- Salary: Includes invalid entries (e.g., “N/A”)
- Email, Phone: Synthetic contact information
- Performance_Score: Categorical performance rating
- Remote_Work: Boolean flag (True/False)
You can access the dataset here and play around with it
The dataset is fully synthetic. It does not contain any real individuals’ data and is safe to use for public, academic, or commercial projects.
This dataset is in the public domain under the CC0 1.0 Universal license. You are free to use, modify, and distribute it without restriction.
Overview of the Cleaning Workflow
The data cleaning workflow I’ll be working with consists of 5 simple stages.
- Load
- Inspect
- Clean
- Review
- Export
Let’s dive deeper into each of these stages.
Step 1 — Load the CSV (And Handle the First Hidden Issues)
There are some things to keep in mind before loading your dataset. However, this is an optional step, and we probably wouldn’t encounter most of these issues in our dataset. But it doesn’t hurt to know these things. Here are some key things to consider while loading.
Encoding issues (utf-8, latin-1)
Encoding defines how characters are stored as bytes in the file. Python and Pandas usually default to UTF-8, which handles most modern text and special characters globally. However, if the file was created in an older system or a non-English environment, it might use a different encoding, most commonly Latin-1
So if you try to read a Latin-1 file with UTF-8, Pandas will encounter bytes it doesn’t recognise as valid UTF-8 sequences. You’ll typically see a UnicodeDecodeError when you try to read a CSV with encoding issues.
If perhaps the default load fails, you could try to specify a different encoding:
# First attempt (the default)
try:
df = pd.read_csv(‘messy_data.csv’)
except UnicodeDecodeError:
# Second attempt with a common alternative
df = pd.read_csv(‘messy_data.csv’, encoding=’latin-1')
Wrong delimiters
CSV stands for “Comma Separated Values,” but in reality, many files use other characters as separators, like semicolons (common in Europe), tabs, or even pipes (|). Pandas typically defaults to the comma (,).
So, if your file uses a semicolon (;) but you load it with the default comma delimiter, Pandas will treat the entire row as a single column. The result would be a DataFrame with a single column containing entire lines of data, making it impossible to work with.
The fix is pretty simple. You can try checking the raw file (opening it in a text editor like VS Code or Notepad++ is best) to see what character separates the values. Then, pass that character to the sep argument like so
# If the file uses semicolons
df = pd.read_csv('messy_data.csv', sep=';')
# If the file uses tabs (TSV)
df = pd.read_csv('messy_data.csv', sep='\t')
Columns that import incorrectly
Sometimes, Pandas guesses the data type for a column based on the first few rows, but later rows contain unexpected data (e.g., text mixed into a column that started with numbers).
For instance, Pandas may correctly identify 0.1, 0.2, 0.3 as floats, but if row 100 contains the value N/A, Pandas might force the entire column into an object (string) type to accommodate the mixed values. This sucks because you lose the ability to perform fast, vectorised numeric operations on that column until you clean up the bad values.
To fix this, I use the dtype argument to tell Pandas what data type a column should be explicitly. This prevents silent type casting.
df = pd.read_csv(‘messy_data.csv’, dtype={‘price’: float, ‘quantity’: ‘Int64’})
Reading the first few rows
You could save time by checking the first few rows directly during the loading process using the nrows parameter. This is great, especially when you’re working with large datasets, as it allows you to test encoding and delimiters without loading the entire 10 GB file.
# Load only the first 50 rows to confirm encoding and delimiter
temp_df = pd.read_csv('large_messy_data.csv', nrows=50)
print(temp_df.head())
Once you’ve confirmed the arguments are correct, you can load the full file.
Let’s load the Employee dataset. I don’t expect to see any issues here.
import pandas as pd
df = pd.read_csv(‘Messy_Employee_dataset.csv’)
df
Output:
1020 rows × 12 columns
Now we can move on to Step 2 : Inspection
Step 2 — Inspect the Dataset
I treat this phase like a forensic audit. I’m looking for evidence of chaos hidden beneath the surface. If I rush this step, I guarantee myself a world of pain and analytical errors down the line. I always run these four crucial checks before writing any transformation code.
The following methods give me the full health report on my 1,020 employee records:
1. df.head() and df.tail(): Understanding the Boundaries
I always start with a visual check. I use df.head() and df.tail() to see the first and last five rows. This is my quick sanity check to see if all columns look aligned and if the data visually makes sense.
My Finding:
When I ran df.head(), I noticed my Employee ID was sitting in a column, and the DataFrame was using the default numerical index (0, 1, 2, …) instead.
While I know I could set Employee ID as the index, for now, I’ll leave it. The bigger immediate visual risk I’m looking for here is data misaligned in the wrong column or obvious leading/trailing spaces on names that will cause trouble later.
2. df.info(): Spotting Datatype Problems and Missingness
This is the most critical method. It tells me the column names, the data types (Dtype), and the exact number of non-null values.
My Findings on 1,020 Rows:
- Missing Age: My total entry count is 1,020, but the
Agecolumn only has 809 non-null values. That’s a significant amount of missing data that I’ll have to decide how to handle later—do I impute it, or do I drop the rows? - Missing Salary: The
Salarycolumn has 996 non-null values, which is only a minor gap, but still something I must resolve. - The ID Type Check: The
Employee IDis correctly listed as anobject(string). This isn’t right. IDs are identifiers, not numbers to be averaged, and using the string type prevents Pandas from accidentally stripping leading zeros.
3. Data Integrity Check: Duplicates and Unique Counts
After checking dtypes, I need to know if I have duplicate records and how consistent my categorical data is.
- Checking for Duplicates: I ran
df.duplicated().sum()and got a result of 0. This is perfect! It means I don’t have identical rows cluttering up my dataset. - Checking Unique Values (
df.nunique()): I use this to understand the diversity within each column. Low counts in categorical columns are fine, but I look for columns that should be unique but aren’t, or columns that have too many unique values, suggesting typos. - Employee_ID have 1020 unique records. This is perfect. It means all records are unique.
- The First_Name / Last_Name field has eight unique records. That’s a little odd. This confirms the dataset’s synthetic nature. My analysis won’t be skewed by a large variety of names, since I’ll treat them as standard strings.
- Department_Region has 36 unique records. There’s high potential for typos. 36 unique values for region/department is too many. I will need to check this column for spelling variations (e.g., “HR” vs. “Human Resources”) in the next step.
- Email (64 unique records). With 1,020 employees, having only 64 unique emails suggests many employees share the same placeholder email. I will flag this for exclusion from analysis, as it’s useless for identifying individuals.
- Phone (1020 unique records). This is perfect because it confirms phone numbers are unique identifiers.
- Age / Performance Score / Status / Remote Work (2–4 unique records). These low counts are expected for categorical or ordinal data, meaning they are ready for encoding.
4. df.describe(): Catching Odd and Impossible Values
I use df.describe() to get a statistical summary of all my numerical columns. This is the place where truly impossible values—the “red flags”—show up instantly. I mostly focus on the min and max rows.
My Findings:
I immediately noticed a problem in what I expected to be the Phone Number column, which Pandas mistakenly converted to a numerical type.
Mean
-4.942253 * 10⁹
Min
-9.994973 * 10⁹
Max
-3.896086 * 10⁶
25%
-7.341992e * 10⁹
50%
4.943997 * 10⁹
75%
-2.520391e * 10⁹
It turns out all the phone number values were massive negative numbers! This confirms two things:
Pandas incorrectly inferred this column as a number, even though phone numbers are strings.
There must be characters in the text that Pandas cannot interpret (for example, parentheses, dashes, or country codes). I need to convert this to an object type and clean it up completely.
5. df.isnull().sum(): Quantifying Missing Data
While df.info() gives me non-null counts, df.isnull().sum() gives me the total count of nulls, which is a cleaner way to quantify my next steps.
My Findings:
Agehas 211 nulls (1020 – 809 = 211), andSalaryhas 24 nulls (1020 – 996 = 24). This precise count sets the stage for Step 3.
This inspection process is my safety net. If I had missed the negative phone numbers, any analytical step that involved numerical data would have failed or, worse, produced skewed results without warning.
By identifying the need to treat Phone Number as a string and the significant missing values in Age now, I have a concrete cleaning list. This prevents runtime errors and, critically, ensures that my final analysis is based on plausible, non-corrupted data.
Step 3 — Standardise Column Names, Correct Dtypes, and Handle Missing Values
With my list of flaws in hand (missing Age, missing Salary, the terrible negative Phone Numbers, and the messy categorical data), I move into the heavy lifting. I treat this step in three sub-phases: ensuring consistency, fixing corruption, and filling gaps.
1. Standardising Column Names and Setting the Index (The Consistency Rule)
Before I do any serious data manipulation, I enforce strict consistency on column names. Why? Because typing df['Employee ID '] accidentally instead of df['employee_id'] is a silent, frustrating error. Once the names are clean, I set the index.
My golden rule is snake_case and lowercase everywhere, and ID columns should be the index.
I use a simple command to strip whitespace, replace spaces with underscores, and convert everything to lowercase.
# The Standardization Command
df.columns = df.columns.str.lower().str.replace(' ', '_').str.strip()
# Before: ['Employee_ID', 'First_Name', 'Phone']
# After: ['employee_id', 'first_name', 'phone']
Now that our columns are standardised. I can move on to set employee_id as an index.
# Set the Employee ID as the DataFrame Index
# This is crucial for efficient lookups and clean merges later.
df.set_index('employee_id', inplace=True)
# Let’s review it real quick
print(df.index)
Output:
Index(['EMP1000', 'EMP1001', 'EMP1002', 'EMP1003', 'EMP1004', 'EMP1005',
'EMP1006', 'EMP1007', 'EMP1008', 'EMP1009',
...
'EMP2010', 'EMP2011', 'EMP2012', 'EMP2013', 'EMP2014', 'EMP2015',
'EMP2016', 'EMP2017', 'EMP2018', 'EMP2019'],
dtype='object', name='employee_id', length=1020)
Perfect, everything is in place.
2. Fixing Data Types and Corruption (Tackling the Negative Phone Numbers)
My df.describe() check revealed the most urgent structural flaw: the Phone column, which was imported as a garbage numerical type. Since phone numbers are identifiers (not quantities), they must be strings.
In this phase, I’ll convert the entire column to a string type, which will turn all those negative scientific notation numbers into human-readable text (though still full of non-digit characters). I will leave the actual text cleaning (removing parentheses, dashes, etc.) for a dedicated standardisation step (Step 4).
# Fix the Phone dtype immediately
# Note: The column name is now 'phone' due to standardization in 3.1
df['phone'] = df['phone'].astype(str)
3. Handling Missing Values (The Age & Salary Gaps)
Finally, I address the gaps revealed by df.info(): the 211 missing Age values and the 24 missing Salary values (out of 1,020 total rows). My strategy depends entirely on the column’s role and the magnitude of the missing data:
- Salary (24 missing values): In this case, removing or dropping all missing values would be the best strategy. Salary is a critical metric for financial analysis. Imputing it risks skewing conclusions. Since only a small fraction (2.3%) is missing, I choose to drop the incomplete records.
- Age (211 missing values). Filling the missing values is the best strategy here. Age is often a feature for predictive modelling (e.g., turnover). Dropping 20% of my data is too costly. I’ll fill the missing values using the median age to avoid skewing the distribution with the mean.
I execute this strategy with two separate commands:
# 1. Removal: Drop rows missing the critical 'salary' data
df.dropna(subset=['salary'], inplace=True)
# 2. Imputation: Fill missing 'age' with the median
median_age = df['age'].median()
df['age'].fillna(median_age, inplace=True)
After these commands, I would run df.info() or isnull().sum() again just to confirm that the non-null counts for salary and age now reflect a clean dataset.
# Rechecking the null counts for salary and age
df[‘salary’].isnull().sum())
df[‘age’].isnull().sum())
Output:
np.int64(0)
So far so good!
By addressing the structural and missing data issues here, the subsequent steps can focus entirely on value standardisation, such as the messy 36 unique values in department_region—which we tackle in the next phase.
Step 4 — Value Standardization: Making Data Consistent
My DataFrame now has the right structure, but the values inside are still dirty. This step is about consistency. If “IT,” “i.t,” and “Info. Tech” all mean the same department, I need to force them into a single, clean value (“IT”). This prevents errors in grouping, filtering, and any statistical analysis based on categories.
1. Cleaning Corrupted String Data (The Phone Number Fix)
Remember the corrupted phone column from Step 2? It’s currently a mess of negative scientific notation numbers that we converted to strings in Step 3. Now, it’s time to extract the actual digits.
So, I’ll be removing every non-digit character (dashes, parentheses, dots, etc.) and converting the result into a clean, unified format. Regular expressions (.str.replace()) are perfect for this. I use \D to match any non-digit character and replace it with an empty string.
# The phone column is currently a string like '-9.994973e+09'
# We use regex to remove everything that isn't a digit
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)
# We can also truncate or format the resulting string if needed
# For example, keeping only the last 10 digits:
df['phone'] = df['phone'].str.slice(-10)
print(df['phone'])
Output:
employee_id
EMP1000 1651623197
EMP1001 1898471390
EMP1002 5596363211
EMP1003 3476490784
EMP1004 1586734256
...
EMP2014 2470739200
EMP2016 2508261122
EMP2017 1261632487
EMP2018 8995729892
EMP2019 7629745492
Name: phone, Length: 996, dtype: object
Looks much better now. This is always a good practice to clean identifiers that contain noise (like IDs with leading characters or zip codes with extensions).
2. Separating and Standardizing Categorical Data (Fixing the 36 Regions)
My df.nunique() check revealed 36 unique values in the department_region column. When I reviewed all the unique values in the column, the output revealed that they are all neatly structured as department-region (e.g., devops-california, finance-texas, cloud tech-new york).
I guess one way to solve this is to split this single column into two dedicated columns. I’ll split the column at the hyphen (-) and assign the parts to new columns: department and region.
# 1. Split the combined column into two new, clean columns
df[['department', 'region']] = df['department_region'].str.split('-', expand=True)
Next, I’ll drop the department_region column since it’s pretty much useless now
# 2. Drop the redundant combined column
df.drop('department_region', axis=1, inplace=True)
Let’s review our new columns
print(df[[‘department’, ‘region’]])
Output:
department region
employee_id
EMP1000 devops california
EMP1001 finance texas
EMP1002 admin nevada
EMP1003 admin nevada
EMP1004 cloud tech florida
... ... ...
EMP2014 finance nevada
EMP2016 cloud tech texas
EMP2017 finance new york
EMP2018 hr florida
EMP2019 devops illinois
[996 rows x 2 columns]
After splitting, the new department column has only 6 unique values (e.g., ‘devops’, ‘finance’, ‘admin’, etc.). This is great news. The values are already standardised and ready for analysis! I guess we could always map all similar departments to one single category. But I’m gonna skip that. I don’t want to get too advanced in this article.
3. Converting Date Columns (The Join_Date Fix)
The Join_Date column is usually read in as a string (object) type, which makes time-series analysis impossible. This means we have to convert it to a proper Pandas datetime object.
pd.to_datetime() is the core function. I often use errors='coerce' as a safety net; if Pandas can’t parse a date, it converts that value to NaT (Not a Time), which is a clean null value, preventing the whole operation from crashing.
# Convert the join_date column to datetime objects
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')
The conversion of dates enables powerful time-series analysis, like calculating average employee tenure or identifying turnover rates by year.
After this step, every value in the dataset is clean, uniform, and correctly formatted. The categorical columns (like department and region) are ready for grouping and visualisation, and the numerical columns (like salary and age) are ready for statistical modeling. The dataset is officially ready for analysis.
Step 5 — Final Quality Check and Export
Before closing the notebook, I always perform one last audit to ensure everything is perfect, and then I export the data so I can perform analysis on it later.
The Final Data Quality Check
This is quick. I re-run the two most critical inspection methods to confirm that all my cleaning commands actually worked:
df.info(): I confirm there are no more missing values in the critical columns (age,salary) and that the data types are correct (phoneis a string,join_dateis datetime).df.describe(): I ensure the statistical summary shows plausible numbers. ThePhonecolumn should now be absent from this output (since it’s a string), andAgeandSalaryshould have logical minimum and maximum values.
If these checks pass, I know the data is reliable.
Exporting the Clean Dataset
The final step is to save this cleaned version of the data. I usually save it as a new CSV file to keep the original messy file intact for reference. I use index=False here if I don’t want the employee_id (which is now the index) to be saved as a separate column, or index=True if I want to save the index as the first column in the new CSV.
# Exporting the clean DataFrame to a new CSV file
# We use index=True to keep our primary key (employee_id) in the exported file
df.to_csv('cleaned_employee_data.csv', index=True)
By exporting with a clear, new filename (e.g., _clean.csv), you officially mark the end of the cleaning phase and provide a clean slate for the next phase of the project.
Conclusion
Honestly, I used to feel overwhelmed by a messy dataset. The missing values, the weird data types, the cryptic columns — it felt like facing the blank page syndrome.
But this structured, repeatable workflow changed everything. By focusing on Load, Inspect, Clean, Review, and Export, we established order instantly: standardizing column names, making the employee_id the index, and using smart strategies for imputation and splitting messy columns.
Now, I can jump straight into the fun analysis part without constantly second-guessing my results. If you struggle with the initial data cleaning step, try out this workflow. I’d love to hear how it goes. If you want to play around with the dataset, you can download it here.
Wanna connect? Feel free to say hi on these platforms
YouTube
Medium


