Data Cleaning in Python
Data cleaning is the process of identifying and fixing problems in a dataset before analysis. Real-world data often contains issues such as:
- Missing values
- Inconsistent formats
- Duplicate records
- Incorrect data types
- Outliers or invalid values
Cleaning data ensures your analysis and visualizations are accurate and reliable.
This guide focuses on data cleaning using the pandas library.
Installing Required Libraries
Install pandas if it is not already installed:
pip install pandas
Import pandas:
import pandas as pd
Example Dataset
Consider the following dataset stored in a CSV file:
name,age,city,salary
Alice,30,Denver,70000
Bob,,Boulder,65000
Charlie,25,Denver,72000
Alice,30,Denver,70000
Dana,29,,68000
Common problems include:
- Missing values
- Duplicate rows
- Inconsistent formatting
Loading the Dataset
Load the data into a pandas DataFrame.
data = pd.read_csv("employees.csv")
View the first rows:
data.head()
Inspect the structure:
data.info()
Identifying Missing Values
Missing values often appear as NaN.
Check for missing values:
data.isnull()
Count missing values per column:
data.isnull().sum()
Handling Missing Values
There are several strategies for dealing with missing data.
Removing Rows With Missing Values
data_clean = data.dropna()
This removes rows containing any missing values.
Filling Missing Values
You can replace missing values with a default value.
Example:
data["age"] = data["age"].fillna(0)
Fill missing values with the column mean:
data["age"] = data["age"].fillna(data["age"].mean())
Removing Duplicate Records
Duplicate rows can distort analysis results.
Check for duplicates:
data.duplicated()
Remove duplicates:
data = data.drop_duplicates()
Fixing Data Types
Columns sometimes load with incorrect types.
Check types:
data.dtypes
Convert a column type:
data["age"] = data["age"].astype(int)
Convert dates:
data["date"] = pd.to_datetime(data["date"])
Cleaning Text Data
Text fields often contain formatting issues such as extra spaces or inconsistent capitalization.
Remove leading and trailing spaces:
data["city"] = data["city"].str.strip()
Convert text to lowercase:
data["city"] = data["city"].str.lower()
Standardize capitalization:
data["city"] = data["city"].str.title()
Renaming Columns
Sometimes column names are inconsistent or unclear.
Rename columns:
data = data.rename(columns={
"salary": "annual_salary"
})
Filtering Invalid Data
Remove rows with unrealistic values.
Example: remove negative ages
data = data[data["age"] >= 0]
Example: filter salary range
data = data[data["salary"] > 30000]
Detecting Outliers
Outliers can distort analysis.
Basic detection using summary statistics:
data.describe()
Example: remove extreme values
data = data[data["salary"] < 200000]
Standardizing Values
Categorical variables often contain inconsistent labels.
Example problem:
Denver
denver
DENVER
Standardize values:
data["city"] = data["city"].str.lower()
Then optionally capitalize:
data["city"] = data["city"].str.title()
Creating Clean Columns
Sometimes you may need derived variables.
Example:
data["salary_k"] = data["salary"] / 1000
Saving the Clean Dataset
Once data is cleaned, save it for further analysis.
data.to_csv("clean_employees.csv", index=False)
You can also save to Excel:
data.to_excel("clean_employees.xlsx")
Example Data Cleaning Workflow
A typical workflow might look like this:
import pandas as pd
# load data
data = pd.read_csv("employees.csv")
# remove duplicates
data = data.drop_duplicates()
# fill missing ages
data["age"] = data["age"].fillna(data["age"].mean())
# clean city names
data["city"] = data["city"].str.strip().str.title()
# remove invalid rows
data = data[data["salary"] > 30000]
# save cleaned data
data.to_csv("employees_clean.csv", index=False)
Data Cleaning Checklist
When preparing a dataset for analysis, consider checking:
- Missing values
- Duplicate records
- Incorrect data types
- Inconsistent text values
- Outliers
- Invalid or unrealistic values
Cleaning these issues ensures that later analysis and visualization are trustworthy.
Summary
Data cleaning is a critical step in any data workflow. Using pandas, you can:
- Detect missing values
- Remove duplicates
- Standardize text fields
- Correct data types
- Filter invalid data
Clean data leads to more accurate analysis, better visualizations, and reliable conclusions.