Jobs & Careers
Build a Data Cleaning & Validation Pipeline in Under 50 Lines of Python
Image by Author | Ideogram
Data is messy. So when you’re pulling information from APIs, analyzing real-world datasets, and the like, you’ll inevitably run into duplicates, missing values, and invalid entries. Instead of writing the same cleaning code repeatedly, a well-designed pipeline saves time and ensures consistency across your data science projects.
In this article, we’ll build a reusable data cleaning and validation pipeline that handles common data quality issues while providing detailed feedback about what was fixed. By the end, you’ll have a tool that can clean datasets and validate them against business rules in just a few lines of code.
Why Data Cleaning Pipelines?
Think of data pipelines like assembly lines in manufacturing. Each step performs a specific function, and the output from one step becomes the input for the next. This approach makes your code more maintainable, testable, and reusable across different projects.
A Simple Data Cleaning Pipeline
Image by Author | diagrams.net (draw.io)
Our pipeline will handle three core responsibilities:
- Cleaning: Remove duplicates and handle missing values (use this as a starting point. You can add as many cleaning steps as needed.)
- Validation: Ensure data meets business rules and constraints
- Reporting: Track what changes were made during processing
Setting Up the Development Environment
Please make sure you’re using a recent version of Python. If using locally, create a virtual environment and install the required packages:
You can also use Google Colab or similar notebook environments if you prefer.
Defining the Validation Schema
Before we can validate data, we need to define what “valid” looks like. We’ll use Pydantic, a Python library that uses type hints to validate data types.
class DataValidator(BaseModel):
name: str
age: Optional[int] = None
email: Optional[str] = None
salary: Optional[float] = None
@field_validator('age')
@classmethod
def validate_age(cls, v):
if v is not None and (v < 0 or v > 100):
raise ValueError('Age must be between 0 and 100')
return v
@field_validator('email')
@classmethod
def validate_email(cls, v):
if v and '@' not in v:
raise ValueError('Invalid email format')
return v
This schema models the expected data using Pydantic’s syntax. To use the @field_validator
decorator, you’ll need the @classmethod
decorator. The validation logic is ensuring age falls within reasonable bounds and emails contain the ‘@’ symbol.
Building the Pipeline Class
Our main pipeline class encapsulates all cleaning and validation logic:
class DataPipeline:
def __init__(self):
self.cleaning_stats = {'duplicates_removed': 0, 'nulls_handled': 0, 'validation_errors': 0}
The constructor initializes a statistics dictionary to track changes made during processing. This helps get a closer look at data quality and also keep track of the cleaning steps applied over time.
Writing the Data Cleaning Logic
Let’s add a clean_data
method to handle common data quality issues like missing values and duplicate records:
def clean_data(self, df: pd.DataFrame) -> pd.DataFrame:
initial_rows = len(df)
# Remove duplicates
df = df.drop_duplicates()
self.cleaning_stats['duplicates_removed'] = initial_rows - len(df)
# Handle missing values
numeric_columns = df.select_dtypes(include=[np.number]).columns
df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].median())
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('Unknown')
This approach is smart about handling different data types. Numeric missing values get filled with the median (more robust than mean against outliers), while text columns get a placeholder value. The duplicate removal happens first to avoid skewing our median calculations.
Adding Validation with Error Tracking
The validation step processes each row individually, collecting both valid data and detailed error information:
def validate_data(self, df: pd.DataFrame) -> pd.DataFrame:
valid_rows = []
errors = []
for idx, row in df.iterrows():
try:
validated_row = DataValidator(**row.to_dict())
valid_rows.append(validated_row.model_dump())
except ValidationError as e:
errors.append({'row': idx, 'errors': str(e)})
self.cleaning_stats['validation_errors'] = len(errors)
return pd.DataFrame(valid_rows), errors
This row-by-row approach ensures that one bad record doesn’t crash the entire pipeline. Valid rows continue through the process while errors are captured for review. This is important in production environments where you need to process what you can while flagging problems.
Orchestrating the Pipeline
The process
method ties everything together:
def process(self, df: pd.DataFrame) -> Dict[str, Any]:
cleaned_df = self.clean_data(df.copy())
validated_df, validation_errors = self.validate_data(cleaned_df)
return {
'cleaned_data': validated_df,
'validation_errors': validation_errors,
'stats': self.cleaning_stats
}
The return value is a comprehensive report that includes the cleaned data, any validation errors, and processing statistics.
Putting It All Together
Here’s how you’d use the pipeline in practice:
# Create sample messy data
sample_data = pd.DataFrame({
'name': ['Tara Jamison', 'Jane Smith', 'Lucy Lee', None, 'Clara Clark','Jane Smith'],
'age': [25, -5, 25, 35, 150,-5],
'email': ['taraj@email.com', 'invalid-email', 'lucy@email.com', 'jane@email.com', 'clara@email.com','invalid-email'],
'salary': [50000, 60000, 50000, None, 75000,60000]
})
pipeline = DataPipeline()
result = pipeline.process(sample_data)
The pipeline automatically removes the duplicate record, handles the missing name by filling it with ‘Unknown’, fills the missing salary with the median value, and flags validation errors for the negative age and invalid email.
🔗 You can find the complete script on GitHub.
Extending the Pipeline
This pipeline serves as a foundation you can build upon. Consider these enhancements for your specific needs:
Custom cleaning rules: Add methods for domain-specific cleaning like standardizing phone numbers or addresses.
Configurable validation: Make the Pydantic schema configurable so the same pipeline can handle different data types.
Advanced error handling: Implement retry logic for transient errors or automatic correction for common mistakes.
Performance optimization: For large datasets, consider using vectorized operations or parallel processing.
Wrapping Up
Data pipelines aren’t just about cleaning individual datasets. They’re about building reliable, maintainable systems.
This pipeline approach ensures consistency across your projects and makes it easy to adjust business rules as requirements change. Start with this basic pipeline, then customize it for your specific needs.
The key is having a reliable, reusable system that handles the mundane tasks so you can focus on extracting insights from clean data. Happy data cleaning!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.
Jobs & Careers
HCLSoftware Launches Domino 14.5 With Focus on Data Privacy and Sovereign AI
HCLSoftware, a global enterprise software leader, launched HCL Domino 14.5 on July 7 as a major upgrade, specifically targeting governments and organisations operating in regulated sectors that are concerned about data privacy and digital independence.
A key feature of the new release is Domino IQ, a sovereign AI extension built into the Domino platform. This new tool gives organisations full control over their AI models and data, helping them comply with regulations such as the European AI Act.
It also removes dependence on foreign cloud services, making it easier for public sector bodies and banks to protect sensitive information.
“The importance of data sovereignty and avoiding unnecessary foreign government influence extends beyond SaaS solutions and AI. Specifically for collaboration – the sensitive data within email, chat, video recordings and documents. With the launch of Domino+ 14.5, HCLSoftware is helping over 200+ government agencies safeguard their sensitive data,” said Richard Jefts, executive vice president and general manager at HCLSoftware
The updated Domino+ collaboration suite now includes enhanced features for secure messaging, meetings, and file sharing. These tools are ready to deploy and meet the needs of organisations that handle highly confidential data.
The platform is supported by IONOS, a leading European cloud provider. Achim Weiss, CEO of IONOS, added, “Today, more than ever, true digital sovereignty is the key to Europe’s digital future. That’s why at IONOS we are proud to provide the sovereign cloud infrastructure for HCL’s sovereign collaboration solutions.”
Other key updates in Domino 14.5 include achieving BSI certification for information security, the integration of security event and incident management (SEIM) tools to enhance threat detection and response, and full compliance with the European Accessibility Act, ensuring that all web-based user experiences are inclusive and accessible to everyone.
With the launch of Domino 14.5, HCLSoftware is aiming to be a trusted technology partner for public sector and highly regulated organisations seeking control, security, and compliance in their digital operations.
Jobs & Careers
Mitsubishi Electric Invests in AI-Assisted PLM Systems Startup ‘Things’
Mitsubishi Electric Corporation announced on July 7 that its ME Innovation Fund has invested in Things, a Japan-based startup that develops and provides AI-assisted product lifecycle management (PLM) systems for the manufacturing industry.
This startup specialises in comprehensive document management, covering everything from product planning and development to disposal. According to the company, this marks the 12th investment made by Mitsubishi’s fund to date.
Through this investment, Mitsubishi Electric aims to combine its extensive manufacturing and control expertise with Things’ generative AI technology. The goal is to accelerate the development of digital transformation (DX) solutions that tackle various challenges facing the manufacturing industry.
In recent years, Japan’s manufacturing sector has encountered several challenges, including labour shortages and the ageing of skilled technicians, which hinder the transfer of expertise. In response, DX initiatives, such as the implementation of PLM and other digital systems, have progressed rapidly. However, these initiatives have faced challenges related to development time, cost, usability, and scalability.
Komi Matsubara, an executive officer at Mitsubishi Electric Corporation, stated, “Through our collaboration with Things, we expect to generate new value by integrating our manufacturing expertise with Things’ generative AI technology. We aim to leverage this initiative to enhance the overall competitiveness of the Mitsubishi Electric group.”
Things launched its ‘PRISM’ PLM system in May 2023, utilising generative AI to improve the structure and usage of information in manufacturing. PRISM offers significant cost and scalability advantages, enhancing user interfaces and experiences while effectively implementing proofs of concept across a wide range of companies.
Atsuya Suzuki, CEO of Things, said, “We are pleased to establish a partnership with Mitsubishi Electric through the ME Innovation Fund. By combining our technology with Mitsubishi Electric’s expertise in manufacturing and control, we aim to accelerate the global implementation of pioneering DX solutions for manufacturing.”
Jobs & Careers
AI to Track Facial Expressions to Detect PTSD Symptoms in Children
A research team from the University of South Florida (USF) has developed an AI system that can identify post-traumatic stress disorder (PTSD) in children.
The project addresses a longstanding clinical dilemma: diagnosing PTSD in children who may not have the emotional vocabulary, cognitive development or comfort to articulate their distress. Traditional methods such as subjective interviews and self-reported questionnaires often fall short. This is where AI steps in.
“Even when they weren’t saying much, you could see what they were going through on their faces,” Alison Salloum, professor at the USF School of Social Work, reportedly said. Her observations during trauma interviews laid the foundation for collaboration with Shaun Canavan, an expert in facial analysis at USF’s Bellini College of Artificial Intelligence, Cybersecurity, and Computing.
The study introduces a privacy-first, context-aware classification model that analyses subtle facial muscle movements. However, instead of using raw footage, the system extracts non-identifiable metrics such as eye gaze, mouth curvature, and head position, ensuring ethical boundaries are respected when working with vulnerable populations.
“We don’t use raw video. We completely get rid of subject identification and only keep data about facial movement,” Canavan reportedly emphasised. The AI also accounts for conversational context, whether a child is speaking to a parent or a therapist, which significantly influences emotional expressivity.
Across 18 therapy sessions, with over 100 minutes of footage per child and approximately 185,000 frames each, the AI identified consistent facial expression patterns in children diagnosed with PTSD. Notably, children were more expressive with clinicians than with parents; a finding that aligns with psychological literature suggesting shame or emotional avoidance often inhibits open communication at home.
While still in its early stages, the tool is not being pitched as a replacement for therapists. Instead, it’s designed as a clinical augmentation, a second set of ‘digital’ eyes that can pick up on emotional signals even trained professionals might miss in real time.
“Data like this is incredibly rare for AI systems,” Canavan added. “That’s what makes this so promising. We now have an ethically sound, objective way to support mental health assessments.”
If validated on a larger scale, the system could transform mental health diagnostics for children—especially for pre-verbal or very young patients—by turning non-verbal cues into actionable insights.
-
Funding & Business6 days ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Jobs & Careers6 days ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Mergers & Acquisitions6 days ago
Donald Trump suggests US government review subsidies to Elon Musk’s companies
-
Funding & Business6 days ago
Rethinking Venture Capital’s Talent Pipeline
-
Jobs & Careers6 days ago
Why Agentic AI Isn’t Pure Hype (And What Skeptics Aren’t Seeing Yet)
-
Funding & Business4 days ago
Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%
-
Jobs & Careers6 days ago
Astrophel Aerospace Raises ₹6.84 Crore to Build Reusable Launch Vehicle
-
Funding & Business7 days ago
From chatbots to collaborators: How AI agents are reshaping enterprise work
-
Jobs & Careers4 days ago
Ilya Sutskever Takes Over as CEO of Safe Superintelligence After Daniel Gross’s Exit
-
Tools & Platforms6 days ago
Winning with AI – A Playbook for Pest Control Business Leaders to Drive Growth