Connect with us

Jobs & Careers

Build a Data Cleaning & Validation Pipeline in Under 50 Lines of Python

Published

on


Build a Robust Data Cleaning & Validation Pipeline in Under 50 Lines of Python
Image by Author | Ideogram

 

Data is messy. So when you’re pulling information from APIs, analyzing real-world datasets, and the like, you’ll inevitably run into duplicates, missing values, and invalid entries. Instead of writing the same cleaning code repeatedly, a well-designed pipeline saves time and ensures consistency across your data science projects.

In this article, we’ll build a reusable data cleaning and validation pipeline that handles common data quality issues while providing detailed feedback about what was fixed. By the end, you’ll have a tool that can clean datasets and validate them against business rules in just a few lines of code.

🔗 Link to the code on GitHub

 

Why Data Cleaning Pipelines?

 
Think of data pipelines like assembly lines in manufacturing. Each step performs a specific function, and the output from one step becomes the input for the next. This approach makes your code more maintainable, testable, and reusable across different projects.

 

data-cleaning-validation-pipeline
A Simple Data Cleaning Pipeline
Image by Author | diagrams.net (draw.io)

 

Our pipeline will handle three core responsibilities:

  • Cleaning: Remove duplicates and handle missing values (use this as a starting point. You can add as many cleaning steps as needed.)
  • Validation: Ensure data meets business rules and constraints
  • Reporting: Track what changes were made during processing

 

Setting Up the Development Environment

 
Please make sure you’re using a recent version of Python. If using locally, create a virtual environment and install the required packages:

You can also use Google Colab or similar notebook environments if you prefer.

 

Defining the Validation Schema

 
Before we can validate data, we need to define what “valid” looks like. We’ll use Pydantic, a Python library that uses type hints to validate data types.

class DataValidator(BaseModel):
    name: str
    age: Optional[int] = None
    email: Optional[str] = None
    salary: Optional[float] = None
    
    @field_validator('age')
    @classmethod
    def validate_age(cls, v):
        if v is not None and (v < 0 or v > 100):
            raise ValueError('Age must be between 0 and 100')
        return v
    
    @field_validator('email')
    @classmethod
    def validate_email(cls, v):
        if v and '@' not in v:
            raise ValueError('Invalid email format')
        return v

 

This schema models the expected data using Pydantic’s syntax. To use the @field_validator decorator, you’ll need the @classmethod decorator. The validation logic is ensuring age falls within reasonable bounds and emails contain the ‘@’ symbol.

 

Building the Pipeline Class

 
Our main pipeline class encapsulates all cleaning and validation logic:

class DataPipeline:
    def __init__(self):
        self.cleaning_stats = {'duplicates_removed': 0, 'nulls_handled': 0, 'validation_errors': 0}

 

The constructor initializes a statistics dictionary to track changes made during processing. This helps get a closer look at data quality and also keep track of the cleaning steps applied over time.

 

Writing the Data Cleaning Logic

 
Let’s add a clean_data method to handle common data quality issues like missing values and duplicate records:

def clean_data(self, df: pd.DataFrame) -> pd.DataFrame:
    initial_rows = len(df)
    
    # Remove duplicates
    df = df.drop_duplicates()
    self.cleaning_stats['duplicates_removed'] = initial_rows - len(df)
    
    # Handle missing values
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].median())
    
    string_columns = df.select_dtypes(include=['object']).columns
    df[string_columns] = df[string_columns].fillna('Unknown')

 

This approach is smart about handling different data types. Numeric missing values get filled with the median (more robust than mean against outliers), while text columns get a placeholder value. The duplicate removal happens first to avoid skewing our median calculations.

 

Adding Validation with Error Tracking

 
The validation step processes each row individually, collecting both valid data and detailed error information:

def validate_data(self, df: pd.DataFrame) -> pd.DataFrame:
    valid_rows = []
    errors = []
    
    for idx, row in df.iterrows():
        try:
            validated_row = DataValidator(**row.to_dict())
            valid_rows.append(validated_row.model_dump())
        except ValidationError as e:
            errors.append({'row': idx, 'errors': str(e)})
    
    self.cleaning_stats['validation_errors'] = len(errors)
    return pd.DataFrame(valid_rows), errors

 

This row-by-row approach ensures that one bad record doesn’t crash the entire pipeline. Valid rows continue through the process while errors are captured for review. This is important in production environments where you need to process what you can while flagging problems.

 

Orchestrating the Pipeline

 
The process method ties everything together:

def process(self, df: pd.DataFrame) -> Dict[str, Any]:
    cleaned_df = self.clean_data(df.copy())
    validated_df, validation_errors = self.validate_data(cleaned_df)
    
    return {
        'cleaned_data': validated_df,
        'validation_errors': validation_errors,
        'stats': self.cleaning_stats
    }

 

The return value is a comprehensive report that includes the cleaned data, any validation errors, and processing statistics.

 

Putting It All Together

 
Here’s how you’d use the pipeline in practice:

# Create sample messy data
sample_data = pd.DataFrame({
    'name': ['Tara Jamison', 'Jane Smith', 'Lucy Lee', None, 'Clara Clark','Jane Smith'],
    'age': [25, -5, 25, 35, 150,-5],
    'email': ['taraj@email.com', 'invalid-email', 'lucy@email.com', 'jane@email.com', 'clara@email.com','invalid-email'],
    'salary': [50000, 60000, 50000, None, 75000,60000]
})

pipeline = DataPipeline()
result = pipeline.process(sample_data)

 

The pipeline automatically removes the duplicate record, handles the missing name by filling it with ‘Unknown’, fills the missing salary with the median value, and flags validation errors for the negative age and invalid email.

🔗 You can find the complete script on GitHub.

 

Extending the Pipeline

 
This pipeline serves as a foundation you can build upon. Consider these enhancements for your specific needs:

Custom cleaning rules: Add methods for domain-specific cleaning like standardizing phone numbers or addresses.

Configurable validation: Make the Pydantic schema configurable so the same pipeline can handle different data types.

Advanced error handling: Implement retry logic for transient errors or automatic correction for common mistakes.

Performance optimization: For large datasets, consider using vectorized operations or parallel processing.

 

Wrapping Up

 
Data pipelines aren’t just about cleaning individual datasets. They’re about building reliable, maintainable systems.

This pipeline approach ensures consistency across your projects and makes it easy to adjust business rules as requirements change. Start with this basic pipeline, then customize it for your specific needs.

The key is having a reliable, reusable system that handles the mundane tasks so you can focus on extracting insights from clean data. Happy data cleaning!
 
 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.





Source link

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Jobs & Careers

Canva Partners With NCERT to Launch AI-Powered Teacher Training

Published

on


Canva has signed a memorandum of understanding (MoU) with the National Council of Educational Research and Training (NCERT) to launch free teacher training and certification programs hosted on the education ministry’s DIKSHA platform. 

The initiative aims to enhance digital literacy, creativity, and AI proficiency among educators across India, in alignment with the objectives of the National Education Policy (NEP) 2020.

As part of the agreement, Canva will offer Indian teachers free access to its education platform and provide learning materials tailored for visual and collaborative instruction. NCERT will ensure that the course content aligns with the national curriculum and is made regionally accessible. Available in multiple Indian languages, the course will also be broadcast via PM e-Vidya DTH channels to extend its reach beyond internet-enabled classrooms.

The certification program includes training on using Canva’s design tools to create engaging lesson plans, infographics, and presentations. Teachers will also learn to co-create content with students and apply AI tools to improve classroom outcomes. Upon completion, participants will receive a joint certificate from NCERT and Canva.

“This partnership is a powerful step toward equipping educators with practical digital skills that not only save time but spark imagination in every classroom,” Jason Wilmot, head of education at Canva, said in a press statement.

Chandrika Deb, country manager for India at Canva stated, “By delivering this program free of cost, in multiple languages, and through a trusted national platform like NCERT, we are not only advancing digital fluency and creative confidence in classrooms across the country, but also deepening Canva’s long-term commitment to India, which plays a pivotal role in our vision to democratize design and creativity at scale.”

Moreover, the company shared some interesting figures. Canva has seen significant global momentum, with over 100 million students and teachers using its platform. In 2024, over 1 billion designs were created, many powered by Canva’s AI tools like Dream Lab, which enables teachers to generate custom visuals instantly. Teacher usage of AI tools has increased by 50% over the past year, with student engagement rising by 107%.

We may see further developments in this partnership as the training program for teachers progresses over time.



Source link

Continue Reading

Jobs & Careers

Capgemini to Acquire WNS for $3.3 Billion with Focus on Agentic AI

Published

on


Capgemini has announced a definitive agreement to acquire WNS, a mid-sized Indian IT firm, for $3.3 billion in cash. This marks a significant step towards establishing a global leadership position in agentic AI.

The deal, unanimously approved by the boards of both companies, values WNS at $76.50 per share—a premium of 28% over the 90-day average and 17% above the July 3 closing price.

The acquisition is expected to immediately boost Capgemini’s revenue growth and operating margin, with normalised EPS accretion of 4% by 2026, increasing to 7% post-synergies in 2027.

“Enterprises are rapidly adopting generative AI and agentic AI to transform their operations end-to-end. Business process services (BPS) will be the showcase for agentic AI,” Aiman Ezzat, CEO of Capgemini, said. 

“Capgemini’s acquisition of WNS will provide the group with the scale and vertical sector expertise to capture that rapidly emerging strategic opportunity created by the paradigm shift from traditional BPS to agentic AI-powered intelligent operations.”

Pending regulatory approvals, the transaction is expected to close by the end of 2025.

WNS’ integration is expected to strengthen Capgemini’s presence in the US market while unlocking immediate cross-selling opportunities through its combined offerings and clientele. 

WNS, which reported $1.27 billion in revenue for FY25 with an 18.7% operating margin, has consistently delivered a revenue growth of around 9% over the past three fiscal years.

“As a recognised leader in the digital BPS space, we see the next wave of transformation being driven by intelligent, domain-centric operations that unlock strategic value for our clients,” Keshav R Murugesh, CEO of WNS, said. “Organisations that have already digitised are now seeking to reimagine their operating models by embedding AI at the core—shifting from automation to autonomy.”

The companies expect to drive additional revenue synergies between €100 million and €140 million, with cost synergies of up to €70 million annually by the end of 2027. 

“WNS and Capgemini share a bold, future-focused vision for Intelligent Operations. I’m confident that Capgemini is the ideal partner at the right time in WNS’ journey,” Timothy L Main, chairman of WNS’ board of directors, said.

Capgemini, already a major player with over €900 million in GenAI bookings in 2024 and strategic partnerships with Microsoft, Google, AWS, Mistral AI, and NVIDIA, aims to solidify its position as a transformation partner for businesses looking to embed agentic AI at scale.



Source link

Continue Reading

Jobs & Careers

Piyush Goyal Announces Second Tranche of INR 10,000 Cr Deep Tech Fund

Published

on


IIT Madras and its alumni association (IITMAA) held the sixth edition of their global innovation and alumni summit, ‘Sangam 2025’, in Bengaluru on 4 and 5 July. The event brought together over 500 participants, including faculty, alumni, entrepreneurs, investors and students.

Union Commerce and Industry Minister Shri Piyush Goyal, addressing the summit, announced a second tranche of ₹10,000 crore under the government’s ‘Fund of Funds’, this time focused on supporting India’s deep tech ecosystem. “This money goes to promote innovation, absorption of newer technologies and development of contemporary fields,” he said. 

The Minister added that guidelines for the fund are currently being finalised, to direct capital to strengthen the entire technology lifecycle — from early-stage research through to commercial deployment, not just startups.. 

He also referred to the recent Cabinet decision approving $12 billion (₹1 lakh crore) for the Department of Science and Technology in the form of a zero-interest 50-year loan. “It gives us more flexibility to provide equity support, grant support, low-cost support and roll that support forward as technologies get fine-tuned,” he said.

Goyal said the government’s push for indigenous innovation stems from cost advantages as well. “When we work on new technologies in India, our cost is nearly one-sixth, one-seventh of what it would cost in Switzerland or America,” he said.

The Minister underlined the government’s focus on emerging technologies such as artificial intelligence, machine learning, and data analytics. “Today, our policies are structured around a future-ready India… an India that is at the forefront of Artificial Intelligence, Machine Learning, computing and data analytics,” he said.

He also laid out a growth trajectory for the Indian economy. “From the 11th largest GDP in the world, we are today the fifth largest. By the end of Calendar year 2025, or maybe anytime during the year, we will be the fourth-largest GDP in the world. By 2027, we will be the third largest,” Goyal said.

Sangam 2025 featured a pitch fest that saw 20 deep tech and AI startups present to over 250 investors and venture capitalists. Selected startups will also receive institutional support from the IIT Madras Innovation Ecosystem, which has incubated over 500 ventures in the last decade.

Key speakers included Aparna Chennapragada (Chief Product Officer, Microsoft), Srinivas Narayanan (VP Engineering, OpenAI), and Tarun Mehta (Co-founder and CEO, Ather Energy), all IIT Madras alumni. The summit also hosted Kris Gopalakrishnan (Axilor Ventures, Infosys), Dr S. Somanath (former ISRO Chairman) and Bengaluru South MP Tejasvi Surya.

Prof. V. Kamakoti, Director, IIT Madras, said, “IIT Madras is committed to playing a pivotal role in shaping ‘Viksit Bharat 2047’. At the forefront of its agenda are innovation and entrepreneurship, which are key drivers for National progress.”

Ms. Shyamala Rajaram, President of IITMAA, said, “Sangam 2025 is a powerful confluence of IIT Madras and its global alumni — sparking bold conversations on innovation and entrepreneurship.”

Prof. Ashwin Mahalingam, Dean (Alumni and Corporate Relations), IIT Madras, added, “None of this would be possible without the unwavering support of our alumni community. Sangam 2025 embodies the strength of that network.”



Source link

Continue Reading

Trending