Jobs & Careers
Automate Data Quality Reports with n8n: From CSV to Professional Analysis
Image by Author | ChatGPT
The Data Quality Bottleneck Every Data Scientist Knows
You’ve just received a new dataset. Before diving into analysis, you need to understand what you’re working with: How many missing values? Which columns are problematic? What’s the overall data quality score?
Most data scientists spend 15-30 minutes manually exploring each new dataset—loading it into pandas, running .info()
, .describe()
, and .isnull().sum()
, then creating visualizations to understand missing data patterns. This routine gets tedious when you’re evaluating multiple datasets daily.
What if you could paste any CSV URL and get a professional data quality report in under 30 seconds? No Python environment setup, no manual coding, no switching between tools.
The Solution: A 4-Node n8n Workflow
n8n (pronounced “n-eight-n”) is an open-source workflow automation platform that connects different services, APIs, and tools through a visual, drag-and-drop interface. While most people associate workflow automation with business processes like email marketing or customer support, n8n can also assist with automating data science tasks that traditionally require custom scripting.
Unlike writing standalone Python scripts, n8n workflows are visual, reusable, and easy to modify. You can connect data sources, perform transformations, run analyses, and deliver results—all without switching between different tools or environments. Each workflow consists of “nodes” that represent different actions, connected together to create an automated pipeline.
Our automated data quality analyzer consists of four connected nodes:
- Manual Trigger – Starts the workflow when you click “Execute”
- HTTP Request – Fetches any CSV file from a URL
- Code Node – Analyzes the data and generates quality metrics
- HTML Node – Creates a beautiful, professional report
Building the Workflow: Step-by-Step Implementation
Prerequisites
- n8n account (free 14 day trial at n8n.io)
- Our pre-built workflow template (JSON file provided)
- Any CSV dataset accessible via public URL (we’ll provide test examples)
Step 1: Import the Workflow Template
Rather than building from scratch, we’ll use a pre-configured template that includes all the analysis logic:
- Download the workflow file
- Open n8n and click “Import from File”
- Select the downloaded JSON file – all four nodes will appear automatically
- Save the workflow with your preferred name
The imported workflow contains four connected nodes with all the complex parsing and analysis code already configured.
Step 2: Understanding Your Workflow
Let’s walk through what each node does:
Manual Trigger Node: Starts the analysis when you click “Execute Workflow.” Perfect for on-demand data quality checks.
HTTP Request Node: Fetches CSV data from any public URL. Pre-configured to handle most standard CSV formats and return the raw text data needed for analysis.
Code Node: The analysis engine that includes robust CSV parsing logic to handle common variations in delimiter usage, quoted fields, and missing value formats. It automatically:
- Parses CSV data with intelligent field detection
- Identifies missing values in multiple formats (null, empty, “N/A”, etc.)
- Calculates quality scores and severity ratings
- Generates specific, actionable recommendations
HTML Node: Transforms the analysis results into a beautiful, professional report with color-coded quality scores and clean formatting.
Step 3: Customizing for Your Data
To analyze your own dataset:
- Click on the HTTP Request node
- Replace the URL with your CSV dataset URL:
- Current:
https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv
- Your data:
https://your-domain.com/your-dataset.csv
- Current:
- Save the workflow
That’s it! The analysis logic automatically adapts to different CSV structures, column names, and data types.
Step 4: Execute and View Results
- Click “Execute Workflow” in the top toolbar
- Watch the nodes process – each will show a green checkmark when complete
- Click on the HTML node and select the “HTML” tab to view your report
- Copy the report or take screenshots to share with your team
The entire process takes under 30 seconds once your workflow is set up.
Understanding the Results
The color-coded quality score gives you an immediate assessment of your dataset:
- 95-100%: Perfect (or near perfect) data quality, ready for immediate analysis
- 85-94%: Excellent quality with minimal cleaning needed
- 75-84%: Good quality, some preprocessing required
- 60-74%: Fair quality, moderate cleaning needed
- Below 60%: Poor quality, significant data work required
Note: This implementation uses a straightforward missing-data-based scoring system. Advanced quality metrics like data consistency, outlier detection, or schema validation could be added to future versions.
Here’s what the final report looks like:
Our example analysis shows a 99.42% quality score – indicating the dataset is largely complete and ready for analysis with minimal preprocessing.
Dataset Overview:
- 173 Total Records: A small but sufficient sample size ideal for quick exploratory analysis
- 21 Total Columns: A manageable number of features that allows focused insights
- 4 Columns with Missing Data: A few select fields contain gaps
- 17 Complete Columns: The majority of fields are fully populated
Testing with Different Datasets
To see how the workflow handles varying data quality patterns, try these example datasets:
- Iris Dataset (
https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
) typically shows a perfect score (100%) with no missing values. - Titanic Dataset (
https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
) demonstrates a more realistic 67.6% score due to strategic missing data in columns like Age and Cabin. - Your Own Data: Upload to Github raw or use any public CSV URL
Based on your quality score, you can determine next steps: above 95% means proceed directly to exploratory data analysis, 85-94% suggests minimal cleaning of identified problematic columns, 75-84% indicates moderate preprocessing work is needed, 60-74% requires planning targeted cleaning strategies for multiple columns, and below 60% suggests evaluating if the dataset is suitable for your analysis goals or if significant data work is justified. The workflow adapts automatically to any CSV structure, allowing you to quickly assess multiple datasets and prioritize your data preparation efforts.
Next Steps
1. Email Integration
Add a Send Email node to automatically deliver reports to stakeholders by connecting it after the HTML node. This transforms your workflow into a distribution system where quality reports are automatically sent to project managers, data engineers, or clients whenever you analyze a new dataset. You can customize the email template to include executive summaries or specific recommendations based on the quality score.
2. Scheduled Analysis
Replace the Manual Trigger with a Schedule Trigger to automatically analyze datasets at regular intervals, perfect for monitoring data sources that update frequently. Set up daily, weekly, or monthly checks on your key datasets to catch quality degradation early. This proactive approach helps you identify data pipeline issues before they impact downstream analysis or model performance.
3. Multiple Dataset Analysis
Modify the workflow to accept a list of CSV URLs and generate a comparative quality report across multiple datasets simultaneously. This batch processing approach is invaluable when evaluating data sources for a new project or conducting regular audits across your organization’s data inventory. You can create summary dashboards that rank datasets by quality score, helping prioritize which data sources need immediate attention versus those ready for analysis.
4. Different File Formats
Extend the workflow to handle other data formats beyond CSV by modifying the parsing logic in the Code node. For JSON files, adapt the data extraction to handle nested structures and arrays, while Excel files can be processed by adding a preprocessing step to convert XLSX to CSV format. Supporting multiple formats makes your quality analyzer a universal tool for any data source in your organization, regardless of how the data is stored or delivered.
Conclusion
This n8n workflow demonstrates how visual automation can streamline routine data science tasks while maintaining the technical depth that data scientists require. By leveraging your existing coding background, you can customize the JavaScript analysis logic, extend the HTML reporting templates, and integrate with your preferred data infrastructure — all within an intuitive visual interface.
The workflow’s modular design makes it particularly valuable for data scientists who understand both the technical requirements and business context of data quality assessment. Unlike rigid no-code tools, n8n allows you to modify the underlying analysis logic while providing visual clarity that makes workflows easy to share, debug, and maintain. You can start with this foundation and gradually add sophisticated features like statistical anomaly detection, custom quality metrics, or integration with your existing MLOps pipeline.
Most importantly, this approach bridges the gap between data science expertise and organizational accessibility. Your technical colleagues can modify the code while non-technical stakeholders can execute workflows and interpret results immediately. This combination of technical sophistication and user-friendly execution makes n8n ideal for data scientists who want to scale their impact beyond individual analysis.
Born in India and raised in Japan, Vinod brings a global perspective to data science and machine learning education. He bridges the gap between emerging AI technologies and practical implementation for working professionals. Vinod focuses on creating accessible learning pathways for complex topics like agentic AI, performance optimization, and AI engineering. He focuses on practical machine learning implementations and mentoring the next generation of data professionals through live sessions and personalized guidance.
Jobs & Careers
Fi.Money Launches Protocol to Connect Personal Finance Data with AI Assistants
Fi.Money, a money management platform based in India, has launched what it says is the first consumer-facing implementation of a model context protocol (MCP) for personal finance.
Fi MCP is designed to bring together users’ complete financial lives, including bank accounts, mutual funds, loans, insurance, EPF, real estate, gold, and more seamlessly into AI assistants of their choice, the company said in a statement.
Users can choose to share this consolidated data with any AI tool, enabling private, intelligent conversations about their money, fully on their terms, it added.
Until now, users have had to stitch together insights from various finance apps, statements, and spreadsheets. When turning to AI tools like ChatGPT or Gemini for advice, they’ve relied on manual inputs, guesswork, or generic prompts.
There was no structured, secure, consent-driven way to help AI understand your actual financial data without sharing screenshots or uploading statements and reports.
The company said that with Fi’s new MCP feature, users can see their entire financial life in a single, unified view.
This data can be privately exported in an AI-readable format or configured for near-real-time syncing with AI assistants.
Once connected, users can ask personal, data-specific questions such as, “Can I afford a six-month career break?” or “What are the mistakes in my portfolio?” and receive context-aware responses based on their actual financial information.
As per the statement, the launch comes at a time when Indian consumers are increasingly seeking digital-first, integrated financial tools. Building on India’s pioneering digital infrastructure, Fi’s MCP represents the next layer of consumer-facing innovation, one that empowers consumers to activate their own data.
Fi Money is the first in the world to let individuals use AI meaningfully with their own money, the company claimed. While most AIs lack context about one’s finances, Fi’s MCP changes that by giving users an AI that actually understands their money.
The Fi MCP is available to all Fi Money users. Any user can download the Fi Money app, consolidate their finances in a few minutes, and start using their data with their preferred AI assistant.
“This is the first time any personal finance app globally has enabled users to securely connect their actual financial data with tools like ChatGPT, Gemini, or Claude,” Sujith Narayanan, co-founder of Fi.Money, said in the statement.
“With MCP, we’re giving users not just a dashboard, but a secure bridge between their financial data and the AI tools they trust. It’s about helping people ask better questions and get smarter answers about their money,” he added.
Jobs & Careers
Large Language Models: A Self-Study Roadmap
Large language models are a big step forward in artificial intelligence. They can predict and generate text that sounds like it was written by a human. LLMs learn the rules of language, like grammar and meaning, which allows them to perform many tasks. They can answer questions, summarize long texts, and even create stories. The growing need for automatically generated and organized content is driving the expansion of the large language model market. According to one report, Large Language Model (LLM) Market Size & Forecast:
“The global LLM Market is currently witnessing robust growth, with estimates indicating a substantial increase in market size. Projections suggest a notable expansion in market value, from USD 6.4 billion in 2024 to USD 36.1 billion by 2030, reflecting a substantial CAGR of 33.2% over the forecast period”
This means 2025 might be the best year to start learning LLMs. Learning advanced concepts of LLMs includes a structured, stepwise approach that includes concepts, models, training, and optimization as well as deployment and advanced retrieval methods. This roadmap presents a step-by-step method to gain expertise in LLMs. So, let’s get started.
Step 1: Cover the Fundamentals
You can skip this step if you already know the basics of programming, machine learning, and natural language processing. However, if you are new to these concepts consider learning them from the following resources:
- Programming: You need to learn the basics of programming in Python, the most popular programming language for machine learning. These resources can help you learn Python:
- Machine Learning: After you learn programming, you have to cover the basic concepts of machine learning before moving on with LLMs. The key here is to focus on concepts like supervised vs. unsupervised learning, regression, classification, clustering, and model evaluation. The best course I found to learn the basics of ML is:
- Natural Language Processing: It is very important to learn the fundamental topics of NLP if you want to learn LLMs. Focus on the key concepts: tokenization, word embeddings, attention mechanisms, etc. I have given a few resources that might help you learn NLP:
Step 2: Understand Core Architectures Behind Large Language Models
Large language models rely on various architectures, with transformers being the most prominent foundation. Understanding these different architectural approaches is essential for working effectively with modern LLMs. Here are the key topics and resources to enhance your understanding:
- Understand transformer architecture and emphasize on understanding self-attention, multi-head attention, and positional encoding.
- Start with Attention Is All You Need, then explore different architectural variants: decoder-only models (GPT series), encoder-only models (BERT), and encoder-decoder models (T5, BART).
- Use libraries like Hugging Face’s Transformers to access and implement various model architectures.
- Practice fine-tuning different architectures for specific tasks like classification, generation, and summarization.
Recommended Learning Resources
Step 3: Specializing in Large Language Models
With the basics in place, it’s time to focus specifically on LLMs. These courses are designed to deepen your understanding of their architecture, ethical implications, and real-world applications:
- LLM University – Cohere (Recommended): Offers both a sequential track for newcomers and a non-sequential, application-driven path for seasoned professionals. It provides a structured exploration of both the theoretical and practical aspects of LLMs.
- Stanford CS324: Large Language Models (Recommended): A comprehensive course exploring the theory, ethics, and hands-on practice of LLMs. You will learn how to build and evaluate LLMs.
- Maxime Labonne Guide (Recommended): This guide provides a clear roadmap for two career paths: LLM Scientist and LLM Engineer. The LLM Scientist path is for those who want to build advanced language models using the latest techniques. The LLM Engineer path focuses on creating and deploying applications that use LLMs. It also includes The LLM Engineer’s Handbook, which takes you step by step from designing to launching LLM-based applications.
- Princeton COS597G: Understanding Large Language Models: A graduate-level course that covers models like BERT, GPT, T5, and more. It is Ideal for those aiming to engage in deep technical research, this course explores both the capabilities and limitations of LLMs.
- Fine Tuning LLM Models – Generative AI Course When working with LLMs, you will often need to fine-tune LLMs, so consider learning efficient fine-tuning techniques such as LoRA and QLoRA, as well as model quantization techniques. These approaches can help reduce model size and computational requirements while maintaining performance. This course will teach you fine-tuning using QLoRA and LoRA, as well as Quantization using LLama2, Gradient, and the Google Gemma model.
- Finetune LLMs to teach them ANYTHING with Huggingface and Pytorch | Step-by-step tutorial: It provides a comprehensive guide on fine-tuning LLMs using Hugging Face and PyTorch. It covers the entire process, from data preparation to model training and evaluation, enabling viewers to adapt LLMs for specific tasks or domains.
Step 4: Build, Deploy & Operationalize LLM Applications
Learning a concept theoretically is one thing; applying it practically is another. The former strengthens your understanding of fundamental ideas, while the latter enables you to translate those concepts into real-world solutions. This section focuses on integrating large language models into projects using popular frameworks, APIs, and best practices for deploying and managing LLMs in production and local environments. By mastering these tools, you’ll efficiently build applications, scale deployments, and implement LLMOps strategies for monitoring, optimization, and maintenance.
- Application Development: Learn how to integrate LLMs into user-facing applications or services.
- LangChain: LangChain is the fast and efficient framework for LLM projects. Learn how to build applications using LangChain.
- API Integrations: Explore how to connect various APIs, like OpenAI’s, to add advanced features to your projects.
- Local LLM Deployment: Learn to set up and run LLMs on your local machine.
- LLMOps Practices: Learn the methodologies for deploying, monitoring, and maintaining LLMs in production environments.
Recommended Learning Resources & Projects
Building LLM applications:
Local LLM Deployment:
Deploying & Managing LLM applications In Production Environments:
GitHub Repositories:
- Awesome-LLM: It is a curated collection of papers, frameworks, tools, courses, tutorials, and resources focused on large language models (LLMs), with a special emphasis on ChatGPT.
- Awesome-langchain: This repository is the hub to track initiatives and projects related to LangChain’s ecosystem.
Step 5: RAG & Vector Databases
Retrieval-Augmented Generation (RAG) is a hybrid approach that combines information retrieval with text generation. Instead of relying only on pre-trained knowledge, RAG retrieves relevant documents from external sources before generating responses. This improves accuracy, reduces hallucinations, and makes models more useful for knowledge-intensive tasks.
- Understand RAG & its Architectures: Standard RAG, Hierarchical RAG, Hybrid RAG etc.
- Vector Databases: Understand how to implement vector databases with RAG. Vector databases store and retrieve information based on semantic meaning rather than exact keyword matches. This makes them ideal for RAG-based applications as these allow for fast and efficient retrieval of relevant documents.
- Retrieval Strategies: Implement dense retrieval, sparse retrieval, and hybrid search for better document matching.
- LlamaIndex & LangChain: Learn how these frameworks facilitate RAG.
- Scaling RAG for Enterprise Applications: Understand distributed retrieval, caching, and latency optimizations for handling large-scale document retrieval.
Recommended Learning Resources & Projects
Basic Foundational courses:
Advanced RAG Architectures & Implementations:
Enterprise-Grade RAG & Scaling:
Step 6: Optimize LLM Inference
Optimizing inference is crucial for making LLM-powered applications efficient, cost-effective, and scalable. This step focuses on techniques to reduce latency, improve response times, and minimize computational overhead.
Key Topics
- Model Quantization: Reduce model size and improve speed using techniques like 8-bit and 4-bit quantization (e.g., GPTQ, AWQ).
- Efficient Serving: Deploy models efficiently with frameworks like vLLM, TGI (Text Generation Inference), and DeepSpeed.
- LoRA & QLoRA: Use parameter-efficient fine-tuning methods to enhance model performance without high resource costs.
- Batching & Caching: Optimize API calls and memory usage with batch processing and caching strategies.
- On-Device Inference: Run LLMs on edge devices using tools like GGUF (for llama.cpp) and optimized runtimes like ONNX and TensorRT.
Recommended Learning Resources
- Efficiently Serving LLMs – Coursera – A guided project on optimizing and deploying large language models efficiently for real-world applications.
- Mastering LLM Inference Optimization: From Theory to Cost-Effective Deployment – YouTube – A tutorial discussing the challenges and solutions in LLM inference. It focuses on scalability, performance, and cost management. (Recommended)
- MIT 6.5940 Fall 2024 TinyML and Efficient Deep Learning Computing – It covers model compression, quantization, and optimization techniques to deploy deep learning models efficiently on resource-constrained devices. (Recommended)
- Inference Optimization Tutorial (KDD) – Making Models Run Faster – YouTube – A tutorial from the Amazon AWS team on methods to accelerate LLM runtime performance.
- Large Language Model inference with ONNX Runtime (Kunal Vaishnavi) – A guide on optimizing LLM inference using ONNX Runtime for faster and more efficient execution.
- Run Llama 2 Locally On CPU without GPU GGUF Quantized Models Colab Notebook Demo – A step-by-step tutorial on running LLaMA 2 models locally on a CPU using GGUF quantization.
- Tutorial on LLM Quantization w/ QLoRA, GPTQ and Llamacpp, LLama 2 – Covers various quantization techniques like QLoRA and GPTQ.
- Inference, Serving, PagedAtttention and vLLM – Explains inference optimization techniques, including PagedAttention and vLLM, to speed up LLM serving.
Wrapping Up
This guide covers a comprehensive roadmap to learning and mastering LLMs in 2025. I know it might seem overwhelming at first, but trust me — if you follow this step-by-step approach, you’ll cover everything in no time. If you have any questions or need more help, do comment.
Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.
Jobs & Careers
HCLSoftware Launches Domino 14.5 With Focus on Data Privacy and Sovereign AI
HCLSoftware, a global enterprise software leader, launched HCL Domino 14.5 on July 7 as a major upgrade, specifically targeting governments and organisations operating in regulated sectors that are concerned about data privacy and digital independence.
A key feature of the new release is Domino IQ, a sovereign AI extension built into the Domino platform. This new tool gives organisations full control over their AI models and data, helping them comply with regulations such as the European AI Act.
It also removes dependence on foreign cloud services, making it easier for public sector bodies and banks to protect sensitive information.
“The importance of data sovereignty and avoiding unnecessary foreign government influence extends beyond SaaS solutions and AI. Specifically for collaboration – the sensitive data within email, chat, video recordings and documents. With the launch of Domino+ 14.5, HCLSoftware is helping over 200+ government agencies safeguard their sensitive data,” said Richard Jefts, executive vice president and general manager at HCLSoftware
The updated Domino+ collaboration suite now includes enhanced features for secure messaging, meetings, and file sharing. These tools are ready to deploy and meet the needs of organisations that handle highly confidential data.
The platform is supported by IONOS, a leading European cloud provider. Achim Weiss, CEO of IONOS, added, “Today, more than ever, true digital sovereignty is the key to Europe’s digital future. That’s why at IONOS we are proud to provide the sovereign cloud infrastructure for HCL’s sovereign collaboration solutions.”
Other key updates in Domino 14.5 include achieving BSI certification for information security, the integration of security event and incident management (SEIM) tools to enhance threat detection and response, and full compliance with the European Accessibility Act, ensuring that all web-based user experiences are inclusive and accessible to everyone.
With the launch of Domino 14.5, HCLSoftware is aiming to be a trusted technology partner for public sector and highly regulated organisations seeking control, security, and compliance in their digital operations.
-
Funding & Business7 days ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Jobs & Careers6 days ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Mergers & Acquisitions6 days ago
Donald Trump suggests US government review subsidies to Elon Musk’s companies
-
Funding & Business6 days ago
Rethinking Venture Capital’s Talent Pipeline
-
Jobs & Careers6 days ago
Why Agentic AI Isn’t Pure Hype (And What Skeptics Aren’t Seeing Yet)
-
Funding & Business4 days ago
Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%
-
Funding & Business7 days ago
From chatbots to collaborators: How AI agents are reshaping enterprise work
-
Jobs & Careers6 days ago
Astrophel Aerospace Raises ₹6.84 Crore to Build Reusable Launch Vehicle
-
Jobs & Careers6 days ago
Telangana Launches TGDeX—India’s First State‑Led AI Public Infrastructure
-
Funding & Business4 days ago
Dust hits $6M ARR helping enterprises build AI agents that actually do stuff instead of just talking