Jobs & Careers

A Gentle Introduction to Principal Component Analysis (PCA) in Python

Published

3 days ago

July 4, 2025

Image by Author | Ideogram

Principal component analysis (PCA) is one of the most popular techniques for reducing the dimensionality of high-dimensional data. This is an important data transformation process in various real-world scenarios and industries like image processing, finance, genetics, and machine learning applications where data contains many features that need to be analyzed more efficiently.

The reasons for the significance of dimensionality reduction techniques like PCA are manifold, with three of them standing out:

Efficiency: reducing the number of features in your data signifies a reduction in the computational cost of data-intensive processes like training advanced machine learning models.
Interpretability: by projecting your data into a low-dimensional space, while keeping its key patterns and properties, it is easier to interpret and visualize in 2D and 3D, sometimes helping gain insight from its visualization.
Noise reduction: often, high-dimensional data may contain redundant or noisy features that, when detected by methods like PCA, can be eliminated while preserving (or even improving) the effectiveness of subsequent analyses.

Hopefully, at this point I have convinced you about the practical relevance of PCA when handling complex data. If that’s the case, keep reading, as we’ll start getting practical by learning how to use PCA in Python.

How to Apply Principal Component Analysis in Python

Thanks to supporting libraries like Scikit-learn that contain abstracted implementations of the PCA algorithm, using it on your data is relatively straightforward as long as the data are numerical, previously preprocessed, and free of missing values, with feature values being standardized to avoid issues like variance dominance. This is particularly important, since PCA is a deeply statistical method that relies on feature variances to determine principal components: new features derived from the original ones and orthogonal to each other.

We will start our example of using PCA from scratch in Python by importing the necessary libraries, loading the MNIST dataset of low-resolution images of handwritten digits, and putting it into a Pandas DataFrame:

import pandas as pd
from torchvision import datasets

mnist_data = datasets.MNIST(root="./data", train=True, download=True)
data = []
for img, label in mnist_data:
    img_array = list(img.getdata()) 
    data.append([label] + img_array)
columns = ["label"] + [f"pixel_{i}" for i in range(28*28)]
mnist_data = pd.DataFrame(data, columns=columns)

In the MNIST dataset, each instance is a 28×28 square image, with a total of 784 pixels, each containing a numerical code associated with its gray level, ranging from 0 for black (no intensity) to 255 for white (maximum intensity). These data must firstly be rearranged into a unidimensional array — rather than bidimensional as per its original 28×28 grid arrangement. This process called flattening takes place in the above code, with the final dataset in DataFrame format containing a total of 785 variables: one for each of the 784 pixels plus the label, indicating with an integer value between 0 and 9 the digit originally written in the image.

MNIST Dataset | Source: TensorFlow

In this example, we won’t need the label — useful for other use cases like image classification — but we will assume we may need to keep it handy for future analysis, therefore we will separate it from the rest of the features associated with image pixels in a new variable:

X = mnist_data.drop('label', axis=1)

y = mnist_data.label

Although we will not apply a supervised learning technique after PCA, we will assume we may need to do so in future analyses, hence we will split the dataset into training (80%) and testing (20%) subsets. There’s another reason we are doing this, let me clarify it a bit later.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state=42)

Preprocessing the data and making it suitable for the PCA algorithm is as important as applying the algorithm itself. In our example, preprocessing entails scaling the original pixel intensities in the MNIST dataset to a standardized range with a mean of 0 and a standard deviation of 1 so that all features have equal contribution to variance computations, avoiding dominance issues in certain features. To do this, we will use the StandardScaler class from sklearn.preprocessing, which standardizes numerical features:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Notice the use of fit_transform for the training data, whereas for the test data we used transform instead. This is the other reason why we previously split the data into training and test data, to have the opportunity to discuss this: in data transformations like standardization of numerical attributes, transformations across the training and test sets must be consistent. The fit_transform method is used on the training data because it calculates the necessary statistics that will guide the data transformation process from the training set (fitting), and then applies the transformation. Meanwhile, the transform method is utilized on the test data, which applies the same transformation “learned” from the training data to the test set. This ensures that the model sees the test data in the same target scale as that used for the training data, preserving consistency and avoiding issues like data leakage or bias.

Now we can apply the PCA algorithm. In Scikit-learn’s implementation, PCA takes an important argument: n_components. This hyperparameter determines the proportion of principal components to retain. Larger values closer to 1 mean retaining more components and capturing more variance in the original data, whereas lower values closer to 0 mean keeping fewer components and applying a more aggressive dimensionality reduction strategy. For example, setting n_components to 0.95 implies retaining sufficient components to capture 95% of the original data’s variance, which may be appropriate for reducing the data’s dimensionality while preserving most of its information. If after applying this setting the data dimensionality is significantly reduced, that means many of the original features did not contain much statistically relevant information.

from sklearn.decomposition import PCA

pca = PCA(n_components = 0.95)
X_train_reduced = pca.fit_transform(X_train_scaled)

X_train_reduced.shape

Using the shape attribute of the resulting dataset after applying PCA, we can see that the dimensionality of the data has been drastically reduced from 784 features to just 325, while still keeping 95% of the important information.

Is this a good result? Answering this question largely depends on the later application or type of analysis you want to perform with your reduced data. For instance, if you want to build an image classifier of digit images, you may want to build two classification models: one trained with the original, high-dimensional dataset, and one trained with the reduced dataset. If there is no significant loss of classification accuracy in your second classifier, good news: you achieved a faster classifier (dimensionality reduction normally implies greater efficiency in training and inference), and similar classification performance as if you were using the original data.

Wrapping Up

This article illustrated through a Python step-by-step tutorial how to apply the PCA algorithm from scratch, starting from a dataset of handwritten digit images with high dimensionality.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

Source link

Jobs & Careers

Piyush Goyal Announces Second Tranche of INR 10,000 Cr Deep Tech Fund

Published

17 hours ago

July 6, 2025

Siddharth Jindal

IIT Madras and its alumni association (IITMAA) held the sixth edition of their global innovation and alumni summit, ‘Sangam 2025’, in Bengaluru on 4 and 5 July. The event brought together over 500 participants, including faculty, alumni, entrepreneurs, investors and students.

Union Commerce and Industry Minister Shri Piyush Goyal, addressing the summit, announced a second tranche of ₹10,000 crore under the government’s ‘Fund of Funds’, this time focused on supporting India’s deep tech ecosystem. “This money goes to promote innovation, absorption of newer technologies and development of contemporary fields,” he said.

The Minister added that guidelines for the fund are currently being finalised, to direct capital to strengthen the entire technology lifecycle — from early-stage research through to commercial deployment, not just startups..

He also referred to the recent Cabinet decision approving $12 billion (₹1 lakh crore) for the Department of Science and Technology in the form of a zero-interest 50-year loan. “It gives us more flexibility to provide equity support, grant support, low-cost support and roll that support forward as technologies get fine-tuned,” he said.

Goyal said the government’s push for indigenous innovation stems from cost advantages as well. “When we work on new technologies in India, our cost is nearly one-sixth, one-seventh of what it would cost in Switzerland or America,” he said.

The Minister underlined the government’s focus on emerging technologies such as artificial intelligence, machine learning, and data analytics. “Today, our policies are structured around a future-ready India… an India that is at the forefront of Artificial Intelligence, Machine Learning, computing and data analytics,” he said.

He also laid out a growth trajectory for the Indian economy. “From the 11th largest GDP in the world, we are today the fifth largest. By the end of Calendar year 2025, or maybe anytime during the year, we will be the fourth-largest GDP in the world. By 2027, we will be the third largest,” Goyal said.

Sangam 2025 featured a pitch fest that saw 20 deep tech and AI startups present to over 250 investors and venture capitalists. Selected startups will also receive institutional support from the IIT Madras Innovation Ecosystem, which has incubated over 500 ventures in the last decade.

Key speakers included Aparna Chennapragada (Chief Product Officer, Microsoft), Srinivas Narayanan (VP Engineering, OpenAI), and Tarun Mehta (Co-founder and CEO, Ather Energy), all IIT Madras alumni. The summit also hosted Kris Gopalakrishnan (Axilor Ventures, Infosys), Dr S. Somanath (former ISRO Chairman) and Bengaluru South MP Tejasvi Surya.

Prof. V. Kamakoti, Director, IIT Madras, said, “IIT Madras is committed to playing a pivotal role in shaping ‘Viksit Bharat 2047’. At the forefront of its agenda are innovation and entrepreneurship, which are key drivers for National progress.”

Ms. Shyamala Rajaram, President of IITMAA, said, “Sangam 2025 is a powerful confluence of IIT Madras and its global alumni — sparking bold conversations on innovation and entrepreneurship.”

Prof. Ashwin Mahalingam, Dean (Alumni and Corporate Relations), IIT Madras, added, “None of this would be possible without the unwavering support of our alumni community. Sangam 2025 embodies the strength of that network.”

Source link

Jobs & Careers

Serve Machine Learning Models via REST APIs in Under 10 Minutes

Published

3 days ago

July 4, 2025

Kanwal Mehreen

SServe Machine Learning Models via REST APIs in Under 10 Minutes

Image by Author | Canva

If you like building machine learning models and experimenting with new stuff, that’s really cool — but to be honest, it only becomes useful to others once you make it available to them. For that, you need to serve it — expose it through a web API so that other programs (or humans) can send data and get predictions back. That’s where REST APIs come in.

In this article, you will learn how we’ll go from a simple machine learning model to a production-ready API using FastAPI, one of Python’s fastest and most developer-friendly web frameworks, in just under 10 minutes. And we won’t just stop at a “make it run” demo, but we will add things like:

Validating incoming data
Logging every request
Adding background tasks to avoid slowdowns
Gracefully handling errors

So, let me just quickly show you how our project structure is going to look before we move to the code part:

ml-api/
│
├── model/
│   └── train_model.py        # Script to train and save the model
│   └── iris_model.pkl        # Trained model file
│
├── app/
│   └── main.py               # FastAPI app
│   └── schema.py             # Input data schema using Pydantic
│
├── requirements.txt          # All dependencies
└── README.md                 # Optional documentation

Step 1: Install What You Need

We’ll need a few Python packages for this project: FastAPI for the API, Scikit-learn for the model, and a few helpers like joblib and pydantic. You can install them using pip:

pip install fastapi uvicorn scikit-learn joblib pydantic

And save your environment:

pip freeze > requirements.txt

Step 2: Train and Save a Simple Model

Let’s keep the machine learning part simple so we can focus on serving the model. We’ll use the famous Iris dataset and train a random forest classifier to predict the type of iris flower based on its petal and sepal measurements.

Here’s the training script. Create a file called train_model.py in a model/ directory:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import joblib, os

X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier()
clf.fit(*train_test_split(X, y, test_size=0.2, random_state=42)[:2])

os.makedirs("model", exist_ok=True)
joblib.dump(clf, "model/iris_model.pkl")
print("✅ Model saved to model/iris_model.pkl")

This script loads the data, splits it, trains the model, and saves it using joblib. Run it once to generate the model file:

python model/train_model.py

Step 3: Define What Input Your API Should Expect

Now we need to define how users will interact with your API. What should they send, and in what format?

We’ll use Pydantic, a built-in part of FastAPI, to create a schema that describes and validates incoming data. Specifically, we’ll ensure that users provide four positive float values — for sepal length/width and petal length/width.

In a new file app/schema.py, add:

from pydantic import BaseModel, Field

class IrisInput(BaseModel):
    sepal_length: float = Field(..., gt=0, lt=10)
    sepal_width: float = Field(..., gt=0, lt=10)
    petal_length: float = Field(..., gt=0, lt=10)
    petal_width: float = Field(..., gt=0, lt=10)

Here, we’ve added value constraints (greater than 0 and less than 10) to keep our inputs clean and realistic.

Step 4: Create the API

Now it’s time to build the actual API. We’ll use FastAPI to:

Load the model
Accept JSON input
Predict the class and probabilities
Log the request in the background
Return a clean JSON response

Let’s write the main API code inside app/main.py:

from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.responses import JSONResponse
from app.schema import IrisInput
import numpy as np, joblib, logging

# Load the model
model = joblib.load("model/iris_model.pkl")

# Set up logging
logging.basicConfig(filename="api.log", level=logging.INFO,
                    format="%(asctime)s - %(message)s")

# Create the FastAPI app
app = FastAPI()

@app.post("/predict")
def predict(input_data: IrisInput, background_tasks: BackgroundTasks):
    try:
        # Format the input as a NumPy array
        data = np.array([[input_data.sepal_length,
                          input_data.sepal_width,
                          input_data.petal_length,
                          input_data.petal_width]])
        
        # Run prediction
        pred = model.predict(data)[0]
        proba = model.predict_proba(data)[0]
        species = ["setosa", "versicolor", "virginica"][pred]

        # Log in the background so it doesn’t block response
        background_tasks.add_task(log_request, input_data, species)

        # Return prediction and probabilities
        return {
            "prediction": species,
            "class_index": int(pred),
            "probabilities": {
                "setosa": float(proba[0]),
                "versicolor": float(proba[1]),
                "virginica": float(proba[2])
            }
        }

    except Exception as e:
        logging.exception("Prediction failed")
        raise HTTPException(status_code=500, detail="Internal error")

# Background logging task
def log_request(data: IrisInput, prediction: str):
    logging.info(f"Input: {data.dict()} | Prediction: {prediction}")

Let’s pause and understand what’s happening here.

We load the model once when the app starts. When a user hits the /predict endpoint with valid JSON input, we convert that into a NumPy array, pass it through the model, and return the predicted class and probabilities. If something goes wrong, we log it and return a friendly error.

Notice the BackgroundTasks part — this is a neat FastAPI feature that lets us do work after the response is sent (like saving logs). That keeps the API responsive and avoids delays.

Step 5: Run Your API

To launch the server, use uvicorn like this:

uvicorn app.main:app --reload

Visit: http://127.0.0.1:8000/docs
You’ll see an interactive Swagger UI where you can test the API.
Try this sample input:

{
  "sepal_length": 6.1,
  "sepal_width": 2.8,
  "petal_length": 4.7,
  "petal_width": 1.2
}

or you can use CURL to make the request like this:

curl -X POST "http://127.0.0.1:8000/predict" -H  "Content-Type: application/json" -d \
'{
  "sepal_length": 6.1,
  "sepal_width": 2.8,
  "petal_length": 4.7,
  "petal_width": 1.2
}'

Both of the them generates the same response which is this:

{"prediction":"versicolor",
 "class_index":1,
 "probabilities": {
	 "setosa":0.0,
	 "versicolor":1.0,
	 "virginica":0.0 }
 }

Optional Step: Deploy Your API

You can deploy the FastAPI app on:

Render.com (zero config deployment)
Railway.app (for continuous integration)
Heroku (via Docker)

You can also extend this into a production-ready service by adding authentication (such as API keys or OAuth) to protect your endpoints, monitoring requests with Prometheus and Grafana, and using Redis or Celery for background job queues. You can also refer to my article : Step-by-Step Guide to Deploying Machine Learning Models with Docker.

Wrapping Up

That’s it — and it’s already better than most demos. What we’ve built is more than just a toy example. However, it:

Validates input data automatically
Returns meaningful responses with prediction confidence
Logs every request to a file (api.log)
Uses background tasks so the API stays fast and responsive
Handles failures gracefully

And all of it in under 100 lines of code.

Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.

Source link

Jobs & Careers

AI-Powered Face Authentication Hits Record 15.87 Crore in June as Aadhaar Transactions Soar

Published

3 days ago

July 4, 2025

C P Balasubramanyam

The adoption of artificial intelligence in India’s digital identity infrastructure is scaling new highs, with Aadhaar’s AI-driven face authentication technology witnessing an unprecedented 15.87 crore transactions in June 2025.

This marks a dramatic surge from 4.61 crore transactions recorded in the same month last year, showcasing the growing trust and reliance on facial biometrics for secure and convenient identity verification, according to an official statement from the electronics & IT ministry.

According to data released by the Unique Identification Authority of India (UIDAI), a total of 229.33 crore Aadhaar authentication transactions were carried out in June 2025, reflecting a 7.8% year-on-year growth.

The steady rise highlights Aadhaar’s critical role in India’s expanding digital economy and its function as an enabler for accessing welfare schemes and public services.

Since its inception, Aadhaar has facilitated over 15,452 crore authentication transactions.

The AI/ML-powered face authentication solution, developed in-house by UIDAI, operates seamlessly across Android and iOS platforms, allowing users to verify their identity with a simple face scan, the ministry informed.

“This not only enhances user convenience but also strengthens the overall security framework,” it said.

More than 100 government ministries, departments, financial institutions, oil marketing companies, and telecom service providers are actively using face authentication to ensure smoother, faster, and safer delivery of services and entitlements.

The system’s rapid expansion underscores how AI is reshaping the landscape of digital public infrastructure in India, it said.

UIDAI’s face authentication technology, with nearly 175 crore cumulative transactions so far, is increasingly becoming central to Aadhaar’s verification ecosystem.
In addition to face authentication, Aadhaar’s electronic Know Your Customer (e-KYC) service recorded over 39.47 crore transactions in June 2025 alone. E-KYC continues to streamline onboarding and compliance processes across banking, financial services, and other regulated sectors, reinforcing Aadhaar’s position as a foundation for ease of living and doing business in India, the ministry shared.

Source link