Connect with us

Jobs & Careers

How to Learn AI for Data Analytics in 2025

Published

on


How to Learn AI for Data Analytics in 2025
Image by Editor | ChatGPT

 

Data analytics has changed. It is no longer sufficient to know tools like Python, SQL, and Excel to be a data analyst.

As a data professional at a tech company, I am experiencing firsthand the integration of AI into every employee’s workflow. There is an ocean of AI tools that can now access and analyze your entire database and help you build data analytics projects, machine learning models, and web applications in minutes.

If you are an aspiring data professional and aren’t using these AI tools, you are losing out. And soon, you will be surpassed by other data analysts; people who are using AI to optimize their workflows.

In this article, I will walk you through AI tools that will help you stay ahead of the competition and 10X your data analytics workflows.

With these tools, you can:

  • Build and deploy creative portfolio projects to get hired as a data analyst
  • Use plain English to create end-to-end data analytics applications
  • Speed up your data workflows and become a more efficient data analyst

Additionally, this article will be a step-by-step guide on how to use AI tools to build data analytics applications. We will focus on two AI tools in particular – Cursor and Pandas AI.

For a video version of this article, watch this:

 

AI Tool 1: Cursor

 
Cursor is an AI code editor that has access to your entire codebase. You just have to type a prompt into Cursor’s chat interface, and it will access all the files in your directory and edit code for you.

If you are a beginner and can’t write a single line of code, you can even start with an empty code folder and ask Cursor to build something for you. The AI tool will then follow your instructions and create code files according to your requirements.

Here is a guide on how you can use Cursor to build an end-to-end data analytics project without writing a single line of code.

 

Step 1: Cursor Installation and Setup

Let’s see how we can use Cursor AI for data analytics.

To install Cursor, just go to www.cursor.com, download the version that is compatible with your OS, follow the installation instructions, and you will be set up in seconds.

Here’s what the Cursor interface looks like:

 

Cursor AI Interface
Cursor AI Interface

 

To follow along to this tutorial, download the train.csv file from the Sentiment Analysis Dataset on Kaggle.

Then create a folder named “Sentiment Analysis Project” and move the downloaded train.csv file into it.

Finally, create an empty file named app.py. Your project folder should now look like this:

 

Sentiment Analysis Project Folder
Sentiment Analysis Project Folder

 

This will be our working directory.

Now, open this folder in Cursor by navigating to File -> Open Folder.

The right side of the screen has a chat interface where you can type prompts into Cursor. Notice that there are a few selections here. Let’s select “Agent” in the drop-down.

This tells Cursor to explore your codebase and act as an AI assistant that will refactor and debug your code.

Additionally, you can choose which language model you’d like to use with Cursor (GPT-4o, Gemini-2.5-Pro, etc). I suggest using Claude-4-Sonnet, a model that is well-known for its advanced coding capabilities.

 

Step 2: Prompting Cursor to Build an Application

Let’s now type this prompt into Cursor, asking it to build an end-to-end sentiment analysis model using the training dataset in our codebase:

Create a sentiment analysis web app that:

1. Uses a pre-trained DistilBERT model to analyze the sentiment of text (positive, negative, or neutral)
2. Has a simple web interface where users can enter text and see results
3. Shows the sentiment result with appropriate colors (green for positive, red for negative)
4. Runs immediately without needing any training

Please connect all the files properly so that when I enter text and click analyze, it shows me the sentiment result right away.

 

After you enter this prompt into Cursor, it will automatically generate code files to build the sentiment analysis application.
 

Step 3: Accepting Changes and Running Commands

As Cursor creates new files and generates code, you need to click on “Accept” to confirm the changes made by the AI agent.

After Cursor writes out all the code, it might prompt you to run some commands on the terminal. Executing these commands will allow you to install the required dependencies and run the web application.

Just click on “Run,” which allows Cursor to run these commands for us:

 

Run Command Cursor
Run Command Cursor

 

Once Cursor has built the application, it will tell you to copy and paste this link into your browser:

 

Cursor App Link
Cursor App Link

 

Doing so will lead you to the sentiment analysis web application, which looks like this:

 

Sentiment Analysis App with Cursor
Sentiment Analysis App with Cursor

 

This is a fully-fledged web application that employers can interact with. You can paste any sentence into this app and it will predict the sentiment, returning a result to you.

I find tools like Cursor to be incredibly powerful if you are a beginner in the field and want to productionize your projects.

Most data professionals don’t know front-end programming languages like HTML and CSS, due to which we’re unable to showcase our projects in an interactive application.

Our code often sits in Kaggle notebooks, which doesn’t give us a competitive advantage over hundreds of other applicants doing the exact same thing.

A tool like Cursor, however, can set you apart from the competition. It can help you turn your ideas into reality by coding out exactly what you tell it to.

 

AI Tool 2: Pandas AI

 
Pandas AI lets you manipulate and analyze Pandas data frames without writing any code.

You just have to type prompts in plain English, which reduces the complexity that comes with performing data preprocessing and EDA.

If you don’t already know, Pandas is a Python library that you can use to analyze and manipulate data.

You read data into something known as a Pandas data frame, which then allows you to perform operations on your data.

Let’s go through an example of how you can perform data preprocessing, manipulation, and analysis with Pandas AI.

For this demo, I will be using the Titanic Survival Prediction dataset on Kaggle (download the train.csv file).

For this analysis, I suggest using a Python notebook environment, like a Jupyter Notebook, a Kaggle Notebook, or Google Colab. The complete code for this analysis can be found in this Kaggle Notebook.

 

Step 1: Pandas AI Installation and Setup

Once you have your notebook environment ready, type the command below to install Pandas AI:

!pip install pandasai

Next, load the Titanic dataframe with the following lines of code:

import pandas as pd

train_data = pd.read_csv('/kaggle/input/titanic/train.csv')

 

Now let’s import the following libraries:

import os
from pandasai import SmartDataframe
from pandasai.llm.openai import OpenAI

 

Next, we must create a Pandas AI object to analyze the Titanic train dataset.

Here’s what this means:

Pandas AI is a library that connects your Pandas data frame to a Large Language Model. You can use Pandas AI to connect to GPT-4o, Claude-3.5, and other LLMs.

By default, Pandas AI uses a language model called Bamboo LLM. To connect Pandas AI to the language model, you can visit this website to get an API key.

Then, enter the API key into this block of code to create a Pandas AI object:

# Set the PandasAI API key
# By default, unless you choose a different LLM, it will use BambooLLM.
# You can get your free API key by signing up at https://app.pandabi.ai
os.environ['PANDASAI_API_KEY'] = 'your-pandasai-api-key'  # Replace with your actual key

# Create SmartDataframe with default LLM (Bamboo)
smart_df = SmartDataframe(train_data) 

 

Personally, I faced some issues in retrieving the Bamboo LLM API key. Due to this, I decided to get an API key from OpenAI instead. Then, I used the GPT-4o model for this analysis.

One caveat to this approach is that OpenAI’s API keys aren’t free. You must purchase OpenAI’s API tokens to use these models.

To do this, navigate to Open AI’s website and purchase tokens from the billings page. Then you can go to the “API keys” page and create your API key.

Now that you have the OpenAI API key, you need to enter it into this block of code to connect the GPT-4o model to Pandas AI:

# Set your OpenAI API key 
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

# Initialize OpenAI LLM
llm = OpenAI(api_token=os.environ["OPENAI_API_KEY"], model="gpt-4o")

config = {
    "llm": llm,
    "enable_cache": False,
    "verbose": False,
    "save_logs": True
}

# Create SmartDataframe with explicit configuration
smart_df = SmartDataframe(train_data, config=config)

 

We can now use this Pandas AI object to analyze the Titanic dataset.
 

Step 2: EDA and Data Preprocessing with Pandas AI

First, let’s start with a simple prompt asking Pandas AI to describe this dataset:

smart_df.chat("Can you describe this dataset and provide a summary, format the output as a table.")

You will see a result that looks like this, with a basic statistical summary of the dataset:

 

Titanic Dataset Description
Titanic Dataset Description

 

Typically we’d write some code to get a summary like this. With Pandas AI, however, we just need to write a prompt.

This will save you a ton of time if you’re a beginner who wants to analyze some data but don’t know how to write Python code.

Next, let’s perform some exploratory data analysis with Pandas AI:

I’m asking it to give me the relationship between the “Survived” variable in the Titanic dataset, along with some other variables in the dataset:

smart_df.chat("Are there correlations between Survived and the following variables: Age, Sex, Ticket Fare. Format this output as a table.")

The above prompt should provide you with a correlation coefficient between “Survived” and the other variables in the dataset.

Next, let’s ask Pandas AI to help us visualize the relationship between these variables:

1. Survived and Age

smart_df.chat("Can you visualize the relationship between the Survived and Age columns?")

The above prompt should give you a histogram that looks like this:

 

Titanic Dataset Age Distribution
Titanic Dataset Age Distribution

 

This visual tells us that younger passengers were more likely to survive the crash.

2. Survived and Gender

smart_df.chat("Can you visualize the relationship between the Survived and Sex")

You should get a bar chart showcasing the relationship between “Survived” and “Gender.”

3. Survived and Fare

smart_df.chat("Can you visualize the relationship between the Survived and Fare")

The above prompt rendered a box plot, telling me that passengers who paid higher fare prices were more likely to survive the Titanic crash.

Note that LLMs are non-deterministic, which means that the output you’ll get might differ from mine. However, you will still get a response that will help you better understand the dataset.

Next, we can perform some data preprocessing with prompts like these:

Prompt Example 1

smart_df.chat("Analyze the quality of this dataset. Identify missing values, outliers, and potential data issues that would need to be addressed before we build a model to predict survival.")

Prompt Example 2

smart_df.chat("Let's drop the cabin column from the dataframe as it has too many missing values.")

Prompt Example 3

smart_df.chat("Let's impute the Age column with the median value.")

If you’d like to go through all the preprocessing steps I used to clean this dataset with Pandas AI, you can find the complete prompts and code in my Kaggle notebook.

In less than 5 minutes, I was able to preprocess this dataset by handling missing values, encoding categorical variables, and creating new features. This was done without writing much Python code, which is especially helpful if you are new to programming.

 

How to Learn AI for Data Analytics: Next Steps

 
In my opinion, the main selling point of tools like Cursor and Pandas AI is that they allow you to analyze data and make code edits within your programming interface.

This is far better than having to copy and paste code out of your programming IDE into an interface like ChatGPT.

Additionally, as your codebase grows (i.e. if you have thousands of lines of code and over 10 datasets), it is incredibly useful to have an integrated AI tool that has all the context and can understand the connection between these code files.

If you’re looking to learn AI for data analytics, here are some more tools that I’ve found helpful:

  • GitHub Copilot: This tool is similar to Cursor. You can use it within your programming IDE to generate code suggestions, and it even has a chat interface you can interact with.
  • Microsoft Copilot in Excel: This AI tool helps you automatically analyze data in your spreadsheets.
  • Python in Excel: This is an extension that allows you to run Python code within Excel. While this isn’t an AI tool, I’ve found it incredibly useful as it allows you to centralize your data analysis without having to switch between different applications.

 
 

Natassha Selvaraj is a self-taught data scientist with a passion for writing. Natassha writes on everything data science-related, a true master of all data topics. You can connect with her on LinkedIn or check out her YouTube channel.



Source link

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Jobs & Careers

7 DuckDB SQL Queries That Save You Hours of Pandas Work

Published

on



Image by Author | Canva

 

Pandas library has one of the fastest-growing communities. This popularity has opened the door for alternatives, like polars. In this article, we will explore one such alternative, DuckDB.

DuckDB is an SQL database that you can run right in your notebook. No setup is needed, and no servers are needed. It is easy to install and can work with Pandas in parallel.

Unlike other SQL databases, you don’t need to configure the server. It just works with your notebook after installation. That means no local setup headaches, you’re writing the code instantly. DuckDB handles filtering, joins, and aggregations with clean SQL syntax, compared to Pandas, and performs significantly better on large datasets.

So enough with the terms, let’s get started!

 

Data Project – Uber Business Modeling

 
We will use it with Jupyter Notebook, combining it with Python for data analysis. To make things more exciting, we will work on a real-life data project. Let’s get started!

 
A Data Project Example for DuckDB SQL Queries
 

Here is the link to the data project we’ll be using in this article. It’s a data project from Uber called Partner’s Business Modeling.

Uber used this data project in the recruitment process for the data science positions, and you will be asked to analyze the data for two different scenarios.

  • Scenario 1: Compare the cost of two bonus programs designed to get more drivers online during a busy day.
  • Scenario 2: Calculate and compare the annual net income of a traditional taxi driver vs one who partners with Uber and buys a car.

 

Loading Dataset

Let’s load the dataframe first. This step will be needed; hence, we will register this dataset with DuckDB in the following sections.

import pandas as pd
df = pd.read_csv("dataset_2.csv")

 

Exploring the Dataset

Here are the first few rows:

 
A Data Project Example for DuckDB SQL Queries
 

Let’s see all the columns.

 

Here is the output.

 
A Data Project Example for DuckDB SQL Queries

 

Connect DuckDB and Register the DataFrame

Good, it is a really straightforward dataset, but how can we connect DuckDB with this dataset?
First, if you have not installed it yet, install DuckDB.

 

Connecting with DuckDB is easy. Also, if you want to read the documentation, check it out here.

Now, here is the code to make a connection and register the dataframe.

import duckdb
con = duckdb.connect()

con.register("my_data", df)

 
Connect DuckDB and Register the DataFrame
 

Good, let’s start exploring seven queries that will save you hours of Pandas work!

 

1. Multi-Criteria Filtering for Complex Eligibility Rules

 
One of the most significant advantages of SQL is how it naturally handles filtering, especially multi-condition filtering, very easily.

 

Implementation of Multi-Criterial Filtering in DuckDB vs Pandas

DuckDB allows you to apply multiple filters using SQL’s Where Clauses and logic, which scales well as the number of filters grows.

SELECT 
    *
FROM data
WHERE condition_1
  AND condition_2
  AND condition_3
  AND condition_4

 

Now let’s see how we’d write the same logic in Pandas. In Pandas, the small logic is expressed using chained boolean masks with brackets, which can get verbose under many conditions.

filtered_df = df[
    (df["condition_1"]) &
    (df["condition_2"]) &
    (df["condition_3"]) &
    (df["condition_4"])
]

 

Both methods are equally readable and applicable to basic use. DuckDB feels more natural and cleaner as the logic gets more complex.

 

Multi-Criteria Filtering for the Uber Data Project

In this case, we want to find drivers who qualify for a specific Uber bonus program.

According to the rules, the drivers must:

  • Be online for at least 8 hours
  • Complete at least 10 trips
  • Accept at least 90% of ride requests
  • Having a rating of 4.7 or above

Now all we have to do is write a query that does all these filterings. Here is the code.

SELECT 
    COUN(*) AS qualified_drivers,
    COUNT(*) * 50 AS total_payout
FROM data
WHERE "Supply Hours" >= 8
  AND CAST(REPLACE("Accept Rate", '%', '') AS DOUBLE) >= 90
  AND "Trips Completed" >= 10
  AND Rating >= 4.7

 

But to execute this code with Python, we need to add con.execute(“”” “””) and fetchdf() methods as shown below:

con.execute("""
SELECT 
    COUNT(*) AS qualified_drivers,
    COUNT(*) * 50 AS total_payout
FROM data
WHERE "Supply Hours" >= 8
  AND CAST(REPLACE("Accept Rate", '%', '') AS DOUBLE) >= 90
  AND "Trips Completed" >= 10
  AND Rating >= 4.7
""").fetchdf()

 

We will do this throughout the article. Now that you know how to run it in a Jupyter notebook, we’ll show only the SQL code from now on, and you’ll know how to convert it to the Pythonic version.
Good. Now, remember that the data project wants us to calculate the total payout for Option 1.

 
Multi-Criteria Filtering
 

We’ve calculated the sum of the driver, but we should multiply this by $50, because the payout will be $50 for each driver, so we will do it with COUNT(*) * 50.
Here is the output.

 
Multi-Criteria Filtering

 

2. Fast Aggregation to Estimate Business Incentives

 
SQL is great for quickly aggregating, especially when you need to summarize data across rows.

 

Implementation of Aggregation in DuckDB vs Pandas

DuckDB lets you aggregate values across rows using SQL functions like SUM and COUNT in one compact block.

SELECT 
    COUNT(*) AS num_rows,
    SUM(column_name) AS total_value
FROM data
WHERE some_condition

 

In pandas, you first need to filter the dataframe, then separately count and sum using chaining methods.

filtered = df[df["some_condition"]]
num_rows = filtered.shape[0]
total_value = filtered["column_name"].sum()

 

DuckDB is more concise and easier to read, and does not require managing intermediate variables.

 

Aggregation in Uber Data Project

Good, let’s move on to the second bonus scheme, Option 2. According to the project description, drivers will receive $4 per trip if:

  • They complete at least 12 trips.
  • Have a rating of 4.7 or better.

This time, instead of just counting the drivers, we need to add the number of trips they completed since the bonus is paid per trip, not per person.

SELECT 
    COUNT(*) AS qualified_drivers,
    SUM("Trips Completed") * 4 AS total_payout
FROM data
WHERE "Trips Completed" >= 12
  AND Rating >= 4.7

 

The count here tells us how many drivers qualify. However, to calculate the total payout, we will calculate their trips and multiply by $4, as required by Option 2.

 
Aggregation in DuckDB
 

Here is the output.

 
Aggregation in DuckDB
 

With DuckDB, we don’t need to loop through the rows or build custom aggregations. The Sum function takes care of everything we need.

 

3. Detect Overlaps and Differences Using Boolean Logic

 
In SQL, you can easily combine the conditions by using Boolean Logic, such as AND, OR, and NOT.

 

Implementation of Boolean Logic in DuckDB vs Pandas

DuckDB supports boolean logic natively in the WHERE clause using AND, OR, and NOT.

SELECT *
FROM data
WHERE condition_a
  AND condition_b
  AND NOT (condition_c)

 

Pandas requires a combination of logical operators with masks and parentheses, including the use of “~” for negation.

filtered = df[
    (df["condition_a"]) &
    (df["condition_b"]) &
    ~(df["condition_c"])
]

 

While both are functional, DuckDB is easier to reason about when the logic involves exclusions or nested conditions.

 

Boolean Logic for Uber Data Project

Now we have calculated Option 1 and Option 2, what comes next? Now it is time to do the comparison. Remember our next question.

 
Boolean Logic in DuckDB
 

This is where we can use Boolean Logic. We’ll use a combination of AND and NOT.

SELECT COUNT(*) AS only_option1
FROM data
WHERE "Supply Hours" >= 8
  AND CAST(REPLACE("Accept Rate", '%', '') AS DOUBLE) >= 90
  AND "Trips Completed" >= 10
  AND Rating >= 4.7
  AND NOT ("Trips Completed" >= 12 AND Rating >= 4.7)

 

Here is the output.

 
Boolean Logic in DuckDB
 

Let’s break it down:

  • The first four conditions are here for Option 1.
  • The NOT(..) part is used to exclude drivers who also qualify for Option 2.

It is pretty straightforward, right?

 

4. Quick Cohort Sizing with Conditional Filters

 
Sometimes, you want to understand how big a specific group or cohort is within your data.

 

Implementation of Conditional Filters in DuckDB vs Pandas?

DuckDB handles cohort filtering and percentage calculation with one SQL query, even including subqueries.

SELECT 
  ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM data), 2) AS percentage
FROM data
WHERE condition_1
  AND condition_2
  AND condition_3

 

Pandas requires filtering, counting, and manual division to calculate percentages.

filtered = df[
    (df["condition_1"]) &
    (df["condition_2"]) &
    (df["condition_3"])
]
percentage = round(100.0 * len(filtered) / len(df), 2)

 

DuckDB here is cleaner and faster. It minimizes the number of steps and avoids repeated code.

 

Cohort Sizing For Uber Data Project

Now we are at the last question of Scenario 1. In this question, Uber wants us to find out the drivers that could not achieve some tasks, like trips and acceptance rate, yet had higher ratings, specifically the drivers.

  • Completed less than 10 trips
  • Had an acceptance rate lower than 90
  • Had a rating higher than 4.7

Now, these are three separate filters, and we want to calculate the percentage of drivers satisfying each of them. Let’s see the query.

SELECT 
  ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM data), 2) AS percentage
FROM data
WHERE "Trips Completed" < 10
  AND CAST(REPLACE("Accept Rate", '%', '') AS DOUBLE) = 4.7

 

Here is the output.

 
Cohort Sizing in DuckDB
 

Here, we filtered the rows where all three conditions were satisfied, counted them, and divided them by the total number of drivers to get a percentage.

 

5. Basic Arithmetic Queries for Revenue Modeling

 
Now, let’s say you want to do some basic math. You can write expressions directly into your SELECT statement.

 

Implementation of Arithmetic in DuckDB vs Pandas

DuckDB allows arithmetic to be written directly in the SELECT clause like a calculator.

SELECT 
    daily_income * work_days * weeks_per_year AS annual_revenue,
    weekly_cost * weeks_per_year AS total_cost,
    (daily_income * work_days * weeks_per_year) - (weekly_cost * weeks_per_year) AS net_income
FROM data

 

Pandas requires multiple intermediate calculations in separate variables for the same result.

daily_income = 200
weeks_per_year = 49
work_days = 6
weekly_cost = 500

annual_revenue = daily_income * work_days * weeks_per_year
total_cost = weekly_cost * weeks_per_year
net_income = annual_revenue - total_cost

 

DuckDB simplifies the math logic into a readable SQL block, whereas Pandas gets a bit cluttered with variable assignments.

 

Basic Arithmetic in Uber Data Project

In Scenario 2, Uber asked us to calculate how much money (after expenses) the driver makes per year without partnering with Uber. Here are some expenses like gas, rent, and insurance.

 
Basic Arithmetic in DuckDB
 

Now let’s calculate the annual revenue and subtract the expenses from it.

SELECT 
    200 * 6 * (52 - 3) AS annual_revenue,
    200 * (52 - 3) AS gas_expense,
    500 * (52 - 3) AS rent_expense,
    400 * 12 AS insurance_expense,
    (200 * 6 * (52 - 3)) 
      - (200 * (52 - 3) + 500 * (52 - 3) + 400 * 12) AS net_income

 

Here is the output.

 
Basic Arithmetic in DuckDB
 

With DuckDB, you can write this like a SQL matrix block. You don’t need Pandas Dataframes or manual looping!

 

6. Conditional Calculations for Dynamic Expense Planning

 
What if your cost structure changes based on certain conditions?

 

Implementation of Conditional Calculations in DuckDB vs Pandas

DuckDB lets you apply conditional logic using arithmetic adjustments inside your query.

SELECT 
    original_cost * 1.05 AS increased_cost,
    original_cost * 0.8 AS discounted_cost,
    0 AS removed_cost,
    (original_cost * 1.05 + original_cost * 0.8) AS total_new_cost

 

Pandas uses the same logic with multiple math lines and manual updates to variables.

weeks_worked = 49
gas = 200
insurance = 400

gas_expense = gas * 1.05 * weeks_worked
insurance_expense = insurance * 0.8 * 12
rent_expense = 0
total = gas_expense + insurance_expense

 

DuckDB turns what would be a multi-step logic in pandas into a single SQL expression.

 

Conditional Calculations in Uber Data Project

In this scenario, we now model what happens if the driver partners with Uber and buys a car. The expenses change like

  • Gas cost increases by 5%
  • Insurance decreases by 20%
  • No more rent expense
con.execute("""
SELECT 
    200 * 1.05 * 49 AS gas_expense,
    400 * 0.8 * 12 AS insurance_expense,
    0 AS rent_expense,
    (200 * 1.05 * 49) + (400 * 0.8 * 12) AS total_expense
""").fetchdf()

 

Here is the output.

 
Conditional Calculations in DuckDB

 

7. Goal-Driven Math for Revenue Targeting

 
Sometimes, your analysis can be driven by a business goal like hitting a revenue target or covering a one time cost.

 

Implementation of Goal-Driven Math in DuckDB vs Pandas

DuckDB handles multi-step logic using CTEs. It makes the query modular and easy to read.

WITH vars AS (
  SELECT base_income, cost_1, cost_2, target_item
),
calc AS (
  SELECT 
    base_income - (cost_1 + cost_2) AS current_profit,
    cost_1 * 1.1 + cost_2 * 0.8 + target_item AS new_total_expense
  FROM vars
),
final AS (
  SELECT 
    current_profit + new_total_expense AS required_revenue,
    required_revenue / 49 AS required_weekly_income
  FROM calc
)
SELECT required_weekly_income FROM final

 

Pandas requires nesting of calculations and reuse of earlier variables to avoid duplication.

weeks = 49
original_income = 200 * 6 * weeks
original_cost = (200 + 500) * weeks + 400 * 12
net_income = original_income - original_cost

# new expenses + car cost
new_gas = 200 * 1.05 * weeks
new_insurance = 400 * 0.8 * 12
car_cost = 40000

required_revenue = net_income + new_gas + new_insurance + car_cost
required_weekly_income = required_revenue / weeks

 

DuckDB allows you to build a logic pipeline step by step, without cluttering your notebook with scattered code.

 

Goal-Driven Math in Uber Data Project

Now that we have modeled the new costs, let’s answer the final business question:

How much more does the driver need to earn per week to do both?

  • Pay off a $40.000 car within a year
  • Maintain the same yearly net income

Now let’s write the code representing this logic.

WITH vars AS (
  SELECT 
    52 AS total_weeks_per_year,
    3 AS weeks_off,
    6 AS days_per_week,
    200 AS fare_per_day,
    400 AS monthly_insurance,
    200 AS gas_per_week,
    500 AS vehicle_rent,
    40000 AS car_cost
),
base AS (
  SELECT 
    total_weeks_per_year,
    weeks_off,
    days_per_week,
    fare_per_day,
    monthly_insurance,
    gas_per_week,
    vehicle_rent,
    car_cost,
    total_weeks_per_year - weeks_off AS weeks_worked,
    (fare_per_day * days_per_week * (total_weeks_per_year - weeks_off)) AS original_annual_revenue,
    (gas_per_week * (total_weeks_per_year - weeks_off)) AS original_gas,
    (vehicle_rent * (total_weeks_per_year - weeks_off)) AS original_rent,
    (monthly_insurance * 12) AS original_insurance
  FROM vars
),
compare AS (
  SELECT *,
    (original_gas + original_rent + original_insurance) AS original_total_expense,
    (original_annual_revenue - (original_gas + original_rent + original_insurance)) AS original_net_income
  FROM base
),
new_costs AS (
  SELECT *,
    gas_per_week * 1.05 * weeks_worked AS new_gas,
    monthly_insurance * 0.8 * 12 AS new_insurance
  FROM compare
),
final AS (
  SELECT *,
    new_gas + new_insurance + car_cost AS new_total_expense,
    original_net_income + new_gas + new_insurance + car_cost AS required_revenue,
    required_revenue / weeks_worked AS required_weekly_revenue,
    original_annual_revenue / weeks_worked AS original_weekly_revenue
  FROM new_costs
)
SELECT 
  ROUND(required_weekly_revenue, 2) AS required_weekly_revenue,
  ROUND(required_weekly_revenue - original_weekly_revenue, 2) AS weekly_uplift
FROM final

 

Here is the output.

 
Goal-Driven Math in DuckDB

 

Final Thoughts

 
In this article, we explored how to connect with DuckDB and analyze data. Instead of using long Pandas functions, we used SQL queries. We also did this using a real-life data project that Uber requested in the data scientist recruitment process.
For data scientists working on analysis-heavy tasks, it’s a lightweight but powerful alternative to Pandas. Try using it on your next project, especially when SQL logic fits the problem better.
 
 

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.





Source link

Continue Reading

Jobs & Careers

Fi.Money Launches Protocol to Connect Personal Finance Data with AI Assistants

Published

on


Fi.Money, a money management platform based in India, has launched what it says is the first consumer-facing implementation of a model context protocol (MCP) for personal finance. 

Fi MCP is designed to bring together users’ complete financial lives, including bank accounts, mutual funds, loans, insurance, EPF, real estate, gold, and more seamlessly into AI assistants of their choice, the company said in a statement.  

Users can choose to share this consolidated data with any AI tool, enabling private, intelligent conversations about their money, fully on their terms, it added. 

Until now, users have had to stitch together insights from various finance apps, statements, and spreadsheets. When turning to AI tools like ChatGPT or Gemini for advice, they’ve relied on manual inputs, guesswork, or generic prompts. 

There was no structured, secure, consent-driven way to help AI understand your actual financial data without sharing screenshots or uploading statements and reports.

The company said that with Fi’s new MCP feature, users can see their entire financial life in a single, unified view. 

This data can be privately exported in an AI-readable format or configured for near-real-time syncing with AI assistants. 

Once connected, users can ask personal, data-specific questions such as, “Can I afford a six-month career break?” or “What are the mistakes in my portfolio?” and receive context-aware responses based on their actual financial information.

As per the statement, the launch comes at a time when Indian consumers are increasingly seeking digital-first, integrated financial tools. Building on India’s pioneering digital infrastructure, Fi’s MCP represents the next layer of consumer-facing innovation, one that empowers consumers to activate their own data. 

Fi Money is the first in the world to let individuals use AI meaningfully with their own money, the company claimed. While most AIs lack context about one’s finances, Fi’s MCP changes that by giving users an AI that actually understands their money.

The Fi MCP is available to all Fi Money users. Any user can download the Fi Money app, consolidate their finances in a few minutes, and start using their data with their preferred AI assistant. 

“This is the first time any personal finance app globally has enabled users to securely connect their actual financial data with tools like ChatGPT, Gemini, or Claude,” Sujith Narayanan, co-founder of Fi.Money, said in the statement.

“With MCP, we’re giving users not just a dashboard, but a secure bridge between their financial data and the AI tools they trust. It’s about helping people ask better questions and get smarter answers about their money,” he added.



Source link

Continue Reading

Jobs & Careers

BRICS Leaders Call For Data Protection Against Unauthorised AI Use

Published

on


Leaders from the BRICS coalition of developing countries are set to advocate for safeguards against unauthorised AI usage to prevent excessive data gathering and to establish systems for fair compensation, as outlined in a draft statement seen by Reuters.

Leading tech companies, predominantly located in wealthier nations, have pushed back against demands to pay copyright fees for content used in training AI systems.

On July 6, the heads of the 11 largest emerging economies ratified the Joint Declaration of the 17th BRICS Summit in Rio de Janeiro. 

Prime Minister Narendra Modi stated that India views AI as a tool to augment human values and abilities, emphasising that both concerns and the promotion of innovation in AI governance should be prioritised equally. He stressed the importance of collective efforts in developing Responsible AI.

He argued that in the 21st century, humanity’s prosperity and progress are increasingly reliant on technology, particularly artificial intelligence. While AI offers significant potential to transform daily life, it also raises important concerns related to risks, ethics, and bias. “We see AI as a medium to enhance human values and capabilities,” the Prime Minister said. 

Modi also invited the BRICS partners to the “AI Impact Summit” that India will host next year.

For the first time, AI governance is a key focus in the BRICS agenda, highlighting a Global South perspective on this technology. 

In their joint declaration, the countries recognise that AI offers a unique opportunity for progress. Still, effective global governance is crucial for addressing risks and meeting the needs of all countries, particularly in the Global South.

“A collective global effort is needed to establish AI governance that upholds our shared values, addresses risks, builds trust, and ensures broad and inclusive international collaboration and access,” the countries said in a joint statement.



Source link

Continue Reading

Trending