Jobs & Careers
AWS Launches Space Accelerator to Support 40 Startups Across India, Japan, and Australia
Amazon Web Services (AWS) has launched the AWS Space Accelerator: APJ 2025, a 10-week programme for space-tech startups in India, Japan, Australia, and New Zealand.
The initiative aims to support up to 40 startups working in space infrastructure, earth observation, and drone systems by offering business mentorship, AWS technical support, and up to $100,000 in cloud credits.
Applications are open from July 8 to September 5, 2025, with the programme starting in September and culminating in a demo day in December 2025. The initiative will run both virtually and in person.
AWS partners such as T-Hub, Minfy, Fusic, and Ansys, alongside space agencies like IN-SPACe and the Australia Space Agency, will help deliver the programme.
T-Hub will manage operations and host the initiative in India. Minfy Technologies will provide AWS training and technical support in both India and Australia.
In Japan, Fusic will offer technical mentorship, while Ansys will assist startups globally with simulation and design testing.
Startups in Different Sectors
This accelerator builds upon the 2024 India edition, which supported 24 startups across various fields, including propulsion, satellite imagery, and quantum key distribution. Several of these participants secured customer contracts or investments and advanced core technologies.
The new programme targets startups focused on earth observation and remote sensing, satellite manufacturing and propulsion, and drone technologies that complement space infrastructure. These focus areas aim to improve agriculture, climate resilience, and connectivity in remote areas.
By leveraging AWS cloud tools, participants can experiment rapidly, reduce costs, and validate solutions before large-scale deployment, addressing key challenges such as capital-intensive testing and the need for specialised talent.
Local Partnerships to Boost Innovation
Clint Crosier, director of aerospace and satellite at AWS, said, “The collaboration with Australian Space Agency, IN-SPACe, iLAuNCH, and Sky Perfect JSAT underscores our commitment to working with local space agencies and industry leaders.”
He added that through this accelerator programme, the company aims not only to support individual startups but also to build a robust community that can drive economic growth and technological advancement throughout the region.
The accelerator arrives at a time of regional momentum in India’s space industry, as it is projected to grow to $44 billion by 2033. Meanwhile, Japan aims to expand its space economy to ¥8 trillion (~$52 billion) by the early 2030s.
Jobs & Careers
Building Modern Data Lakehouses on Google Cloud with Apache Iceberg and Apache Spark
Sponsored Content
The landscape of big data analytics is constantly evolving, with organizations seeking more flexible, scalable, and cost-effective ways to manage and analyze vast amounts of data. This pursuit has led to the rise of the data lakehouse paradigm, which combines the low-cost storage and flexibility of data lakes with the data management capabilities and transactional consistency of data warehouses. At the heart of this revolution are open table formats like Apache Iceberg and powerful processing engines like Apache Spark, all empowered by the robust infrastructure of Google Cloud.
The Rise of Apache Iceberg: A Game-Changer for Data Lakes
For years, data lakes, typically built on cloud object storage like Google Cloud Storage (GCS), offered unparalleled scalability and cost efficiency. However, they often lacked the crucial features found in traditional data warehouses, such as transactional consistency, schema evolution, and performance optimizations for analytical queries. This is where Apache Iceberg shines.
Apache Iceberg is an open table format designed to address these limitations. It sits on top of your data files (like Parquet, ORC, or Avro) in cloud storage, providing a layer of metadata that transforms a collection of files into a high-performance, SQL-like table. Here’s what makes Iceberg so powerful:
- ACID Compliance: Iceberg brings Atomicity, Consistency, Isolation, and Durability (ACID) properties to your data lake. This means that data writes are transactional, ensuring data integrity even with concurrent operations. No more partial writes or inconsistent reads.
- Schema Evolution: One of the biggest pain points in traditional data lakes is managing schema changes. Iceberg handles schema evolution seamlessly, allowing you to add, drop, rename, or reorder columns without rewriting the underlying data. This is critical for agile data development.
- Hidden Partitioning: Iceberg intelligently manages partitioning, abstracting away the physical layout of your data. Users no longer need to know the partitioning scheme to write efficient queries, and you can evolve your partitioning strategy over time without data migrations.
- Time Travel and Rollback: Iceberg maintains a complete history of table snapshots. This enables “time travel” queries, allowing you to query data as it existed at any point in the past. It also provides rollback capabilities, letting you revert a table to a previous good state, invaluable for debugging and data recovery.
- Performance Optimizations: Iceberg’s rich metadata allows query engines to prune irrelevant data files and partitions efficiently, significantly accelerating query execution. It avoids costly file listing operations, directly jumping to the relevant data based on its metadata.
By providing these data warehouse-like features on top of a data lake, Apache Iceberg enables the creation of a true “data lakehouse,” offering the best of both worlds: the flexibility and cost-effectiveness of cloud storage combined with the reliability and performance of structured tables.
Google Cloud’s BigLake tables for Apache Iceberg in BigQuery offers a fully-managed table experience similar to standard BigQuery tables, but all of the data is stored in customer-owned storage buckets. Support features include:
- Table mutations via GoogleSQL data manipulation language (DML)
- Unified batch and high throughput streaming using the Storage Write API through BigLake connectors such as Spark
- Iceberg V2 snapshot export and automatic refresh on each table mutation
- Schema evolution to update column metadata
- Automatic storage optimization
- Time travel for historical data access
- Column-level security and data masking
Here’s an example of how to create an empty BigLake Iceberg table using GoogleSQL:
SQL
CREATE TABLE PROJECT_ID.DATASET_ID.my_iceberg_table (
name STRING,
id INT64
)
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
file_format="PARQUET"
table_format="ICEBERG"
storage_uri = 'gs://BUCKET/PATH');
You can then import data into the data using LOAD INTO
to import data from a file or INSERT INTO
from another table.
SQL
# Load from file
LOAD DATA INTO PROJECT_ID.DATASET_ID.my_iceberg_table
FROM FILES (
uris=['gs://bucket/path/to/data'],
format="PARQUET");
# Load from table
INSERT INTO PROJECT_ID.DATASET_ID.my_iceberg_table
SELECT name, id
FROM PROJECT_ID.DATASET_ID.source_table
In addition to a fully-managed offering, Apache Iceberg is also supported as a read-external table in BigQuery. Use this to point to an existing path with data files.
SQL
CREATE OR REPLACE EXTERNAL TABLE PROJECT_ID.DATASET_ID.my_external_iceberg_table
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
format="ICEBERG",
uris =
['gs://BUCKET/PATH/TO/DATA'],
require_partition_filter = FALSE);
Apache Spark: The Engine for Data Lakehouse Analytics
While Apache Iceberg provides the structure and management for your data lakehouse, Apache Spark is the processing engine that brings it to life. Spark is a powerful open-source, distributed processing system renowned for its speed, versatility, and ability to handle diverse big data workloads. Spark’s in-memory processing, robust ecosystem of tools including ML and SQL-based processing, and deep Iceberg support make it an excellent choice.
Apache Spark is deeply integrated into the Google Cloud ecosystem. Benefits of using Apache Spark on Google Cloud include:
- Access to a true serverless Spark experience without cluster management using Google Cloud Serverless for Apache Spark.
- Fully managed Spark experience with flexible cluster configuration and management via Dataproc.
- Accelerate Spark jobs using the new Lightning Engine for Apache Spark preview feature.
- Configure your runtime with GPUs and drivers preinstalled.
- Run AI/ML jobs using a robust set of libraries available by default in Spark runtimes, including XGBoost, PyTorch and Transformers.
- Write PySpark code directly inside BigQuery Studio via Colab Enterprise notebooks along with Gemini-powered PySpark code generation.
- Easily connect to your data in BigQuery native tables, BigLake Iceberg tables, external tables and GCS
- Integration with Vertex AI for end-to-end MLOps
Iceberg + Spark: Better Together
Together, Iceberg and Spark form a potent combination for building performant and reliable data lakehouses. Spark can leverage Iceberg’s metadata to optimize query plans, perform efficient data pruning, and ensure transactional consistency across your data lake.
Your Iceberg tables and BigQuery native tables are accessible via BigLake metastore. This exposes your tables to open source engines with BigQuery compatibility, including Spark.
Python
from pyspark.sql import SparkSession
# Create a spark session
spark = SparkSession.builder \
.appName("BigLake Metastore Iceberg") \
.config("spark.sql.catalog.CATALOG_NAME", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_project", "PROJECT_ID") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_location", "LOCATION") \
.config("spark.sql.catalog.CATALOG_NAME.warehouse", "WAREHOUSE_DIRECTORY") \
.getOrCreate()
spark.conf.set("viewsEnabled","true")
# Use the blms_catalog
spark.sql("USE `CATALOG_NAME`;")
spark.sql("USE NAMESPACE DATASET_NAME;")
# Configure spark for temp results
spark.sql("CREATE namespace if not exists MATERIALIZATION_NAMESPACE");
spark.conf.set("materializationDataset","MATERIALIZATION_NAMESPACE")
# List the tables in the dataset
df = spark.sql("SHOW TABLES;")
df.show();
# Query the tables
sql = """SELECT * FROM DATASET_NAME.TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()
sql = """SELECT * FROM DATASET_NAME.ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()
sql = """SELECT * FROM DATASET_NAME.READONLY_ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()
Extending the functionality of BigLake metastore is the Iceberg REST catalog (in preview) to access Iceberg data with any data processing engine. Here’s how to connect to it using Spark:
Python
import google.auth
from google.auth.transport.requests import Request
from google.oauth2 import service_account
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
catalog = ""
spark = SparkSession.builder.appName("") \
.config("spark.sql.defaultCatalog", catalog) \
.config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog") \
.config(f"spark.sql.catalog.{catalog}.type", "rest") \
.config(f"spark.sql.catalog.{catalog}.uri",
"https://biglake.googleapis.com/iceberg/v1beta/restcatalog") \
.config(f"spark.sql.catalog.{catalog}.warehouse", "gs://") \
.config(f"spark.sql.catalog.{catalog}.token", "") \
.config(f"spark.sql.catalog.{catalog}.oauth2-server-uri", "https://oauth2.googleapis.com/token") \ .config(f"spark.sql.catalog.{catalog}.header.x-goog-user-project", "") \ .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config(f"spark.sql.catalog.{catalog}.io-impl","org.apache.iceberg.hadoop.HadoopFileIO") \ .config(f"spark.sql.catalog.{catalog}.rest-metrics-reporting-enabled", "false") \
.getOrCreate()
Completing the lakehouse
Google Cloud provides a comprehensive suite of services that complement Apache Iceberg and Apache Spark, enabling you to build, manage, and scale your data lakehouse with ease while leveraging many of the open-source technologies you already use:
- Dataplex Universal Catalog: Dataplex Universal Catalog provides a unified data fabric for managing, monitoring, and governing your data across data lakes, data warehouses, and data marts. It integrates with BigLake Metastore, ensuring that governance policies are consistently enforced across your Iceberg tables, and enabling capabilities like semantic search, data lineage, and data quality checks.
- Google Cloud Managed Service for Apache Kafka: Run fully-managed Kafka clusters on Google Cloud, including Kafka Connect. Data streams can be read directly to BigQuery, including to managed Iceberg tables with low latency reads.
- Cloud Composer: A fully managed workflow orchestration service built on Apache Airflow.
- Vertex AI: Use Vertex AI to manage the full end-to-end ML Ops experience. You can also use Vertex AI Workbench for a managed JupyterLab experience to connect to your serverless Spark and Dataproc instances.
Conclusion
The combination of Apache Iceberg and Apache Spark on Google Cloud offers a compelling solution for building modern, high-performance data lakehouses. Iceberg provides the transactional consistency, schema evolution, and performance optimizations that were historically missing from data lakes, while Spark offers a versatile and scalable engine for processing these large datasets.
To learn more, check out our free webinar on July 8th at 11AM PST where we’ll dive deeper into using Apache Spark and supporting tools on Google Cloud.
Author: Brad Miro, Senior Developer Advocate – Google
Jobs & Careers
Replit and Microsoft Bring Vibe Coding to Enterprises Through Azure Partnership
AI coding platform, Replit, has entered a strategic partnership with Microsoft to bring its agentic software development platform to enterprise customers through Microsoft Azure. The collaboration enables business users across departments to create and deploy secure, production-ready applications using natural language, without writing code.
As part of the integration, Replit is connecting with several Microsoft services, including Azure Container Apps, Azure Virtual Machines, and Neon Serverless Postgres on Azure. This will allow enterprise users to develop on Replit and deploy directly to Microsoft’s infrastructure.
The partnership also allows customers to purchase Replit via the Azure Marketplace, simplifying procurement and adoption.
“Our mission is to empower entrepreneurial individuals to transform ideas into software — regardless of their coding experience or whether they’re launching a startup or innovating within an enterprise,” said Amjad Masad, CEO and co-founder of Replit. “Forward-thinking companies like Zillow are already using Replit to build internal tools and address unique business challenges.”
Deb Cupp, President of Microsoft Americas, said the partnership supports Microsoft’s broader vision of enabling everyone to do more with technology. “Our collaboration with Replit democratizes application development, enabling business teams across enterprises to innovate and solve problems without traditional technical barriers,” she said.
Replit’s platform, already used by more than 500,000 business users, supports application building in teams like Product, Sales, Marketing, Operations, and Design. Common use cases include internal tool development and rapid prototyping, especially where off-the-shelf SaaS solutions are not sufficient.
The integration also aligns with enterprise security needs. Replit is SOC 2 Type II compliant and offers governance controls required by large organisations. Upcoming features include support for direct deployment into a customer’s own Azure environment.
Replit was among the first to evaluate Anthropic on Azure Databricks through the Mosaic AI gateway, illustrating how organisations can operationalise new AI models across Microsoft’s ecosystem.
The service will soon be available through the Azure Marketplace, with apps deployed on Replit-managed Azure infrastructure, combining ease of use with enterprise-grade scalability and compliance.
Jobs & Careers
Build ETL Pipelines for Data Science Workflows in About 30 Lines of Python
Image by Author | Ideogram
You know that feeling when you have data scattered across different formats and sources, and you need to make sense of it all? That’s exactly what we’re solving today. Let’s build an ETL pipeline that takes messy data and turns it into something actually useful.
In this article, I’ll walk you through creating a pipeline that processes e-commerce transactions. Nothing fancy, just practical code that gets the job done.
We’ll grab data from a CSV file (like you’d download from an e-commerce platform), clean it up, and store it in a proper database for analysis.
What Is an Extract, Transform, Load (ETL) Pipeline?
Every ETL pipeline follows the same pattern. You grab data from somewhere (Extract), clean it up and make it better (Transform), then put it somewhere useful (Load).
ETL Pipeline | Image by Author | diagrams.net (draw.io)
The process begins with the extract phase, where data is retrieved from various source systems such as databases, APIs, files, or streaming platforms. During this phase, the pipeline identifies and pulls relevant data while maintaining connections to disparate systems that may operate on different schedules and formats.
Next the transform phase represents the core processing stage, where extracted data undergoes cleaning, validation, and restructuring. This step addresses data quality issues, applies business rules, performs calculations, and converts data into the required format and structure. Common transformations include data type conversions, field mapping, aggregations, and the removal of duplicates or invalid records.
Finally, the load phase transfers the now transformed data into the target system. This step can occur through full loads, where entire datasets are replaced, or incremental loads, where only new or changed data is added. The loading strategy depends on factors such as data volume, system performance requirements, and business needs.
Step 1: Extract
The “extract” step is where we get our hands on data. In the real world, you might be downloading this CSV from your e-commerce platform’s reporting dashboard, pulling it from an FTP server, or getting it via API. Here, we’re reading from an available CSV file.
def extract_data_from_csv(csv_file_path):
try:
print(f"Extracting data from {csv_file_path}...")
df = pd.read_csv(csv_file_path)
print(f"Successfully extracted {len(df)} records")
return df
except FileNotFoundError:
print(f"Error: {csv_file_path} not found. Creating sample data...")
csv_file = create_sample_csv_data()
return pd.read_csv(csv_file)
Now that we have the raw data from its source (raw_transactions.csv), we need to transform it into something usable.
Step 2: Transform
This is where we make the data actually useful.
def transform_data(df):
print("Transforming data...")
df_clean = df.copy()
# Remove records with missing emails
initial_count = len(df_clean)
df_clean = df_clean.dropna(subset=['customer_email'])
removed_count = initial_count - len(df_clean)
print(f"Removed {removed_count} records with missing emails")
# Calculate derived fields
df_clean['total_amount'] = df_clean['price'] * df_clean['quantity']
# Extract date components
df_clean['transaction_date'] = pd.to_datetime(df_clean['transaction_date'])
df_clean['year'] = df_clean['transaction_date'].dt.year
df_clean['month'] = df_clean['transaction_date'].dt.month
df_clean['day_of_week'] = df_clean['transaction_date'].dt.day_name()
# Create customer segments
df_clean['customer_segment'] = pd.cut(df_clean['total_amount'],
bins=[0, 50, 200, float('inf')],
labels=['Low', 'Medium', 'High'])
return df_clean
First, we’re dropping rows with missing emails because incomplete customer data isn’t helpful for most analyses.
Then we calculate total_amount
by multiplying price and quantity. This seems obvious, but you’d be surprised how often derived fields like this are missing from raw data.
The date extraction is really handy. Instead of just having a timestamp, now we have separate year, month, and day-of-week columns. This makes it easy to analyze patterns like “do we sell more on weekends?”
The customer segmentation using pd.cut()
can be particularly useful. It automatically buckets customers into spending categories. Now instead of just having transaction amounts, we have meaningful business segments.
Step 3: Load
In a real project, you might be loading into a database, sending to an API, or pushing to cloud storage.
Here, we’re loading our clean data into a proper SQLite database.
def load_data_to_sqlite(df, db_name="ecommerce_data.db", table_name="transactions"):
print(f"Loading data to SQLite database '{db_name}'...")
conn = sqlite3.connect(db_name)
try:
df.to_sql(table_name, conn, if_exists="replace", index=False)
cursor = conn.cursor()
cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
record_count = cursor.fetchone()[0]
print(f"Successfully loaded {record_count} records to '{table_name}' table")
return f"Data successfully loaded to {db_name}"
finally:
conn.close()
Now analysts can run SQL queries, connect BI tools, and actually use this data for decision-making.
SQLite works well for this because it’s lightweight, requires no setup, and creates a single file you can easily share or backup. The if_exists="replace"
parameter means you can run this pipeline multiple times without worrying about duplicate data.
We’ve added verification steps so you know the load was successful. There’s nothing worse than thinking your data is safely stored only to find an empty table later.
Running the ETL Pipeline
This orchestrates the entire extract, transform, load workflow.
def run_etl_pipeline():
print("Starting ETL Pipeline...")
# Extract
raw_data = extract_data_from_csv('raw_transactions.csv')
# Transform
transformed_data = transform_data(raw_data)
# Load
load_result = load_data_to_sqlite(transformed_data)
print("ETL Pipeline completed successfully!")
return transformed_data
Notice how this ties everything together. Extract, transform, load, done. You can run this and immediately see your processed data.
You can find the complete code on GitHub.
Wrapping Up
This pipeline takes raw transaction data and turns it into something an analyst or data scientist can actually work with. You’ve got clean records, calculated fields, and meaningful segments.
Each function does one thing well, and you can easily modify or extend any part without breaking the rest.
Now try running it yourself. Also try to modify it for another use case. Happy coding!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.
-
Funding & Business1 week ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Jobs & Careers1 week ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Mergers & Acquisitions1 week ago
Donald Trump suggests US government review subsidies to Elon Musk’s companies
-
Funding & Business1 week ago
Rethinking Venture Capital’s Talent Pipeline
-
Jobs & Careers1 week ago
Why Agentic AI Isn’t Pure Hype (And What Skeptics Aren’t Seeing Yet)
-
Jobs & Careers1 week ago
Astrophel Aerospace Raises ₹6.84 Crore to Build Reusable Launch Vehicle
-
Funding & Business5 days ago
Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%
-
Funding & Business1 week ago
From chatbots to collaborators: How AI agents are reshaping enterprise work
-
Jobs & Careers1 week ago
Telangana Launches TGDeX—India’s First State‑Led AI Public Infrastructure
-
Tools & Platforms1 week ago
Winning with AI – A Playbook for Pest Control Business Leaders to Drive Growth