Connect with us

Jobs & Careers

Noida’s Suhora Brings Hyperspectral Satellite Services to India with Orbital Sidekick

Published

on


Noida-based Suhora Technologies has partnered with US-based Orbital Sidekick (OSK) to launch high-resolution hyperspectral satellite services in India. The collaboration makes Suhora the first Indian company to offer commercial operational hyperspectral data across the visible and short-wave infrared (VNIR-SWIR) spectrum.

Under this agreement, Suhora will integrate OSK’s hyperspectral data into its flagship SPADE platform, enabling detailed material detection and classification. The service will support applications in mining, environmental monitoring, and strategic analytics.

Krishanu Acharya, CEO and co-founder of Suhora Technologies said, “This partnership with Orbital Sidekick marks an important step for the global geospatial community. The strategic applications enabled by this collaboration stand to benefit users globally, including India.”

Rupesh Kumar, CTO and co-founder, added that this addition will allow “precise mineral mapping, real-time environmental monitoring and anomaly detections.”

Suhora’s SPADE, a subscription-based SaaS platform, currently aggregates data from SAR, Optical, and Thermal satellites. With OSK’s hyperspectral data, SPADE will offer up to 472 contiguous spectral bands at 8.3-metre spatial resolution.

These capabilities will aid in applications such as rare earth mineral mapping, oil spill detection, and methane leak monitoring.

Suhora will leverage OSK’s GHOSt constellation of five satellites, which provide high revisit rates and full VNIR-SWIR coverage. The partnership is expected to deliver better temporal frequency and more precise data than other global providers like EnMAP, PRISMA, and NASA’s EMIT.

Tushar Prabhakar, co-founder and COO of Orbital Sidekick, commented, “Suhora’s strong local presence and expertise will be instrumental in delivering these powerful insights and driving significant value for our clients.”



Source link

Jobs & Careers

Building Modern Data Lakehouses on Google Cloud with Apache Iceberg and Apache Spark

Published

on


Sponsored Content

 

 

 

The landscape of big data analytics is constantly evolving, with organizations seeking more flexible, scalable, and cost-effective ways to manage and analyze vast amounts of data. This pursuit has led to the rise of the data lakehouse paradigm, which combines the low-cost storage and flexibility of data lakes with the data management capabilities and transactional consistency of data warehouses. At the heart of this revolution are open table formats like Apache Iceberg and powerful processing engines like Apache Spark, all empowered by the robust infrastructure of Google Cloud.

 

The Rise of Apache Iceberg: A Game-Changer for Data Lakes

 

For years, data lakes, typically built on cloud object storage like Google Cloud Storage (GCS), offered unparalleled scalability and cost efficiency. However, they often lacked the crucial features found in traditional data warehouses, such as transactional consistency, schema evolution, and performance optimizations for analytical queries. This is where Apache Iceberg shines.

Apache Iceberg is an open table format designed to address these limitations. It sits on top of your data files (like Parquet, ORC, or Avro) in cloud storage, providing a layer of metadata that transforms a collection of files into a high-performance, SQL-like table. Here’s what makes Iceberg so powerful:

  • ACID Compliance: Iceberg brings Atomicity, Consistency, Isolation, and Durability (ACID) properties to your data lake. This means that data writes are transactional, ensuring data integrity even with concurrent operations. No more partial writes or inconsistent reads.
  • Schema Evolution: One of the biggest pain points in traditional data lakes is managing schema changes. Iceberg handles schema evolution seamlessly, allowing you to add, drop, rename, or reorder columns without rewriting the underlying data. This is critical for agile data development.
  • Hidden Partitioning: Iceberg intelligently manages partitioning, abstracting away the physical layout of your data. Users no longer need to know the partitioning scheme to write efficient queries, and you can evolve your partitioning strategy over time without data migrations.
  • Time Travel and Rollback: Iceberg maintains a complete history of table snapshots. This enables “time travel” queries, allowing you to query data as it existed at any point in the past. It also provides rollback capabilities, letting you revert a table to a previous good state, invaluable for debugging and data recovery.
  • Performance Optimizations: Iceberg’s rich metadata allows query engines to prune irrelevant data files and partitions efficiently, significantly accelerating query execution. It avoids costly file listing operations, directly jumping to the relevant data based on its metadata.

By providing these data warehouse-like features on top of a data lake, Apache Iceberg enables the creation of a true “data lakehouse,” offering the best of both worlds: the flexibility and cost-effectiveness of cloud storage combined with the reliability and performance of structured tables.

Google Cloud’s BigLake tables for Apache Iceberg in BigQuery offers a fully-managed table experience similar to standard BigQuery tables, but all of the data is stored in customer-owned storage buckets. Support features include:

  • Table mutations via GoogleSQL data manipulation language (DML)
  • Unified batch and high throughput streaming using the Storage Write API through BigLake connectors such as Spark
  • Iceberg V2 snapshot export and automatic refresh on each table mutation
  • Schema evolution to update column metadata
  • Automatic storage optimization
  • Time travel for historical data access
  • Column-level security and data masking

Here’s an example of how to create an empty BigLake Iceberg table using GoogleSQL:


SQL

CREATE TABLE PROJECT_ID.DATASET_ID.my_iceberg_table (
  name STRING,
  id INT64
)
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
file_format="PARQUET"
table_format="ICEBERG"
storage_uri = 'gs://BUCKET/PATH');

 

You can then import data into the data using LOAD INTO to import data from a file or INSERT INTO from another table.


SQL

# Load from file
LOAD DATA INTO PROJECT_ID.DATASET_ID.my_iceberg_table
FROM FILES (
uris=['gs://bucket/path/to/data'],
format="PARQUET");

# Load from table
INSERT INTO PROJECT_ID.DATASET_ID.my_iceberg_table
SELECT name, id
FROM PROJECT_ID.DATASET_ID.source_table

 

In addition to a fully-managed offering, Apache Iceberg is also supported as a read-external table in BigQuery. Use this to point to an existing path with data files.


SQL

CREATE OR REPLACE EXTERNAL TABLE PROJECT_ID.DATASET_ID.my_external_iceberg_table
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
  format="ICEBERG",
  uris =
    ['gs://BUCKET/PATH/TO/DATA'],
  require_partition_filter = FALSE);

 

 

Apache Spark: The Engine for Data Lakehouse Analytics

 

While Apache Iceberg provides the structure and management for your data lakehouse, Apache Spark is the processing engine that brings it to life. Spark is a powerful open-source, distributed processing system renowned for its speed, versatility, and ability to handle diverse big data workloads. Spark’s in-memory processing, robust ecosystem of tools including ML and SQL-based processing, and deep Iceberg support make it an excellent choice.

Apache Spark is deeply integrated into the Google Cloud ecosystem. Benefits of using Apache Spark on Google Cloud include:

  • Access to a true serverless Spark experience without cluster management using Google Cloud Serverless for Apache Spark.
  • Fully managed Spark experience with flexible cluster configuration and management via Dataproc.
  • Accelerate Spark jobs using the new Lightning Engine for Apache Spark preview feature.
  • Configure your runtime with GPUs and drivers preinstalled.
  • Run AI/ML jobs using a robust set of libraries available by default in Spark runtimes, including XGBoost, PyTorch and Transformers.
  • Write PySpark code directly inside BigQuery Studio via Colab Enterprise notebooks along with Gemini-powered PySpark code generation.
  • Easily connect to your data in BigQuery native tables, BigLake Iceberg tables, external tables and GCS
  • Integration with Vertex AI for end-to-end MLOps

 

Iceberg + Spark: Better Together

 

Together, Iceberg and Spark form a potent combination for building performant and reliable data lakehouses. Spark can leverage Iceberg’s metadata to optimize query plans, perform efficient data pruning, and ensure transactional consistency across your data lake.

Your Iceberg tables and BigQuery native tables are accessible via BigLake metastore. This exposes your tables to open source engines with BigQuery compatibility, including Spark.


Python

from pyspark.sql import SparkSession

# Create a spark session
spark = SparkSession.builder \
.appName("BigLake Metastore Iceberg") \
.config("spark.sql.catalog.CATALOG_NAME", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_project", "PROJECT_ID") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_location", "LOCATION") \
.config("spark.sql.catalog.CATALOG_NAME.warehouse", "WAREHOUSE_DIRECTORY") \
.getOrCreate()
spark.conf.set("viewsEnabled","true")

# Use the blms_catalog
spark.sql("USE `CATALOG_NAME`;")
spark.sql("USE NAMESPACE DATASET_NAME;")

# Configure spark for temp results
spark.sql("CREATE namespace if not exists MATERIALIZATION_NAMESPACE");
spark.conf.set("materializationDataset","MATERIALIZATION_NAMESPACE")

# List the tables in the dataset
df = spark.sql("SHOW TABLES;")
df.show();

# Query the tables
sql = """SELECT * FROM DATASET_NAME.TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()
sql = """SELECT * FROM DATASET_NAME.ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()

sql = """SELECT * FROM DATASET_NAME.READONLY_ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()

 

Extending the functionality of BigLake metastore is the Iceberg REST catalog (in preview) to access Iceberg data with any data processing engine. Here’s how to connect to it using Spark:


Python

import google.auth
from google.auth.transport.requests import Request
from google.oauth2 import service_account
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

catalog = ""
spark = SparkSession.builder.appName("") \
    .config("spark.sql.defaultCatalog", catalog) \
    .config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog}.type", "rest") \
    .config(f"spark.sql.catalog.{catalog}.uri",
"https://biglake.googleapis.com/iceberg/v1beta/restcatalog") \
    .config(f"spark.sql.catalog.{catalog}.warehouse", "gs://") \
    .config(f"spark.sql.catalog.{catalog}.token", "") \
    .config(f"spark.sql.catalog.{catalog}.oauth2-server-uri", "https://oauth2.googleapis.com/token") \                   .config(f"spark.sql.catalog.{catalog}.header.x-goog-user-project", "") \     .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config(f"spark.sql.catalog.{catalog}.io-impl","org.apache.iceberg.hadoop.HadoopFileIO") \    .config(f"spark.sql.catalog.{catalog}.rest-metrics-reporting-enabled", "false") \
.getOrCreate()

 

 

Completing the lakehouse

 

Google Cloud provides a comprehensive suite of services that complement Apache Iceberg and Apache Spark, enabling you to build, manage, and scale your data lakehouse with ease while leveraging many of the open-source technologies you already use:

  • Dataplex Universal Catalog: Dataplex Universal Catalog provides a unified data fabric for managing, monitoring, and governing your data across data lakes, data warehouses, and data marts. It integrates with BigLake Metastore, ensuring that governance policies are consistently enforced across your Iceberg tables, and enabling capabilities like semantic search, data lineage, and data quality checks.
  • Google Cloud Managed Service for Apache Kafka: Run fully-managed Kafka clusters on Google Cloud, including Kafka Connect. Data streams can be read directly to BigQuery, including to managed Iceberg tables with low latency reads.
  • Cloud Composer: A fully managed workflow orchestration service built on Apache Airflow.
  • Vertex AI: Use Vertex AI to manage the full end-to-end ML Ops experience. You can also use Vertex AI Workbench for a managed JupyterLab experience to connect to your serverless Spark and Dataproc instances.

 

Conclusion

 

The combination of Apache Iceberg and Apache Spark on Google Cloud offers a compelling solution for building modern, high-performance data lakehouses. Iceberg provides the transactional consistency, schema evolution, and performance optimizations that were historically missing from data lakes, while Spark offers a versatile and scalable engine for processing these large datasets.

To learn more, check out our free webinar on July 8th at 11AM PST where we’ll dive deeper into using Apache Spark and supporting tools on Google Cloud.

Author: Brad Miro, Senior Developer Advocate – Google

 
 



Source link

Continue Reading

Jobs & Careers

Replit and Microsoft Bring Vibe Coding to Enterprises Through Azure Partnership

Published

on


AI coding platform, Replit, has entered a strategic partnership with Microsoft to bring its agentic software development platform to enterprise customers through Microsoft Azure. The collaboration enables business users across departments to create and deploy secure, production-ready applications using natural language, without writing code.

As part of the integration, Replit is connecting with several Microsoft services, including Azure Container Apps, Azure Virtual Machines, and Neon Serverless Postgres on Azure. This will allow enterprise users to develop on Replit and deploy directly to Microsoft’s infrastructure.

The partnership also allows customers to purchase Replit via the Azure Marketplace, simplifying procurement and adoption.

“Our mission is to empower entrepreneurial individuals to transform ideas into software — regardless of their coding experience or whether they’re launching a startup or innovating within an enterprise,” said Amjad Masad, CEO and co-founder of Replit. “Forward-thinking companies like Zillow are already using Replit to build internal tools and address unique business challenges.”

Deb Cupp, President of Microsoft Americas, said the partnership supports Microsoft’s broader vision of enabling everyone to do more with technology. “Our collaboration with Replit democratizes application development, enabling business teams across enterprises to innovate and solve problems without traditional technical barriers,” she said.

Replit’s platform, already used by more than 500,000 business users, supports application building in teams like Product, Sales, Marketing, Operations, and Design. Common use cases include internal tool development and rapid prototyping, especially where off-the-shelf SaaS solutions are not sufficient.

The integration also aligns with enterprise security needs. Replit is SOC 2 Type II compliant and offers governance controls required by large organisations. Upcoming features include support for direct deployment into a customer’s own Azure environment.

Replit was among the first to evaluate Anthropic on Azure Databricks through the Mosaic AI gateway, illustrating how organisations can operationalise new AI models across Microsoft’s ecosystem.

The service will soon be available through the Azure Marketplace, with apps deployed on Replit-managed Azure infrastructure,  combining ease of use with enterprise-grade scalability and compliance.



Source link

Continue Reading

Jobs & Careers

Build ETL Pipelines for Data Science Workflows in About 30 Lines of Python

Published

on



Image by Author | Ideogram

 

You know that feeling when you have data scattered across different formats and sources, and you need to make sense of it all? That’s exactly what we’re solving today. Let’s build an ETL pipeline that takes messy data and turns it into something actually useful.

In this article, I’ll walk you through creating a pipeline that processes e-commerce transactions. Nothing fancy, just practical code that gets the job done.

We’ll grab data from a CSV file (like you’d download from an e-commerce platform), clean it up, and store it in a proper database for analysis.

🔗 Link to the code on GitHub

 

What Is an Extract, Transform, Load (ETL) Pipeline?

 
Every ETL pipeline follows the same pattern. You grab data from somewhere (Extract), clean it up and make it better (Transform), then put it somewhere useful (Load).

 

etl-pipeline
ETL Pipeline | Image by Author | diagrams.net (draw.io)

 

The process begins with the extract phase, where data is retrieved from various source systems such as databases, APIs, files, or streaming platforms. During this phase, the pipeline identifies and pulls relevant data while maintaining connections to disparate systems that may operate on different schedules and formats.

Next the transform phase represents the core processing stage, where extracted data undergoes cleaning, validation, and restructuring. This step addresses data quality issues, applies business rules, performs calculations, and converts data into the required format and structure. Common transformations include data type conversions, field mapping, aggregations, and the removal of duplicates or invalid records.

Finally, the load phase transfers the now transformed data into the target system. This step can occur through full loads, where entire datasets are replaced, or incremental loads, where only new or changed data is added. The loading strategy depends on factors such as data volume, system performance requirements, and business needs.

 

Step 1: Extract

 
The “extract” step is where we get our hands on data. In the real world, you might be downloading this CSV from your e-commerce platform’s reporting dashboard, pulling it from an FTP server, or getting it via API. Here, we’re reading from an available CSV file.

def extract_data_from_csv(csv_file_path):
    try:
        print(f"Extracting data from {csv_file_path}...")
        df = pd.read_csv(csv_file_path)
        print(f"Successfully extracted {len(df)} records")
        return df
    except FileNotFoundError:
        print(f"Error: {csv_file_path} not found. Creating sample data...")
        csv_file = create_sample_csv_data()
        return pd.read_csv(csv_file)

 

Now that we have the raw data from its source (raw_transactions.csv), we need to transform it into something usable.

 

Step 2: Transform

 
This is where we make the data actually useful.

def transform_data(df):
    print("Transforming data...")
    
    df_clean = df.copy()
    
    # Remove records with missing emails
    initial_count = len(df_clean)
    df_clean = df_clean.dropna(subset=['customer_email'])
    removed_count = initial_count - len(df_clean)
    print(f"Removed {removed_count} records with missing emails")
    
    # Calculate derived fields
    df_clean['total_amount'] = df_clean['price'] * df_clean['quantity']
    
    # Extract date components
    df_clean['transaction_date'] = pd.to_datetime(df_clean['transaction_date'])
    df_clean['year'] = df_clean['transaction_date'].dt.year
    df_clean['month'] = df_clean['transaction_date'].dt.month
    df_clean['day_of_week'] = df_clean['transaction_date'].dt.day_name()
    
    # Create customer segments
    df_clean['customer_segment'] = pd.cut(df_clean['total_amount'], 
                                        bins=[0, 50, 200, float('inf')], 
                                        labels=['Low', 'Medium', 'High'])
    
    return df_clean

 

First, we’re dropping rows with missing emails because incomplete customer data isn’t helpful for most analyses.

Then we calculate total_amount by multiplying price and quantity. This seems obvious, but you’d be surprised how often derived fields like this are missing from raw data.

The date extraction is really handy. Instead of just having a timestamp, now we have separate year, month, and day-of-week columns. This makes it easy to analyze patterns like “do we sell more on weekends?”

The customer segmentation using pd.cut() can be particularly useful. It automatically buckets customers into spending categories. Now instead of just having transaction amounts, we have meaningful business segments.

 

Step 3: Load

 
In a real project, you might be loading into a database, sending to an API, or pushing to cloud storage.

Here, we’re loading our clean data into a proper SQLite database.

def load_data_to_sqlite(df, db_name="ecommerce_data.db", table_name="transactions"):
    print(f"Loading data to SQLite database '{db_name}'...")
    
    conn = sqlite3.connect(db_name)
    
    try:
        df.to_sql(table_name, conn, if_exists="replace", index=False)
        
        cursor = conn.cursor()
        cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
        record_count = cursor.fetchone()[0]
        
        print(f"Successfully loaded {record_count} records to '{table_name}' table")
        
        return f"Data successfully loaded to {db_name}"
        
    finally:
        conn.close()

 

Now analysts can run SQL queries, connect BI tools, and actually use this data for decision-making.

SQLite works well for this because it’s lightweight, requires no setup, and creates a single file you can easily share or backup. The if_exists="replace" parameter means you can run this pipeline multiple times without worrying about duplicate data.

We’ve added verification steps so you know the load was successful. There’s nothing worse than thinking your data is safely stored only to find an empty table later.

 

Running the ETL Pipeline

 
This orchestrates the entire extract, transform, load workflow.

def run_etl_pipeline():
    print("Starting ETL Pipeline...")
    
    # Extract
    raw_data = extract_data_from_csv('raw_transactions.csv')
    
    # Transform  
    transformed_data = transform_data(raw_data)
    
    # Load
    load_result = load_data_to_sqlite(transformed_data)
    
    print("ETL Pipeline completed successfully!")
    
    return transformed_data

 

Notice how this ties everything together. Extract, transform, load, done. You can run this and immediately see your processed data.

You can find the complete code on GitHub.

 

Wrapping Up

 
This pipeline takes raw transaction data and turns it into something an analyst or data scientist can actually work with. You’ve got clean records, calculated fields, and meaningful segments.

Each function does one thing well, and you can easily modify or extend any part without breaking the rest.

Now try running it yourself. Also try to modify it for another use case. Happy coding!
 
 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.





Source link

Continue Reading

Trending