Pre-SIGMOD 2026 Talks @ BMSCE

2026-05-30

#databases #genai #research

Anshul Paruchuri

Editors Note:
We were invited to attend the Pre-Conference Workshop at BMSCE for the ACM SIGMOD 2026 Conference, happening in Bangalore. There were quite a few amazing talks that were given, and a lot of insightful discussion took place. We had a blast, and would love to see more events like this in the future!
This article first appeared on Anshul's blog! (link)


Test Automation for Low-Code ETL Workflows

Speaker: Meenakshi D'Souza (IIT Madras)

Introduction

ETL is a data integration process that combines data from various sources according to the desired requirements, and then loads it into a target (typically a data warehouse).

No-code/low-code ETL platforms provide features for business users to develop ETL pipelines without writing much code.

Low-code open source ETL tools: Apache Airflow, Apache Ni-Fi, CDAP.

Objective

Can we design a set of functional testing plugins for ETL workflows, mainly for data pipelines using ETL tools?

Structure of the ETL Process

Data from multiple sources are transformed using a well defined set of syntactic rules that are applied step by step, and loaded into a sink.

Typically, this is represented as a DAG. Each node in the DAG is a transformation step, and edges connect one transformation step into another.

Proposed Framework (EasyTest ETL)

Consists of 3 plugins: Assertion, Fixture and Mutation.

This project presents a general framework for plugins to facilitate functional testing of low-code ETL workflows.

Fueling Enterprise AI through Robust Data Ingestion

Speaker: Prasad M Deshpande (Databricks)

Introduction

Typical enterprise data is spread over multiple sources. The first step is to get all this data into a single platform. 3 steps in the process:

Why is Ingestion Hard?

There are hundreds of source types, each with different APIs, protocols and quirks. Enterprises also have these requirements:

Ingestion has two phases: Snapshot (initial copy) and Incremental (keeping the data fresh).

Sources can be of 2 types: SaaS Systems and Database Systems

Another challenge is rate limits while using APIs for SaaS sources: Use smart backoff strategies.

Overall Flow

Source > Reader > Buffer > Merger > Destination

Incremental vs Batch Ingestion

Cursor

Challenge: How do we fetch only new data? This can be done using the Cursor logic.

Steps:

  1. Find a col that changes whenever a record is updated
  2. Keep track of maximum value seen so far.

The ideal cursor field changes whenever the record is updated, always increases and is strictly ordered by update time.

Chunking Large Datasets

Naive Approach

Robust Chunking

Keyset Pagination Solution

Unstructured Data for AI

Challenges:

ACLs and Permission Problem

Approach 1: Data Storage

Approach 2: Platform Security

Database Sources

These sources have CDCs that can read the transaction log. The two phases here are CDC (read change stream) and Snapshot (capture current state by querying tables).

CDC needs to start before snapshot.

CDC records have a sequence number; The correct sequence number S should be the LSN at CDC start.

Snapshot Split Problem

A 1TB table at 100MB/s requires nearly 3 hours for a full scan. In such a situation, the solution is to read in parallel, checkpoint individually, and retry independently if issues occur. Splits should ideally be even.

Use composite key partitioning for even splits.

Challenges

Merging

Merger must reconcile these technical hurdles to ensure data integrity.

PostgreSQL Architecture

Speaker: Pavan Deolasse (EDB)

Architecture

Postgres uses processes, not threads. Helps with portability, debugging and fault isolation, but leads to high process overhead.

Query Processing

Parser > Analyzer > Rewriter > Planner > Executor

Planner/Optimizer is the most active contribution area in core Postgres.

Storage

MVCC

Multiversion Concurrency Control. The mechanism is that tuples are never updated in-place. Always create a new version of the row that sits next to the old version of the row.

This ensures readers and writers do not block each other. It gives accurate snapshot isolation, and rollbacks are easy.

WAL

WAL is used for durability and replication. Changes are logged before hitting the disk.

Replication is both physical and logical in Postgres.

Pluggable Indexes

Built-in Index Types

Plugins

Extensibility

Extensibility is a design principle in Postgres. It is built to be extended without forking.

The primary mode of contributions is through the pgsql-hackers mailing list.

Rethinking Relational DBs in the Age of GenAI

Speaker: Carsten Binnig (TU Darmstadt)

Introduction

Almost every critical system depends on Relational DBs, and cloud helps it become more scalable. However, there is a high overhead to pay, to use these DBs. Not much changed even when everything moved to cloud.

Original Promises:

However, data tends to be unstructured, and does not come in tables.

Relational tax: Overheads are rooted in the design of the relational model.

Query tax: Query authoring is complex. Tuning tax: DBs require massive tuning.

Cutting these Relational Taxes with AI

Query and Data Tax

Major Issues of using LLMs as DBs

A pure LLM/RAG-style approach for natural language queries leads to some issues.

  1. Limited to simple NL queries
  2. Limited data understanding if data is already structured.
  3. Enterprise data is stored in structured form
  4. Black-box processing
  5. High cost.

A Solution

LLM-Augmented DBs: Extend DBs with LLMs as needed. LLMs and DBs can be used to complement each other.

Relational + LLM-based operators. Use LLM-driven Multimodal filters.

LLM can be used for query planning.

Carsten's project is called CAESURA. (Code)

Working: Take NL query and logical operators, and ask the LLM to create a logical plan to possibly execute the query. This logical plan is then converted into a physical plan (actual execution strategy with the tech stack being used). LLM is used to reason over data and logical operations.

Text to SQL in Enterprise Data