2026-05-30
#databases #genai #research
Anshul Paruchuri
Editors Note:
We were invited to attend the Pre-Conference Workshop at BMSCE for the ACM SIGMOD 2026 Conference, happening in Bangalore. There were quite a few amazing talks that were given, and a lot of insightful discussion took place. We had a blast, and would love to see more events like this in the future!
This article first appeared on Anshul's blog! (link)
Speaker: Meenakshi D'Souza (IIT Madras)
ETL is a data integration process that combines data from various sources according to the desired requirements, and then loads it into a target (typically a data warehouse).
No-code/low-code ETL platforms provide features for business users to develop ETL pipelines without writing much code.
Low-code open source ETL tools: Apache Airflow, Apache Ni-Fi, CDAP.
Can we design a set of functional testing plugins for ETL workflows, mainly for data pipelines using ETL tools?
Data from multiple sources are transformed using a well defined set of syntactic rules that are applied step by step, and loaded into a sink.
Typically, this is represented as a DAG. Each node in the DAG is a transformation step, and edges connect one transformation step into another.
Consists of 3 plugins: Assertion, Fixture and Mutation.
This project presents a general framework for plugins to facilitate functional testing of low-code ETL workflows.
Speaker: Prasad M Deshpande (Databricks)
Typical enterprise data is spread over multiple sources. The first step is to get all this data into a single platform. 3 steps in the process:
There are hundreds of source types, each with different APIs, protocols and quirks. Enterprises also have these requirements:
Ingestion has two phases: Snapshot (initial copy) and Incremental (keeping the data fresh).
Sources can be of 2 types: SaaS Systems and Database Systems
Another challenge is rate limits while using APIs for SaaS sources: Use smart backoff strategies.
Source > Reader > Buffer > Merger > Destination
Incremental vs Batch Ingestion
Challenge: How do we fetch only new data? This can be done using the Cursor logic.
Steps:
The ideal cursor field changes whenever the record is updated, always increases and is strictly ordered by update time.
>= but it causes duplicates.Challenges:
These sources have CDCs that can read the transaction log. The two phases here are CDC (read change stream) and Snapshot (capture current state by querying tables).
CDC needs to start before snapshot.
CDC records have a sequence number; The correct sequence number S should be the LSN at CDC start.
A 1TB table at 100MB/s requires nearly 3 hours for a full scan. In such a situation, the solution is to read in parallel, checkpoint individually, and retry independently if issues occur. Splits should ideally be even.
Use composite key partitioning for even splits.
Merger must reconcile these technical hurdles to ensure data integrity.
Speaker: Pavan Deolasse (EDB)
Postgres uses processes, not threads. Helps with portability, debugging and fault isolation, but leads to high process overhead.
Parser > Analyzer > Rewriter > Planner > Executor
Planner/Optimizer is the most active contribution area in core Postgres.
Multiversion Concurrency Control. The mechanism is that tuples are never updated in-place. Always create a new version of the row that sits next to the old version of the row.
This ensures readers and writers do not block each other. It gives accurate snapshot isolation, and rollbacks are easy.
WAL is used for durability and replication. Changes are logged before hitting the disk.
Replication is both physical and logical in Postgres.
Extensibility is a design principle in Postgres. It is built to be extended without forking.
The primary mode of contributions is through the pgsql-hackers mailing list.
Speaker: Carsten Binnig (TU Darmstadt)
Almost every critical system depends on Relational DBs, and cloud helps it become more scalable. However, there is a high overhead to pay, to use these DBs. Not much changed even when everything moved to cloud.
Original Promises:
However, data tends to be unstructured, and does not come in tables.
Relational tax: Overheads are rooted in the design of the relational model.
Query tax: Query authoring is complex. Tuning tax: DBs require massive tuning.
A pure LLM/RAG-style approach for natural language queries leads to some issues.
LLM-Augmented DBs: Extend DBs with LLMs as needed. LLMs and DBs can be used to complement each other.
Relational + LLM-based operators. Use LLM-driven Multimodal filters.
LLM can be used for query planning.
Carsten's project is called CAESURA. (Code)
Working: Take NL query and logical operators, and ask the LLM to create a logical plan to possibly execute the query. This logical plan is then converted into a physical plan (actual execution strategy with the tech stack being used). LLM is used to reason over data and logical operations.