Skip to content

ADAPT-uiuc/RuleFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generating Reusable Program
Optimizations with LLMs


RuleFlow turns LLM-discovered Pandas optimizations into reusable rewrite rules and applies them automatically using a lightweight compiler.

ArXiv Paper: RuleFlow: Generating Reusable Program Optimizations with LLMs


📙 About

RuleFlow is a practical framework for optimizing Pandas notebooks. It:

  • Discovers optimized code variants using LLMs,
  • Generalizes the optimizations into reusable rewrite rules, and
  • Deploys the rules deterministically to new notebooks.

The key idea is to decouple discovery from deployment: LLMs are used offline to discover and generalize optimizations, while notebook rewriting is performed by a compiler without repeated LLM calls.


✨ Features

  • 🔍 Automatically discovers optimized Pandas code from real notebooks
  • 🔁 Synthesizes general rewrite rules from per-program optimizations
  • ⚙️ Applies rules using a lightweight compiler
  • 🔒 Runtime preconditions ensure safe rewrites
  • 🚫 No LLM calls during deployment
  • 🏎️ State-of-the-art performance on PandasBench

👀 Quick Example

# original
df = df.drop(['name'], axis=1)

# rewritten by RuleFlow
df.pop('name')

In our benchmark, this rewrite is up to 1770× faster on a representative dataframe size.


🧠 Overview

RuleFlow framework diagram

RuleFlow is organized as a three-stage pipeline:

  • SnippetGen – discovers optimized code snippets
  • RuleGen – converts snippet pairs into generalized rewrite rules
  • CodeGen – applies these rules to rewrite user notebooks

🚀 Quick Start

SnippetGen (Docker)

SnippetGen uses Docker for reproducibility.

Make sure your OPENAI_API_KEY is set in snippetgen/Dockerfile (see line 17).

cd snippetgen
docker build -t snippetgen .

Prepare the dataset from the repository root:

python3 snippetgen/prepare_dataset.py <path/to/directory/containing/training/data>

Generate randomized strings used by SnippetGen:

cd snippetgen/synthRun
python gen_rand_strings.py

RuleGen and CodeGen

Create environments for RuleGen and CodeGen:

chmod +x setup_env.sh
./setup_env.sh

Set your OpenAI key in:

rulegen/.env

Set the path to the notebooks you want to rewrite in:

codegen/codegen.py

▶️ Running the Pipeline

Use the driver script from the repository root:

python3 ruleflow.py <stage>

Allowed values:

  • all — run the full pipeline
  • snippetgen — run only SnippetGen
  • rulegen — run only RuleGen
  • codegen — run only CodeGen

🛠️ Pipeline Details

Stage 1 — SnippetGen (Discovery)

SnippetGen Diagram
Original Code
df = df.drop(['name'], axis=1)

Optimized Code
df.pop('name')
SnippetGen Example original, optimized code pair generated by SnippetGen

SnippetGen extracts cells from real notebooks and uses an agent (candidateGen) to generate semantically equivalent optimized rewrites. The candidates are validated and improved using algorithmic equivalence checks (equivCheck), optimization checks (optCheck), and LLM feedback (feedbackGen).


Stage 2 — RuleGen (Bridge)

SnippetGen Diagram
LHS
@{Name: v1} = @{Name: v1}.drop([@{Const(str: c1)}], axis=1)
RHS
@{v1}.pop(@{c1})
Runtime Preconditions
[isinstance(@{v1}, pandas.DataFrame), @{c1} in @{v1}.columns]
RuleGen Example rule generated by RuleGen

RuleGen converts original–optimized code pairs into generalized rewrite rules. It decomposes rule synthesis into four modular steps, each supported by a dedicated agent.


Stage 3 — CodeGen (Deployment)

SnippetGen Diagram
if isinstance(df, pd.DataFrame) and 'Date' in df.columns:
    df.pop('Date')
else:
    df = df.drop(['Date'], axis=1)
CodeGen Example application of a rewrite rule by CodeGen

CodeGen is a lightweight compiler that applies the rewrite rules produced by RuleGen to rewrite user notebooks.


📊 Results

🏆 State of the Art (SOTA)

Speedups of different frameworks over Pandas across PandasBench

RuleFlow runs the maximum number of notebooks (101/102) and achieves:

Framework # Notebooks Mean Median Max Min
DIAS 97 1.54 1.13 4.30 0.80
Modin 72 112.79 32.78 1914.89 0.48
Dask 3 12.32 6.45 28.85 1.68
Koalas 10 140.15 100.36 377.09 7.25

At a finer granularity, individual rewrite rules achieve speedups of up to 199× over Dias and 1704× over Modin. A single rule can apply to as many as 72 notebooks, and a single notebook can match up to 13 distinct rules.


📈 Yield Analysis

SnippetGen Yield

The low yield highlights that per-program LLM optimization is impractical: most generated candidates are either incorrect or fail to provide speedups. This motivates RuleFlow’s rule-centric design.


📄 About the Paper

RuleFlow is presented in our research paper, which introduces a Pandas optimization framework that combines the creativity of large language models with the determinism of compiler-based optimization. The paper evaluates RuleFlow on the PandasBench benchmark and shows that it outperforms prior compiler-based and systems-based Pandas optimization frameworks.

frameworks.


📎 Citation

If you use RuleFlow in your work, please cite our paper:

@misc{singh2026ruleflowgeneratingreusable,
      title={RuleFlow : Generating Reusable Program Optimizations with LLMs}, 
      author={Avaljot Singh and Dushyant Bharadwaj and Stefanos Baziotis and Kaushik Varadharajan and Charith Mendis},
      year={2026},
      eprint={2602.09051},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2602.09051}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published