Generating Reusable Program
Optimizations with LLMs

RuleFlow turns LLM-discovered Pandas optimizations into reusable rewrite rules and applies them automatically using a lightweight compiler.

ArXiv Paper: RuleFlow: Generating Reusable Program Optimizations with LLMs

📙 About

RuleFlow is a practical framework for optimizing Pandas notebooks. It:

Discovers optimized code variants using LLMs,
Generalizes the optimizations into reusable rewrite rules, and
Deploys the rules deterministically to new notebooks.

The key idea is to decouple discovery from deployment: LLMs are used offline to discover and generalize optimizations, while notebook rewriting is performed by a compiler without repeated LLM calls.

✨ Features

🔍 Automatically discovers optimized Pandas code from real notebooks
🔁 Synthesizes general rewrite rules from per-program optimizations
⚙️ Applies rules using a lightweight compiler
🔒 Runtime preconditions ensure safe rewrites
🚫 No LLM calls during deployment
🏎️ State-of-the-art performance on PandasBench

👀 Quick Example

# original
df = df.drop(['name'], axis=1)

# rewritten by RuleFlow
df.pop('name')

In our benchmark, this rewrite is up to 1770× faster on a representative dataframe size.

🧠 Overview

RuleFlow is organized as a three-stage pipeline:

SnippetGen – discovers optimized code snippets
RuleGen – converts snippet pairs into generalized rewrite rules
CodeGen – applies these rules to rewrite user notebooks

🚀 Quick Start

SnippetGen (Docker)

SnippetGen uses Docker for reproducibility.

Make sure your OPENAI_API_KEY is set in snippetgen/Dockerfile (see line 17).

cd snippetgen
docker build -t snippetgen .

Prepare the dataset from the repository root:

python3 snippetgen/prepare_dataset.py <path/to/directory/containing/training/data>

Generate randomized strings used by SnippetGen:

cd snippetgen/synthRun
python gen_rand_strings.py

RuleGen and CodeGen

Create environments for RuleGen and CodeGen:

chmod +x setup_env.sh
./setup_env.sh

Set your OpenAI key in:

rulegen/.env

Set the path to the notebooks you want to rewrite in:

codegen/codegen.py

▶️ Running the Pipeline

Use the driver script from the repository root:

python3 ruleflow.py <stage>

Allowed values:

all — run the full pipeline
snippetgen — run only SnippetGen
rulegen — run only RuleGen
codegen — run only CodeGen

🛠️ Pipeline Details

Stage 1 — SnippetGen (Discovery)

	Original Code `df = df.drop(['name'], axis=1)` Optimized Code `df.pop('name')`
SnippetGen	Example original, optimized code pair generated by SnippetGen

SnippetGen extracts cells from real notebooks and uses an agent (candidateGen) to generate semantically equivalent optimized rewrites. The candidates are validated and improved using algorithmic equivalence checks (equivCheck), optimization checks (optCheck), and LLM feedback (feedbackGen).

Stage 2 — RuleGen (Bridge)

	LHS `@{Name: v1} = @{Name: v1}.drop([@{Const(str: c1)}], axis=1)` RHS `@{v1}.pop(@{c1})` Runtime Preconditions `[isinstance(@{v1}, pandas.DataFrame), @{c1} in @{v1}.columns]`
RuleGen	Example rule generated by RuleGen

RuleGen converts original–optimized code pairs into generalized rewrite rules. It decomposes rule synthesis into four modular steps, each supported by a dedicated agent.

Stage 3 — CodeGen (Deployment)

	`if isinstance(df, pd.DataFrame) and 'Date' in df.columns: df.pop('Date') else: df = df.drop(['Date'], axis=1)`
CodeGen	Example application of a rewrite rule by CodeGen

CodeGen is a lightweight compiler that applies the rewrite rules produced by RuleGen to rewrite user notebooks.

📊 Results

🏆 State of the Art (SOTA)

Speedups of different frameworks over Pandas across PandasBench

RuleFlow runs the maximum number of notebooks (101/102) and achieves:

Framework	# Notebooks	Mean	Median	Max	Min
DIAS	97	1.54	1.13	4.30	0.80
Modin	72	112.79	32.78	1914.89	0.48
Dask	3	12.32	6.45	28.85	1.68
Koalas	10	140.15	100.36	377.09	7.25

At a finer granularity, individual rewrite rules achieve speedups of up to 199× over Dias and 1704× over Modin. A single rule can apply to as many as 72 notebooks, and a single notebook can match up to 13 distinct rules.

📈 Yield Analysis

The low yield highlights that per-program LLM optimization is impractical: most generated candidates are either incorrect or fail to provide speedups. This motivates RuleFlow’s rule-centric design.

📄 About the Paper

RuleFlow is presented in our research paper, which introduces a Pandas optimization framework that combines the creativity of large language models with the determinism of compiler-based optimization. The paper evaluates RuleFlow on the PandasBench benchmark and shows that it outperforms prior compiler-based and systems-based Pandas optimization frameworks.

frameworks.

📎 Citation

If you use RuleFlow in your work, please cite our paper:

@misc{singh2026ruleflowgeneratingreusable,
      title={RuleFlow : Generating Reusable Program Optimizations with LLMs}, 
      author={Avaljot Singh and Dushyant Bharadwaj and Stefanos Baziotis and Kaushik Varadharajan and Charith Mendis},
      year={2026},
      eprint={2602.09051},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2602.09051}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generating Reusable Program
Optimizations with LLMs

📙 About

✨ Features

👀 Quick Example

🧠 Overview

🚀 Quick Start

SnippetGen (Docker)

RuleGen and CodeGen

▶️ Running the Pipeline

🛠️ Pipeline Details

Stage 1 — SnippetGen (Discovery)

Stage 2 — RuleGen (Bridge)

Stage 3 — CodeGen (Deployment)

📊 Results

🏆 State of the Art (SOTA)

📈 Yield Analysis

📄 About the Paper

📎 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
codegen		codegen
rulegen		rulegen
snippetgen		snippetgen
LICENSE		LICENSE
README.md		README.md
ruleflow.py		ruleflow.py
setup_env.sh		setup_env.sh

License

ADAPT-uiuc/RuleFlow

Folders and files

Latest commit

History

Repository files navigation

Generating Reusable Program Optimizations with LLMs

📙 About

✨ Features

👀 Quick Example

🧠 Overview

🚀 Quick Start

SnippetGen (Docker)

RuleGen and CodeGen

▶️ Running the Pipeline

🛠️ Pipeline Details

Stage 1 — SnippetGen (Discovery)

Stage 2 — RuleGen (Bridge)

Stage 3 — CodeGen (Deployment)

📊 Results

🏆 State of the Art (SOTA)

📈 Yield Analysis

📄 About the Paper

📎 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Generating Reusable Program
Optimizations with LLMs

Packages