[PyTorch] Add grouped linear op and experimental fusion for grouped MLP #2622

timmoon10 · 2026-01-24T06:40:00Z

Description

This PR adds a grouped linear op, which can be used in the grouped MLP block in Mixture-of-Experts models. It also adds an experimental fused operation for a grouped MLP block, using a CuTe DSL kernel that computes an MXFP8 grouped GEMM and SwiGLU.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Add a grouped linear operation
Add a post-scaled SwiGLU op and add support for interleaving SwiGLU gate and linear units
Add a fused operation for grouped MLP

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Test is too permissive since the test should still be failing. The weights are not properly interleaved yet. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2026-01-25T01:00:24Z

/te-ci pytorch L1

greptile-apps · 2026-01-25T01:03:24Z

Greptile Overview

Greptile Summary

Adds grouped linear operation and experimental fused grouped MLP for Mixture-of-Experts models. The implementation includes a new GroupedLinear operation that splits input tensors and applies separate linear transformations to each group, a ScaledSwiGLU activation with post-scaling support, and an experimental fused operation ForwardGroupedMLP_CuTeGEMMSwiGLU_MXFP8 that uses a CuTe DSL kernel from cuDNN to fuse FC1 + SwiGLU + FC2 into fewer kernel launches for SM100+ GPUs with MXFP8 quantization.

Key changes:

GroupedLinear supports MXFP8 quantization with packed weight buffers for efficient grouped GEMMs
ScaledSwiGLU enables post-scaling with optional 32-wide gate/activation interleaving
Experimental fusion uses CuTe DSL kernel (SM100+ only) to compute grouped GEMM + SwiGLU + post-scale in a single kernel
Helper function noop_cat added for efficient tensor concatenation without copying when tensors are already contiguous in memory
Comprehensive test coverage added for both operations

Issues previously reported have been addressed:
All previously flagged issues (undefined variables, duplicate condition checks, typos, missing f-string prefixes, incorrect attribute access, gradient accumulation flag handling) appear to have been fixed in the current version.

Confidence Score: 4/5

Safe to merge with minor considerations - the experimental fusion is properly gated behind SM100+ checks and MXFP8 recipe detection
The implementation is well-structured with comprehensive tests, proper error handling, and hardware capability checks. All previously reported issues have been addressed. Score is 4 rather than 5 because the experimental CuTe DSL fusion is complex and hardware-specific (SM100+ only), requiring thorough hardware testing that cannot be verified from code review alone.
The experimental fusion in forward_grouped_mlp.py requires SM100+ hardware validation. All other files appear production-ready.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ops/basic/grouped_linear.py	Implements grouped linear operation with MXFP8 quantization support, includes proper parameter initialization and gradient accumulation
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py	Implements experimental fused operation for grouped MLP using CuTe DSL kernel for MXFP8, SM100+ only
transformer_engine/pytorch/ops/basic/swiglu.py	Adds `ScaledSwiGLU` operation with post-scaling support and optional gate/activation interleaving

Sequence Diagram

sequenceDiagram
    participant User
    participant Sequential as te_ops.Sequential
    participant GroupedLinear1 as GroupedLinear (FC1)
    participant ScaledSwiGLU
    participant GroupedLinear2 as GroupedLinear (FC2)
    participant FusedOp as ForwardGroupedMLP_CuTeGEMMSwiGLU_MXFP8
    participant CuTeKernel as CuDNN CuTe DSL Kernel

    Note over User,CuTeKernel: Regular Path (No Fusion)
    User->>Sequential: forward(input, split_sizes, scales)
    Sequential->>GroupedLinear1: forward(input, split_sizes)
    GroupedLinear1->>GroupedLinear1: Split input by split_sizes
    GroupedLinear1->>GroupedLinear1: Quantize to MXFP8 if enabled
    GroupedLinear1->>GroupedLinear1: general_grouped_gemm(weights, inputs)
    GroupedLinear1-->>Sequential: fc1_output
    Sequential->>ScaledSwiGLU: forward(fc1_output, scales)
    ScaledSwiGLU->>ScaledSwiGLU: Remove gate interleaving if needed
    ScaledSwiGLU->>ScaledSwiGLU: Compute SwiGLU activation
    ScaledSwiGLU->>ScaledSwiGLU: Apply post-scaling (output * scales)
    ScaledSwiGLU-->>Sequential: swiglu_output
    Sequential->>GroupedLinear2: forward(swiglu_output, split_sizes)
    GroupedLinear2->>GroupedLinear2: Split input by split_sizes
    GroupedLinear2->>GroupedLinear2: Quantize to MXFP8 if enabled
    GroupedLinear2->>GroupedLinear2: general_grouped_gemm(weights, inputs)
    GroupedLinear2-->>Sequential: final_output
    Sequential-->>User: final_output

    Note over User,CuTeKernel: Fused Path (MXFP8 + SM100+)
    User->>Sequential: forward(input, split_sizes, scales)
    Sequential->>FusedOp: fuser_forward(input, split_sizes, scales)
    FusedOp->>FusedOp: Quantize FC1 inputs to MXFP8
    FusedOp->>FusedOp: Pack FC1 data/scales with gate swapping
    FusedOp->>CuTeKernel: grouped_gemm_swiglu_wrapper_sm100()
    Note right of CuTeKernel: Single kernel:<br/>FC1 GEMM + SwiGLU + post-scale
    CuTeKernel-->>FusedOp: FC2 inputs (MXFP8, row+col quantized)
    FusedOp->>FusedOp: Unpack FC2 inputs and undo gate swap
    FusedOp->>FusedOp: Construct MXFP8Tensor objects
    FusedOp->>FusedOp: general_grouped_gemm(FC2 weights, FC2 inputs)
    FusedOp-->>Sequential: final_output
    Sequential-->>User: final_output

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

yaox12 · 2026-01-27T09:59:48Z

transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

+            quantizer.optimize_for_gemm = True
+        fc1_xs = tex.split_quantize(fc1_x, split_sizes_cpu, fc1_input_quantizers)
+
+        # Pack data tensors


May be a silly question: are these packing and unpacking code just for verification? Or will they be in the final version?

I'm working on getting rid of the concatenations, but the permutes are no-ops. The kernel API expects tensors with non-contiguous dims: https://github.com/NVIDIA/cudnn-frontend/blob/main/python/cudnn/grouped_gemm/grouped_gemm_swiglu/api.py#L240-L245

skydoorkai · 2026-01-28T05:30:55Z

transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

+        )
+
+        # Fused kernel for FC1 + SwiGLU + post-scale
+        fc1_kernel_out = self.grouped_gemm_swiglu_kernel()(


After swiglu, it usually needs to multiply with permuted_probs. Does this weighted swiglu supported?

Yes, the probs are passed into the kernel here: https://github.com/timmoon10/TransformerEngine/blob/46294be478f6551e2cf251283adc7529ddb2964e/transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py#L264

Signed-off-by: Tim Moon <tmoon@nvidia.com>

@greptile-apps

Review suggestions from @greptile-apps Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

@greptile-apps

Review suggestion from @greptile-apps Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-31T01:47:02Z

transformer_engine/pytorch/ops/basic/grouped_linear.py

+                    accumulate_into_main_grad = not getattr(
+                        weight_param, "overwrite_main_grad", False
+                    )


accumulate_into_main_grad reassigned in loop - last group's setting applies to all groups in GEMM call on line 576. If different weight groups have different overwrite_main_grad settings, this causes incorrect gradient accumulation behavior. Should either check consistency across groups or use per-group flags.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

timmoon10 and others added 30 commits January 7, 2026 00:15

Naive implementation of grouped linear op

5175aad

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Use grouped GEMM tex functions

5ffd57e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Support quantized compute

2ee42da

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug test failures with MXFP8 or NVFP4 params

93e71df

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add multiply op

fdddc47

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Bug fixes

b448a17

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Expose option for custom op fusions

3f38897

Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add tests for custom ops

a359b67

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5f7204f

for more information, see https://pre-commit.ci

Fix linter warnings and numerical test failures

8ddb8ce

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Tweak pattern matching logic with fixed window sizes

cfc2617

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0ce5dfb

for more information, see https://pre-commit.ci

Merge branch 'main' into tmoon/custom-fused-ops

9bf5843

Use TF32 tols in fused op tests

4992903

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Review suggestion from @greptile-apps

9ab7751

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into tmoon/custom-fused-ops

a086d81

Merge branch 'main' into tmoon/grouped-linear-op

f05f7a8

Fix linter warnings

9348138

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5366729

for more information, see https://pre-commit.ci

Merge branch 'tmoon/grouped-linear-op' into tmoon/cute-gemm-swiglu

1b0b229

Merge branch 'tmoon/custom-fused-ops' into tmoon/cute-gemm-swiglu

3bbe881

Initial impl of fused op for grouped MLP

321646e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Import group GEMM+SwiGLU kernel

e137451

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into tmoon/cute-gemm-swiglu

11da59d

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add unit test for grouped MLP op

cb728bb

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Call fused group GEMM + SwiGLU kernel

e7459cc

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug test failures

b15ca0d

Test is too permissive since the test should still be failing. The weights are not properly interleaved yet. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Get test to not pass trivially

3da2c17

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Handle interleaving for SwiGLU

0270eb1

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fix numeric tests, except for probs grad

0b09790

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fix linter warnings

ba28c6f

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 marked this pull request as ready for review January 25, 2026 01:00

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

Review suggestions from @greptile-apps

575da6e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

This comment was marked as resolved.

Sign in to view

Apply suggestion from @greptile-apps[bot]

46294be

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

This comment was marked as outdated.

Sign in to view

yaox12 reviewed Jan 27, 2026

View reviewed changes

skydoorkai reviewed Jan 28, 2026

View reviewed changes

Tweak variable names

fccb0bb

Signed-off-by: Tim Moon <tmoon@nvidia.com>

This comment was marked as resolved.

Sign in to view

Fix f-strings

4259e27

Review suggestions from @greptile-apps Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

This comment was marked as outdated.

Sign in to view

Fix bug when grouped MLP is not being trained

2442d34

Signed-off-by: Tim Moon <tmoon@nvidia.com>

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

timmoon10 mentioned this pull request Jan 30, 2026

[PyTorch] End-to-end MoE grouped tensor support in grouped MLP #2466

Open

Fix f-string

a7351e5

Review suggestion from @greptile-apps Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

greptile-apps bot reviewed Jan 31, 2026

View reviewed changes

Replace explicit concat with optional concat

9be1c49

Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps bot reviewed Jan 31, 2026

View reviewed changes

Add comments for CuTe DSL expected tensor shapes

2c6d6be

Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

timmoon10 and others added 2 commits February 4, 2026 03:38

Support contiguous weights in grouped linear op

65cb77d

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2e577d1

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

[PyTorch] Add grouped linear op and experimental fusion for grouped MLP #2622

Are you sure you want to change the base?

[PyTorch] Add grouped linear op and experimental fusion for grouped MLP #2622

Uh oh!

Conversation

timmoon10 commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 commented Jan 25, 2026

Uh oh!

greptile-apps bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

Uh oh!

yaox12 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

skydoorkai Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timmoon10 commented Jan 24, 2026 •

edited

Loading

greptile-apps bot commented Jan 25, 2026 •

edited

Loading