SpecBridge provides a DreaMS-conditioned adapter for spectra->molecule mapping and a training pipeline with synthetic and real (MGF) data.
-
Create conda environment:
conda env create -f environment.yml conda activate specbridge
-
Install SpecBridge:
pip install -e . -
Install DreaMS dependency:
cd DreaMS pip install -e . cd ..
For detailed setup instructions, troubleshooting, and alternative installation methods, see SETUP_ENVIRONMENT.md.
DreaMS pre-trained weights are available at: https://zenodo.org/records/10997887
SpecBridge pre-trained adapters, datasets, and candidate files are available at: https://zenodo.org/records/18357418 (DOI: 10.5281/zenodo.18357418)
SpecBridge model weights are also available on Hugging Face: https://huggingface.co/Spony/SpecBridge (weights only)
The following datasets are supported with their corresponding candidate files. Download from Zenodo:
-
MassSpecGym (MSGYM):
- MGF:
SpecBridge_MassSpecGym_dataset.mgf - Candidates:
SpecBridge_MSGYM_candidates.pkl
- MGF:
-
Spectraverse:
- MGF:
SpecBridge_Spectraverse_dataset.mgf - Candidates:
SpecBridge_Spectraverse_candidates.pkl
- MGF:
-
MSnLib:
- MGF:
SpecBridge_MSnLib_dataset.mgf - Candidates:
SpecBridge_MSnLib_candidates.pkl
- MGF:
Quick Download: Use the provided script to download all files:
./download_from_zenodo.shThis will automatically download files to the correct directories:
- Checkpoints →
runs/msgym/,runs/msnlib/,runs/spectraverse/ - Datasets →
data/SpecBridge_*_dataset.mgf - Candidates →
data/SpecBridge_*_candidates.pkl
Files keep their Zenodo names. Make sure to use the matching candidate file for your MGF dataset when running evaluation.
To train the adapter on your MGF data:
python -m specbridge.train.train \
--mgf path/to/your/data.mgf \
--dreams-ckpt path/to/ssl_model.ckpt \
--fold train \
--batch-size 128 \
--epochs 2 \
--cond-dim 2048 \
--mapper-hidden 2048 \
--no-gaussian \
--supcon-k 4 \
--w-con 0 \
--w-con-mapped 0 \
--w-map 5.0 \
--w-ortho 1e-3 \
--w-supcon 1.0 \
--supcon-temp 0.07 \
--log-every 50 \
--save-every 200 \
--outdir runs/your_experiment_name \
--mol-space chemberta \
--chemberta-model Derify/ChemBERTa_augmented_pubchem_13m \
--lr 1e-4 \
--n-blocks 8 \
--unfreeze-last 2 \
--unfreeze-after 0After training, you can use eval_all.sh to automatically evaluate all checkpoints in a run directory:
# Edit eval_all.sh to configure:
# - RUN_DIR: path to your training run directory
# - MGF: path to your MGF dataset
# - CANDS: path to candidate file matching your dataset
# - FOLD: evaluation fold (train/val/test)
# - Other parameters (batch size, embedding space, etc.)
# Run evaluation
./eval_all.shThe script will:
- Loop through all checkpoints (
ckpt_*.pt) in the run directory - Evaluate each checkpoint on the specified dataset
- Generate a summary CSV file (
eval_summary_${FOLD}_all.csv) with metrics (R@1, R@5, R@20, MRR, median_rank) - Skip checkpoints that have already been evaluated
- Display the top 5 checkpoints by R@5
To evaluate a single checkpoint on a dataset with candidates:
python -m specbridge.eval.candidates \
--mgf path/to/your/data.mgf \
--dreams-ckpt path/to/ssl_model.ckpt \
--adapter-ckpt path/to/adapter/ckpt.pt \
--candidates path/to/candidates.pkl \
--fold-query test \
--use-mapped \
--deterministic-map \
--no-gaussian \
--batch-size 32 \
--cond-dim 2048 \
--mapper-hidden 2048 \
--mol-space chemberta \
--chemberta-model Derify/ChemBERTa_augmented_pubchem_13mspecbridge/- Core package codeadapters/- DreaMS adapter implementationdata/- Data loading and processingeval/- Evaluation scriptsmodels/- Model definitionslosses/- Loss functionstrain/- Training scriptstrain.py- Main training script
utils/- Utility functions
DreaMS/- DreaMS dependency
Pre-trained SpecBridge adapter checkpoints, datasets, and candidate files are available on Zenodo:
📦 Download from Zenodo | DOI: 10.5281/zenodo.18357418
Model weights are also available on Hugging Face (weights only; datasets and candidates are on Zenodo).
The Zenodo dataset includes:
- MSGYM:
SpecBridge_MSGYM_checkpoint.pt(step 1200, R@5=0.91528, MRR=0.87518) - MSnLib:
SpecBridge_MSnLib_checkpoint.pt(step 26000, R@5=0.59368, MRR=0.56004) - Spectraverse:
SpecBridge_Spectraverse_checkpoint.pt(step 20600, R@5=0.43532, MRR=0.39055)
SpecBridge_MassSpecGym_dataset.mgf- MassSpecGym datasetSpecBridge_MSnLib_dataset.mgf- MSnLib dataset with train/val/test foldsSpecBridge_Spectraverse_dataset.mgf- Spectraverse dataset
SpecBridge_MSGYM_candidates.pkl- MSGYM candidate dictionary (SMILES format)SpecBridge_MSnLib_candidates.pkl- MSnLib candidate dictionarySpecBridge_Spectraverse_candidates.pkl- Spectraverse candidate dictionary
Note: Files are restricted access. Please request access through the Zenodo record page if needed.
Download Script: Use ./download_from_zenodo.sh to automatically download all files to the correct local directories.
See pyproject.toml for dependencies. Main requirements:
- Python >= 3.10
- PyTorch >= 2.1
- NumPy >= 1.24
- pyteomics >= 4.7.5
Optional dependencies:
rdkit-pypifor molecular processingwandbfor experiment tracking
All pre-trained checkpoints, datasets, and candidate files are available on Zenodo:
📦 Zenodo Dataset | DOI: 10.5281/zenodo.18357418
Quick Download: Use the provided script to download all files:
./download_from_zenodo.shThe script will automatically:
- Download checkpoints to
runs/specbridge_align_chemberta_pub_v3g_*/ckpt_*.pt - Download datasets to
data/*.mgf - Download candidates to
data/*.pkl - Create necessary directories
- Skip files that already exist
After downloading, files will be organized as:
-
Checkpoints:
runs/msgym/SpecBridge_MSGYM_checkpoint.ptruns/msnlib/SpecBridge_MSnLib_checkpoint.ptruns/spectraverse/SpecBridge_Spectraverse_checkpoint.pt
-
Datasets:
data/SpecBridge_MassSpecGym_dataset.mgfdata/SpecBridge_MSnLib_dataset.mgfdata/SpecBridge_Spectraverse_dataset.mgf
-
Candidates:
data/SpecBridge_MSGYM_candidates.pkldata/SpecBridge_MSnLib_candidates.pkldata/SpecBridge_Spectraverse_candidates.pkl
If you use SpecBridge in your research, please cite:
@misc{wang2026specbridgebridgingmassspectrometry,
title={SpecBridge: Bridging Mass Spectrometry and Molecular Representations via Cross-Modal Alignment},
author={Yinkai Wang and Yan Zhou Chen and Xiaohui Chen and Li-Ping Liu and Soha Hassoun},
year={2026},
eprint={2601.17204},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.17204},
}