otk (ecDNA Analysis Toolkit) is a deep learning-based tool for analyzing extrachromosomal DNA (ecDNA), predicting whether genes are detected as ecDNA cargo genes at the gene level, and classifying focal amplification types at the sample level.
- Deep learning-based ecDNA cargo gene prediction
- Sample-level focal amplification type classification
- Support for analysis from BAM files or processed copy number data
- Efficient command-line interface
- GPU acceleration support
- Python 3.8+
- PyTorch 2.0+
- NumPy
- Pandas
- scikit-learn
- Click (command-line interface)
- Clone the repository:
git clone https://github.com/WangLabCSU/otk.git
cd otk- Install with pip:
pip install -e .The following dependencies will be installed automatically:
- pandas>=2.0
- numpy>=1.24
- torch>=2.0
- scikit-learn>=1.3
- tqdm>=4.65
- click>=8.1
- matplotlib>=3.7
- seaborn>=0.12
- pyyaml>=6.0
otk provides two main command-line subcommands: train and predict.
Use the otk train command to train the model:
otk train --config configs/model_config.yml --output models/ --gpu 0Parameters:
--config, -c: Path to configuration file (default: configs/model_config.yml)--output, -o: Output directory for trained models (default: models/)--gpu, -g: GPU device ID to use (default: 0)
Use the otk predict command for predictions:
otk predict --model models/best_model.pth --input data/test_data.csv --output predictions/ --gpu -1Parameters:
--model, -m: Path to trained model (required)--input, -i: Path to input data file (required)--output, -o: Output directory for predictions (default: predictions/)--gpu, -g: GPU device ID to use (default: -1, i.e., use CPU)
Input data should be in CSV format with the following columns:
Required identifier columns:
sample: Tumor sample IDgene_id: Gene ID
Copy number features:
segVal: Total gene copy numberminor_cn: Minor gene copy numberintersect_ratio: Proportion of overlap between copy number detection segment and gene region
Sample-level genomic features (same value for all genes in a sample):
purity: Tumor purity estimateploidy: Tumor genome ploidy estimateAScore: Aneuploidy scorepLOH: Proportion of genome with loss of heterozygosity (LOH)cna_burden: Proportion of genome with copy number alterations
Copy number signature features:
CN1toCN19: 19 copy number signature activity estimates
Clinical features:
age: Patient agegender: Patient gender (0/1 encoded)
Tumor type features (one-hot encoded, 24 cancer types):
type_BLCA,type_BRCA,type_CESC,type_COAD,type_DLBC,type_ESCA,type_GBM,type_HNSCtype_KICH,type_KIRC,type_KIRP,type_LGG,type_LIHC,type_LUAD,type_LUSC,type_OVtype_PRAD,type_READ,type_SARC,type_SKCM,type_STAD,type_THCA,type_UCEC,type_UVM
Gene frequency features:
freq_Linear: Prior estimated frequency of gene in linear focal amplificationsfreq_BFB: Prior estimated frequency of gene in breakage-fusion-bridge (BFB) eventsfreq_Circular: Prior estimated frequency of gene in circular focal amplifications (ecDNA)freq_HR: Prior estimated frequency of gene in homologous recombination events
Target column (for training data):
y: Binary label indicating whether the gene is detected as an ecDNA cargo gene (1) or not (0)
Prediction results are saved as a CSV file with the following columns:
Gene-level predictions:
sample: Tumor sample IDgene_id: Gene IDprediction_prob: Probability of being an ecDNA cargo gene (0-1)prediction: Binary classification result (0 = not ecDNA cargo, 1 = ecDNA cargo)
Sample-level predictions:
sample_level_prediction_label: Sample-level focal amplification type classification:nofocal: No focal amplification detectednoncircular: Non-circular focal amplification detectedcircular: Circular focal amplification (ecDNA) detected
sample_level_prediction: Numerical encoding of sample-level classification (0 = nofocal, 1 = noncircular, 2 = circular)
Note: Sample-level classification follows these rules:
- If any gene in the sample is predicted as ecDNA cargo (
prediction= 1), the sample is classified ascircular - If no ecDNA cargo genes but any gene has
segVal > ploidy + 2, the sample is classified asnoncircular - Otherwise, the sample is classified as
nofocal
otk supports multiple model architectures with unified interface:
| Model | Type | Description |
|---|---|---|
| xgb_new | XGBoost | Optimized with feature engineering |
| xgb_paper | XGBoost | Paper reproduction (11 features) |
| baseline_mlp | Neural Network | Simple MLP baseline |
| transformer | Neural Network | Transformer architecture |
| deep_residual | Neural Network | Deep residual network |
| optimized_residual | Neural Network | Optimized residual network |
| dgit_super | Neural Network | Deep gated interaction transformer |
| tabpfn | TabPFN | TabPFN ensemble |
All models inherit from BaseEcDNAModel and provide:
fit(X_train, y_train, X_val, y_val)- Trainingpredict_proba(X)- Probability predictionpredict(X)- Binary predictionsave(path)/load(path)- Persistence
All models use unified data split (80/10/10) with seed=2026 for reproducibility.
Use the unified training script:
# Train single model
python train_unified.py --model xgb_new
# Train all models
python train_unified.py --allModel configuration uses YAML format, with example configuration files located in configs/. You can modify parameters in the configuration files as needed, such as model architecture and training parameters.
# Train model with default configuration
otk train
# Train model with custom configuration file
otk train --config my_config.yml# Make predictions using a trained model
otk predict --model models/best_model.pth --input test_data.csvThe following performance metrics are recorded during model training:
- auPRC (Area under Precision-Recall Curve)
- AUC (Area under ROC Curve)
- F1 Score
- Precision
- Recall
We welcome community contributions! If you have any questions or suggestions, please submit them through GitHub Issues.
- Fork the repository
- Create a feature branch
- Implement features or fix bugs
- Run tests
- Submit a Pull Request
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
If you use otk in your research, please cite the following paper:
Wang, S., Wu, C. Y., He, M. M., Yong, J. X., Chen, Y. X., Qian, L. M., ... & Zhao, Q. (2024). Machine learning-based extrachromosomal DNA identification in large-scale cohorts reveals its clinical implications in cancer. Nature Communications, 15(1), 1-17.
- Project homepage: https://github.com/WangLabCSU/otk
- Email: wangshx@csu.edu.cn