kTYPr is a command-line tool for predicting K-antigen type classifications in Escherichia coli using HMM-based annotation profiling. It supports both whole-genome and flanking-region prediction modes and can be run in parallel on multiple files.
- 𧬠kTYPr
- π¦ Installation
- π Usage
- π§ Options
- π§ͺ Example
- π Output
- β Add custom K-types
- π kTYPr as Python package
- π BLAST-based approach
- π Citation
- β Need Help?
If you already have Python and pip set up:
git clone https://github.com/SushiLab/kTYPr.git
cd kTYPr
pip install .This will install ktypr and its dependencies.
To install in editable/development mode (so changes to the code apply immediately):
pip install -e .Note this installation can be done within an existing conda (slower, can take minutes) or venv environment (faster, seconds).
If you use Miniconda or Anaconda:
git clone https://github.com/SushiLab/kTYPr.git
cd kTYPr
conda env create -f environment.yml
conda activate ktypr_env
pip install .This will create and activate a new environment named ktypr_env with all dependencies and install ktypr in editable mode.
After installation, you can run the tool with:
ktypr --helpktypr -i <input_path> -i,--input: Path to a single file (e.g.,.faa,.fna,.gbk), a directory containing multiple annotation files, or a.txtfile listing full paths (one per line) to the genome/annotation files.
| Option | Description |
|---|---|
-i, --input |
Input path, which can be: β’ A genome file ( .fasta, .gbk, etc.)β’ An annotation file ( .faa)β’ A directory containing such files β’ A .txt file listing paths to input files. This is a required argument. |
-o, --output |
Output directory where all results will be saved. Defaults to ./ktypr_results. The directory will be created if it does not exist. |
-m, --mode |
Prediction mode: β’ 0 (default): Flanking mode β only genes located upstream and downstream of the kpsC gene, within a specified flanking window, are used for prediction.β’ 1: Whole genome mode β all annotated genes in the genome are considered. |
-f, --flank |
Flanking window size in base pairs around the kpsC gene to evaluate (default: 30000 bp up- and downstream). This option is ignored when using whole genome mode (-m 1). |
-n, --n-jobs |
Number of parallel jobs to run: β’ -1 (default): use all available CPU cores.β’ 1: run sequentially without parallelism.β’ Any other positive integer specifies the exact number of CPU cores to use. |
-p, --prefix |
Optional prefix to prepend to all output file names. If not provided, the base name of each input genome or annotation file will be used as the prefix. |
-s, --short |
Flag to use metagenomic mode in prodigal gene calling in all input sequences. Recommended when short sequences are provided. |
-r, --reannotate |
Flag to force re-annotation of genes using Prodigal, even if annotations are already present in the genome file. Useful to ensure consistent annotations when needed. |
-c, --clinker |
Flag to produce clinker reports ONLY available in FLANKING MODE when a K-type is assigned. This can be computationally expensive, so it does not run by default. |
-v, --verbose |
Enable verbose mode for detailed logging and debugging information during the run. |
-ko, --keep_output |
Keep intermediate output files: 0 = do not keep intermediate files (recommended for large collections), 1 = keep intermediate files (default) |
Please note the following aspects:
- kTYPr accepts annotated genomes (as
.faaor.gbk), in this case, gene calling will not be run. Use-rto force the reannotation (only if genbank provided). - We use pyrodigal in
singlemode for gene annotation by default. This can raise errors when the sequence is not long enough for training a gene caller specific for the input genome (min. length is 20,000 bases). In these cases,metamode will be activated by default; take into account this can produce less accurate gene sequences (shorter/longer than they should).- For a consistent gene calling, you can set
-sto treat all provided sequences as short sequences and call genes with pyrodigal metagenomic mode.
- For a consistent gene calling, you can set
Run K-type prediction in flanking mode on a folder containing .fa files using all available cores (-n -1) and producing clinker reports (-c), saving results to a custom folder (-o results/) and printing in the terminal the running processes (-v):
ktypr -i ./test/genomes/fasta -o ./results/ -n -1 -c -vRun in whole-genome mode (--mode 1) using a text file of paths using a custom flanking size and with a custom prefix to name all the files. Do not keep single genome outputs, only the combined prediction files:
ktypr -i genome_list.txt --flank 25000 -p ecoli_run -ko 0 --mode 1The expected runtime for a single genome prediction should be less than 30 seconds on a typical desktop computer.
For each input genome, kTYPr creates in a folder per genome:
| Filename suffix | Description |
|---|---|
.gff |
Annotation file in GFF format. Contains coordinates and features of predicted genes. |
.faa |
Protein sequences of all annotated genes in FASTA format. |
_flanks.faa |
Protein sequences extracted from the flanking region around the kpsC gene (only in flanking mode). |
_hits.tsv.gz |
Compressed TSV file listing all detected HMM hits (annotations) with their scores and locations. |
_filtered_hits.tsv.gz |
Compressed TSV file with filtered HMM hits after applying score thresholds. |
_ktypr.tsv |
Summary TSV file containing the final K-antigen type prediction results for the genome or annotation set. |
.gbk |
Full genome file with annotations in GenBank format, optionally including re-annotation results. |
clink.html |
Clinker HTML report against the best K-antigen type predicted. |
A summary table of all classifications is also created if more than one genome is given as <prefix>results_ktypr.tsv in the selected output directory. This file provides a summary of the K-antigen (K-type) prediction results for each genome or annotation set. It contains both the final predicted type and detailed match statistics for conserved regions and all candidate K-types.
Columns description:
| Column name | Description |
|---|---|
genome_id |
Genome identifier (extracted from the basename of the genome file). |
predicted |
Best-matching K-type predicted for the genome based on gene content and HMM match scores. |
pred_nr_genes |
Number of genes expected for the predicted K-type (based on its reference profile). |
pred_genes_in_genome |
Number of those expected genes actually found in the genome. |
pred_acc_bitscore |
Sum of HMM bitscores across all hits matching the predicted K-type. A proxy for overall match strength. |
pred_frac_bitscore |
pred_acc_bitscore normalized by the maximum accumulated bitscore for that K-type in RefSeq profiled genomes. |
pred_is_complete |
Indicates if all expected genes for the predicted K-type were found in the genome (1 = complete, 0 = incomplete). |
These are followed by conserved K-locus genes shared across many K-types, such as those in the KpsEDCSMT and KpsFU operons:
| Column name | Description |
|---|---|
KpsEDCSMT_nr_genes |
Number of conserved EDCSMT genes expected. |
KpsEDCSMT_genes_in_genome |
Number of EDCSMT genes found in the genome. |
KpsEDCSMT_acc_bitscore |
Cumulative bitscore for the EDCSMT gene matches. |
KpsEDCSMT_frac_bitscore |
KpsEDCSMT_acc_bitscore normalized by the maximum KpsEDCSMT bitscore in RefSeq profiled genomes |
KpsEDCSMT_is_complete |
Whether all EDCSMT genes were found (1) or not (0). |
KpsFU_nr_genes, ... |
Same structure as above, but for the conserved FU gene set. |
For further exploration, and forF each known K-type in our database the following columns provide detailed information:
| Column suffix | Description |
|---|---|
<KTYPE>_nr_genes |
Number of genes expected for this K-type. |
<KTYPE>_genes_in_genome |
Number of expected genes found in the genome. |
<KTYPE>_acc_bitscore |
Sum of HMM match bitscores for this K-typeβs genes. |
<KTYPE>_frac_bitscore |
<KTYPE>_acc_bitscore normalized by the maximum accumulated bitscore for that K-type in RefSeq profiled genomes. |
<KTYPE>_is_complete |
Whether the K-type is fully present in the genome (1) or not (0). |
As example of the collection output, you can find in test/output the results of running:
ktypr -i ./test/genomes/fasta -v -o ./test/output/ -p run1_To add new custom K-types, users can:
- Download this repository.
- Produce HMMs as done in the main publication for each of the specific genes to detect and copy them in data/hmms.
- Add a new line on data/ktypr_definitions_v20250512.tsv with the K-type name followed by the gene names (corresponding to HMM files) that composed them.
- Add the cut-offs per HMM in data/hmm_cutoffs_v20250704.tsv. kTYPr employs curated internal thresholds to define a hit, if these are not know, you can set a 0 cut-off but these can increase false positive hits.
- Add the maximum expected bitscore for the new K-type in data/max_bitscores_v20260107.tsv for the normalized bitscore in the output. If not known, simply include 1 as maximum (this will make the
<KTYPE>_frac_bitscoreto match<KTYPE>_acc_bitscore).
Once installed, you can use kTYPr within your python environment as:
from ktypr import ktypr
# To show the arguments required:
help(ktypr)The arguments described above are shared by this function, for example, to call it in a genome/directory/list of files:
ktypr(<genome_path>, <output_directory>)The collection of best genomes associated to a specific K-type can be employed as database for Kaptive as done in EC-K-typing by Gladstone et. al.
For this, you can simply download this multi-genbank and use it as:
kaptive assembly kTYPrDB_for_Kaptive.gbk <your_genomes> -o output.tsvIf you use kTYPr in your research, please cite:
In the terminal:
ktypr --helpIn python:
from ktypr import ktypr
help(ktypr)You can always write an issue in this repo or contact the maintainer at smiravet@ethz.ch