🧬 kTYPr

kTYPr is a command-line tool for predicting K-antigen type classifications in Escherichia coli using HMM-based annotation profiling. It supports both whole-genome and flanking-region prediction modes and can be run in parallel on multiple files.

📑 Index

🧬 kTYPr
📦 Installation
- Option 1 – Install via pip
- Option 2 – Use a Conda environment
🚀 Usage
- Basic command
- Required arguments
🔧 Options
- Genome annotation considerations
🧪 Example
🗃 Output
- Collection results
➕ Add custom K-types
🐍 kTYPr as Python package
🔎 BLAST-based approach
📖 Citation
❓ Need Help?

📦 Installation

Option 1 – Install via `pip` (recommended for most users)

If you already have Python and pip set up:

git clone https://github.com/SushiLab/kTYPr.git
cd kTYPr
pip install .

This will install ktypr and its dependencies.

To install in editable/development mode (so changes to the code apply immediately):

pip install -e .

Note this installation can be done within an existing conda (slower, can take minutes) or venv environment (faster, seconds).

Option 2 – Use a Conda environment (safer, isolated setup)

If you use Miniconda or Anaconda:

git clone https://github.com/SushiLab/kTYPr.git
cd kTYPr
conda env create -f environment.yml
conda activate ktypr_env
pip install .

This will create and activate a new environment named ktypr_env with all dependencies and install ktypr in editable mode.

🚀 Usage

After installation, you can run the tool with:

ktypr --help

Basic command

ktypr -i <input_path>

Required arguments

-i, --input: Path to a single file (e.g., .faa, .fna, .gbk), a directory containing multiple annotation files, or a .txt file listing full paths (one per line) to the genome/annotation files.

🔧 Options

Option	Description
`-i`, `--input`	Input path, which can be: • A genome file (`.fasta`, `.gbk`, etc.) • An annotation file (`.faa`) • A directory containing such files • A `.txt` file listing paths to input files. This is a required argument.
`-o`, `--output`	Output directory where all results will be saved. Defaults to `./ktypr_results`. The directory will be created if it does not exist.
`-m`, `--mode`	Prediction mode: • `0` (default): Flanking mode — only genes located upstream and downstream of the kpsC gene, within a specified flanking window, are used for prediction. • `1`: Whole genome mode — all annotated genes in the genome are considered.
`-f`, `--flank`	Flanking window size in base pairs around the kpsC gene to evaluate (default: `30000` bp up- and downstream). This option is ignored when using whole genome mode (`-m 1`).
`-n`, `--n-jobs`	Number of parallel jobs to run: • `-1` (default): use all available CPU cores. • `1`: run sequentially without parallelism. • Any other positive integer specifies the exact number of CPU cores to use.
`-p`, `--prefix`	Optional prefix to prepend to all output file names. If not provided, the base name of each input genome or annotation file will be used as the prefix.
`-s`, `--short`	Flag to use metagenomic mode in prodigal gene calling in all input sequences. Recommended when short sequences are provided.
`-r`, `--reannotate`	Flag to force re-annotation of genes using Prodigal, even if annotations are already present in the genome file. Useful to ensure consistent annotations when needed.
`-c`, `--clinker`	Flag to produce clinker reports ONLY available in FLANKING MODE when a K-type is assigned. This can be computationally expensive, so it does not run by default.
`-v`, `--verbose`	Enable verbose mode for detailed logging and debugging information during the run.
`-ko`, `--keep_output`	Keep intermediate output files: 0 = do not keep intermediate files (recommended for large collections), 1 = keep intermediate files (default)

Genome annotation considerations

Please note the following aspects:

kTYPr accepts annotated genomes (as .faa or .gbk), in this case, gene calling will not be run. Use -r to force the reannotation (only if genbank provided).
We use pyrodigal in single mode for gene annotation by default. This can raise errors when the sequence is not long enough for training a gene caller specific for the input genome (min. length is 20,000 bases). In these cases, meta mode will be activated by default; take into account this can produce less accurate gene sequences (shorter/longer than they should).
- For a consistent gene calling, you can set -s to treat all provided sequences as short sequences and call genes with pyrodigal metagenomic mode.

🧪 Example

Run K-type prediction in flanking mode on a folder containing .fa files using all available cores (-n -1) and producing clinker reports (-c), saving results to a custom folder (-o results/) and printing in the terminal the running processes (-v):

ktypr -i ./test/genomes/fasta -o ./results/ -n -1 -c -v

Run in whole-genome mode (--mode 1) using a text file of paths using a custom flanking size and with a custom prefix to name all the files. Do not keep single genome outputs, only the combined prediction files:

ktypr -i genome_list.txt --flank 25000 -p ecoli_run -ko 0 --mode 1

The expected runtime for a single genome prediction should be less than 30 seconds on a typical desktop computer.

🗃 Output

For each input genome, kTYPr creates in a folder per genome:

Filename suffix	Description
`.gff`	Annotation file in GFF format. Contains coordinates and features of predicted genes.
`.faa`	Protein sequences of all annotated genes in FASTA format.
`_flanks.faa`	Protein sequences extracted from the flanking region around the kpsC gene (only in flanking mode).
`_hits.tsv.gz`	Compressed TSV file listing all detected HMM hits (annotations) with their scores and locations.
`_filtered_hits.tsv.gz`	Compressed TSV file with filtered HMM hits after applying score thresholds.
`_ktypr.tsv`	Summary TSV file containing the final K-antigen type prediction results for the genome or annotation set.
`.gbk`	Full genome file with annotations in GenBank format, optionally including re-annotation results.
`clink.html`	Clinker HTML report against the best K-antigen type predicted.

Collection results

A summary table of all classifications is also created if more than one genome is given as <prefix>results_ktypr.tsv in the selected output directory. This file provides a summary of the K-antigen (K-type) prediction results for each genome or annotation set. It contains both the final predicted type and detailed match statistics for conserved regions and all candidate K-types.

Columns description:

Column name	Description
`genome_id`	Genome identifier (extracted from the basename of the genome file).
`predicted`	Best-matching K-type predicted for the genome based on gene content and HMM match scores.
`pred_nr_genes`	Number of genes expected for the predicted K-type (based on its reference profile).
`pred_genes_in_genome`	Number of those expected genes actually found in the genome.
`pred_acc_bitscore`	Sum of HMM bitscores across all hits matching the predicted K-type. A proxy for overall match strength.
`pred_frac_bitscore`	`pred_acc_bitscore` normalized by the maximum accumulated bitscore for that K-type in RefSeq profiled genomes.
`pred_is_complete`	Indicates if all expected genes for the predicted K-type were found in the genome (`1` = complete, `0` = incomplete).

These are followed by conserved K-locus genes shared across many K-types, such as those in the KpsEDCSMT and KpsFU operons:

Column name	Description
`KpsEDCSMT_nr_genes`	Number of conserved EDCSMT genes expected.
`KpsEDCSMT_genes_in_genome`	Number of EDCSMT genes found in the genome.
`KpsEDCSMT_acc_bitscore`	Cumulative bitscore for the EDCSMT gene matches.
`KpsEDCSMT_frac_bitscore`	`KpsEDCSMT_acc_bitscore` normalized by the maximum KpsEDCSMT bitscore in RefSeq profiled genomes
`KpsEDCSMT_is_complete`	Whether all EDCSMT genes were found (`1`) or not (`0`).
`KpsFU_nr_genes`, ...	Same structure as above, but for the conserved FU gene set.

For further exploration, and forF each known K-type in our database the following columns provide detailed information:

Column suffix	Description
`<KTYPE>_nr_genes`	Number of genes expected for this K-type.
`<KTYPE>_genes_in_genome`	Number of expected genes found in the genome.
`<KTYPE>_acc_bitscore`	Sum of HMM match bitscores for this K-type’s genes.
`<KTYPE>_frac_bitscore`	`<KTYPE>_acc_bitscore` normalized by the maximum accumulated bitscore for that K-type in RefSeq profiled genomes.
`<KTYPE>_is_complete`	Whether the K-type is fully present in the genome (`1`) or not (`0`).

As example of the collection output, you can find in test/output the results of running:

ktypr -i ./test/genomes/fasta -v -o ./test/output/ -p run1_

➕ Add custom K-types

To add new custom K-types, users can:

Download this repository.
Produce HMMs as done in the main publication for each of the specific genes to detect and copy them in data/hmms.
Add a new line on data/ktypr_definitions_v20250512.tsv with the K-type name followed by the gene names (corresponding to HMM files) that composed them.
Add the cut-offs per HMM in data/hmm_cutoffs_v20250704.tsv. kTYPr employs curated internal thresholds to define a hit, if these are not know, you can set a 0 cut-off but these can increase false positive hits.
Add the maximum expected bitscore for the new K-type in data/max_bitscores_v20260107.tsv for the normalized bitscore in the output. If not known, simply include 1 as maximum (this will make the <KTYPE>_frac_bitscore to match <KTYPE>_acc_bitscore).

🐍 kTYPr as Python package

Once installed, you can use kTYPr within your python environment as:

from ktypr import ktypr

# To show the arguments required:
help(ktypr)

The arguments described above are shared by this function, for example, to call it in a genome/directory/list of files:

ktypr(<genome_path>, <output_directory>)

🔎 BLAST-based approach

The collection of best genomes associated to a specific K-type can be employed as database for Kaptive as done in EC-K-typing by Gladstone et. al.

For this, you can simply download this multi-genbank and use it as:

kaptive assembly kTYPrDB_for_Kaptive.gbk <your_genomes> -o output.tsv

📖 Citation

If you use kTYPr in your research, please cite:

Preprint

❓ Need Help?

In the terminal:

ktypr --help

In python:

from ktypr import ktypr
help(ktypr)

You can always write an issue in this repo or contact the maintainer at smiravet@ethz.ch

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
kaptive_db		kaptive_db
ktypr		ktypr
results		results
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 kTYPr

📑 Index

📦 Installation

Option 1 – Install via `pip` (recommended for most users)

Option 2 – Use a Conda environment (safer, isolated setup)

🚀 Usage

Basic command

Required arguments

🔧 Options

Genome annotation considerations

🧪 Example

🗃 Output

Collection results

➕ Add custom K-types

🐍 kTYPr as Python package

🔎 BLAST-based approach

📖 Citation

❓ Need Help?

About

Uh oh!

Releases 1

Packages

Languages

License

SushiLab/kTYPr

Folders and files

Latest commit

History

Repository files navigation

🧬 kTYPr

📑 Index

📦 Installation

Option 1 – Install via pip (recommended for most users)

Option 2 – Use a Conda environment (safer, isolated setup)

🚀 Usage

Basic command

Required arguments

🔧 Options

Genome annotation considerations

🧪 Example

🗃 Output

Collection results

➕ Add custom K-types

🐍 kTYPr as Python package

🔎 BLAST-based approach

📖 Citation

❓ Need Help?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Option 1 – Install via `pip` (recommended for most users)

Packages