Skip to content
/ kTYPr Public

A command-line tool for predicting K-antigen type classifications in Escherichia coli.

License

Notifications You must be signed in to change notification settings

SushiLab/kTYPr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 kTYPr

kTYPr is a command-line tool for predicting K-antigen type classifications in Escherichia coli using HMM-based annotation profiling. It supports both whole-genome and flanking-region prediction modes and can be run in parallel on multiple files.


πŸ“‘ Index


πŸ“¦ Installation

Option 1 – Install via pip (recommended for most users)

If you already have Python and pip set up:

git clone https://github.com/SushiLab/kTYPr.git
cd kTYPr
pip install .

This will install ktypr and its dependencies.

To install in editable/development mode (so changes to the code apply immediately):

pip install -e .

Note this installation can be done within an existing conda (slower, can take minutes) or venv environment (faster, seconds).

Option 2 – Use a Conda environment (safer, isolated setup)

If you use Miniconda or Anaconda:

git clone https://github.com/SushiLab/kTYPr.git
cd kTYPr
conda env create -f environment.yml
conda activate ktypr_env
pip install .

This will create and activate a new environment named ktypr_env with all dependencies and install ktypr in editable mode.


πŸš€ Usage

After installation, you can run the tool with:

ktypr --help

Basic command

ktypr -i <input_path> 

Required arguments

  • -i, --input: Path to a single file (e.g., .faa, .fna, .gbk), a directory containing multiple annotation files, or a .txt file listing full paths (one per line) to the genome/annotation files.

πŸ”§ Options

Option Description
-i, --input Input path, which can be:
β€’ A genome file (.fasta, .gbk, etc.)
β€’ An annotation file (.faa)
β€’ A directory containing such files
β€’ A .txt file listing paths to input files. This is a required argument.
-o, --output Output directory where all results will be saved. Defaults to ./ktypr_results. The directory will be created if it does not exist.
-m, --mode Prediction mode:
β€’ 0 (default): Flanking mode β€” only genes located upstream and downstream of the kpsC gene, within a specified flanking window, are used for prediction.
β€’ 1: Whole genome mode β€” all annotated genes in the genome are considered.
-f, --flank Flanking window size in base pairs around the kpsC gene to evaluate (default: 30000 bp up- and downstream). This option is ignored when using whole genome mode (-m 1).
-n, --n-jobs Number of parallel jobs to run:
β€’ -1 (default): use all available CPU cores.
β€’ 1: run sequentially without parallelism.
β€’ Any other positive integer specifies the exact number of CPU cores to use.
-p, --prefix Optional prefix to prepend to all output file names. If not provided, the base name of each input genome or annotation file will be used as the prefix.
-s, --short Flag to use metagenomic mode in prodigal gene calling in all input sequences. Recommended when short sequences are provided.
-r, --reannotate Flag to force re-annotation of genes using Prodigal, even if annotations are already present in the genome file. Useful to ensure consistent annotations when needed.
-c, --clinker Flag to produce clinker reports ONLY available in FLANKING MODE when a K-type is assigned. This can be computationally expensive, so it does not run by default.
-v, --verbose Enable verbose mode for detailed logging and debugging information during the run.
-ko, --keep_output Keep intermediate output files: 0 = do not keep intermediate files (recommended for large collections), 1 = keep intermediate files (default)

Genome annotation considerations

Please note the following aspects:

  • kTYPr accepts annotated genomes (as .faa or .gbk), in this case, gene calling will not be run. Use -r to force the reannotation (only if genbank provided).
  • We use pyrodigal in single mode for gene annotation by default. This can raise errors when the sequence is not long enough for training a gene caller specific for the input genome (min. length is 20,000 bases). In these cases, meta mode will be activated by default; take into account this can produce less accurate gene sequences (shorter/longer than they should).
    • For a consistent gene calling, you can set -s to treat all provided sequences as short sequences and call genes with pyrodigal metagenomic mode.

πŸ§ͺ Example

Run K-type prediction in flanking mode on a folder containing .fa files using all available cores (-n -1) and producing clinker reports (-c), saving results to a custom folder (-o results/) and printing in the terminal the running processes (-v):

ktypr -i ./test/genomes/fasta -o ./results/ -n -1 -c -v

Run in whole-genome mode (--mode 1) using a text file of paths using a custom flanking size and with a custom prefix to name all the files. Do not keep single genome outputs, only the combined prediction files:

ktypr -i genome_list.txt --flank 25000 -p ecoli_run -ko 0 --mode 1

The expected runtime for a single genome prediction should be less than 30 seconds on a typical desktop computer.


πŸ—ƒ Output

For each input genome, kTYPr creates in a folder per genome:

Filename suffix Description
.gff Annotation file in GFF format. Contains coordinates and features of predicted genes.
.faa Protein sequences of all annotated genes in FASTA format.
_flanks.faa Protein sequences extracted from the flanking region around the kpsC gene (only in flanking mode).
_hits.tsv.gz Compressed TSV file listing all detected HMM hits (annotations) with their scores and locations.
_filtered_hits.tsv.gz Compressed TSV file with filtered HMM hits after applying score thresholds.
_ktypr.tsv Summary TSV file containing the final K-antigen type prediction results for the genome or annotation set.
.gbk Full genome file with annotations in GenBank format, optionally including re-annotation results.
clink.html Clinker HTML report against the best K-antigen type predicted.

Collection results

A summary table of all classifications is also created if more than one genome is given as <prefix>results_ktypr.tsv in the selected output directory. This file provides a summary of the K-antigen (K-type) prediction results for each genome or annotation set. It contains both the final predicted type and detailed match statistics for conserved regions and all candidate K-types.

Columns description:

Column name Description
genome_id Genome identifier (extracted from the basename of the genome file).
predicted Best-matching K-type predicted for the genome based on gene content and HMM match scores.
pred_nr_genes Number of genes expected for the predicted K-type (based on its reference profile).
pred_genes_in_genome Number of those expected genes actually found in the genome.
pred_acc_bitscore Sum of HMM bitscores across all hits matching the predicted K-type. A proxy for overall match strength.
pred_frac_bitscore pred_acc_bitscore normalized by the maximum accumulated bitscore for that K-type in RefSeq profiled genomes.
pred_is_complete Indicates if all expected genes for the predicted K-type were found in the genome (1 = complete, 0 = incomplete).

These are followed by conserved K-locus genes shared across many K-types, such as those in the KpsEDCSMT and KpsFU operons:

Column name Description
KpsEDCSMT_nr_genes Number of conserved EDCSMT genes expected.
KpsEDCSMT_genes_in_genome Number of EDCSMT genes found in the genome.
KpsEDCSMT_acc_bitscore Cumulative bitscore for the EDCSMT gene matches.
KpsEDCSMT_frac_bitscore KpsEDCSMT_acc_bitscore normalized by the maximum KpsEDCSMT bitscore in RefSeq profiled genomes
KpsEDCSMT_is_complete Whether all EDCSMT genes were found (1) or not (0).
KpsFU_nr_genes, ... Same structure as above, but for the conserved FU gene set.

For further exploration, and forF each known K-type in our database the following columns provide detailed information:

Column suffix Description
<KTYPE>_nr_genes Number of genes expected for this K-type.
<KTYPE>_genes_in_genome Number of expected genes found in the genome.
<KTYPE>_acc_bitscore Sum of HMM match bitscores for this K-type’s genes.
<KTYPE>_frac_bitscore <KTYPE>_acc_bitscore normalized by the maximum accumulated bitscore for that K-type in RefSeq profiled genomes.
<KTYPE>_is_complete Whether the K-type is fully present in the genome (1) or not (0).

As example of the collection output, you can find in test/output the results of running:

ktypr -i ./test/genomes/fasta -v -o ./test/output/ -p run1_

βž• Add custom K-types

To add new custom K-types, users can:

  1. Download this repository.
  2. Produce HMMs as done in the main publication for each of the specific genes to detect and copy them in data/hmms.
  3. Add a new line on data/ktypr_definitions_v20250512.tsv with the K-type name followed by the gene names (corresponding to HMM files) that composed them.
  4. Add the cut-offs per HMM in data/hmm_cutoffs_v20250704.tsv. kTYPr employs curated internal thresholds to define a hit, if these are not know, you can set a 0 cut-off but these can increase false positive hits.
  5. Add the maximum expected bitscore for the new K-type in data/max_bitscores_v20260107.tsv for the normalized bitscore in the output. If not known, simply include 1 as maximum (this will make the <KTYPE>_frac_bitscore to match <KTYPE>_acc_bitscore).

🐍 kTYPr as Python package

Once installed, you can use kTYPr within your python environment as:

from ktypr import ktypr

# To show the arguments required:
help(ktypr)

The arguments described above are shared by this function, for example, to call it in a genome/directory/list of files:

ktypr(<genome_path>, <output_directory>)

πŸ”Ž BLAST-based approach

The collection of best genomes associated to a specific K-type can be employed as database for Kaptive as done in EC-K-typing by Gladstone et. al.

For this, you can simply download this multi-genbank and use it as:

kaptive assembly kTYPrDB_for_Kaptive.gbk <your_genomes> -o output.tsv

πŸ“– Citation

If you use kTYPr in your research, please cite:


❓ Need Help?

In the terminal:

ktypr --help

In python:

from ktypr import ktypr
help(ktypr)

You can always write an issue in this repo or contact the maintainer at smiravet@ethz.ch

About

A command-line tool for predicting K-antigen type classifications in Escherichia coli.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages