Version: 1.0.0
Purpose: Multimodal AI inference service for General Bots
BotModels is a Python-based AI inference service that provides multimodal capabilities to the General Bots platform. It serves as a companion to botserver (Rust), specializing in cutting-edge AI/ML models from the Python ecosystem including image generation, video creation, speech synthesis, and vision/captioning.
While botserver handles business logic, networking, and systems-level operations, BotModels exists solely to leverage the extensive Python AI/ML ecosystem for inference tasks that are impractical to implement in Rust.
For comprehensive documentation, see docs.pragmatismo.com.br or the BotBook for detailed guides, API references, and tutorials.
- Image Generation: Generate images from text prompts using Stable Diffusion
- Video Generation: Create short videos from text descriptions using Zeroscope
- Speech Synthesis: Text-to-speech using Coqui TTS
- Speech Recognition: Audio transcription using OpenAI Whisper
- Vision/Captioning: Image and video description using BLIP2
# Clone the repository
cd botmodels
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
.\venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtCopy the example environment file and configure:
cp .env.example .envEdit .env with your settings:
HOST=0.0.0.0
PORT=8085
API_KEY=your-secret-key
DEVICE=cuda
IMAGE_MODEL_PATH=./models/stable-diffusion-v1-5
VIDEO_MODEL_PATH=./models/zeroscope-v2
VISION_MODEL_PATH=./models/blip2# Development mode
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --reload
# Production mode
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --workers 4
# With HTTPS (production)
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --ssl-keyfile key.pem --ssl-certfile cert.pem- Rust vs. Python Rule:
- If logic is deterministic, systems-level, or performance-critical: Do it in Rust (botserver)
- If logic requires cutting-edge ML models, rapid experimentation with HuggingFace, or specific Python-only libraries: Do it here
- Inference Only: This service should NOT hold business state. It accepts inputs, runs inference, and returns predictions.
- Stateless: Treated as a sidecar to
botserver. - API First: Exposes strict HTTP/REST endpoints consumed by
botserver.
- Runtime: Python 3.10+
- Web Framework: FastAPI (preferred over Flask for async/performance)
- ML Frameworks: PyTorch, HuggingFace Transformers, Diffusers
- Quality:
ruff(linting),black(formatting),mypy(typing)
All endpoints require the X-API-Key header for authentication.
POST /api/image/generate
Content-Type: application/json
X-API-Key: your-api-key
{
"prompt": "a cute cat playing with yarn",
"steps": 30,
"width": 512,
"height": 512,
"guidance_scale": 7.5,
"seed": 42
}POST /api/video/generate
Content-Type: application/json
X-API-Key: your-api-key
{
"prompt": "a rocket launching into space",
"num_frames": 24,
"fps": 8,
"steps": 50
}POST /api/speech/generate
Content-Type: application/json
X-API-Key: your-api-key
{
"prompt": "Hello, welcome to our service!",
"voice": "default",
"language": "en"
}POST /api/speech/totext
Content-Type: multipart/form-data
X-API-Key: your-api-key
file: <audio_file>POST /api/vision/describe
Content-Type: multipart/form-data
X-API-Key: your-api-key
file: <image_file>
prompt: "What is in this image?" (optional)POST /api/vision/describe_video
Content-Type: multipart/form-data
X-API-Key: your-api-key
file: <video_file>
num_frames: 8 (optional)POST /api/vision/vqa
Content-Type: multipart/form-data
X-API-Key: your-api-key
file: <image_file>
question: "How many people are in this image?"GET /api/healthInteractive API documentation:
- Swagger UI:
http://localhost:8085/api/docs - ReDoc:
http://localhost:8085/api/redoc
key,value
botmodels-enabled,true
botmodels-host,0.0.0.0
botmodels-port,8085
botmodels-api-key,your-secret-key
botmodels-https,false
image-generator-model,../../../../data/diffusion/sd_turbo_f16.gguf
image-generator-steps,4
image-generator-width,512
image-generator-height,512
video-generator-model,../../../../data/diffusion/zeroscope_v2_576w
video-generator-frames,24
video-generator-fps,8// Generate an image
file = IMAGE "a beautiful sunset over mountains"
SEND FILE TO user, file
// Generate a video
video = VIDEO "waves crashing on a beach"
SEND FILE TO user, video
// Generate speech
audio = AUDIO "Welcome to General Bots!"
SEND FILE TO user, audio
// Get image/video description
caption = SEE "/path/to/image.jpg"
TALK captionβββββββββββββββ HTTPS βββββββββββββββ
β botserver β βββββββββββββΆ β botmodels β
β (Rust) β β (Python) β
βββββββββββββββ βββββββββββββββ
β β
β BASIC Keywords β AI Models
β - IMAGE β - Stable Diffusion
β - VIDEO β - Zeroscope
β - AUDIO β - TTS/Whisper
β - SEE β - BLIP2
βΌ βΌ
βββββββββββββββ βββββββββββββββ
β config β β outputs β
β .csv β β (files) β
βββββββββββββββ βββββββββββββββ
- Deprecate Legacy: Move away from outdated libs (e.g., old
allennlp) in favor of HuggingFace Transformers and Diffusers - Quantization: Always consider quantized models (bitsandbytes, GGUF) to reduce VRAM usage
- Lazy Loading: Do NOT load 10GB models at module import time. Load on startup lifecycle or first request with locking
- GPU Handling: Robustly detect CUDA/MPS (Mac) and fallback to CPU gracefully
- Type Hints: All functions MUST have type hints
- Error Handling: No bare
except:. Catch precise exceptions and return structured JSON errors tobotserver
botmodels/
βββ src/
β βββ api/
β β βββ v1/
β β β βββ endpoints/
β β β βββ image.py
β β β βββ video.py
β β β βββ speech.py
β β β βββ vision.py
β β βββ dependencies.py
β βββ core/
β β βββ config.py
β β βββ logging.py
β βββ schemas/
β β βββ generation.py
β βββ services/
β β βββ image_service.py
β β βββ video_service.py
β β βββ speech_service.py
β β βββ vision_service.py
β βββ main.py
βββ outputs/
βββ models/
βββ tests/
βββ requirements.txt
βββ README.md
pytest tests/- Always use HTTPS in production
- Use strong, unique API keys
- Restrict network access to the service
- Consider running on a separate GPU server
- Monitor resource usage and set appropriate limits
For complete documentation, guides, and API references:
- docs.pragmatismo.com.br - Full online documentation
- BotBook - Local comprehensive guide with tutorials and examples
- General Bots Repository - Main project repository
- Python 3.10+
- CUDA-capable GPU (recommended, 8GB+ VRAM)
- 16GB+ RAM
- Inference Only: No business state, just predictions
- Modern Models: Use HuggingFace Transformers, Diffusers
- Type Safety: All functions must have type hints
- Lazy Loading: Don't load models at import time
- GPU Detection: Graceful fallback to CPU
- Version 1.0.0 - Do not change without approval
- GIT WORKFLOW - ALWAYS push to ALL repositories (github, pragmatismo)
See LICENSE file for details.