An experimental project for evaluating small AI models using lm-eval, with a beautiful interactive dashboard for visualizing results.
This project provides a complete workflow for:
- Running evaluations on tiny AI models using
lm-eval - Creating Wikipedia knowledge bases with vector embeddings
- Visualizing results with an interactive web dashboard
- Comparing performance across different model runs
- Analyzing detailed metrics and failure patterns
- Python 3.8+
- Virtual environment (recommended)
- CUDA/MPS support for GPU acceleration (optional)
-
Create and activate virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
Run a quick test with limited samples:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.2-3B \
--tasks hellaswag,mmlu \
--limit 10 \
--output_path ./results/ \
--device mpsRun complete evaluation on all questions:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.2-3B \
--tasks hellaswag,mmlu \
--output_path ./results/ \
--device mpsRun evaluation with Wikipedia knowledge base augmentation:
Quick test (10 questions):
python rag_eval.py \
--model meta-llama/Llama-3.2-3B \
--tasks mmlu_global_facts \
--device mps \
--limit 10Full evaluation:
python rag_eval.py \
--model meta-llama/Llama-3.2-3B \
--tasks mmlu_global_facts \
--device mpsCustomize retrieval:
python rag_eval.py \
--model meta-llama/Llama-3.2-3B \
--tasks mmlu_global_facts \
--device mps \
--n_retrieval 5 # Retrieve 5 Wikipedia chunks instead of 3--model hf: Use HuggingFace models--model_args pretrained=<model_name>: Specify the model to evaluate--tasks hellaswag,mmlu: Tasks to run (HellaSwag and MMLU benchmarks)--limit <number>: Limit number of questions (useful for testing)--output_path ./results/: Directory to save results--device mps: Use Apple Silicon GPU (orcudafor NVIDIA,cpufor CPU-only)
Any HuggingFace model compatible with transformers:
meta-llama/Llama-3.2-3B(used in examples)microsoft/DialoGPT-mediumEleutherAI/gpt-neo-2.7B- And many more...
- hellaswag: Commonsense reasoning (10,042 questions)
- mmlu: Massive multitask language understanding (15,908 questions across 57 subjects)
- arc: AI2 reasoning challenge
- truthfulqa: Truthfulness evaluation
- And 100+ other benchmarks
The project includes tools for creating a Wikipedia knowledge base using vector embeddings:
-
Download Simple Wikipedia:
python simple-wiki-dl.py
-
Create embeddings:
python create_embeddings_simple.py
-
Query the knowledge base:
python query_embeddings.py --query "What is machine learning?"
- Vector embeddings using
sentence-transformers/all-mpnet-base-v2 - ChromaDB storage for efficient similarity search
- Smart chunking - Articles split into manageable pieces
- Semantic search - Find relevant content by meaning, not just keywords
- Metadata tracking - Title, chunk ID, and article information
# Check collection statistics
python query_embeddings.py --stats
# Search for specific topics
python query_embeddings.py --query "What is photosynthesis?"
# Interactive search mode
python query_embeddings.py- π Interactive Charts: Visualize performance across different tasks and subjects
- π Key Metrics: Quick overview of overall accuracy, HellaSwag, and MMLU scores
- π Sortable Results Table: Complete breakdown with visual accuracy bars and sorting
- β±οΈ Test Duration: Shows how long evaluations took to complete
- π Question Breakdown: Shows total questions evaluated per test
- π Model Comparison: Compare results across different evaluation runs
- π± Responsive Design: Works on desktop and mobile devices
-
Run an evaluation:
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B --tasks hellaswag,mmlu --limit 10 --output_path ./results/ --device mps
-
Start the dashboard server:
cd dashboard python server.py -
Open your browser: The dashboard will automatically open at
http://localhost:8080 -
View your results:
- Choose from available models in the dropdown
- Select a specific evaluation run
- View interactive charts and detailed results
- Sort the results table by accuracy, task name, or error margin
The server provides REST API endpoints for programmatic access:
GET /api/models- List all available modelsGET /api/runs?model_id=<id>- List runs for a specific modelGET /api/data/<run_id>- Get detailed data for a specific run
tiny-ai-models/
βββ dashboard/
β βββ index.html # Interactive dashboard interface
β βββ server.py # HTTP server with API endpoints
βββ results/ # Evaluation results (auto-generated)
β βββ meta-llama__Llama-3.2-3B/ # Baseline model results
β β βββ results_2025-10-03T14-37-54.407411.json # Limited test
β β βββ results_2025-10-03T23-47-07.831204.json # Full evaluation
β βββ rag-meta-llama__Llama-3.2-3B/ # RAG-enhanced model results
β βββ results_2025-10-08T18:09:50.857457.json
βββ simple_wikipedia/ # Wikipedia dataset (auto-generated)
βββ chroma_db/ # Vector embeddings database (auto-generated)
βββ simple-wiki-dl.py # Wikipedia downloader script
βββ create_embeddings_simple.py # Embedding creation script
βββ query_embeddings.py # Knowledge base query tool
βββ rag_eval.py # RAG-enhanced evaluation script (NEW!)
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ notes.md # Project notes and development log
-
Test run (quick validation):
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B --tasks hellaswag,mmlu --limit 10 --output_path ./results/ --device mps
-
Full evaluation (comprehensive testing):
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B --tasks hellaswag,mmlu --output_path ./results/ --device mps
-
View results in dashboard:
cd dashboard && python server.py
-
Analyze performance:
- Compare limited vs full test results
- Sort by accuracy to find best/worst subjects
- Check test duration and question counts
- Identify areas for improvement
Evaluate different models:
# Smaller model
lm_eval --model hf --model_args pretrained=microsoft/DialoGPT-medium --tasks hellaswag,mmlu --limit 100 --output_path ./results/ --device mps
# Different model size
lm_eval --model hf --model_args pretrained=EleutherAI/gpt-neo-2.7B --tasks hellaswag,mmlu --output_path ./results/ --device mpsTry other evaluation tasks:
# Add more benchmarks
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B --tasks hellaswag,mmlu,arc,truthfulqa --limit 100 --output_path ./results/ --device mpsFor more detailed output (if supported):
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B --tasks hellaswag,mmlu --limit 10 --output_path ./results/ --device mps --log_samplesBy default, the server looks for results in ../results/. To change this:
python server.py 8080 /path/to/your/resultspython server.py 3000 # Use port 3000 instead of 8080- Evaluation Engine: lm-eval (Language Model Evaluation Harness)
- Frontend: Vanilla JavaScript with Chart.js for visualizations
- Backend: Python HTTP server with JSON API
- Data Format: Expects lm-eval JSON output format
- Browser Support: Modern browsers with ES6+ support
- GPU Support: MPS (Apple Silicon), CUDA (NVIDIA), CPU fallback
"Model not found" error:
- Verify the model name is correct on HuggingFace
- Check internet connection for model download
- Ensure sufficient disk space for model storage
Out of memory errors:
- Use
--limitto reduce question count - Try CPU evaluation with
--device cpu - Consider smaller models
Slow evaluation:
- Use GPU acceleration (
--device mpsor--device cuda) - Reduce
--limitfor testing - Check available system resources
"No models found" error:
- Ensure your results directory contains model subdirectories
- Check that JSON files follow the naming pattern
results_*.json - Verify lm-eval completed successfully
Charts not displaying:
- Check browser console for JavaScript errors
- Ensure your JSON data contains the expected structure with
acc,nonefields - Try refreshing the page
Server won't start:
- Check that port 8080 is available
- Ensure Python 3.6+ is installed
- Verify the results directory path is correct
- Start small: Use
--limit 10for initial testing - Monitor resources: Watch CPU/GPU usage during evaluation
- Save results: Results are automatically saved with timestamps
- Compare runs: Use the dashboard to compare different evaluation runs
Retrieval-Augmented Generation (RAG) enhances language models by providing relevant context from a knowledge base. Before answering each question, the system:
- Searches the Wikipedia knowledge base for relevant information
- Retrieves the most relevant chunks (default: 3)
- Augments the question prompt with this context
- Lets the model answer with enriched information
The rag_eval.py script:
- Wraps the base Llama model with RAG capabilities
- Queries ChromaDB for relevant Wikipedia content for each question
- Injects retrieved context into prompts automatically
- Saves results in the same format as baseline evaluations
- Results appear in dashboard as a separate model for easy comparison
Early results show RAG improvement on global facts:
- Baseline model: 25% accuracy (100 questions)
- RAG-enhanced: 30% accuracy (10 question sample)
Results are directly comparable in the dashboard!
--n_retrieval N: Number of Wikipedia chunks to retrieve (default: 3)--chroma_path PATH: Path to ChromaDB (default: ./chroma_db)- All standard lm-eval options work:
--limit,--device,--tasks, etc.
-
β Scale up Wikipedia embeddings: COMPLETE!
- All Simple Wikipedia articles processed
- Comprehensive knowledge base ready
-
β Integrate with evaluations: COMPLETE!
- RAG evaluation system implemented (
rag_eval.py) - Wikipedia context enhances question answering
- Dashboard shows baseline vs RAG comparison
- RAG evaluation system implemented (
-
Run comprehensive RAG evaluations:
- Test on full MMLU global_facts dataset (100 questions)
- Try other MMLU subjects where Wikipedia helps
- Experiment with different retrieval settings (n_retrieval=5, 7, 10)
-
Enhanced dashboard features:
- Add knowledge base integration to the dashboard
- Show which Wikipedia articles were used for each evaluation
- Display retrieval quality metrics
-
Detailed failure analysis:
- Implement detailed question-by-question evaluation logging
- Show which specific questions the model got wrong
- Analyze failure patterns by subject area
-
Model comparison framework:
- Compare multiple models side-by-side
- Track improvement over time
- Generate comparison reports
-
Custom evaluation tasks:
- Create domain-specific evaluation benchmarks
- Test models on specialized knowledge areas
- Build evaluation suites for specific use cases
-
Performance optimization:
- Optimize embedding creation for larger datasets
- Implement parallel processing for faster evaluations
- Add caching for frequently accessed data
-
Extended knowledge bases:
- Add other Wikipedia language versions
- Include academic papers or technical documentation
- Create domain-specific knowledge collections
-
Retrieval-augmented generation (RAG):
- Use Wikipedia knowledge to improve model responses
- Implement context-aware evaluation methods
- Study the impact of external knowledge on model performance
-
Bias and fairness analysis:
- Analyze model performance across different demographic groups
- Study knowledge representation biases
- Implement fairness metrics in evaluations