An intelligent, self-aware AI chatbot that serves as a dynamic, interactive portfolio for a user, powered by a sophisticated RAG pipeline and advanced NLP.
- Introduction
- Core Features
- Architecture & Tech Stack
- Setup and Local Installation
- Environment Configuration
- Future Improvements
- Get In Touch
Nayan's AI Assistant is a full-stack chatbot application designed to be more than just an information source—it's an intelligent, interactive representation of my professional profile. It leverages a Hybrid Retrieval-Augmented Generation (RAG) pipeline to provide recruiters and collaboraters an accurate, context-aware answers about my skills, projects, and background.
This project was built to demonstrate a deep understanding of modern AI application development, from sophisticated NLP-powered guardrails and conversational memory to a complete, cloud-based analytics pipeline for monitoring user interactions in real-time.
This project is more than a simple Q&A bot. It's an end-to-end showcase of modern AI application development.
-
🧠 Hybrid RAG System: A multi-step retrieval strategy ensures fast, accurate, and relevant answers. The logic prioritizes responses in the following order:
- High-Confidence Semantic Match: Uses a
BAAI/bge-small-ensentence transformer to find the most similar question from a pre-computed vector database. An answer is returned if the cosine similarity score is ≥ 0.87. - Lexical Fuzzy Match: If semantic search fails, it uses
fuzzywuzzy's token sort ratio to find a close match. An answer is returned if the score is ≥ 90. - Generative Fallback with Context: For novel or nuanced questions, the bot uses OpenAI's GPT-4o model, providing it with the recent conversation history for context.
- High-Confidence Semantic Match: Uses a
-
🛡️ Intelligent Guardrails:
- NER-Powered Scope Control: Utilizes
spaCyfor Named Entity Recognition (NER) to detect if a question mentions another person's name. This prevents the bot from answering questions that are outside its scope of representing Nayan Reddy Soma. - Sensitive Topic Filtering: A custom keyword filter deflects inappropriate or overly personal questions with professional, pre-defined responses.
- NER-Powered Scope Control: Utilizes
-
📊 Real-time Analytics Pipeline:
- Every user interaction with the live chatbot is logged in real-time to a Google Sheet using the
gspreadAPI. - Data points captured include
session_id,timestamp,user_query,final_response,response_source(e.g., fallback, llm_general), andresponse_time_ms.
- Every user interaction with the live chatbot is logged in real-time to a Google Sheet using the
-
📈 Decoupled Analytics Dashboard:
- A separate Streamlit app (
analytics.py) reads the live data from Google Sheets to provide insights on:- KPIs: Total users, total questions, average questions per user, and average response time.
- Performance: A pie chart showing the distribution of response sources (how often the RAG system vs. the LLM provides an answer).
- Engagement: A bar chart of daily usage and a table of the most frequently asked questions.
- A separate Streamlit app (
-
🗣️ Context-Aware Follow-ups: The chatbot remembers the context of the last interaction, allowing it to handle follow-up questions like "tell me more about that" or "why was that important?" with high relevance.
This project is built with a modern, end-to-end Python stack designed for performance and scalability.
| Category | Technology / Library | Purpose |
|---|---|---|
| Web Framework | Streamlit | For building the interactive chat UI and the analytics dashboard. |
| Backend Logic | Python 3.10+ | Core application logic, data processing, and integrations. |
| NLP (Retrieval) | sentence-transformers |
To generate vector embeddings for semantic search (BAAI/bge-small-en model). |
scikit-learn |
For calculating cosine similarity between text embeddings. | |
fuzzywuzzy |
For lexical-based fuzzy string matching as a secondary retrieval layer. | |
| NLP (Guardrails) | spaCy (en_core_web_sm) |
For Named Entity Recognition (NER) to power the smart scope-control guardrail. |
| Generative AI | OpenAI GPT-4o | The final generative layer for handling novel and conversational questions. |
| Database & Logging | Google Sheets API (gspread) |
A robust and free solution for real-time logging and data collection from the cloud. |
| Data Analysis | pandas, plotly |
For data manipulation and creating visualizations in the analytics dashboard. |
| Deployment | Streamlit Community Cloud | For hosting the live chatbot and analytics dashboard. |
| Dependencies | joblib, numpy |
For serializing/deserializing the embedding file and numerical operations. |
| Environment Mgmt | python-dotenv |
To manage local environment variables. |
To run this project on your local machine, follow these steps:
-
Clone the Repository:
git clone https://github.com/Nayan-Reddy/Nayan-chatbot.git cd your-repo-name -
Set Up a Virtual Environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install Dependencies: The
requirements.txtfile is configured to install CPU-specific versions of PyTorch and the direct spaCy model for efficiency.pip install -r requirements.txt
-
Generate Embeddings: This is a crucial one-time preprocessing step. Run this script to create the
fallback_embeddings.pklfile from your Q&A data.python generate_embeddings.py
-
Configure Environment Variables/Secrets:
- Follow the instructions in the Environment Configuration section below to set up your API keys and credentials.
-
Run the Apps: You can now run the chatbot and the analytics dashboard locally.
- Run the Chatbot:
streamlit run app.py
- Run the Analytics Dashboard:
streamlit run analytics.py
- Run the Chatbot:
The application requires two separate files for credentials:
-
For the GitHub API Token (via OpenAI):
- Create a file named
.envin the project's root directory. - Add your GitHub token (used for the OpenAI proxy):
# .env GITHUB_TOKEN="ghp_YOUR_TOKEN_HERE"
- Create a file named
-
For Google Sheets Logging:
- Create a folder named
.streamlitin the project's root directory. - Inside that folder, create a file named
secrets.toml. - Paste your Google Cloud Platform service account JSON credentials here. This is used by both
app.pyandanalytics.py.# .streamlit/secrets.toml [gcp_service_account] type = "service_account" project_id = "your-gcp-project-id" private_key_id = "your-private-key-id" private_key = "-----BEGIN PRIVATE KEY-----\nYOUR-PRIVATE-KEY\n-----END PRIVATE KEY-----\n" client_email = "your-client-email@your-gcp-project-id.iam.gserviceaccount.com" client_id = "your-client-id" # ... and so on for the rest of the JSON key file.
- Create a folder named
This project has a strong foundation that can be extended with even more features:
- User Feedback System: Add thumbs-up/down buttons to log user satisfaction with responses directly into the Google Sheet for finer-grained analysis.
- Advanced Analytics: Use semantic clustering on the logged questions to identify common user intents that are not yet covered in the
fallback_qna.json. - Multi-Modal Capabilities: Integrate tools to display images of projects or architecture diagrams directly in the chat when asked.
I'm a passionate data enthusiast actively seeking opportunities in data analytics and AI. If you're impressed by this project or have any questions, I'd love to connect!
- Email: nayanreddy007@gmail.com

