A simple local code search engine. Just point it to your source folder and query away!
Trained on Lightning AI's free platform.
Coded in Python 3.12
Feel free to skip ahead to the notebook if you (really quickly) just want to run the project.
Each library serves a specific purpose:
Please use a virtual environment for all the following steps. Here's a quick start:
-
python -m venv .venv
-
source .venv/bin/activate
You do not have to manually install these packages, scroll down to module to get set up.
-
pip install watchdog
For observing local file changes.
-
pip install qdrant-client
The vector store in which your (source) functions will be stored and retrieved from.
-
pip install parso
Used to accurately chunk your (python) source files.
-
pip install -U "huggingface_hub[cli]"Required to use models hosted on huggingface.
-
pip install -U sentence-transformers
For the embedding model(s) the engine can use.
-
pip install "accelerate>=0.26.0"Required for training optimally, alternatively you can use:
# pip install "transformers[torch]" -
pip install datasets
Used to get data for fine tuning the model(s).
-
pip install matplotlib
For plotting the loss (after training).
-
pip install fastapi
For quick local access through simple CRUD operations.
-
pip install uvicorn
Serves the engine locally.
In case you want to manually install these packages:
pip install watchdog qdrant-client parso "accelerate>=0.26.0" datasets fastapi uvicorn matplotlib -U "huggingface_hub[cli]" sentence-transformers-
python -m pip install --upgrade build
-
python -m build ./src
-
pip install ./src/dist/scse-2025.0.0-py3-none-any.whl
- If you're not already authenticated, please do so or you won't be able to use
the models hosted on huggingface.
hf auth login
- If you're planning on using RAG please install Docker and QDrant. See limitations for more info.
Make sure Docker is running first, then:
-
docker compose -f "compose.yaml" up -d --build "qdrant"
-
python -m scse --source-path="test" --rag-proc --load-and-split-cosqa-test=5
python -m scse --source-path="test" --fine-tune-dir="./tuned" --fine-tune # Optionally provide --model-namepython -m scse --source-path="test" --fine-tune-dir="./tuned" --use-tuned # Optionally provide --model-nameI started with the MultipleNegativesRankingLoss function with varying scales
and learning rates delivered little (to no) improvements. What I failed to
consider was using only the strong labelled docs (those labelled 1). But
increasing the epochs and using (both/only) the strong labels also didn't yield
any better performance. I then considered only focussing on the loss function
instead of the data, which led me into a rabbit hole of papers. One of them
(which ultimately became the solution) used binary cross entropy. Switching to a
primitive implementation of it immediately yielded some better performance. I
also got inspiration for what values to use for the parameters there.
(Rabbit hole endpoint)
(Uploading and ) Working in batches also helped speed things up. The number for batching and chunking was chosen arbitrarily but ensured a smoother experience overall.
- An interface exists for databases, but the project started with QDrant, hence
why it was listed as a dependency. Depending on your vector store the
SAgentDB can be inherited, and used in the
SearchEngine. For example QSAgentDB. - Full CRUD support. This library currently only supports searching.
Ideally there would be mechanisms in place for retrying and logging accurately. Due to time constraints this may be done after the fact.
The code was written under ideal assumptions, meaning: "if I didn't encounter that bug, it must not exist". This is obviously absurd and will be addressed when time allows for it.
- (Stored) Functions are limited to top-level, I need to devise a smarter/efficient way for nested functions.
- Stricter OOP. No getters/setters for all properties.