Skip to content

A simple local code search engine. Just point it to your source folder and query away!

Notifications You must be signed in to change notification settings

jalenna/code-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Search Engine

A simple local code search engine. Just point it to your source folder and query away!

Trained on Lightning AI's free platform.

Coded in Python 3.12

Getting Started

Feel free to skip ahead to the notebook if you (really quickly) just want to run the project.

Dependencies

Each library serves a specific purpose:

Before you start (1/2)

Please use a virtual environment for all the following steps. Here's a quick start:

  • python -m venv .venv
  • source .venv/bin/activate

You do not have to manually install these packages, scroll down to module to get set up.

RAG

  • pip install watchdog

    For observing local file changes.

  • pip install qdrant-client

    The vector store in which your (source) functions will be stored and retrieved from.

  • pip install parso

    Used to accurately chunk your (python) source files.

Models & Training

  • pip install -U "huggingface_hub[cli]"

    Required to use models hosted on huggingface.

  • pip install -U sentence-transformers

    For the embedding model(s) the engine can use.

  • pip install "accelerate>=0.26.0"

    Required for training optimally, alternatively you can use:

    # pip install "transformers[torch]"
  • pip install datasets

    Used to get data for fine tuning the model(s).

  • pip install matplotlib

    For plotting the loss (after training).

Serving

  • pip install fastapi

    For quick local access through simple CRUD operations.

  • pip install uvicorn

    Serves the engine locally.

One-liner

In case you want to manually install these packages:

pip install watchdog qdrant-client parso "accelerate>=0.26.0" datasets fastapi uvicorn matplotlib -U "huggingface_hub[cli]" sentence-transformers

Building the module

  • python -m pip install --upgrade build
  • python -m build ./src
  • pip install ./src/dist/scse-2025.0.0-py3-none-any.whl

Before you start (2/2)

  • If you're not already authenticated, please do so or you won't be able to use the models hosted on huggingface.
    hf auth login
  • If you're planning on using RAG please install Docker and QDrant. See limitations for more info.

Examples

RAG on local CoSQA

Make sure Docker is running first, then:

  • docker compose -f "compose.yaml" up -d --build "qdrant"
  • python -m scse --source-path="test" --rag-proc --load-and-split-cosqa-test=5

Tuning a model

python -m scse --source-path="test" --fine-tune-dir="./tuned" --fine-tune # Optionally provide --model-name

Using a tuned model

python -m scse --source-path="test" --fine-tune-dir="./tuned" --use-tuned # Optionally provide --model-name

Optimizations

I started with the MultipleNegativesRankingLoss function with varying scales and learning rates delivered little (to no) improvements. What I failed to consider was using only the strong labelled docs (those labelled 1). But increasing the epochs and using (both/only) the strong labels also didn't yield any better performance. I then considered only focussing on the loss function instead of the data, which led me into a rabbit hole of papers. One of them (which ultimately became the solution) used binary cross entropy. Switching to a primitive implementation of it immediately yielded some better performance. I also got inspiration for what values to use for the parameters there. (Rabbit hole endpoint)

(Uploading and ) Working in batches also helped speed things up. The number for batching and chunking was chosen arbitrarily but ensured a smoother experience overall.

Limitations

Vector Store

  • An interface exists for databases, but the project started with QDrant, hence why it was listed as a dependency. Depending on your vector store the SAgentDB can be inherited, and used in the SearchEngine. For example QSAgentDB.
  • Full CRUD support. This library currently only supports searching.

Error Handling

Ideally there would be mechanisms in place for retrying and logging accurately. Due to time constraints this may be done after the fact.

Tests

The code was written under ideal assumptions, meaning: "if I didn't encounter that bug, it must not exist". This is obviously absurd and will be addressed when time allows for it.

Other

  • (Stored) Functions are limited to top-level, I need to devise a smarter/efficient way for nested functions.
  • Stricter OOP. No getters/setters for all properties.

About

A simple local code search engine. Just point it to your source folder and query away!

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published