Code Search Engine

A simple local code search engine. Just point it to your source folder and query away!

Trained on Lightning AI's free platform.

Coded in Python 3.12

Getting Started

Feel free to skip ahead to the notebook if you (really quickly) just want to run the project.

Dependencies

Each library serves a specific purpose:

Before you start (1/2)

Please use a virtual environment for all the following steps. Here's a quick start:

```
python -m venv .venv
```
```
source .venv/bin/activate
```

You do not have to manually install these packages, scroll down to module to get set up.

RAG

```
pip install watchdog
```
For observing local file changes.
```
pip install qdrant-client
```
The vector store in which your (source) functions will be stored and retrieved from.
```
pip install parso
```
Used to accurately chunk your (python) source files.

Models & Training

```
pip install -U "huggingface_hub[cli]"
```
Required to use models hosted on huggingface.
```
pip install -U sentence-transformers
```
For the embedding model(s) the engine can use.

pip install "accelerate>=0.26.0"

Required for training optimally, alternatively you can use:

# pip install "transformers[torch]"

```
pip install datasets
```
Used to get data for fine tuning the model(s).
```
pip install matplotlib
```
For plotting the loss (after training).

Serving

```
pip install fastapi
```
For quick local access through simple CRUD operations.
```
pip install uvicorn
```
Serves the engine locally.

One-liner

In case you want to manually install these packages:

pip install watchdog qdrant-client parso "accelerate>=0.26.0" datasets fastapi uvicorn matplotlib -U "huggingface_hub[cli]" sentence-transformers

Building the module

```
python -m pip install --upgrade build
```
```
python -m build ./src
```

pip install ./src/dist/scse-2025.0.0-py3-none-any.whl

Before you start (2/2)

If you're not already authenticated, please do so or you won't be able to use the models hosted on huggingface.
```
hf auth login
```
If you're planning on using RAG please install Docker and QDrant. See limitations for more info.

Examples

RAG on local CoSQA

Make sure Docker is running first, then:

docker compose -f "compose.yaml" up -d --build "qdrant"

python -m scse --source-path="test" --rag-proc --load-and-split-cosqa-test=5

Tuning a model

python -m scse --source-path="test" --fine-tune-dir="./tuned" --fine-tune # Optionally provide --model-name

Using a tuned model

python -m scse --source-path="test" --fine-tune-dir="./tuned" --use-tuned # Optionally provide --model-name

Optimizations

I started with the MultipleNegativesRankingLoss function with varying scales and learning rates delivered little (to no) improvements. What I failed to consider was using only the strong labelled docs (those labelled 1). But increasing the epochs and using (both/only) the strong labels also didn't yield any better performance. I then considered only focussing on the loss function instead of the data, which led me into a rabbit hole of papers. One of them (which ultimately became the solution) used binary cross entropy. Switching to a primitive implementation of it immediately yielded some better performance. I also got inspiration for what values to use for the parameters there. (Rabbit hole endpoint)

(Uploading and ) Working in batches also helped speed things up. The number for batching and chunking was chosen arbitrarily but ensured a smoother experience overall.

Limitations

Vector Store

An interface exists for databases, but the project started with QDrant, hence why it was listed as a dependency. Depending on your vector store the SAgentDB can be inherited, and used in the SearchEngine. For example QSAgentDB.
Full CRUD support. This library currently only supports searching.

Error Handling

Ideally there would be mechanisms in place for retrying and logging accurately. Due to time constraints this may be done after the fact.

Tests

The code was written under ideal assumptions, meaning: "if I didn't encounter that bug, it must not exist". This is obviously absurd and will be addressed when time allows for it.

Other

(Stored) Functions are limited to top-level, I need to devise a smarter/efficient way for nested functions.
Stricter OOP. No getters/setters for all properties.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
.gitignore		.gitignore
README.md		README.md
compose.yaml		compose.yaml
engine.ipynb		engine.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Search Engine

Getting Started

Dependencies

Before you start (1/2)

RAG

Models & Training

Serving

One-liner

Building the module

Before you start (2/2)

Examples

RAG on local CoSQA

Tuning a model

Using a tuned model

Optimizations

Limitations

Vector Store

Error Handling

Tests

Other

About

Uh oh!

Releases

Packages

Languages

jalenna/code-search-engine

Folders and files

Latest commit

History

Repository files navigation

Code Search Engine

Getting Started

Dependencies

Before you start (1/2)

RAG

Models & Training

Serving

One-liner

Building the module

Before you start (2/2)

Examples

RAG on local CoSQA

Tuning a model

Using a tuned model

Optimizations

Limitations

Vector Store

Error Handling

Tests

Other

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages