In-memory similarity search

The in_memory_similarity_search module computes cosine similarity between a query string and a list of candidate strings. It is suited for small to medium candidate sets that fit comfortably in memory.

Installation

pip install alex-ber-utils[np]

NumPy is a required dependency for this module. The [np] extra installs it automatically. If NumPy is absent an ImportWarning is raised at import time and the module will not be usable.

For production use with real language model embeddings:

pip install langchain-openai

Quick start

from alexber.utils.in_memory_similarity_search import (
    find_most_similar,
    SimpleEmbeddings,
)

embeddings = SimpleEmbeddings()

candidates = [
    "The weather is sunny today.",
    "It is raining outside.",
    "Python is a great programming language.",
]

index, text = find_most_similar(embeddings, "What is the forecast?", *candidates)
print(index, text)
# 0 The weather is sunny today.

`find_most_similar`

Returns the single best-matching candidate.

def find_most_similar(
    embeddings: Embeddings,
    input_text: str,
    /,
    *args: str,
    verbose: bool = True,
) -> tuple[int, str]:
    ...

embeddings

Embeddings

required

An object that implements the Embeddings protocol — it must expose an embed_documents(texts) method.

input_text

str

required

The query text to compare against the candidates.

*args

str

Positional candidate strings. Pass them as individual arguments: find_most_similar(emb, query, cand1, cand2, cand3).

verbose

bool

default:"True"

When True, logs the similarity score for each candidate at INFO level.

Returns tuple[int, str] — the zero-based index and the text of the most similar candidate.

When *args is empty (no candidates supplied), the function returns (-1, input_text). The negative index signals that no real match was found. Always check for a negative index before using the result.

`find_most_similar_with_scores`

Returns all candidates ranked by similarity score, highest first.

def find_most_similar_with_scores(
    embeddings: Embeddings,
    input_text: str,
    /,
    *args: str,
    verbose: bool = True,
) -> list[tuple[tuple[int, str], float]]:
    ...

The return type is a list of ((index, text), score) tuples sorted descending by score.

from alexber.utils.in_memory_similarity_search import (
    find_most_similar_with_scores,
    SimpleEmbeddings,
)

embeddings = SimpleEmbeddings()

candidates = [
    "The weather is sunny today.",
    "It is raining outside.",
    "Python is a great programming language.",
]

results = find_most_similar_with_scores(
    embeddings, "What is the forecast?", *candidates
)

for (idx, text), score in results:
    print(f"[{idx}] {score:.4f}  {text}")
# [0] 0.9123  The weather is sunny today.
# [1] 0.8741  It is raining outside.
# [2] 0.5102  Python is a great programming language.

embeddings

Embeddings

required

An object implementing the Embeddings protocol.

input_text

str

required

The query text.

*args

str

Candidate strings to rank.

verbose

bool

default:"True"

Log ranked results at INFO level.

Returns list[tuple[tuple[int, str], float]] — all candidates with scores, highest similarity first. When *args is empty, returns [((-1, input_text), 0.0)].

`Embeddings` protocol

Any object with an embed_documents method satisfies the protocol:

from typing import List, Protocol

class Embeddings(Protocol):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """
        Embed a list of documents into a list of float vectors.

        Args:
            texts: List of strings to embed.

        Returns:
            List of embedding vectors, one per input string.
        """
        ...

The functions use positional-only parameters (/ in the signature), so embeddings must always be the first positional argument.

`SimpleEmbeddings`

SimpleEmbeddings is a minimal, self-contained embedding implementation suitable for unit tests and learning exercises.

from alexber.utils.in_memory_similarity_search import SimpleEmbeddings

embeddings = SimpleEmbeddings(dims=1536)  # default dimension
vectors = embeddings.embed_documents(["hello world", "foo bar"])
print(len(vectors[0]))  # 1536

dims

int

default:"1536"

Dimension of each output vector. The default matches the dimension of OpenAI’s text-embedding-ada-002 model for compatibility in tests.

Internally SimpleEmbeddings maps each character in a text to a fixed-size vector by hashing the character to an index and incrementing that position. This produces character-frequency vectors rather than semantic embeddings.

SimpleEmbeddings does not produce meaningful semantic representations. It is provided for educational purposes and tests only. Do not use it in production — results will not reflect real language similarity.

Using a production embedding backend

Install langchain-openai

pip install langchain-openai

Configure your API key

export OPENAI_API_KEY=sk-...

Pass OpenAIEmbeddings to the search functions

from langchain_openai import OpenAIEmbeddings
from alexber.utils.in_memory_similarity_search import find_most_similar

embeddings = OpenAIEmbeddings()

candidates = [
    "Contact customer support.",
    "View your billing history.",
    "Reset your password.",
]

index, text = find_most_similar(
    embeddings, "How do I change my password?", *candidates
)
print(index, text)
# 2 Reset your password.

OpenAIEmbeddings from langchain-openai implements the same embed_documents interface and is a drop-in replacement for SimpleEmbeddings. Any LangChain-compatible embedding class works the same way.

Edge cases

Situation	Behaviour
No candidates (`*args` is empty)	Returns `(-1, input_text)` from `find_most_similar`; `[((-1, input_text), 0.0)]` from `find_most_similar_with_scores`.
All-zero embedding vectors	Cosine similarity is `0.0` (division-by-zero is caught and set to `0.0`).
NaN or Inf in similarity matrix	Replaced with `0.0` automatically.

Documentation Index

​Installation

​Quick start

​find_most_similar

​find_most_similar_with_scores

​Embeddings protocol

​SimpleEmbeddings

​Using a production embedding backend

​Edge cases

Installation

Quick start

`find_most_similar`

`find_most_similar_with_scores`

`Embeddings` protocol

`SimpleEmbeddings`

Using a production embedding backend

Edge cases