Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/alex-ber/AlexBerUtils/llms.txt

Use this file to discover all available pages before exploring further.

The in_memory_similarity_search module computes cosine similarity between a query string and a list of candidate strings. It is suited for small to medium candidate sets that fit comfortably in memory.

Installation

pip install alex-ber-utils[np]
NumPy is a required dependency for this module. The [np] extra installs it automatically. If NumPy is absent an ImportWarning is raised at import time and the module will not be usable.
For production use with real language model embeddings:
pip install langchain-openai

Quick start

from alexber.utils.in_memory_similarity_search import (
    find_most_similar,
    SimpleEmbeddings,
)

embeddings = SimpleEmbeddings()

candidates = [
    "The weather is sunny today.",
    "It is raining outside.",
    "Python is a great programming language.",
]

index, text = find_most_similar(embeddings, "What is the forecast?", *candidates)
print(index, text)
# 0 The weather is sunny today.

find_most_similar

Returns the single best-matching candidate.
def find_most_similar(
    embeddings: Embeddings,
    input_text: str,
    /,
    *args: str,
    verbose: bool = True,
) -> tuple[int, str]:
    ...
embeddings
Embeddings
required
An object that implements the Embeddings protocol — it must expose an embed_documents(texts) method.
input_text
str
required
The query text to compare against the candidates.
*args
str
Positional candidate strings. Pass them as individual arguments: find_most_similar(emb, query, cand1, cand2, cand3).
verbose
bool
default:"True"
When True, logs the similarity score for each candidate at INFO level.
Returns tuple[int, str] — the zero-based index and the text of the most similar candidate.
When *args is empty (no candidates supplied), the function returns (-1, input_text). The negative index signals that no real match was found. Always check for a negative index before using the result.

find_most_similar_with_scores

Returns all candidates ranked by similarity score, highest first.
def find_most_similar_with_scores(
    embeddings: Embeddings,
    input_text: str,
    /,
    *args: str,
    verbose: bool = True,
) -> list[tuple[tuple[int, str], float]]:
    ...
The return type is a list of ((index, text), score) tuples sorted descending by score.
from alexber.utils.in_memory_similarity_search import (
    find_most_similar_with_scores,
    SimpleEmbeddings,
)

embeddings = SimpleEmbeddings()

candidates = [
    "The weather is sunny today.",
    "It is raining outside.",
    "Python is a great programming language.",
]

results = find_most_similar_with_scores(
    embeddings, "What is the forecast?", *candidates
)

for (idx, text), score in results:
    print(f"[{idx}] {score:.4f}  {text}")
# [0] 0.9123  The weather is sunny today.
# [1] 0.8741  It is raining outside.
# [2] 0.5102  Python is a great programming language.
embeddings
Embeddings
required
An object implementing the Embeddings protocol.
input_text
str
required
The query text.
*args
str
Candidate strings to rank.
verbose
bool
default:"True"
Log ranked results at INFO level.
Returns list[tuple[tuple[int, str], float]] — all candidates with scores, highest similarity first. When *args is empty, returns [((-1, input_text), 0.0)].

Embeddings protocol

Any object with an embed_documents method satisfies the protocol:
from typing import List, Protocol

class Embeddings(Protocol):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """
        Embed a list of documents into a list of float vectors.

        Args:
            texts: List of strings to embed.

        Returns:
            List of embedding vectors, one per input string.
        """
        ...
The functions use positional-only parameters (/ in the signature), so embeddings must always be the first positional argument.

SimpleEmbeddings

SimpleEmbeddings is a minimal, self-contained embedding implementation suitable for unit tests and learning exercises.
from alexber.utils.in_memory_similarity_search import SimpleEmbeddings

embeddings = SimpleEmbeddings(dims=1536)  # default dimension
vectors = embeddings.embed_documents(["hello world", "foo bar"])
print(len(vectors[0]))  # 1536
dims
int
default:"1536"
Dimension of each output vector. The default matches the dimension of OpenAI’s text-embedding-ada-002 model for compatibility in tests.
Internally SimpleEmbeddings maps each character in a text to a fixed-size vector by hashing the character to an index and incrementing that position. This produces character-frequency vectors rather than semantic embeddings.
SimpleEmbeddings does not produce meaningful semantic representations. It is provided for educational purposes and tests only. Do not use it in production — results will not reflect real language similarity.

Using a production embedding backend

1

Install langchain-openai

pip install langchain-openai
2

Configure your API key

export OPENAI_API_KEY=sk-...
3

Pass OpenAIEmbeddings to the search functions

from langchain_openai import OpenAIEmbeddings
from alexber.utils.in_memory_similarity_search import find_most_similar

embeddings = OpenAIEmbeddings()

candidates = [
    "Contact customer support.",
    "View your billing history.",
    "Reset your password.",
]

index, text = find_most_similar(
    embeddings, "How do I change my password?", *candidates
)
print(index, text)
# 2 Reset your password.
OpenAIEmbeddings from langchain-openai implements the same embed_documents interface and is a drop-in replacement for SimpleEmbeddings. Any LangChain-compatible embedding class works the same way.

Edge cases

SituationBehaviour
No candidates (*args is empty)Returns (-1, input_text) from find_most_similar; [((-1, input_text), 0.0)] from find_most_similar_with_scores.
All-zero embedding vectorsCosine similarity is 0.0 (division-by-zero is caught and set to 0.0).
NaN or Inf in similarity matrixReplaced with 0.0 automatically.