server: semantic similarity queries via sqlite-vec + canonical scoring via cq-schema by javiermtorres · Pull Request #247 · mozilla-ai/cq

javiermtorres · 2026-04-07T13:52:39Z

Adds optional semantic similarity to the server's /query path using sqlite-vec and remote token embeddings, and migrates the server's scoring constants onto the canonical cq-schema source via the cq-sdk.

Closes #21.
Closes #332.

Semantic search (#21)

When TOKEN_EMBEDDING_URL is set and the semsearch extra is installed (sqlite-vec, numpy, httpx), SqliteStore:

loads sqlite-vec into every pooled DBAPI connection via the SQLAlchemy connect event;
creates a knowledge_units_vec table on schema bootstrap;
upserts an embedding row for each unit on insert/update;
in query(), fetches the per-domain candidate set joined with cosine distance via combined_query, then ranks units by relevance × confidence × Π(1 + logit(field)) so semantically closer units boost their score and far ones diminish it.

When the env var is unset (or the extra isn't installed) the path short-circuits via semsearch.is_enabled() and behaviour matches origin/main exactly. CI runs without the embedding server, so the modulation-specific test (test_query_returns_best_result_with_domain_overlap) skips by design.

Scoring canonicalisation (#332)

cq-sdk 0.10.0 reads its scoring constants from cq-schema's canonical scoring.values.json. With this change the server consumes those constants transitively via the SDK rather than holding its own copy.

bump cq-sdk pin from ~=0.9.1 to ~=0.10.0 (pulls cq-schema 0.0.1)
swap from .scoring imports in app.py, store/_sqlite.py, tests/test_store.py to from cq.scoring
delete server/backend/src/cq_server/scoring.py and server/backend/tests/test_scoring.py (the SDK's cq.scoring is now the canonical Python implementation; SDK tests cover the same behaviours)
add server/backend/tests/test_schema_oracle.py to validate that server-serialised KnowledgeUnits round-trip through cq_schema.load_schema("knowledge_unit")

Behaviour note: cq.scoring.calculate_relevance clamps to [0, 1]; the deleted local copy did not. In practice a no-op (Jaccard ∈ [0,1] and the four weights sum to 1.0), but it removes a class of latent bug from any future weight tweak.

Test plan

make test-server-backend — 302 passed, 1 skipped (embedding-server-dependent)
make lint-server-backend
Manual smoke against a live embedding endpoint (out of CI scope)

Summary by CodeRabbit

New Features
- Optional semantic search added (disabled by default) to improve knowledge discovery when enabled.
Documentation
- Backend README documents semantic search activation, required env vars and expected behaviour.
Tests
- New and extended unit/integration tests covering semantic-search scoring, insertion and query routing; conditional e2e checks when embeddings available.
Chores
- Optional “semsearch” extra added, test logging/pytest tweaks, and dev sync target updated to include the semsearch extra.
CI
- New CI workflow to run lint, schema checks and embedding-enabled tests.

javiermtorres · 2026-04-07T13:53:15Z

@copilot check if conflicts can be solved easily

Copilot · 2026-04-07T14:12:47Z

@copilot check if conflicts can be solved easily

Conflicts resolved in f7e3464. Here's what was needed:

plugins/cq/server/ (modify/delete): Main deleted the Python MCP server (replaced by Go CLI). Accepted the deletion and ported our sqlite-vec changes to the new location.
sdk/python/src/cq/store.py (content, rename-detected): Git detected that plugins/cq/server/cq_mcp/local_store.py was renamed to sdk/python/src/cq/store.py on main. Merged our sqlite-vec additions (_VEC_SCHEMA_SQL, embedding_model param, vec insert/update/delete/query logic) into main's updated version.
sdk/python/tests/test_store.py (content, rename-detected): Similarly ported embedding_model fixture and updated LocalStore constructor calls.
server/backend/uv.lock (content): Took main's pygments 2.20.0 version.
team-api/pyproject.toml (modify/delete): Accepted deletion (renamed to server/backend/pyproject.toml on main; sqlite-vec now goes via the SDK dependency).

Also added sqlite-vec==0.1.7a10 and sentence-transformers[onnx] to sdk/python/pyproject.toml.

peteski22 · 2026-04-09T19:19:48Z

@copilot rebase from origin/main and solve conflicts.

Copilot · 2026-04-09T19:25:48Z

@copilot rebase from origin/main and solve conflicts.

Done in 45f715e. The branch is now rebased on origin/main — the diff from main contains exactly our 3 intended files:

sdk/python/src/cq/store.py — sqlite-vec virtual table + embedding model
sdk/python/tests/test_store.py — embedding_model fixture + updated constructors
sdk/python/pyproject.toml — sqlite-vec==0.1.7a10 and sentence-transformers[onnx] deps

Also fixed an unintended regression where server/backend/uv.lock had been modified by an old merge commit (restored to match main).

javiermtorres · 2026-04-17T07:29:15Z

Since the SDK and server stores were split at some point during the development of this PR, the agent tried to reconcile changes and this resulted in unwanted functionality.
@peteski22 as discussed, this will be implemented in the server sqlite instance using an encoderfile, behind a clean interface so other alternatives can be plugged in as well.

javiermtorres · 2026-04-24T07:21:46Z

Removed the sdk changes and kept semsearch to the server store.

Copilot

Pull request overview

This PR adds an optional semantic-search path to the server’s SQLite-backed RemoteStore using sqlite-vec + a remote embedding endpoint, and refactors the server store/API/tests to use async store methods so vector/HTTP work can be awaited during indexing and querying.

Changes:

Introduces cq_server.semsearch for embedding generation, sqlite-vec schema management, and vector distance queries.
Refactors RemoteStore APIs (and FastAPI routes + tests) to be async and integrates semantic upsert/query behavior.
Updates server packaging/tooling to install semsearch deps via an optional extra and updates setup script target.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
server/backend/src/cq_server/semsearch.py	New semantic search helper module (embedding fetch + sqlite-vec integration).
server/backend/src/cq_server/store/init.py	Integrates semsearch into `RemoteStore`, converts many methods to async, and changes query scoring behavior.
server/backend/src/cq_server/app.py	Makes API routes async and awaits store operations.
server/backend/src/cq_server/review.py	Makes review endpoints async where they call async store methods.
server/backend/src/cq_server/auth.py	Makes auth endpoints async where they call async store methods.
server/backend/tests/test_app.py	Converts tests to async and adds semsearch-specific query tests/expectations.
server/backend/tests/test_store.py	Converts tests to async and adjusts confidence expectations based on query-time scoring.
server/backend/tests/test_review.py	Converts tests to async to align with async auth/store calls.
server/backend/tests/test_queries.py	Converts tests to async to align with async store calls.
server/backend/tests/test_migrations.py	Converts seeding helper to async; currently has a missing await.
server/backend/tests/test_auth.py	Converts tests/helpers to async for async auth/store.
server/backend/pyproject.toml	Adds `semsearch` optional extra deps; adjusts pytest logging config.
Makefile	Installs server backend dev deps with `--extra semsearch`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T10:01:13Z

+                row_data_by_id.get(unit.id, ()),
+                field_logits,
+            )
+            unit.evidence.confidence = combined_relevance  # Update confidence to combined score for sorting


query() mutates the returned KnowledgeUnit by overwriting unit.evidence.confidence with a query-time relevance score. This breaks the semantics of confidence (which is validated as 0..1 in the model), can exceed bounds or become negative, and can leak into persistence if a caller later passes the mutated unit back to update() (tests here do exactly that). Keep confidence unchanged and use a separate local score for sorting (or return a (unit, score) structure at an API boundary that can represent ranking).

Suggested change

unit.evidence.confidence = combined_relevance # Update confidence to combined score for sorting

Still valid. During the merge the mutation moved to _sqlite.py:_score_semsearch_rows (unit.evidence.confidence = combined). Same concern: blends ranking score with the model's persisted 0..1 invariant, and a caller round-tripping a query() result through update() would persist the modulated value. The new test_query_returns_best_result_with_domain_overlap depends on the mutation for relative-order assertions, so resolving requires re-shaping that test too. Tracking together with #3153159271.

Copilot · 2026-04-28T10:01:13Z

+    for field_idx, logit_map in field_logits.items():
+        if field_idx < len(row_data):
+            value = row_data[field_idx]
+            logit = logit_map.get(value, 0.0)
+            # Use (1.0 + logit) to keep positive: high logits boost, low logits diminish
+            combined *= 1.0 + logit
+


combined *= 1.0 + logit can flip the score negative when logit < -1 (e.g., if v/mean_val > e with invert=True). If this score is later exposed as/derived from confidence, it can violate expected non-negativity and ordering. Consider using a strictly-positive transform (e.g., combined *= exp(logit)), clamping, or otherwise bounding the adjustment so scores remain well-behaved.

Still valid. The same multiplicative form lives in semsearch.compute_combined_relevance (moved during the merge from the deleted monolithic store/__init__.py to semsearch.py). If logit < -1 the score flips sign; needs real embeddings to trigger but matters for ranking robustness. Tracking.

Copilot · 2026-04-28T10:01:17Z

        # Query returns the unit.
        resp = client.get(
            "/query",
            params={"domains": ["api", "payments"], "languages": ["python"]},
        )
        assert len(resp.json()) == 1
-        assert resp.json()[0]["evidence"]["confidence"] == 0.5
+        assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.35)

        # Confirm boosts confidence.
        resp = client.post(f"/confirm/{unit_id}")
        assert resp.status_code == 200

        resp = client.get("/query", params={"domains": ["api", "payments"]})
-        assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.6)
+        assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.33)

        # Flag reduces confidence.
        resp = client.post(f"/flag/{unit_id}", json={"reason": "stale"})
        assert resp.status_code == 200

        resp = client.get("/query", params={"domains": ["api", "payments"]})
        result = resp.json()[0]
-        assert result["evidence"]["confidence"] == pytest.approx(0.45)
+        assert result["evidence"]["confidence"] == pytest.approx(0.2475)


These assertions assume /query returns a modified evidence.confidence that reflects query-time ranking (0.35/0.33/0.2475). That makes the test (and API behavior) dependent on semsearch enablement and on the ranking formula, and it conflates persisted evidence confidence with retrieval score. Prefer asserting the returned unit IDs/order and keep confidence checks tied to the persisted confidence updates from confirm/flag (which should stay within 0..1).

Still partially valid. The brittle exact-value assertions are gone, but the new test_query_returns_best_result_with_domain_overlap still asserts on result['evidence']['confidence'] returned from /query, which depends on the mutation in _sqlite.py:_score_semsearch_rows. Same root cause as #3153158840; tracked together.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

server/backend/tests/test_app.py (2)
320-320: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assertion does not verify semantic ranking.

Line 320 checks result_confidences[ku1["id"]] >= result_confidences[ku2["id"]], but both units are inserted with the default confidence of 0.5 (from _propose_payload and _approve_unit). The semsearch scoring multiplies relevance by confidence but does not modify the returned unit's evidence.confidence field. Therefore, this assertion is 0.5 >= 0.5, which is trivially true and does not verify that ku1 ranks above ku2 based on semantic similarity. To properly validate semantic ranking, assert that ku1 is the top-ranked result.
🐛 Proposed fix — verify ku1 is the top-ranked result
         assert ku1["id"] in result_confidences and ku2["id"] in result_confidences
         assert result["domains"] == ["astronomy"]
-        assert result_confidences[ku1["id"]] >= result_confidences[ku2["id"]]
+        assert result["id"] == ku1["id"], "Expected ku1 (exoplanet transit) to rank higher than ku2 (HII enrichment) via semantic similarity"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_app.py` at line 320, The current assertion compares
result_confidences[ku1["id"]] >= result_confidences[ku2["id"]], which is
trivially true because both confidences default to 0.5; instead change the test
to assert semantic ranking by verifying that ku1 is the top-ranked result (e.g.,
assert that the first search result's id equals ku1["id"]). Update the assertion
in the test (around the existing result_confidences usage and after the search
call) to check the ordering of search results directly rather than comparing
evidence.confidence values produced by _propose_payload/_approve_unit and used
by semsearch. Ensure you reference ku1["id"] as the expected top result.
316-318: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assertion threshold inconsistent with subsequent checks.

Line 316 asserts len(results['data']) >= 1, but line 318 immediately requires both ku1 and ku2 to be present in result_confidences, implying at least 2 results are expected. This inconsistency could mask a failure where only 1 result is returned.
🧹 Proposed fix
-        assert len(results['data']) >= 1
+        assert len(results['data']) >= 2
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_app.py` around lines 316 - 318, The assertion
expecting at least 1 result is inconsistent with the subsequent check that both
ku1 and ku2 must appear; update the test to require at least 2 results (e.g.,
change the length assertion to assert len(results['data']) >= 2) so it aligns
with the later membership checks for ku1["id"] and ku2["id"] in
result_confidences, ensuring the test fails when too few results are returned.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/backend/tests/test_app.py`:
- Around line 268-272: The test is incorrectly asserting len(results) == 1 on
the full JSON response object; update the assertion to check the number of items
in the data list instead. Replace the check with an assertion on results['data']
(e.g., assert len(results['data']) == 1 or assert len(results.get('data', []))
== 1) so that the test validates the actual result count before using result =
results['data'][0]; keep the subsequent assertions on result["domains"] and
result["insight"]["summary"] unchanged.

---

Duplicate comments:
In `@server/backend/tests/test_app.py`:
- Line 320: The current assertion compares result_confidences[ku1["id"]] >=
result_confidences[ku2["id"]], which is trivially true because both confidences
default to 0.5; instead change the test to assert semantic ranking by verifying
that ku1 is the top-ranked result (e.g., assert that the first search result's
id equals ku1["id"]). Update the assertion in the test (around the existing
result_confidences usage and after the search call) to check the ordering of
search results directly rather than comparing evidence.confidence values
produced by _propose_payload/_approve_unit and used by semsearch. Ensure you
reference ku1["id"] as the expected top result.
- Around line 316-318: The assertion expecting at least 1 result is inconsistent
with the subsequent check that both ku1 and ku2 must appear; update the test to
require at least 2 results (e.g., change the length assertion to assert
len(results['data']) >= 2) so it aligns with the later membership checks for
ku1["id"] and ku2["id"] in result_confidences, ensuring the test fails when too
few results are returned.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c57f750a-53c2-41b2-ba15-838985affffc

📥 Commits

Reviewing files that changed from the base of the PR and between 82ac6f3 and 399d7b1.

📒 Files selected for processing (2)

server/backend/src/cq_server/semsearch/queries.py
server/backend/tests/test_app.py

coderabbitai · 2026-06-11T07:46:28Z

Note

Docstrings generation - SUCCESS
Generated docstrings and committed to branch 21-semantic-similarity-queries (commit: 7dc8563aa48c0f06c5b5cb674e4b5d800a2a96ca)

@javiermtorres

Docstrings generation was requested by @javiermtorres. The following files were modified: * `server/backend/src/cq_server/core/db.py` * `server/backend/src/cq_server/repositories/knowledge.py` * `server/backend/src/cq_server/semsearch/__init__.py` * `server/backend/src/cq_server/semsearch/queries.py` * `server/backend/tests/conftest.py` * `server/backend/tests/db_helpers.py` * `server/backend/tests/test_semsearch.py` * `server/backend/tests/test_sqlite_store.py` * `server/backend/tests/test_store.py` These files were kept as they were: * `server/backend/tests/test_app.py` These file types are not supported: * `Makefile` * `server/backend/README.md` * `server/backend/pyproject.toml`

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

server/backend/pyproject.toml (1)
43-49: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the duplicated jsonschema test dependency entry.

tests declares jsonschema>=4.23.0 twice (Line 44 and Line 48). Keep one entry to avoid redundant dependency resolution noise and lock churn.
Proposed fix
 tests = [
     "jsonschema>=4.23.0",
     "pytest>=9.0.3",
     "pytest-asyncio>=1.3.0",
     "httpx",
-    "jsonschema>=4.23.0",
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/pyproject.toml` around lines 43 - 49, In the tests dependency
list (the "tests" table entry) remove the duplicate "jsonschema>=4.23.0" so the
dependency appears only once; locate the tests = [...] block and delete the
redundant "jsonschema>=4.23.0" string, leaving a single instance to avoid
duplicate dependency entries and unnecessary lock churn.
server/backend/src/cq_server/semsearch/__init__.py (1)
29-29: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid logging potentially sensitive URL.

If TOKEN_EMBEDDING_URL contains authentication credentials (e.g., http://user:pass@host/), they will be logged at INFO level. Consider sanitising the URL before logging or logging only the host portion.
🛡️ Proposed fix
-        logger.info(f"Token embedding enabled using encoderfile endpoint at {_TOKEN_EMBEDDING_URL}")
+        from urllib.parse import urlparse
+        parsed = urlparse(_TOKEN_EMBEDDING_URL)
+        safe_url = f"{parsed.scheme}://{parsed.hostname}" + (f":{parsed.port}" if parsed.port else "")
+        logger.info(f"Token embedding enabled using encoderfile endpoint at {safe_url}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` at line 29, The log
statement prints the full _TOKEN_EMBEDDING_URL which may contain credentials;
update the logging in __init__.py to sanitize the URL before logging by parsing
_TOKEN_EMBEDDING_URL (use urllib.parse.urlparse) and log only the safe parts
(hostname and port or reconstructed URL without username/password) or a redacted
form, replacing credentials with placeholders; change the logger.info call that
references _TOKEN_EMBEDDING_URL to use the sanitized/redacted value instead.
server/backend/src/cq_server/core/db.py (1)
18-19: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Both files import private _ENABLED instead of using the public is_enabled() API. The semsearch module exposes is_enabled() as its public interface for checking enablement status. Both db.py and knowledge.py import the private _ENABLED variable directly, creating unnecessary coupling to the module's internal implementation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/core/db.py` around lines 18 - 19, Replace direct
imports of the private _ENABLED symbol with the semsearch public API
is_enabled() — e.g., remove "from ..semsearch import _ENABLED" and instead
import and call "is_enabled" where enable checks occur (use "from ..semsearch
import is_enabled as semsearch_is_enabled" or call semsearch.is_enabled()).
Update both db.py (references to _ENABLED and any conditional logic around
semsearch_load) and knowledge.py to call semsearch_is_enabled() (or
semsearch.is_enabled()) before using semsearch_load/other features so you rely
on the public API rather than the private _ENABLED variable.
server/backend/src/cq_server/semsearch/queries.py (2)
108-108: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Remove stale TODO comment.

The comment # check type of sqlalchemy conns appears to be a leftover reminder that was previously flagged. The type annotation has been added to conn: Connection, so this comment should be removed.
🧹 Proposed fix
-# check type of sqlalchemy conns
 async def query(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` at line 108, Remove the
stale TODO comment "# check type of sqlalchemy conns" from the code near the
Connection-typed variable (conn: Connection) in queries.py; since the type
annotation is already present on conn, simply delete that comment line to avoid
misleading TODOs and leave the surrounding code and the conn: Connection
annotation unchanged.
53-56: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Empty embeddings list causes np.average to produce unexpected results.

If results[0]["embeddings"] is an empty list (service returns a result but with no embeddings), np.average on an empty array will issue a warning and return nan, which then propagates to the serialisation and SQL insertion. Consider validating that embeddings are present.
🛡️ Proposed fix
         results = response.json().get("results")
         if not results:
             raise RuntimeError(f"Embedding API returned no embeddings for input: {request_data}")
+        embeddings_list = results[0].get("embeddings", [])
+        if not embeddings_list:
+            raise RuntimeError(f"Embedding API returned empty embeddings for input: {request_data}")
-        return np.average(np.array([embedding.get("embedding") for embedding in results[0]["embeddings"]]), axis=0)
+        return np.average(np.array([embedding.get("embedding") for embedding in embeddings_list]), axis=0)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 53 - 56, The
code currently computes np.average over results[0]["embeddings"] which can be an
empty list and yield NaN; update the logic in the function that handles the API
response (look for variables/results usage: response.json(), results,
request_data, and the return line using np.average) to explicitly validate that
results is non-empty and that results[0]["embeddings"] is a non-empty list
before calling np.average; if embeddings is empty, raise a RuntimeError
(including request_data and the raw response/results for debugging) so
downstream serialization/SQL insertion never receives NaN.
server/backend/tests/test_semsearch.py (1)
218-221: ⚠️ Potential issue | 🟡 Minor

Narrow the skip exception scope for _get_embeddings to only embedding/service failures

Catching Exception here can mask genuine semsearch regressions (e.g., unexpected response/JSON/embedding structure issues). _get_embeddings raises RuntimeError for “not enabled”/“no embeddings”, and otherwise relies on httpx to raise its own request/HTTP status errors—so skip only on those instead of all exceptions.
import httpx

try:
    await semsearch_queries._get_embeddings(["connectivity check"])
except (RuntimeError, httpx.RequestError, httpx.HTTPStatusError) as exc:
    pytest.skip(f"embedding server unavailable: {exc}")
Also consider aligning server/backend/tests/test_app.py—it currently catches built-in ConnectionError/TimeoutError, which won’t match httpx’s exception hierarchy.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_semsearch.py` around lines 218 - 221, Narrow the
broad except in the test that calls semsearch_queries._get_embeddings so it only
skips when the embedding/service is unavailable: catch RuntimeError and httpx
request/HTTP status errors instead of Exception (e.g., catch RuntimeError,
httpx.RequestError, httpx.HTTPStatusError). Update the related test_app.py
handling (which currently catches ConnectionError/TimeoutError) to use httpx
exception types so both tests correctly distinguish service availability issues
from real semsearch regressions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/backend/tests/conftest.py`:
- Around line 77-85: Fix the mixed tab/space indentation in the fixture
docstrings in conftest.py by converting all leading tabs to spaces (or vice
versa to match project style) so indentation is consistent; specifically update
the docstrings for the fixture that provides the UserRepository (the function
accepting repos (_RepoBundle) and returning UserRepository) and the other nearby
test fixtures to use only spaces for indentation so Ruff E101 is resolved.

In `@server/backend/tests/db_helpers.py`:
- Line 157: Several parameter docstring lines (e.g., the "unit (KnowledgeUnit):
The knowledge unit with updated fields; must correspond to an existing record in
the database." entry and other long lines at the indicated ranges) exceed the
120-char limit and must be wrapped; edit the docstrings in tests/db_helpers.py
(the docstring containing the "unit (KnowledgeUnit)" parameter and the other
long docstrings at the noted ranges) to reflow text to lines under 120
characters, preserving the existing triple-quote block and parameter
formatting/indentation, breaking long sentences into wrapped lines or separate
sentences as needed, and then run the linter (ruff) to confirm E501 is resolved.
- Around line 194-203: The docstring for get_review_status contains mixed tabs
and spaces causing Ruff E101; fix by converting all leading tab characters to
spaces (use 4-space indentation per PEP8) inside the triple-quoted string for
get_review_status and any other nearby docstrings flagged (the docstring blocks
around the later helper functions referenced in the comment), ensuring
consistent spacing throughout so no tabs remain.

In `@server/backend/tests/test_semsearch.py`:
- Line 35: The docstring line describing KnowledgeUnit is too long (Ruff E501);
split the sentence into multiple shorter lines so each stays within the
project's line-length limit. Edit the docstring that contains "KnowledgeUnit: A
newly created KnowledgeUnit configured for tests (private tier, empty Context,
created_by=\"tester\")." and break it into two or more logical segments (for
example after "private tier," or "empty Context,") while preserving punctuation
and meaning so the linter no longer flags E501.

In `@server/backend/tests/test_sqlite_store.py`:
- Around line 21-25: The docstring for the helper that "Builds a KnowledgeUnit
test fixture for the given domain" in server/backend/tests/test_sqlite_store.py
is too long and triggers Ruff E501; reflow it into multiple shorter lines (wrap
sentences so each line is under the configured line-length limit) and preserve
the content (mention KnowledgeUnit, Insight payload summary="s", detail="d",
action="a", empty Context, tier Tier.PRIVATE, created_by "alice") so the helper
docstring stays descriptive but passes linting.

In `@server/backend/tests/test_store.py`:
- Around line 29-34: Wrap the long docstring line in the "Create an Insight with
sensible default values" docstring (the paragraph that describes Parameters and
the overrides (Any) argument) so no single line exceeds the line-length limit;
break the sentence into multiple lines or sentences (e.g., after "defaults." and
before "Parameters:" or within the "overrides (Any):" description) to satisfy
Ruff E501 while keeping the same wording and meaning.

---

Outside diff comments:
In `@server/backend/pyproject.toml`:
- Around line 43-49: In the tests dependency list (the "tests" table entry)
remove the duplicate "jsonschema>=4.23.0" so the dependency appears only once;
locate the tests = [...] block and delete the redundant "jsonschema>=4.23.0"
string, leaving a single instance to avoid duplicate dependency entries and
unnecessary lock churn.

In `@server/backend/src/cq_server/core/db.py`:
- Around line 18-19: Replace direct imports of the private _ENABLED symbol with
the semsearch public API is_enabled() — e.g., remove "from ..semsearch import
_ENABLED" and instead import and call "is_enabled" where enable checks occur
(use "from ..semsearch import is_enabled as semsearch_is_enabled" or call
semsearch.is_enabled()). Update both db.py (references to _ENABLED and any
conditional logic around semsearch_load) and knowledge.py to call
semsearch_is_enabled() (or semsearch.is_enabled()) before using
semsearch_load/other features so you rely on the public API rather than the
private _ENABLED variable.

In `@server/backend/src/cq_server/semsearch/__init__.py`:
- Line 29: The log statement prints the full _TOKEN_EMBEDDING_URL which may
contain credentials; update the logging in __init__.py to sanitize the URL
before logging by parsing _TOKEN_EMBEDDING_URL (use urllib.parse.urlparse) and
log only the safe parts (hostname and port or reconstructed URL without
username/password) or a redacted form, replacing credentials with placeholders;
change the logger.info call that references _TOKEN_EMBEDDING_URL to use the
sanitized/redacted value instead.

In `@server/backend/src/cq_server/semsearch/queries.py`:
- Line 108: Remove the stale TODO comment "# check type of sqlalchemy conns"
from the code near the Connection-typed variable (conn: Connection) in
queries.py; since the type annotation is already present on conn, simply delete
that comment line to avoid misleading TODOs and leave the surrounding code and
the conn: Connection annotation unchanged.
- Around line 53-56: The code currently computes np.average over
results[0]["embeddings"] which can be an empty list and yield NaN; update the
logic in the function that handles the API response (look for variables/results
usage: response.json(), results, request_data, and the return line using
np.average) to explicitly validate that results is non-empty and that
results[0]["embeddings"] is a non-empty list before calling np.average; if
embeddings is empty, raise a RuntimeError (including request_data and the raw
response/results for debugging) so downstream serialization/SQL insertion never
receives NaN.

In `@server/backend/tests/test_semsearch.py`:
- Around line 218-221: Narrow the broad except in the test that calls
semsearch_queries._get_embeddings so it only skips when the embedding/service is
unavailable: catch RuntimeError and httpx request/HTTP status errors instead of
Exception (e.g., catch RuntimeError, httpx.RequestError, httpx.HTTPStatusError).
Update the related test_app.py handling (which currently catches
ConnectionError/TimeoutError) to use httpx exception types so both tests
correctly distinguish service availability issues from real semsearch
regressions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cdbfde9b-c377-4bcb-99e4-b5e6ff37c277

📥 Commits

Reviewing files that changed from the base of the PR and between 399d7b1 and 20e3360.

⛔ Files ignored due to path filters (1)

server/backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

server/backend/pyproject.toml
server/backend/src/cq_server/core/db.py
server/backend/src/cq_server/repositories/knowledge.py
server/backend/src/cq_server/semsearch/__init__.py
server/backend/src/cq_server/semsearch/queries.py
server/backend/tests/conftest.py
server/backend/tests/db_helpers.py
server/backend/tests/test_app.py
server/backend/tests/test_semsearch.py
server/backend/tests/test_sqlite_store.py
server/backend/tests/test_store.py

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

server/backend/tests/test_semsearch.py (1)

129-146: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Install semsearch monkeypatches before the first store.insert() in mocked-path tests.

These tests currently seed data before patching _SEMSEARCH_ENABLED/sem_insert_unit/sem_query, so in semsearch-enabled environments they can execute real embedding calls and become flaky.

💡 Suggested fix pattern

     async def test_query_calls_sem_query_when_semsearch_enabled(...):
-        unit = _make_unit("astronomy", summary="Exoplanet transit photometry")
-        store = _make_store(db_path)
-        await store.insert(unit)
-        await store.set_review_status(unit.id, "approved", "reviewer")
-
         import cq_server.repositories.knowledge as knowledge_mod

         monkeypatch.setattr(knowledge_mod, "_SEMSEARCH_ENABLED", True)
         mock_sem_query = AsyncMock(return_value=[unit])
         monkeypatch.setattr(knowledge_mod, "sem_query", mock_sem_query)
         monkeypatch.setattr(knowledge_mod, "sem_insert_unit", AsyncMock())
+
+        unit = _make_unit("astronomy", summary="Exoplanet transit photometry")
+        store = _make_store(db_path)
+        await store.insert(unit)
+        await store.set_review_status(unit.id, "approved", "reviewer")

Apply the same ordering in the other mocked-path tests in this class.

Also applies to: 153-172, 178-193

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_semsearch.py` around lines 129 - 146, The test
seeds data with store.insert before enabling the semsearch monkeypatches, which
can cause real embedding calls; move the monkeypatch.setattr calls that set
knowledge_mod._SEMSEARCH_ENABLED, knowledge_mod.sem_query, and
knowledge_mod.sem_insert_unit so they occur before the first call to
store.insert (and before any other operations that would trigger semsearch),
e.g. set the monkeypatches at the top of
test_query_calls_sem_query_when_semsearch_enabled (and apply the same reordering
to the other mocked-path tests in this file that seed data: the tests around
lines 153-172 and 178-193) so the mocked sem_query/sem_insert_unit are in effect
during store.insert and subsequent operations.

server/backend/src/cq_server/semsearch/queries.py (3)

73-90: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wire upsert_unit() into the repository update path as well.

This helper is added here, but the changed repository integration only imports insert_unit and combined_query. Edited summaries/details/actions will therefore keep their old embedding indefinitely.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 73 - 90, The
repository update flow currently imports insert_unit and combined_query but
never calls upsert_unit, so edited summaries/details/actions never refresh their
embeddings; update the repository code that performs unit inserts/updates to
also import and await upsert_unit(unit) after the DB write (the same places you
call insert_unit or perform updates to a KnowledgeUnit), ensuring you call the
async upsert_unit function for edits (not just creates), and wrap the call in a
short try/except to log failures without breaking the main transaction.

245-262: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The distance term is rewarding less similar vectors.

vec_distance_cosine is lower for better matches, but combine() adds the normalised distance and then sorts descending. That boosts the farthest units instead of the closest ones.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 245 - 262,
The combine() function currently adds normalized distance so lower (better)
vec_distance_cosine values reduce the combined score, which incorrectly rewards
farther vectors; change the distance contribution to reward smaller distances by
using an inverted term (e.g., 1 - (distance / total_distance) when
total_distance > 0) or subtracting the normalized distance from relevance
instead of adding it; update combine to compute normalized_distance = distance /
total_distance if total_distance > 0 else 0 and then use distance_score = 1 -
normalized_distance (or equivalently flip the sign) when weighting with
distance_weight so that closer vectors (smaller vec_distance_cosine) increase
the final score.

55-62: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Average embeddings across every requested input.

The request sends the whole wordlist, but the response handling only reads results[0]["embeddings"]. Any multi-domain query therefore embeds just the first input and silently drops the rest.

Suggested fix

         results = response.json().get("results")
         if not results:
             raise RuntimeError(f"Embedding API returned no embeddings for input: {request_data}")
-        return np.average(np.array([embedding.get("embedding") for embedding in results[0]["embeddings"]]), axis=0)
+        embeddings = [
+            embedding["embedding"]
+            for result in results
+            for embedding in result.get("embeddings", [])
+            if embedding.get("embedding") is not None
+        ]
+        if not embeddings:
+            raise RuntimeError(f"Embedding API returned no embeddings for input: {request_data}")
+        return np.average(np.array(embeddings), axis=0)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 55 - 62, The
code posts the full wordlist but only reads results[0]["embeddings"], dropping
embeddings for other inputs; update the response handling to collect embeddings
from every result entry (e.g., iterate over results and within each result over
result["embeddings"], extracting embedding.get("embedding")), validate the
combined list is non-empty, and then compute the mean across all collected
embedding vectors with np.mean(axis=0) before returning; refer to AsyncClient,
_TOKEN_EMBEDDING_URL, request_data, response, and results to locate where to
change the comprehension that currently uses results[0]["embeddings"].

server/backend/src/cq_server/repositories/knowledge.py (1)

76-83: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid returning a failure after the main insert has already committed.

_insert_sync() commits the canonical KnowledgeUnit before the semsearch write runs. If the embedding fetch or vector insert fails afterwards, the caller sees an exception even though the unit already exists, so retries can trip duplicate-key errors and the vector index is left behind.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/repositories/knowledge.py` around lines 76 - 83,
The code currently commits the canonical KnowledgeUnit in _insert_sync via await
self._db.run_sync(self._insert_sync, unit) and then runs sem_insert_unit in a
separate transaction, which can leave the DB in an inconsistent state if
semsearch fails; change the flow so sem_insert_unit runs inside the same
transaction as the main insert or at least does not raise after the main commit:
move or call sem_insert_unit from within _insert_sync (or obtain the same
transaction/connection and call sem_insert_unit before committing) so both
operations share the same transactional scope (use the same
connection/transaction object instead of self._db.engine.begin()), or if
co-transactional execution is impossible, catch/log semsearch errors and
suppress exceptions so the caller is not signaled a failure after the unit has
been committed; reference _insert_sync, sem_insert_unit, _db.run_sync and
_SEMSEARCH_ENABLED when implementing the change.

♻️ Duplicate comments (8)

server/backend/tests/test_app.py (2)
268-271: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert result count from data, not envelope keys.

len(results) == 1 validates the response object shape, not the number of returned units.
💡 Suggested fix
-        results = resp.json()
-        assert len(results) == 1
-        result = results["data"][0]
+        results = resp.json()
+        assert len(results["data"]) == 1
+        result = results["data"][0]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_app.py` around lines 268 - 271, The test currently
asserts len(results) == 1 which checks the top-level response envelope instead
of the number of returned items; update the assertion to check the length of the
"data" list from resp.json() (i.e., assert the count using results["data"])
before accessing results["data"][0], so the test validates the number of
returned units and then asserts result["domains"] == ["astronomy"].
315-320: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Current check does not validate semantic ranking.

Comparing result["evidence"]["confidence"] between ku1 and ku2 is not a ranking assertion here; both seeded units can share the same default confidence, so this can pass without proving semantic ordering.
💡 Suggested fix
-        assert len(results["data"]) >= 1
+        assert len(results["data"]) >= 2
         result = results["data"][0]
         assert ku1["id"] in result_confidences and ku2["id"] in result_confidences
         assert result["domains"] == ["astronomy"]
-        assert result_confidences[ku1["id"]] >= result_confidences[ku2["id"]]
+        assert result["id"] == ku1["id"]
server/backend/src/cq_server/semsearch/__init__.py (3)
111-114: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Always re-disable extension loading.

If sqlite_vec.load(conn) fails here, the connection stays in enable_load_extension(True) state. Wrap the load/schema path in try/finally so later code cannot load arbitrary extensions on that pooled connection.
Suggested fix
     conn.enable_load_extension(True)
-    sqlite_vec.load(conn)
-    conn.enable_load_extension(False)
-    ensure_schema(conn)
+    try:
+        sqlite_vec.load(conn)
+        ensure_schema(conn)
+    finally:
+        conn.enable_load_extension(False)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` around lines 111 - 114,
The connection is left with extension loading enabled if sqlite_vec.load(conn)
raises; wrap the load and ensure_schema calls in a try/finally so
conn.enable_load_extension(False) always runs. Specifically, around the calls to
sqlite_vec.load(conn) and ensure_schema(conn) ensure you call
conn.enable_load_extension(False) in a finally block while keeping
conn.enable_load_extension(True) before the try; this guarantees extension
loading is re-disabled on the connection even if sqlite_vec.load fails.
16-17: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard SEMSEARCH_EMBEDDING_DIM parsing.

int() still runs at import time, so a non-numeric env value brings the whole server down before semsearch has a chance to stay disabled.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` around lines 16 - 17, The
module currently calls int(os.environ.get("SEMSEARCH_EMBEDDING_DIM", 768)) at
import time (symbol _DIM), which will raise on non-numeric env values even when
semsearch is disabled; change it to safely parse the environment variable (e.g.,
read raw value then try/except ValueError or only parse when _ENABLED is True),
falling back to 768 on parse failure or missing value, and ensure the
fallback/default is used without throwing during import.
31-34: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the install hint in the warning.

This message still tells operators to install cq-sdk[embedding], but this feature is exposed via the server's semsearch extra, so the recovery path is wrong.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` around lines 31 - 34, The
warning logged in semsearch/__init__.py (the logger.warning call referencing
TOKEN_EMBEDDING_URL) has the wrong recovery hint; update the message text to
tell operators to install the server package with the semsearch extra (e.g.
suggest "pip install cq-server[semsearch]" or the appropriate server package
name with the 'semsearch' extra) instead of "cq-sdk[embedding]"; keep the rest
of the message intact and only replace the install hint.
server/backend/src/cq_server/repositories/knowledge.py (1)
81-83: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep semsearch I/O off borrowed DB connections and out of the event-loop thread.

Both branches open a synchronous SQLAlchemy connection inside an async method and then await a helper that performs remote I/O before issuing sync conn.execute() calls. That blocks the event loop on DB work and keeps the pooled connection/transaction pinned across the embedding round-trip.
#!/bin/bash
set -e
rg -n -C2 'with self\._db\.engine\.(begin|connect)\(\) as conn|await sem_(insert_unit|query)\(' server/backend/src/cq_server/repositories/knowledge.py
Expected result: both insert() and query() show direct Engine usage inside async methods, bypassing Database.run_sync(...).

Also applies to: 109-113
server/backend/src/cq_server/semsearch/queries.py (2)
227-233: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the normalised domains for the query embedding too.

The SQL filter uses normalized, but the embedding still uses raw domains; case/whitespace variants can therefore search one concept and filter another.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 227 - 233,
The code normalizes domains via normalize_domains() but still passes the
original domains into _get_embeddings, causing mismatch between embedding and
SQL filter; change the call to _get_embeddings to use the normalized variable
(and propagate that normalized value into any subsequent embedding
serialization/_serialize_embedding and args construction) so both the
query_embedding and the "domains" filter use the same normalized domains (keep
references: normalize_domains, normalized, _get_embeddings,
_serialize_embedding, args).
333-336: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clamp zero ratios before taking log().

A valid 0 value makes ratio == 0 when the field mean is positive, so math.log(ratio) raises. Zero cosine distance is a normal best-match case, so this helper can crash on perfect matches.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 333 - 336,
The loop computing logit can raise on ratio==0 (perfect match); clamp the
computed ratio to a tiny positive epsilon before calling math.log. In the block
iterating over values (variables: values, mean_val, ratio, logit, invert,
field_logits, field_idx), compute ratio as before but then set clamped_ratio =
max(ratio, 1e-12) (or similar small constant) and use clamped_ratio in the
math.log call (respecting the invert flag) before assigning into
field_logits[field_idx][v].

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci-server-embeddings.yaml:
- Line 7: Update the self-trigger path filters in the GitHub Actions workflow so
they match this workflow's filename: in
.github/workflows/ci-server-embeddings.yaml replace any occurrences of
".github/workflows/ci-server.yaml" with
".github/workflows/ci-server-embeddings.yaml" (these are the path filter entries
that control self-triggering of this workflow).
- Line 25: Replace every tag-pinned GitHub Action reference (e.g., the uses:
actions/checkout@v6 entry and the other uses: lines at the noted locations) with
the corresponding full commit SHA for the action repository; for each uses:
value (actions/checkout, and the other actions referenced at lines called out)
look up the desired release tag’s commit SHA on the action's GitHub repo and
substitute `@vX` or `@main` with the full 40-character commit SHA so all uses:
entries are SHA-pinned.
- Line 25: The actions/checkout usage in the workflow currently omits disabling
credential persistence; update every checkout step that uses
"actions/checkout@v6" (all occurrences) to include a with block setting
persist-credentials: false so the step becomes actions/checkout@v6 with
persist-credentials: false to prevent token exposure to downstream steps or
artifacts.
- Around line 102-104: Start the encoder in background (the
modernbert.x86_64-unknown-linux-gnu.encoderfile serve command) then poll the
TOKEN_EMBEDDING_URL (http://127.0.0.1:8080) with a short loop and timeout to
ensure it returns a successful response (HTTP 200 or health/readiness endpoint)
before invoking make setup-server test-server; fail the job if the service
doesn't become ready within the timeout so tests don't run against an unready
encoder.

---

Outside diff comments:
In `@server/backend/src/cq_server/repositories/knowledge.py`:
- Around line 76-83: The code currently commits the canonical KnowledgeUnit in
_insert_sync via await self._db.run_sync(self._insert_sync, unit) and then runs
sem_insert_unit in a separate transaction, which can leave the DB in an
inconsistent state if semsearch fails; change the flow so sem_insert_unit runs
inside the same transaction as the main insert or at least does not raise after
the main commit: move or call sem_insert_unit from within _insert_sync (or
obtain the same transaction/connection and call sem_insert_unit before
committing) so both operations share the same transactional scope (use the same
connection/transaction object instead of self._db.engine.begin()), or if
co-transactional execution is impossible, catch/log semsearch errors and
suppress exceptions so the caller is not signaled a failure after the unit has
been committed; reference _insert_sync, sem_insert_unit, _db.run_sync and
_SEMSEARCH_ENABLED when implementing the change.

In `@server/backend/src/cq_server/semsearch/queries.py`:
- Around line 73-90: The repository update flow currently imports insert_unit
and combined_query but never calls upsert_unit, so edited
summaries/details/actions never refresh their embeddings; update the repository
code that performs unit inserts/updates to also import and await
upsert_unit(unit) after the DB write (the same places you call insert_unit or
perform updates to a KnowledgeUnit), ensuring you call the async upsert_unit
function for edits (not just creates), and wrap the call in a short try/except
to log failures without breaking the main transaction.
- Around line 245-262: The combine() function currently adds normalized distance
so lower (better) vec_distance_cosine values reduce the combined score, which
incorrectly rewards farther vectors; change the distance contribution to reward
smaller distances by using an inverted term (e.g., 1 - (distance /
total_distance) when total_distance > 0) or subtracting the normalized distance
from relevance instead of adding it; update combine to compute
normalized_distance = distance / total_distance if total_distance > 0 else 0 and
then use distance_score = 1 - normalized_distance (or equivalently flip the
sign) when weighting with distance_weight so that closer vectors (smaller
vec_distance_cosine) increase the final score.
- Around line 55-62: The code posts the full wordlist but only reads
results[0]["embeddings"], dropping embeddings for other inputs; update the
response handling to collect embeddings from every result entry (e.g., iterate
over results and within each result over result["embeddings"], extracting
embedding.get("embedding")), validate the combined list is non-empty, and then
compute the mean across all collected embedding vectors with np.mean(axis=0)
before returning; refer to AsyncClient, _TOKEN_EMBEDDING_URL, request_data,
response, and results to locate where to change the comprehension that currently
uses results[0]["embeddings"].

In `@server/backend/tests/test_semsearch.py`:
- Around line 129-146: The test seeds data with store.insert before enabling the
semsearch monkeypatches, which can cause real embedding calls; move the
monkeypatch.setattr calls that set knowledge_mod._SEMSEARCH_ENABLED,
knowledge_mod.sem_query, and knowledge_mod.sem_insert_unit so they occur before
the first call to store.insert (and before any other operations that would
trigger semsearch), e.g. set the monkeypatches at the top of
test_query_calls_sem_query_when_semsearch_enabled (and apply the same reordering
to the other mocked-path tests in this file that seed data: the tests around
lines 153-172 and 178-193) so the mocked sem_query/sem_insert_unit are in effect
during store.insert and subsequent operations.

---

Duplicate comments:
In `@server/backend/src/cq_server/semsearch/__init__.py`:
- Around line 111-114: The connection is left with extension loading enabled if
sqlite_vec.load(conn) raises; wrap the load and ensure_schema calls in a
try/finally so conn.enable_load_extension(False) always runs. Specifically,
around the calls to sqlite_vec.load(conn) and ensure_schema(conn) ensure you
call conn.enable_load_extension(False) in a finally block while keeping
conn.enable_load_extension(True) before the try; this guarantees extension
loading is re-disabled on the connection even if sqlite_vec.load fails.
- Around line 16-17: The module currently calls
int(os.environ.get("SEMSEARCH_EMBEDDING_DIM", 768)) at import time (symbol
_DIM), which will raise on non-numeric env values even when semsearch is
disabled; change it to safely parse the environment variable (e.g., read raw
value then try/except ValueError or only parse when _ENABLED is True), falling
back to 768 on parse failure or missing value, and ensure the fallback/default
is used without throwing during import.
- Around line 31-34: The warning logged in semsearch/__init__.py (the
logger.warning call referencing TOKEN_EMBEDDING_URL) has the wrong recovery
hint; update the message text to tell operators to install the server package
with the semsearch extra (e.g. suggest "pip install cq-server[semsearch]" or the
appropriate server package name with the 'semsearch' extra) instead of
"cq-sdk[embedding]"; keep the rest of the message intact and only replace the
install hint.

In `@server/backend/src/cq_server/semsearch/queries.py`:
- Around line 227-233: The code normalizes domains via normalize_domains() but
still passes the original domains into _get_embeddings, causing mismatch between
embedding and SQL filter; change the call to _get_embeddings to use the
normalized variable (and propagate that normalized value into any subsequent
embedding serialization/_serialize_embedding and args construction) so both the
query_embedding and the "domains" filter use the same normalized domains (keep
references: normalize_domains, normalized, _get_embeddings,
_serialize_embedding, args).
- Around line 333-336: The loop computing logit can raise on ratio==0 (perfect
match); clamp the computed ratio to a tiny positive epsilon before calling
math.log. In the block iterating over values (variables: values, mean_val,
ratio, logit, invert, field_logits, field_idx), compute ratio as before but then
set clamped_ratio = max(ratio, 1e-12) (or similar small constant) and use
clamped_ratio in the math.log call (respecting the invert flag) before assigning
into field_logits[field_idx][v].

In `@server/backend/tests/test_app.py`:
- Around line 268-271: The test currently asserts len(results) == 1 which checks
the top-level response envelope instead of the number of returned items; update
the assertion to check the length of the "data" list from resp.json() (i.e.,
assert the count using results["data"]) before accessing results["data"][0], so
the test validates the number of returned units and then asserts
result["domains"] == ["astronomy"].

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b2539dd6-dfb1-488f-81df-41535f850133

📥 Commits

Reviewing files that changed from the base of the PR and between 20e3360 and 3cbe64f.

📒 Files selected for processing (11)

.github/workflows/ci-server-embeddings.yaml
server/backend/src/cq_server/core/db.py
server/backend/src/cq_server/repositories/knowledge.py
server/backend/src/cq_server/semsearch/__init__.py
server/backend/src/cq_server/semsearch/queries.py
server/backend/tests/conftest.py
server/backend/tests/db_helpers.py
server/backend/tests/test_app.py
server/backend/tests/test_semsearch.py
server/backend/tests/test_sqlite_store.py
server/backend/tests/test_store.py

coderabbitai · 2026-06-11T15:33:57Z

+    name: Validate schemas
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pin all GitHub Actions to full commit SHAs.

All uses: entries are tag-based (@v*) rather than SHA-pinned. This violates the stated policy and weakens supply-chain guarantees for CI execution.

Also applies to: 27-27, 36-36, 38-38, 42-42, 46-46, 63-63, 65-65, 70-70, 74-74, 82-82

🧰 Tools

🪛 zizmor (1.25.2)

[warning] 25-25: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

[error] 25-25: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci-server-embeddings.yaml at line 25, Replace every tag-pinned GitHub Action reference (e.g., the uses: actions/checkout@v6 entry and the other uses: lines at the noted locations) with the corresponding full commit SHA for the action repository; for each uses: value (actions/checkout, and the other actions referenced at lines called out) look up the desired release tag’s commit SHA on the action's GitHub repo and substitute `@vX` or `@main` with the full 40-character commit SHA so all uses: entries are SHA-pinned.

Source: Linters/SAST tools

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disable checkout credential persistence in all jobs.

Each actions/checkout step should set persist-credentials: false to reduce token exposure risk in downstream steps/artifacts.

Suggested patch pattern

- - uses: actions/checkout@v6 + - uses: actions/checkout@v6 + with: + persist-credentials: false

Also applies to: 36-36, 63-63

🧰 Tools

🪛 zizmor (1.25.2)

[warning] 25-25: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

[error] 25-25: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci-server-embeddings.yaml at line 25, The actions/checkout usage in the workflow currently omits disabling credential persistence; update every checkout step that uses "actions/checkout@v6" (all occurrences) to include a with block setting persist-credentials: false so the step becomes actions/checkout@v6 with persist-credentials: false to prevent token exposure to downstream steps or artifacts.

Source: Linters/SAST tools

javiermtorres · 2026-06-11T17:22:20Z

Thanks @javiermtorres

I had a look at this PR but I think it's got some things that need tweaking and some things that are still missing (from previous feedback. Don't we still need an updated Docker file for cq server to provider encoderfile (and docs)?

Added a GH action to check the encoderfile. There's no need to dockerize it in the actions themselves. The user can later decide the specific deployment.

make lint/make lint-server-backend is also failing.

Every check goes smooth now :)

peteski22

Thanks for sticking with this @javiermtorres; the refactor onto core/db.py and repositories/ is looking good and resolves most of the earlier feedback (the confidence mutation is gone, the combined query now has a LIMIT, the deps are pinned, the env vars are documented, and the conditional imports fixed the ImportError-without-the-extra problem). 🙏🏼

Since the restructure moved everything around, the old inline feedback comments are hard to follow, so here's a single consolidated checklist of what's left. All of it stems from this PR, so I'd like to land it complete rather than split out ... genuine future enhancements can be tracked separately (see the footer).

Correctness/error handling around the embedding call

Domain-only fallback. combined_query now raises on an embedding or SQL error (queries.py:236-241) and the caller doesn't catch it (knowledge.py:109-113), so a transient embedding-service hiccup returns a 500 instead of degrading to the SQL-only path.
Partial failure on insert. insert() commits the unit, then embeds in a separate transaction (knowledge.py:80-83). If embedding fails, the unit is persisted but has no vector row; invisible to semantic query() (which JOINs knowledge_units_vec) with no re-index path, and the caller still gets a 500. The unit and its embedding should land or roll back together. (This also answers the :76-79 FIXME; and that comment has a couple of typos worth fixing while it's touched.)
update() doesn't refresh the embedding. It only calls _update_sync (knowledge.py:123-125); upsert_unit exists but is never wired in, so an edited unit keeps a stale vector. README.md:97 already claims insert and update write an embedding row — so docs and behaviour need reconciling.
Domain normalization mismatch. combined_query embeds the raw domains (queries.py:231) but filters on normalized (:233), unlike query() which embeds normalized. Use the normalized form for both.

Dead code (all introduced in this PR)

build_field_logits, compute_combined_relevance, and queries.query are never called outside tests — the live path uses the inline combine(). Remove them (or wire them in). Note compute_combined_relevance's docstring claims it "keeps the combined score positive," which isn't true (combined *= 1.0 + logit flips sign for logit < -1).
upsert_unit is currently unused, either wire it into update() (above) or remove it.

Tests

The new test_semsearch.py exercises the dead helpers, but the live scoring path (combined_query / combine()) only has the skip-gated e2e test, so it has no coverage in normal CI. Once the dead helpers are removed their tests should go too — please add an embedding-free test that mocks _get_embeddings and asserts the real ranking.

Cleanup (all introduced in this PR)

Remove server/backend/.DS_Store (committed) and add it to .gitignore.
Drop the per-insert logger.info (knowledge.py:75).
Fix the ImportError message at semsearch/__init__.py:34 — it says pip install cq-sdk[embedding], but the package is cq-server and the extra is semsearch (cq-server[semsearch]).
Revert the pytest logging config — log_cli = true now prints DEBUG on every run, which contradicts the "only with -vv" intent from earlier; jsonschema is also pinned twice in the tests group (pyproject.toml:44,48,65-67).

Docs

Turn the encoderfile link into a recommendation: a concrete model whose output dimension matches SEMSEARCH_EMBEDDING_DIM (default 768) and the minimal command to run it, so TOKEN_EMBEDDING_URL has something to point at.
Document (or implement) the backfill story: units inserted while semsearch was disabled have no embedding and won't appear in semantic results; there's no reindex path today.
Add TOKEN_EMBEDDING_URL and SEMSEARCH_EMBEDDING_DIM to the self-hosted env-var table in DEVELOPMENT.md; README.md:71-72 points there as the canonical reference, but they're missing.

Genuine future work: configurable candidate limit and relevance/distance weights (queries.py:258-259) is out of scope here and should be tracked in new issues that are referenced on the PR.

Happy to pair on the error-handling items if that's quicker than another round.

javiermtorres · 2026-06-15T13:16:49Z

@peteski22 Re: domain only fallback. Do you want any error to fall back, or only those directly related to the semantic query machinery? Note that the current query can potentially fail for similar reasons (e.g. db connectivity). Also, a fallback provides not a degraded, but a different service instead, since the ranking will not be respecting embeddings. A feature indicator flag in both requests and responses to mark the fallback to standard query may be needed here.

javiermtorres · 2026-06-15T13:26:25Z

@peteski22 Re: partial failure on insert. The operations are not composable now at this level, since the sql commit is not visible. To avoid modifying _insert_sync, I will probably copy the operations there into the embedding insert.

javiermtorres · 2026-06-15T13:29:41Z

@peteski22 Re: domain normalization mismatch. I will probably remove query in favour of combined_query, since domain filtering needs to be done anyway.

javiermtorres · 2026-06-15T13:33:41Z

Re: e2e tests, notice that there's an additional CI action that runs all tests enabling embeddings. I could fetch the embedding encoderfile, start it, run a specific suite, and then finish, but I supposed that you'd prefer having the embedding tests separate. Please let me know which approach you'd prefer. I will check whether there is appropriate coverage mocking the embeddings in the current setup.

peteski22 · 2026-06-15T13:57:34Z

@peteski22 Re: domain only fallback. Do you want any error to fall back, or only those directly related to the semantic query machinery? Note that the current query can potentially fail for similar reasons (e.g. db connectivity). Also, a fallback provides not a degraded, but a different service instead, since the ranking will not be respecting embeddings. A feature indicator flag in both requests and responses to mark the fallback to standard query may be needed here.

on any error to fall back:

I'd say scope the catch to the semantic path; don't blanket-catch/except everything.

on the current query can potentially fail:

Try: semantic -> fallback to standard query if semantic stuff fails ... if all that fails, surface the error

on different service :

Yeah, you're right that swapping things quietly isn't great 🤔 probably just have to log a warning as changing the wire format for things is a bigger change and impacts more things.

peteski22 · 2026-06-15T14:07:21Z

@peteski22 Re: partial failure on insert. The operations are not composable now at this level, since the sql commit is not visible. To avoid modifying _insert_sync, I will probably copy the operations there into the embedding insert.

Do you mean move the logic so you compute the embedding first (when enabled) then do the SQL in a single transaction?

javiermtorres · 2026-06-15T14:31:14Z

@peteski22 Re: partial failure on insert. The operations are not composable now at this level, since the sql commit is not visible. To avoid modifying _insert_sync, I will probably copy the operations there into the embedding insert.

Do you mean move the logic so you compute the embedding first (when enabled) then do the SQL in a single transaction?

Yep but the embeddings are not needed in the case of not-semantic search so the condition would be used in a couple of places because we also want the embeddings to be out of the sql queries. It's... awkward anyway.

peteski22 · 2026-06-15T14:44:44Z

@peteski22 Re: partial failure on insert. The operations are not composable now at this level, since the sql commit is not visible. To avoid modifying _insert_sync, I will probably copy the operations there into the embedding insert.

Do you mean move the logic so you compute the embedding first (when enabled) then do the SQL in a single transaction?

Yep but the embeddings are not needed in the case of not-semantic search so the condition would be used in a couple of places because we also want the embeddings to be out of the sql queries. It's... awkward anyway.

Do you mean like:

# semsearch owns the enabled check:
async def compute_embedding(unit) -> bytes | None # None when disabled
def write_embedding(conn, unit_id, vector) -> None # sync, in-txn

# repository.insert() — no _SEMSEARCH_ENABLED branching:
vector = await semsearch.compute_embedding(unit) # HTTP, before the txn
await run_sync(self._insert_sync, unit, vector) # one txn: unit + domains, and write_embedding if vector

javiermtorres · 2026-06-19T08:26:18Z

Do you mean like:

# semsearch owns the enabled check:
async def compute_embedding(unit) -> bytes | None # None when disabled
def write_embedding(conn, unit_id, vector) -> None # sync, in-txn

# repository.insert() — no _SEMSEARCH_ENABLED branching:
vector = await semsearch.compute_embedding(unit) # HTTP, before the txn
await run_sync(self._insert_sync, unit, vector) # one txn: unit + domains, and write_embedding if vector

Probably. But take into account that this needs to be done general enough so that it works without semantic search, and it is compatible with other kinds of search (BM25 for example), and it may need to be compatible with mechanisms in postgres (pgvector e.g.) since we've already merged #431. So it's not going to be trivial.

javiermtorres · 2026-06-22T07:48:38Z

@copilot check if the comments in #247 (review) are now addressed

Copilot started work on behalf of javiermtorres April 7, 2026 13:53 View session

Copilot AI changed the title ~~Add sqlite-vec plugin~~ Add sqlite-vec plugin to Python SDK store Apr 7, 2026

Copilot finished work on behalf of javiermtorres April 7, 2026 14:14

Copilot started work on behalf of peteski22 April 9, 2026 19:19 View session

Copilot finished work on behalf of peteski22 April 9, 2026 19:26

Copilot AI requested a review from peteski22 April 9, 2026 19:26

peteski22 added do-not-merge labels Apr 9, 2026

javiermtorres changed the title ~~Add sqlite-vec plugin to Python SDK store~~ Add sqlite-vec plugin to Python server store Apr 17, 2026

Add semantic search support to server store

588ae9a

javiermtorres force-pushed the 21-semantic-similarity-queries branch from 45f715e to 588ae9a Compare April 24, 2026 07:19

javiermtorres added 8 commits April 24, 2026 09:31

Merge branch 'main' into 21-semantic-similarity-queries

70c19b2

Remove default embeddings endpoint

9bceb0a

Fix lint

ff6f201

Make semsearch modulate current scoring

56cdba2

Merge branch 'main' into 21-semantic-similarity-queries

16d538a

Add semsearch to the dependencies in CI

7fad12d

Merge branch 'main' into 21-semantic-similarity-queries

34e2330

Fix query tests to async protocol

7900707

peteski22 requested a review from Copilot April 28, 2026 09:53

Copilot started reviewing on behalf of peteski22 April 28, 2026 09:53 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

peteski22 mentioned this pull request Apr 28, 2026

Implement SqliteStore against Store protocol via Core helpers #308

Closed

peteski22 removed the do-not-merge label Apr 28, 2026

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread server/backend/tests/test_app.py

coderabbitai Bot and others added 2 commits June 11, 2026 07:46

Merge branch 'main' into 21-semantic-similarity-queries

20e3360

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Fix lint

3cbe64f

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Fix actions, test check

263dc7d

Merge branch 'main' into 21-semantic-similarity-queries

fca9c0d

peteski22 reviewed Jun 15, 2026

View reviewed changes

Address review comments

6ab956e

Copilot started work on behalf of javiermtorres June 22, 2026 07:49 View session

Copilot finished work on behalf of javiermtorres June 22, 2026 07:57

javiermtorres force-pushed the 21-semantic-similarity-queries branch from 1e237dd to 6ab956e Compare June 22, 2026 08:18

javiermtorres added 3 commits June 22, 2026 10:19

Merge branch 'main' into 21-semantic-similarity-queries

055e7dc

Fix lint

b8aa5b2

Fix db integrity exception class in tests

0745ad5

javiermtorres force-pushed the 21-semantic-similarity-queries branch from 79fc4ef to 0745ad5 Compare June 22, 2026 08:49

Fix integrity exception

2c14fd4

Conversation

javiermtorres commented Apr 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semantic search (#21)

Scoring canonicalisation (#332)

Test plan

Summary by CodeRabbit

Uh oh!

javiermtorres commented Apr 7, 2026

Uh oh!

Copilot AI commented Apr 7, 2026

Uh oh!

peteski22 commented Apr 9, 2026

Uh oh!

Copilot AI commented Apr 9, 2026

Uh oh!

javiermtorres commented Apr 17, 2026

Uh oh!

javiermtorres commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

peteski22 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

peteski22 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

peteski22 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

javiermtorres commented Jun 11, 2026

Uh oh!

peteski22 left a comment • edited by javiermtorres Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Correctness/error handling around the embedding call

javiermtorres commented Apr 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

peteski22 left a comment •

edited by javiermtorres

Loading

javiermtorres commented Jun 15, 2026 •

edited

Loading

peteski22 commented Jun 15, 2026 •

edited

Loading

javiermtorres commented Jun 15, 2026 •

edited by peteski22

Loading