server: semantic similarity queries via sqlite-vec + canonical scoring via cq-schema#247
server: semantic similarity queries via sqlite-vec + canonical scoring via cq-schema#247javiermtorres wants to merge 34 commits into
Conversation
|
@copilot check if conflicts can be solved easily |
Conflicts resolved in f7e3464. Here's what was needed:
Also added |
|
@copilot rebase from origin/main and solve conflicts. |
Done in 45f715e. The branch is now rebased on origin/main — the diff from main contains exactly our 3 intended files:
Also fixed an unintended regression where |
|
Since the SDK and server stores were split at some point during the development of this PR, the agent tried to reconcile changes and this resulted in unwanted functionality. |
45f715e to
588ae9a
Compare
|
Removed the sdk changes and kept semsearch to the server store. |
There was a problem hiding this comment.
Pull request overview
This PR adds an optional semantic-search path to the server’s SQLite-backed RemoteStore using sqlite-vec + a remote embedding endpoint, and refactors the server store/API/tests to use async store methods so vector/HTTP work can be awaited during indexing and querying.
Changes:
- Introduces
cq_server.semsearchfor embedding generation, sqlite-vec schema management, and vector distance queries. - Refactors
RemoteStoreAPIs (and FastAPI routes + tests) to beasyncand integrates semantic upsert/query behavior. - Updates server packaging/tooling to install semsearch deps via an optional extra and updates setup script target.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| server/backend/src/cq_server/semsearch.py | New semantic search helper module (embedding fetch + sqlite-vec integration). |
| server/backend/src/cq_server/store/init.py | Integrates semsearch into RemoteStore, converts many methods to async, and changes query scoring behavior. |
| server/backend/src/cq_server/app.py | Makes API routes async and awaits store operations. |
| server/backend/src/cq_server/review.py | Makes review endpoints async where they call async store methods. |
| server/backend/src/cq_server/auth.py | Makes auth endpoints async where they call async store methods. |
| server/backend/tests/test_app.py | Converts tests to async and adds semsearch-specific query tests/expectations. |
| server/backend/tests/test_store.py | Converts tests to async and adjusts confidence expectations based on query-time scoring. |
| server/backend/tests/test_review.py | Converts tests to async to align with async auth/store calls. |
| server/backend/tests/test_queries.py | Converts tests to async to align with async store calls. |
| server/backend/tests/test_migrations.py | Converts seeding helper to async; currently has a missing await. |
| server/backend/tests/test_auth.py | Converts tests/helpers to async for async auth/store. |
| server/backend/pyproject.toml | Adds semsearch optional extra deps; adjusts pytest logging config. |
| Makefile | Installs server backend dev deps with --extra semsearch. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| row_data_by_id.get(unit.id, ()), | ||
| field_logits, | ||
| ) | ||
| unit.evidence.confidence = combined_relevance # Update confidence to combined score for sorting |
There was a problem hiding this comment.
query() mutates the returned KnowledgeUnit by overwriting unit.evidence.confidence with a query-time relevance score. This breaks the semantics of confidence (which is validated as 0..1 in the model), can exceed bounds or become negative, and can leak into persistence if a caller later passes the mutated unit back to update() (tests here do exactly that). Keep confidence unchanged and use a separate local score for sorting (or return a (unit, score) structure at an API boundary that can represent ranking).
| unit.evidence.confidence = combined_relevance # Update confidence to combined score for sorting |
There was a problem hiding this comment.
Still valid. During the merge the mutation moved to _sqlite.py:_score_semsearch_rows (unit.evidence.confidence = combined). Same concern: blends ranking score with the model's persisted 0..1 invariant, and a caller round-tripping a query() result through update() would persist the modulated value. The new test_query_returns_best_result_with_domain_overlap depends on the mutation for relative-order assertions, so resolving requires re-shaping that test too. Tracking together with #3153159271.
| for field_idx, logit_map in field_logits.items(): | ||
| if field_idx < len(row_data): | ||
| value = row_data[field_idx] | ||
| logit = logit_map.get(value, 0.0) | ||
| # Use (1.0 + logit) to keep positive: high logits boost, low logits diminish | ||
| combined *= 1.0 + logit | ||
|
|
There was a problem hiding this comment.
combined *= 1.0 + logit can flip the score negative when logit < -1 (e.g., if v/mean_val > e with invert=True). If this score is later exposed as/derived from confidence, it can violate expected non-negativity and ordering. Consider using a strictly-positive transform (e.g., combined *= exp(logit)), clamping, or otherwise bounding the adjustment so scores remain well-behaved.
There was a problem hiding this comment.
Still valid. The same multiplicative form lives in semsearch.compute_combined_relevance (moved during the merge from the deleted monolithic store/__init__.py to semsearch.py). If logit < -1 the score flips sign; needs real embeddings to trigger but matters for ranking robustness. Tracking.
| # Query returns the unit. | ||
| resp = client.get( | ||
| "/query", | ||
| params={"domains": ["api", "payments"], "languages": ["python"]}, | ||
| ) | ||
| assert len(resp.json()) == 1 | ||
| assert resp.json()[0]["evidence"]["confidence"] == 0.5 | ||
| assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.35) | ||
|
|
||
| # Confirm boosts confidence. | ||
| resp = client.post(f"/confirm/{unit_id}") | ||
| assert resp.status_code == 200 | ||
|
|
||
| resp = client.get("/query", params={"domains": ["api", "payments"]}) | ||
| assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.6) | ||
| assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.33) | ||
|
|
||
| # Flag reduces confidence. | ||
| resp = client.post(f"/flag/{unit_id}", json={"reason": "stale"}) | ||
| assert resp.status_code == 200 | ||
|
|
||
| resp = client.get("/query", params={"domains": ["api", "payments"]}) | ||
| result = resp.json()[0] | ||
| assert result["evidence"]["confidence"] == pytest.approx(0.45) | ||
| assert result["evidence"]["confidence"] == pytest.approx(0.2475) |
There was a problem hiding this comment.
These assertions assume /query returns a modified evidence.confidence that reflects query-time ranking (0.35/0.33/0.2475). That makes the test (and API behavior) dependent on semsearch enablement and on the ranking formula, and it conflates persisted evidence confidence with retrieval score. Prefer asserting the returned unit IDs/order and keep confidence checks tied to the persisted confidence updates from confirm/flag (which should stay within 0..1).
There was a problem hiding this comment.
Still partially valid. The brittle exact-value assertions are gone, but the new test_query_returns_best_result_with_domain_overlap still asserts on result['evidence']['confidence'] returned from /query, which depends on the mutation in _sqlite.py:_score_semsearch_rows. Same root cause as #3153158840; tracked together.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
server/backend/tests/test_app.py (2)
320-320:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAssertion does not verify semantic ranking.
Line 320 checks
result_confidences[ku1["id"]] >= result_confidences[ku2["id"]], but both units are inserted with the default confidence of 0.5 (from_propose_payloadand_approve_unit). The semsearch scoring multiplies relevance by confidence but does not modify the returned unit'sevidence.confidencefield. Therefore, this assertion is0.5 >= 0.5, which is trivially true and does not verify thatku1ranks aboveku2based on semantic similarity. To properly validate semantic ranking, assert thatku1is the top-ranked result.🐛 Proposed fix — verify ku1 is the top-ranked result
assert ku1["id"] in result_confidences and ku2["id"] in result_confidences assert result["domains"] == ["astronomy"] - assert result_confidences[ku1["id"]] >= result_confidences[ku2["id"]] + assert result["id"] == ku1["id"], "Expected ku1 (exoplanet transit) to rank higher than ku2 (HII enrichment) via semantic similarity"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/tests/test_app.py` at line 320, The current assertion compares result_confidences[ku1["id"]] >= result_confidences[ku2["id"]], which is trivially true because both confidences default to 0.5; instead change the test to assert semantic ranking by verifying that ku1 is the top-ranked result (e.g., assert that the first search result's id equals ku1["id"]). Update the assertion in the test (around the existing result_confidences usage and after the search call) to check the ordering of search results directly rather than comparing evidence.confidence values produced by _propose_payload/_approve_unit and used by semsearch. Ensure you reference ku1["id"] as the expected top result.
316-318:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAssertion threshold inconsistent with subsequent checks.
Line 316 asserts
len(results['data']) >= 1, but line 318 immediately requires bothku1andku2to be present inresult_confidences, implying at least 2 results are expected. This inconsistency could mask a failure where only 1 result is returned.🧹 Proposed fix
- assert len(results['data']) >= 1 + assert len(results['data']) >= 2🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/tests/test_app.py` around lines 316 - 318, The assertion expecting at least 1 result is inconsistent with the subsequent check that both ku1 and ku2 must appear; update the test to require at least 2 results (e.g., change the length assertion to assert len(results['data']) >= 2) so it aligns with the later membership checks for ku1["id"] and ku2["id"] in result_confidences, ensuring the test fails when too few results are returned.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/backend/tests/test_app.py`:
- Around line 268-272: The test is incorrectly asserting len(results) == 1 on
the full JSON response object; update the assertion to check the number of items
in the data list instead. Replace the check with an assertion on results['data']
(e.g., assert len(results['data']) == 1 or assert len(results.get('data', []))
== 1) so that the test validates the actual result count before using result =
results['data'][0]; keep the subsequent assertions on result["domains"] and
result["insight"]["summary"] unchanged.
---
Duplicate comments:
In `@server/backend/tests/test_app.py`:
- Line 320: The current assertion compares result_confidences[ku1["id"]] >=
result_confidences[ku2["id"]], which is trivially true because both confidences
default to 0.5; instead change the test to assert semantic ranking by verifying
that ku1 is the top-ranked result (e.g., assert that the first search result's
id equals ku1["id"]). Update the assertion in the test (around the existing
result_confidences usage and after the search call) to check the ordering of
search results directly rather than comparing evidence.confidence values
produced by _propose_payload/_approve_unit and used by semsearch. Ensure you
reference ku1["id"] as the expected top result.
- Around line 316-318: The assertion expecting at least 1 result is inconsistent
with the subsequent check that both ku1 and ku2 must appear; update the test to
require at least 2 results (e.g., change the length assertion to assert
len(results['data']) >= 2) so it aligns with the later membership checks for
ku1["id"] and ku2["id"] in result_confidences, ensuring the test fails when too
few results are returned.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c57f750a-53c2-41b2-ba15-838985affffc
📒 Files selected for processing (2)
server/backend/src/cq_server/semsearch/queries.pyserver/backend/tests/test_app.py
|
Note Docstrings generation - SUCCESS |
Docstrings generation was requested by @javiermtorres. The following files were modified: * `server/backend/src/cq_server/core/db.py` * `server/backend/src/cq_server/repositories/knowledge.py` * `server/backend/src/cq_server/semsearch/__init__.py` * `server/backend/src/cq_server/semsearch/queries.py` * `server/backend/tests/conftest.py` * `server/backend/tests/db_helpers.py` * `server/backend/tests/test_semsearch.py` * `server/backend/tests/test_sqlite_store.py` * `server/backend/tests/test_store.py` These files were kept as they were: * `server/backend/tests/test_app.py` These file types are not supported: * `Makefile` * `server/backend/README.md` * `server/backend/pyproject.toml`
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
server/backend/pyproject.toml (1)
43-49:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRemove the duplicated
jsonschematest dependency entry.
testsdeclaresjsonschema>=4.23.0twice (Line 44 and Line 48). Keep one entry to avoid redundant dependency resolution noise and lock churn.Proposed fix
tests = [ "jsonschema>=4.23.0", "pytest>=9.0.3", "pytest-asyncio>=1.3.0", "httpx", - "jsonschema>=4.23.0", ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/pyproject.toml` around lines 43 - 49, In the tests dependency list (the "tests" table entry) remove the duplicate "jsonschema>=4.23.0" so the dependency appears only once; locate the tests = [...] block and delete the redundant "jsonschema>=4.23.0" string, leaving a single instance to avoid duplicate dependency entries and unnecessary lock churn.server/backend/src/cq_server/semsearch/__init__.py (1)
29-29:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAvoid logging potentially sensitive URL.
If
TOKEN_EMBEDDING_URLcontains authentication credentials (e.g.,http://user:pass@host/), they will be logged at INFO level. Consider sanitising the URL before logging or logging only the host portion.🛡️ Proposed fix
- logger.info(f"Token embedding enabled using encoderfile endpoint at {_TOKEN_EMBEDDING_URL}") + from urllib.parse import urlparse + parsed = urlparse(_TOKEN_EMBEDDING_URL) + safe_url = f"{parsed.scheme}://{parsed.hostname}" + (f":{parsed.port}" if parsed.port else "") + logger.info(f"Token embedding enabled using encoderfile endpoint at {safe_url}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/semsearch/__init__.py` at line 29, The log statement prints the full _TOKEN_EMBEDDING_URL which may contain credentials; update the logging in __init__.py to sanitize the URL before logging by parsing _TOKEN_EMBEDDING_URL (use urllib.parse.urlparse) and log only the safe parts (hostname and port or reconstructed URL without username/password) or a redacted form, replacing credentials with placeholders; change the logger.info call that references _TOKEN_EMBEDDING_URL to use the sanitized/redacted value instead.server/backend/src/cq_server/core/db.py (1)
18-19: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueBoth files import private
_ENABLEDinstead of using the publicis_enabled()API. The semsearch module exposesis_enabled()as its public interface for checking enablement status. Bothdb.pyandknowledge.pyimport the private_ENABLEDvariable directly, creating unnecessary coupling to the module's internal implementation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/core/db.py` around lines 18 - 19, Replace direct imports of the private _ENABLED symbol with the semsearch public API is_enabled() — e.g., remove "from ..semsearch import _ENABLED" and instead import and call "is_enabled" where enable checks occur (use "from ..semsearch import is_enabled as semsearch_is_enabled" or call semsearch.is_enabled()). Update both db.py (references to _ENABLED and any conditional logic around semsearch_load) and knowledge.py to call semsearch_is_enabled() (or semsearch.is_enabled()) before using semsearch_load/other features so you rely on the public API rather than the private _ENABLED variable.server/backend/src/cq_server/semsearch/queries.py (2)
108-108: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueRemove stale TODO comment.
The comment
# check type of sqlalchemy connsappears to be a leftover reminder that was previously flagged. The type annotation has been added toconn: Connection, so this comment should be removed.🧹 Proposed fix
-# check type of sqlalchemy conns async def query(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/semsearch/queries.py` at line 108, Remove the stale TODO comment "# check type of sqlalchemy conns" from the code near the Connection-typed variable (conn: Connection) in queries.py; since the type annotation is already present on conn, simply delete that comment line to avoid misleading TODOs and leave the surrounding code and the conn: Connection annotation unchanged.
53-56:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winEmpty embeddings list causes
np.averageto produce unexpected results.If
results[0]["embeddings"]is an empty list (service returns a result but with no embeddings),np.averageon an empty array will issue a warning and returnnan, which then propagates to the serialisation and SQL insertion. Consider validating that embeddings are present.🛡️ Proposed fix
results = response.json().get("results") if not results: raise RuntimeError(f"Embedding API returned no embeddings for input: {request_data}") + embeddings_list = results[0].get("embeddings", []) + if not embeddings_list: + raise RuntimeError(f"Embedding API returned empty embeddings for input: {request_data}") - return np.average(np.array([embedding.get("embedding") for embedding in results[0]["embeddings"]]), axis=0) + return np.average(np.array([embedding.get("embedding") for embedding in embeddings_list]), axis=0)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/semsearch/queries.py` around lines 53 - 56, The code currently computes np.average over results[0]["embeddings"] which can be an empty list and yield NaN; update the logic in the function that handles the API response (look for variables/results usage: response.json(), results, request_data, and the return line using np.average) to explicitly validate that results is non-empty and that results[0]["embeddings"] is a non-empty list before calling np.average; if embeddings is empty, raise a RuntimeError (including request_data and the raw response/results for debugging) so downstream serialization/SQL insertion never receives NaN.server/backend/tests/test_semsearch.py (1)
218-221:⚠️ Potential issue | 🟡 MinorNarrow the skip exception scope for
_get_embeddingsto only embedding/service failuresCatching
Exceptionhere can mask genuine semsearch regressions (e.g., unexpected response/JSON/embedding structure issues)._get_embeddingsraisesRuntimeErrorfor “not enabled”/“no embeddings”, and otherwise relies onhttpxto raise its own request/HTTP status errors—so skip only on those instead of all exceptions.import httpx try: await semsearch_queries._get_embeddings(["connectivity check"]) except (RuntimeError, httpx.RequestError, httpx.HTTPStatusError) as exc: pytest.skip(f"embedding server unavailable: {exc}")Also consider aligning
server/backend/tests/test_app.py—it currently catches built-inConnectionError/TimeoutError, which won’t matchhttpx’s exception hierarchy.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/tests/test_semsearch.py` around lines 218 - 221, Narrow the broad except in the test that calls semsearch_queries._get_embeddings so it only skips when the embedding/service is unavailable: catch RuntimeError and httpx request/HTTP status errors instead of Exception (e.g., catch RuntimeError, httpx.RequestError, httpx.HTTPStatusError). Update the related test_app.py handling (which currently catches ConnectionError/TimeoutError) to use httpx exception types so both tests correctly distinguish service availability issues from real semsearch regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/backend/tests/conftest.py`:
- Around line 77-85: Fix the mixed tab/space indentation in the fixture
docstrings in conftest.py by converting all leading tabs to spaces (or vice
versa to match project style) so indentation is consistent; specifically update
the docstrings for the fixture that provides the UserRepository (the function
accepting repos (_RepoBundle) and returning UserRepository) and the other nearby
test fixtures to use only spaces for indentation so Ruff E101 is resolved.
In `@server/backend/tests/db_helpers.py`:
- Line 157: Several parameter docstring lines (e.g., the "unit (KnowledgeUnit):
The knowledge unit with updated fields; must correspond to an existing record in
the database." entry and other long lines at the indicated ranges) exceed the
120-char limit and must be wrapped; edit the docstrings in tests/db_helpers.py
(the docstring containing the "unit (KnowledgeUnit)" parameter and the other
long docstrings at the noted ranges) to reflow text to lines under 120
characters, preserving the existing triple-quote block and parameter
formatting/indentation, breaking long sentences into wrapped lines or separate
sentences as needed, and then run the linter (ruff) to confirm E501 is resolved.
- Around line 194-203: The docstring for get_review_status contains mixed tabs
and spaces causing Ruff E101; fix by converting all leading tab characters to
spaces (use 4-space indentation per PEP8) inside the triple-quoted string for
get_review_status and any other nearby docstrings flagged (the docstring blocks
around the later helper functions referenced in the comment), ensuring
consistent spacing throughout so no tabs remain.
In `@server/backend/tests/test_semsearch.py`:
- Line 35: The docstring line describing KnowledgeUnit is too long (Ruff E501);
split the sentence into multiple shorter lines so each stays within the
project's line-length limit. Edit the docstring that contains "KnowledgeUnit: A
newly created KnowledgeUnit configured for tests (private tier, empty Context,
created_by=\"tester\")." and break it into two or more logical segments (for
example after "private tier," or "empty Context,") while preserving punctuation
and meaning so the linter no longer flags E501.
In `@server/backend/tests/test_sqlite_store.py`:
- Around line 21-25: The docstring for the helper that "Builds a KnowledgeUnit
test fixture for the given domain" in server/backend/tests/test_sqlite_store.py
is too long and triggers Ruff E501; reflow it into multiple shorter lines (wrap
sentences so each line is under the configured line-length limit) and preserve
the content (mention KnowledgeUnit, Insight payload summary="s", detail="d",
action="a", empty Context, tier Tier.PRIVATE, created_by "alice") so the helper
docstring stays descriptive but passes linting.
In `@server/backend/tests/test_store.py`:
- Around line 29-34: Wrap the long docstring line in the "Create an Insight with
sensible default values" docstring (the paragraph that describes Parameters and
the overrides (Any) argument) so no single line exceeds the line-length limit;
break the sentence into multiple lines or sentences (e.g., after "defaults." and
before "Parameters:" or within the "overrides (Any):" description) to satisfy
Ruff E501 while keeping the same wording and meaning.
---
Outside diff comments:
In `@server/backend/pyproject.toml`:
- Around line 43-49: In the tests dependency list (the "tests" table entry)
remove the duplicate "jsonschema>=4.23.0" so the dependency appears only once;
locate the tests = [...] block and delete the redundant "jsonschema>=4.23.0"
string, leaving a single instance to avoid duplicate dependency entries and
unnecessary lock churn.
In `@server/backend/src/cq_server/core/db.py`:
- Around line 18-19: Replace direct imports of the private _ENABLED symbol with
the semsearch public API is_enabled() — e.g., remove "from ..semsearch import
_ENABLED" and instead import and call "is_enabled" where enable checks occur
(use "from ..semsearch import is_enabled as semsearch_is_enabled" or call
semsearch.is_enabled()). Update both db.py (references to _ENABLED and any
conditional logic around semsearch_load) and knowledge.py to call
semsearch_is_enabled() (or semsearch.is_enabled()) before using
semsearch_load/other features so you rely on the public API rather than the
private _ENABLED variable.
In `@server/backend/src/cq_server/semsearch/__init__.py`:
- Line 29: The log statement prints the full _TOKEN_EMBEDDING_URL which may
contain credentials; update the logging in __init__.py to sanitize the URL
before logging by parsing _TOKEN_EMBEDDING_URL (use urllib.parse.urlparse) and
log only the safe parts (hostname and port or reconstructed URL without
username/password) or a redacted form, replacing credentials with placeholders;
change the logger.info call that references _TOKEN_EMBEDDING_URL to use the
sanitized/redacted value instead.
In `@server/backend/src/cq_server/semsearch/queries.py`:
- Line 108: Remove the stale TODO comment "# check type of sqlalchemy conns"
from the code near the Connection-typed variable (conn: Connection) in
queries.py; since the type annotation is already present on conn, simply delete
that comment line to avoid misleading TODOs and leave the surrounding code and
the conn: Connection annotation unchanged.
- Around line 53-56: The code currently computes np.average over
results[0]["embeddings"] which can be an empty list and yield NaN; update the
logic in the function that handles the API response (look for variables/results
usage: response.json(), results, request_data, and the return line using
np.average) to explicitly validate that results is non-empty and that
results[0]["embeddings"] is a non-empty list before calling np.average; if
embeddings is empty, raise a RuntimeError (including request_data and the raw
response/results for debugging) so downstream serialization/SQL insertion never
receives NaN.
In `@server/backend/tests/test_semsearch.py`:
- Around line 218-221: Narrow the broad except in the test that calls
semsearch_queries._get_embeddings so it only skips when the embedding/service is
unavailable: catch RuntimeError and httpx request/HTTP status errors instead of
Exception (e.g., catch RuntimeError, httpx.RequestError, httpx.HTTPStatusError).
Update the related test_app.py handling (which currently catches
ConnectionError/TimeoutError) to use httpx exception types so both tests
correctly distinguish service availability issues from real semsearch
regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cdbfde9b-c377-4bcb-99e4-b5e6ff37c277
⛔ Files ignored due to path filters (1)
server/backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
server/backend/pyproject.tomlserver/backend/src/cq_server/core/db.pyserver/backend/src/cq_server/repositories/knowledge.pyserver/backend/src/cq_server/semsearch/__init__.pyserver/backend/src/cq_server/semsearch/queries.pyserver/backend/tests/conftest.pyserver/backend/tests/db_helpers.pyserver/backend/tests/test_app.pyserver/backend/tests/test_semsearch.pyserver/backend/tests/test_sqlite_store.pyserver/backend/tests/test_store.py
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
server/backend/tests/test_semsearch.py (1)
129-146:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInstall semsearch monkeypatches before the first
store.insert()in mocked-path tests.These tests currently seed data before patching
_SEMSEARCH_ENABLED/sem_insert_unit/sem_query, so in semsearch-enabled environments they can execute real embedding calls and become flaky.💡 Suggested fix pattern
async def test_query_calls_sem_query_when_semsearch_enabled(...): - unit = _make_unit("astronomy", summary="Exoplanet transit photometry") - store = _make_store(db_path) - await store.insert(unit) - await store.set_review_status(unit.id, "approved", "reviewer") - import cq_server.repositories.knowledge as knowledge_mod monkeypatch.setattr(knowledge_mod, "_SEMSEARCH_ENABLED", True) mock_sem_query = AsyncMock(return_value=[unit]) monkeypatch.setattr(knowledge_mod, "sem_query", mock_sem_query) monkeypatch.setattr(knowledge_mod, "sem_insert_unit", AsyncMock()) + + unit = _make_unit("astronomy", summary="Exoplanet transit photometry") + store = _make_store(db_path) + await store.insert(unit) + await store.set_review_status(unit.id, "approved", "reviewer")Apply the same ordering in the other mocked-path tests in this class.
Also applies to: 153-172, 178-193
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/tests/test_semsearch.py` around lines 129 - 146, The test seeds data with store.insert before enabling the semsearch monkeypatches, which can cause real embedding calls; move the monkeypatch.setattr calls that set knowledge_mod._SEMSEARCH_ENABLED, knowledge_mod.sem_query, and knowledge_mod.sem_insert_unit so they occur before the first call to store.insert (and before any other operations that would trigger semsearch), e.g. set the monkeypatches at the top of test_query_calls_sem_query_when_semsearch_enabled (and apply the same reordering to the other mocked-path tests in this file that seed data: the tests around lines 153-172 and 178-193) so the mocked sem_query/sem_insert_unit are in effect during store.insert and subsequent operations.server/backend/src/cq_server/semsearch/queries.py (3)
73-90:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWire
upsert_unit()into the repository update path as well.This helper is added here, but the changed repository integration only imports
insert_unitandcombined_query. Edited summaries/details/actions will therefore keep their old embedding indefinitely.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/semsearch/queries.py` around lines 73 - 90, The repository update flow currently imports insert_unit and combined_query but never calls upsert_unit, so edited summaries/details/actions never refresh their embeddings; update the repository code that performs unit inserts/updates to also import and await upsert_unit(unit) after the DB write (the same places you call insert_unit or perform updates to a KnowledgeUnit), ensuring you call the async upsert_unit function for edits (not just creates), and wrap the call in a short try/except to log failures without breaking the main transaction.
245-262:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe distance term is rewarding less similar vectors.
vec_distance_cosineis lower for better matches, butcombine()adds the normalised distance and then sorts descending. That boosts the farthest units instead of the closest ones.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/semsearch/queries.py` around lines 245 - 262, The combine() function currently adds normalized distance so lower (better) vec_distance_cosine values reduce the combined score, which incorrectly rewards farther vectors; change the distance contribution to reward smaller distances by using an inverted term (e.g., 1 - (distance / total_distance) when total_distance > 0) or subtracting the normalized distance from relevance instead of adding it; update combine to compute normalized_distance = distance / total_distance if total_distance > 0 else 0 and then use distance_score = 1 - normalized_distance (or equivalently flip the sign) when weighting with distance_weight so that closer vectors (smaller vec_distance_cosine) increase the final score.
55-62:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAverage embeddings across every requested input.
The request sends the whole
wordlist, but the response handling only readsresults[0]["embeddings"]. Any multi-domain query therefore embeds just the first input and silently drops the rest.Suggested fix
results = response.json().get("results") if not results: raise RuntimeError(f"Embedding API returned no embeddings for input: {request_data}") - return np.average(np.array([embedding.get("embedding") for embedding in results[0]["embeddings"]]), axis=0) + embeddings = [ + embedding["embedding"] + for result in results + for embedding in result.get("embeddings", []) + if embedding.get("embedding") is not None + ] + if not embeddings: + raise RuntimeError(f"Embedding API returned no embeddings for input: {request_data}") + return np.average(np.array(embeddings), axis=0)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/semsearch/queries.py` around lines 55 - 62, The code posts the full wordlist but only reads results[0]["embeddings"], dropping embeddings for other inputs; update the response handling to collect embeddings from every result entry (e.g., iterate over results and within each result over result["embeddings"], extracting embedding.get("embedding")), validate the combined list is non-empty, and then compute the mean across all collected embedding vectors with np.mean(axis=0) before returning; refer to AsyncClient, _TOKEN_EMBEDDING_URL, request_data, response, and results to locate where to change the comprehension that currently uses results[0]["embeddings"].server/backend/src/cq_server/repositories/knowledge.py (1)
76-83:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftAvoid returning a failure after the main insert has already committed.
_insert_sync()commits the canonical KnowledgeUnit before the semsearch write runs. If the embedding fetch or vector insert fails afterwards, the caller sees an exception even though the unit already exists, so retries can trip duplicate-key errors and the vector index is left behind.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/repositories/knowledge.py` around lines 76 - 83, The code currently commits the canonical KnowledgeUnit in _insert_sync via await self._db.run_sync(self._insert_sync, unit) and then runs sem_insert_unit in a separate transaction, which can leave the DB in an inconsistent state if semsearch fails; change the flow so sem_insert_unit runs inside the same transaction as the main insert or at least does not raise after the main commit: move or call sem_insert_unit from within _insert_sync (or obtain the same transaction/connection and call sem_insert_unit before committing) so both operations share the same transactional scope (use the same connection/transaction object instead of self._db.engine.begin()), or if co-transactional execution is impossible, catch/log semsearch errors and suppress exceptions so the caller is not signaled a failure after the unit has been committed; reference _insert_sync, sem_insert_unit, _db.run_sync and _SEMSEARCH_ENABLED when implementing the change.
♻️ Duplicate comments (8)
server/backend/tests/test_app.py (2)
268-271:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAssert result count from
data, not envelope keys.
len(results) == 1validates the response object shape, not the number of returned units.💡 Suggested fix
- results = resp.json() - assert len(results) == 1 - result = results["data"][0] + results = resp.json() + assert len(results["data"]) == 1 + result = results["data"][0]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/tests/test_app.py` around lines 268 - 271, The test currently asserts len(results) == 1 which checks the top-level response envelope instead of the number of returned items; update the assertion to check the length of the "data" list from resp.json() (i.e., assert the count using results["data"]) before accessing results["data"][0], so the test validates the number of returned units and then asserts result["domains"] == ["astronomy"].
315-320:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCurrent check does not validate semantic ranking.
Comparing
result["evidence"]["confidence"]betweenku1andku2is not a ranking assertion here; both seeded units can share the same default confidence, so this can pass without proving semantic ordering.💡 Suggested fix
- assert len(results["data"]) >= 1 + assert len(results["data"]) >= 2 result = results["data"][0] assert ku1["id"] in result_confidences and ku2["id"] in result_confidences assert result["domains"] == ["astronomy"] - assert result_confidences[ku1["id"]] >= result_confidences[ku2["id"]] + assert result["id"] == ku1["id"]server/backend/src/cq_server/semsearch/__init__.py (3)
111-114:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAlways re-disable extension loading.
If
sqlite_vec.load(conn)fails here, the connection stays inenable_load_extension(True)state. Wrap the load/schema path intry/finallyso later code cannot load arbitrary extensions on that pooled connection.Suggested fix
conn.enable_load_extension(True) - sqlite_vec.load(conn) - conn.enable_load_extension(False) - ensure_schema(conn) + try: + sqlite_vec.load(conn) + ensure_schema(conn) + finally: + conn.enable_load_extension(False)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/semsearch/__init__.py` around lines 111 - 114, The connection is left with extension loading enabled if sqlite_vec.load(conn) raises; wrap the load and ensure_schema calls in a try/finally so conn.enable_load_extension(False) always runs. Specifically, around the calls to sqlite_vec.load(conn) and ensure_schema(conn) ensure you call conn.enable_load_extension(False) in a finally block while keeping conn.enable_load_extension(True) before the try; this guarantees extension loading is re-disabled on the connection even if sqlite_vec.load fails.
16-17:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard
SEMSEARCH_EMBEDDING_DIMparsing.
int()still runs at import time, so a non-numeric env value brings the whole server down before semsearch has a chance to stay disabled.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/semsearch/__init__.py` around lines 16 - 17, The module currently calls int(os.environ.get("SEMSEARCH_EMBEDDING_DIM", 768)) at import time (symbol _DIM), which will raise on non-numeric env values even when semsearch is disabled; change it to safely parse the environment variable (e.g., read raw value then try/except ValueError or only parse when _ENABLED is True), falling back to 768 on parse failure or missing value, and ensure the fallback/default is used without throwing during import.
31-34:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix the install hint in the warning.
This message still tells operators to install
cq-sdk[embedding], but this feature is exposed via the server'ssemsearchextra, so the recovery path is wrong.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/semsearch/__init__.py` around lines 31 - 34, The warning logged in semsearch/__init__.py (the logger.warning call referencing TOKEN_EMBEDDING_URL) has the wrong recovery hint; update the message text to tell operators to install the server package with the semsearch extra (e.g. suggest "pip install cq-server[semsearch]" or the appropriate server package name with the 'semsearch' extra) instead of "cq-sdk[embedding]"; keep the rest of the message intact and only replace the install hint.server/backend/src/cq_server/repositories/knowledge.py (1)
81-83:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftKeep semsearch I/O off borrowed DB connections and out of the event-loop thread.
Both branches open a synchronous SQLAlchemy connection inside an async method and then await a helper that performs remote I/O before issuing sync
conn.execute()calls. That blocks the event loop on DB work and keeps the pooled connection/transaction pinned across the embedding round-trip.#!/bin/bash set -e rg -n -C2 'with self\._db\.engine\.(begin|connect)\(\) as conn|await sem_(insert_unit|query)\(' server/backend/src/cq_server/repositories/knowledge.pyExpected result: both
insert()andquery()show directEngineusage inside async methods, bypassingDatabase.run_sync(...).Also applies to: 109-113
server/backend/src/cq_server/semsearch/queries.py (2)
227-233:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the normalised domains for the query embedding too.
The SQL filter uses
normalized, but the embedding still uses rawdomains; case/whitespace variants can therefore search one concept and filter another.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/semsearch/queries.py` around lines 227 - 233, The code normalizes domains via normalize_domains() but still passes the original domains into _get_embeddings, causing mismatch between embedding and SQL filter; change the call to _get_embeddings to use the normalized variable (and propagate that normalized value into any subsequent embedding serialization/_serialize_embedding and args construction) so both the query_embedding and the "domains" filter use the same normalized domains (keep references: normalize_domains, normalized, _get_embeddings, _serialize_embedding, args).
333-336:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClamp zero ratios before taking
log().A valid
0value makesratio == 0when the field mean is positive, somath.log(ratio)raises. Zero cosine distance is a normal best-match case, so this helper can crash on perfect matches.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/src/cq_server/semsearch/queries.py` around lines 333 - 336, The loop computing logit can raise on ratio==0 (perfect match); clamp the computed ratio to a tiny positive epsilon before calling math.log. In the block iterating over values (variables: values, mean_val, ratio, logit, invert, field_logits, field_idx), compute ratio as before but then set clamped_ratio = max(ratio, 1e-12) (or similar small constant) and use clamped_ratio in the math.log call (respecting the invert flag) before assigning into field_logits[field_idx][v].
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/ci-server-embeddings.yaml:
- Line 7: Update the self-trigger path filters in the GitHub Actions workflow so
they match this workflow's filename: in
.github/workflows/ci-server-embeddings.yaml replace any occurrences of
".github/workflows/ci-server.yaml" with
".github/workflows/ci-server-embeddings.yaml" (these are the path filter entries
that control self-triggering of this workflow).
- Line 25: Replace every tag-pinned GitHub Action reference (e.g., the uses:
actions/checkout@v6 entry and the other uses: lines at the noted locations) with
the corresponding full commit SHA for the action repository; for each uses:
value (actions/checkout, and the other actions referenced at lines called out)
look up the desired release tag’s commit SHA on the action's GitHub repo and
substitute `@vX` or `@main` with the full 40-character commit SHA so all uses:
entries are SHA-pinned.
- Line 25: The actions/checkout usage in the workflow currently omits disabling
credential persistence; update every checkout step that uses
"actions/checkout@v6" (all occurrences) to include a with block setting
persist-credentials: false so the step becomes actions/checkout@v6 with
persist-credentials: false to prevent token exposure to downstream steps or
artifacts.
- Around line 102-104: Start the encoder in background (the
modernbert.x86_64-unknown-linux-gnu.encoderfile serve command) then poll the
TOKEN_EMBEDDING_URL (http://127.0.0.1:8080) with a short loop and timeout to
ensure it returns a successful response (HTTP 200 or health/readiness endpoint)
before invoking make setup-server test-server; fail the job if the service
doesn't become ready within the timeout so tests don't run against an unready
encoder.
---
Outside diff comments:
In `@server/backend/src/cq_server/repositories/knowledge.py`:
- Around line 76-83: The code currently commits the canonical KnowledgeUnit in
_insert_sync via await self._db.run_sync(self._insert_sync, unit) and then runs
sem_insert_unit in a separate transaction, which can leave the DB in an
inconsistent state if semsearch fails; change the flow so sem_insert_unit runs
inside the same transaction as the main insert or at least does not raise after
the main commit: move or call sem_insert_unit from within _insert_sync (or
obtain the same transaction/connection and call sem_insert_unit before
committing) so both operations share the same transactional scope (use the same
connection/transaction object instead of self._db.engine.begin()), or if
co-transactional execution is impossible, catch/log semsearch errors and
suppress exceptions so the caller is not signaled a failure after the unit has
been committed; reference _insert_sync, sem_insert_unit, _db.run_sync and
_SEMSEARCH_ENABLED when implementing the change.
In `@server/backend/src/cq_server/semsearch/queries.py`:
- Around line 73-90: The repository update flow currently imports insert_unit
and combined_query but never calls upsert_unit, so edited
summaries/details/actions never refresh their embeddings; update the repository
code that performs unit inserts/updates to also import and await
upsert_unit(unit) after the DB write (the same places you call insert_unit or
perform updates to a KnowledgeUnit), ensuring you call the async upsert_unit
function for edits (not just creates), and wrap the call in a short try/except
to log failures without breaking the main transaction.
- Around line 245-262: The combine() function currently adds normalized distance
so lower (better) vec_distance_cosine values reduce the combined score, which
incorrectly rewards farther vectors; change the distance contribution to reward
smaller distances by using an inverted term (e.g., 1 - (distance /
total_distance) when total_distance > 0) or subtracting the normalized distance
from relevance instead of adding it; update combine to compute
normalized_distance = distance / total_distance if total_distance > 0 else 0 and
then use distance_score = 1 - normalized_distance (or equivalently flip the
sign) when weighting with distance_weight so that closer vectors (smaller
vec_distance_cosine) increase the final score.
- Around line 55-62: The code posts the full wordlist but only reads
results[0]["embeddings"], dropping embeddings for other inputs; update the
response handling to collect embeddings from every result entry (e.g., iterate
over results and within each result over result["embeddings"], extracting
embedding.get("embedding")), validate the combined list is non-empty, and then
compute the mean across all collected embedding vectors with np.mean(axis=0)
before returning; refer to AsyncClient, _TOKEN_EMBEDDING_URL, request_data,
response, and results to locate where to change the comprehension that currently
uses results[0]["embeddings"].
In `@server/backend/tests/test_semsearch.py`:
- Around line 129-146: The test seeds data with store.insert before enabling the
semsearch monkeypatches, which can cause real embedding calls; move the
monkeypatch.setattr calls that set knowledge_mod._SEMSEARCH_ENABLED,
knowledge_mod.sem_query, and knowledge_mod.sem_insert_unit so they occur before
the first call to store.insert (and before any other operations that would
trigger semsearch), e.g. set the monkeypatches at the top of
test_query_calls_sem_query_when_semsearch_enabled (and apply the same reordering
to the other mocked-path tests in this file that seed data: the tests around
lines 153-172 and 178-193) so the mocked sem_query/sem_insert_unit are in effect
during store.insert and subsequent operations.
---
Duplicate comments:
In `@server/backend/src/cq_server/semsearch/__init__.py`:
- Around line 111-114: The connection is left with extension loading enabled if
sqlite_vec.load(conn) raises; wrap the load and ensure_schema calls in a
try/finally so conn.enable_load_extension(False) always runs. Specifically,
around the calls to sqlite_vec.load(conn) and ensure_schema(conn) ensure you
call conn.enable_load_extension(False) in a finally block while keeping
conn.enable_load_extension(True) before the try; this guarantees extension
loading is re-disabled on the connection even if sqlite_vec.load fails.
- Around line 16-17: The module currently calls
int(os.environ.get("SEMSEARCH_EMBEDDING_DIM", 768)) at import time (symbol
_DIM), which will raise on non-numeric env values even when semsearch is
disabled; change it to safely parse the environment variable (e.g., read raw
value then try/except ValueError or only parse when _ENABLED is True), falling
back to 768 on parse failure or missing value, and ensure the fallback/default
is used without throwing during import.
- Around line 31-34: The warning logged in semsearch/__init__.py (the
logger.warning call referencing TOKEN_EMBEDDING_URL) has the wrong recovery
hint; update the message text to tell operators to install the server package
with the semsearch extra (e.g. suggest "pip install cq-server[semsearch]" or the
appropriate server package name with the 'semsearch' extra) instead of
"cq-sdk[embedding]"; keep the rest of the message intact and only replace the
install hint.
In `@server/backend/src/cq_server/semsearch/queries.py`:
- Around line 227-233: The code normalizes domains via normalize_domains() but
still passes the original domains into _get_embeddings, causing mismatch between
embedding and SQL filter; change the call to _get_embeddings to use the
normalized variable (and propagate that normalized value into any subsequent
embedding serialization/_serialize_embedding and args construction) so both the
query_embedding and the "domains" filter use the same normalized domains (keep
references: normalize_domains, normalized, _get_embeddings,
_serialize_embedding, args).
- Around line 333-336: The loop computing logit can raise on ratio==0 (perfect
match); clamp the computed ratio to a tiny positive epsilon before calling
math.log. In the block iterating over values (variables: values, mean_val,
ratio, logit, invert, field_logits, field_idx), compute ratio as before but then
set clamped_ratio = max(ratio, 1e-12) (or similar small constant) and use
clamped_ratio in the math.log call (respecting the invert flag) before assigning
into field_logits[field_idx][v].
In `@server/backend/tests/test_app.py`:
- Around line 268-271: The test currently asserts len(results) == 1 which checks
the top-level response envelope instead of the number of returned items; update
the assertion to check the length of the "data" list from resp.json() (i.e.,
assert the count using results["data"]) before accessing results["data"][0], so
the test validates the number of returned units and then asserts
result["domains"] == ["astronomy"].
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: b2539dd6-dfb1-488f-81df-41535f850133
📒 Files selected for processing (11)
.github/workflows/ci-server-embeddings.yamlserver/backend/src/cq_server/core/db.pyserver/backend/src/cq_server/repositories/knowledge.pyserver/backend/src/cq_server/semsearch/__init__.pyserver/backend/src/cq_server/semsearch/queries.pyserver/backend/tests/conftest.pyserver/backend/tests/db_helpers.pyserver/backend/tests/test_app.pyserver/backend/tests/test_semsearch.pyserver/backend/tests/test_sqlite_store.pyserver/backend/tests/test_store.py
| name: Validate schemas | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - uses: actions/checkout@v6 |
There was a problem hiding this comment.
Pin all GitHub Actions to full commit SHAs.
All uses: entries are tag-based (@v*) rather than SHA-pinned. This violates the stated policy and weakens supply-chain guarantees for CI execution.
Also applies to: 27-27, 36-36, 38-38, 42-42, 46-46, 63-63, 65-65, 70-70, 74-74, 82-82
🧰 Tools
🪛 zizmor (1.25.2)
[warning] 25-25: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false
(artipacked)
[error] 25-25: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/ci-server-embeddings.yaml at line 25, Replace every
tag-pinned GitHub Action reference (e.g., the uses: actions/checkout@v6 entry
and the other uses: lines at the noted locations) with the corresponding full
commit SHA for the action repository; for each uses: value (actions/checkout,
and the other actions referenced at lines called out) look up the desired
release tag’s commit SHA on the action's GitHub repo and substitute `@vX` or `@main`
with the full 40-character commit SHA so all uses: entries are SHA-pinned.
Source: Linters/SAST tools
Disable checkout credential persistence in all jobs.
Each actions/checkout step should set persist-credentials: false to reduce token exposure risk in downstream steps/artifacts.
Suggested patch pattern
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v6
+ with:
+ persist-credentials: falseAlso applies to: 36-36, 63-63
🧰 Tools
🪛 zizmor (1.25.2)
[warning] 25-25: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false
(artipacked)
[error] 25-25: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/ci-server-embeddings.yaml at line 25, The actions/checkout
usage in the workflow currently omits disabling credential persistence; update
every checkout step that uses "actions/checkout@v6" (all occurrences) to include
a with block setting persist-credentials: false so the step becomes
actions/checkout@v6 with persist-credentials: false to prevent token exposure to
downstream steps or artifacts.
Source: Linters/SAST tools
Added a GH action to check the encoderfile. There's no need to dockerize it in the actions themselves. The user can later decide the specific deployment.
Every check goes smooth now :) |
There was a problem hiding this comment.
Thanks for sticking with this @javiermtorres; the refactor onto core/db.py and repositories/ is looking good and resolves most of the earlier feedback (the confidence mutation is gone, the combined query now has a LIMIT, the deps are pinned, the env vars are documented, and the conditional imports fixed the ImportError-without-the-extra problem). 🙏🏼
Since the restructure moved everything around, the old inline feedback comments are hard to follow, so here's a single consolidated checklist of what's left. All of it stems from this PR, so I'd like to land it complete rather than split out ... genuine future enhancements can be tracked separately (see the footer).
Correctness/error handling around the embedding call
- Domain-only fallback.
combined_querynow raises on an embedding or SQL error (queries.py:236-241) and the caller doesn't catch it (knowledge.py:109-113), so a transient embedding-service hiccup returns a 500 instead of degrading to the SQL-only path. - Partial failure on insert.
insert()commits the unit, then embeds in a separate transaction (knowledge.py:80-83). If embedding fails, the unit is persisted but has no vector row; invisible to semanticquery()(which JOINsknowledge_units_vec) with no re-index path, and the caller still gets a 500. The unit and its embedding should land or roll back together. (This also answers the:76-79FIXME; and that comment has a couple of typos worth fixing while it's touched.) -
update()doesn't refresh the embedding. It only calls_update_sync(knowledge.py:123-125);upsert_unitexists but is never wired in, so an edited unit keeps a stale vector.README.md:97already claims insert and update write an embedding row — so docs and behaviour need reconciling. - Domain normalization mismatch.
combined_queryembeds the rawdomains(queries.py:231) but filters onnormalized(:233), unlikequery()which embedsnormalized. Use the normalized form for both.
Dead code (all introduced in this PR)
-
build_field_logits,compute_combined_relevance, andqueries.queryare never called outside tests — the live path uses the inlinecombine(). Remove them (or wire them in). Notecompute_combined_relevance's docstring claims it "keeps the combined score positive," which isn't true (combined *= 1.0 + logitflips sign forlogit < -1). -
upsert_unitis currently unused, either wire it intoupdate()(above) or remove it.
Tests
- The new
test_semsearch.pyexercises the dead helpers, but the live scoring path (combined_query/combine()) only has the skip-gated e2e test, so it has no coverage in normal CI. Once the dead helpers are removed their tests should go too — please add an embedding-free test that mocks_get_embeddingsand asserts the real ranking.
Cleanup (all introduced in this PR)
- Remove
server/backend/.DS_Store(committed) and add it to.gitignore. - Drop the per-insert
logger.info(knowledge.py:75). - Fix the
ImportErrormessage atsemsearch/__init__.py:34— it sayspip install cq-sdk[embedding], but the package iscq-serverand the extra issemsearch(cq-server[semsearch]). - Revert the pytest logging config —
log_cli = truenow prints DEBUG on every run, which contradicts the "only with-vv" intent from earlier;jsonschemais also pinned twice in thetestsgroup (pyproject.toml:44,48,65-67).
Docs
- Turn the encoderfile link into a recommendation: a concrete model whose output dimension matches
SEMSEARCH_EMBEDDING_DIM(default768) and the minimal command to run it, soTOKEN_EMBEDDING_URLhas something to point at. - Document (or implement) the backfill story: units inserted while semsearch was disabled have no embedding and won't appear in semantic results; there's no reindex path today.
- Add
TOKEN_EMBEDDING_URLandSEMSEARCH_EMBEDDING_DIMto the self-hosted env-var table inDEVELOPMENT.md;README.md:71-72points there as the canonical reference, but they're missing.
Genuine future work: configurable candidate limit and relevance/distance weights (queries.py:258-259) is out of scope here and should be tracked in new issues that are referenced on the PR.
Happy to pair on the error-handling items if that's quicker than another round.
|
@peteski22 Re: domain only fallback. Do you want any error to fall back, or only those directly related to the semantic query machinery? Note that the current query can potentially fail for similar reasons (e.g. db connectivity). Also, a fallback provides not a degraded, but a different service instead, since the ranking will not be respecting embeddings. A feature indicator flag in both requests and responses to mark the fallback to standard query may be needed here. |
|
@peteski22 Re: partial failure on insert. The operations are not composable now at this level, since the sql commit is not visible. To avoid modifying _insert_sync, I will probably copy the operations there into the embedding insert. |
|
@peteski22 Re: domain normalization mismatch. I will probably remove |
|
Re: e2e tests, notice that there's an additional CI action that runs all tests enabling embeddings. I could fetch the embedding encoderfile, start it, run a specific suite, and then finish, but I supposed that you'd prefer having the embedding tests separate. Please let me know which approach you'd prefer. I will check whether there is appropriate coverage mocking the embeddings in the current setup. |
on I'd say scope the catch to the semantic path; don't blanket-catch/except everything. on Try: semantic -> fallback to standard query if semantic stuff fails ... if all that fails, surface the error on Yeah, you're right that swapping things quietly isn't great 🤔 probably just have to log a warning as changing the wire format for things is a bigger change and impacts more things. |
Do you mean move the logic so you compute the embedding first (when enabled) then do the SQL in a single transaction? |
Yep but the embeddings are not needed in the case of not-semantic search so the condition would be used in a couple of places because we also want the embeddings to be out of the sql queries. It's... awkward anyway. |
Do you mean like: # semsearch owns the enabled check:
async def compute_embedding(unit) -> bytes | None # None when disabled
def write_embedding(conn, unit_id, vector) -> None # sync, in-txn
# repository.insert() — no _SEMSEARCH_ENABLED branching:
vector = await semsearch.compute_embedding(unit) # HTTP, before the txn
await run_sync(self._insert_sync, unit, vector) # one txn: unit + domains, and write_embedding if vector |
Probably. But take into account that this needs to be done general enough so that it works without semantic search, and it is compatible with other kinds of search (BM25 for example), and it may need to be compatible with mechanisms in postgres (pgvector e.g.) since we've already merged #431. So it's not going to be trivial. |
|
@copilot check if the comments in #247 (review) are now addressed |
1e237dd to
6ab956e
Compare
79fc4ef to
0745ad5
Compare
Adds optional semantic similarity to the server's
/querypath using sqlite-vec and remote token embeddings, and migrates the server's scoring constants onto the canonicalcq-schemasource via the cq-sdk.Closes #21.
Closes #332.
Semantic search (#21)
When
TOKEN_EMBEDDING_URLis set and thesemsearchextra is installed (sqlite-vec,numpy,httpx),SqliteStore:connectevent;knowledge_units_vectable on schema bootstrap;insert/update;query(), fetches the per-domain candidate set joined with cosine distance viacombined_query, then ranks units byrelevance × confidence × Π(1 + logit(field))so semantically closer units boost their score and far ones diminish it.When the env var is unset (or the extra isn't installed) the path short-circuits via
semsearch.is_enabled()and behaviour matchesorigin/mainexactly. CI runs without the embedding server, so the modulation-specific test (test_query_returns_best_result_with_domain_overlap) skips by design.Scoring canonicalisation (#332)
cq-sdk 0.10.0 reads its scoring constants from cq-schema's canonical
scoring.values.json. With this change the server consumes those constants transitively via the SDK rather than holding its own copy.cq-sdkpin from~=0.9.1to~=0.10.0(pullscq-schema 0.0.1)from .scoringimports inapp.py,store/_sqlite.py,tests/test_store.pytofrom cq.scoringserver/backend/src/cq_server/scoring.pyandserver/backend/tests/test_scoring.py(the SDK'scq.scoringis now the canonical Python implementation; SDK tests cover the same behaviours)server/backend/tests/test_schema_oracle.pyto validate that server-serialised KnowledgeUnits round-trip throughcq_schema.load_schema("knowledge_unit")Behaviour note:
cq.scoring.calculate_relevanceclamps to[0, 1]; the deleted local copy did not. In practice a no-op (Jaccard ∈ [0,1] and the four weights sum to 1.0), but it removes a class of latent bug from any future weight tweak.Test plan
make test-server-backend— 302 passed, 1 skipped (embedding-server-dependent)make lint-server-backendSummary by CodeRabbit
New Features
Documentation
Tests
Chores
CI