[ZEPPELIN-6411] Semantic search for Zeppelin by kkalyan · Pull Request #5218 · apache/zeppelin

kkalyan · 2026-04-19T18:05:26Z

What is this PR for?

Added EmbeddingSearch — a new SearchService implementation that enables natural language search across Zeppelin notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2).
Disabled by default, enabled with zeppelin.search.semantic.enable = true.

The problem:
Zeppelin's built-in search uses Lucene's keyword matching, which works well for exact terms but falls short for the way analysts actually search.
A user looking for "yesterday's spending" gets zero results — even though their notebooks contain SELECT sum(cost) WHERE date = current_date -
interval '1' day. The words don't match, so Lucene can't find it.

This PR adds EmbeddingSearch, an alternative SearchService that uses sentence embeddings (all-MiniLM-L6-v2 via ONNX Runtime) to match by meaning
instead of keywords. It runs entirely in-process with no external services required.

Beyond semantic matching, EmbeddingSearch addresses other gaps in notebook search:

Indexes paragraph output — table results and text output become searchable, not just the code
Extracts SQL table names — FROM/JOIN references are extracted and used to boost related paragraphs in a two-phase ranking
Strips interpreter prefixes — %spark.sql, %python etc. are removed so they don't pollute search results
Live indexing — new or updated paragraphs are searchable immediately, no restart needed

What type of PR is it?

Feature

Todos

EmbeddingSearch core implementation (ONNX inference, mean pooling, cosine similarity)
Table name extraction from SQL (FROM/JOIN regex) with two-phase search boosting
Paragraph output indexing (TABLE, TEXT results)
Versioned binary persistence (v3 format)
Live indexing (new paragraphs searchable immediately)
Angular UI: render search results with separate code/output/tables blocks
Classic UI: same improvements
11 unit tests including semantic validation
Documentation

What is the Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-6411

How should this be tested?

Automated tests:

# Embedding search tests (requires ~86MB model download, one-time)
ZEPPELIN_EMBEDDING_TEST=true mvn test -pl zeppelin-zengine -Dtest=EmbeddingSearchTest

# Verify no regressions to existing Lucene search
mvn test -pl zeppelin-zengine -Dtest=LuceneSearchTest

Manual testing:

1. Set zeppelin.search.semantic.enable = true in zeppelin-site.xml
2. Restart Zeppelin
3. Search for natural language queries like:
  - "yesterday's spending" (Lucene: 0 results → Semantic: finds spend queries)
  - "how much do drivers earn" (finds taxi tip analysis)
  - "late deliveries" (finds shipping performance queries)
  - "airport rides" (both work — keyword match exists)

Screenshots (if appropriate)

Semantic Search with New UI

Semantic Search with Classic UI

Questions:

Does the license files need to update?
Yes — NOTICE updated with ONNX Runtime (MIT) and DJL Tokenizers (Apache 2.0) attribution.
Is there breaking changes for older versions?
No. Disabled by default. Existing LuceneSearch behavior is unchanged.

Copilot

Pull request overview

This PR introduces an optional semantic notebook search implementation (EmbeddingSearch) that uses ONNX-based sentence embeddings to match queries by meaning (plus output indexing and SQL table boosting), and updates both Classic and Angular UIs to render richer search results.

Changes:

Add EmbeddingSearch (ONNX Runtime + DJL tokenizer) with binary persistence and live indexing.
Add semantic-search config flag (zeppelin.search.semantic.enable) and wire the server to select Lucene vs semantic search.
Update Classic + Angular search result rendering (code/output/tables blocks) and adjust TypeScript typing/build settings.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
zeppelin-zengine/src/main/java/org/apache/zeppelin/search/EmbeddingSearch.java	New semantic search service (model download, embedding, indexing, persistence, query ranking).
zeppelin-zengine/src/test/java/org/apache/zeppelin/search/EmbeddingSearchTest.java	New gated tests for semantic indexing/query behavior.
zeppelin-zengine/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java	Add config accessor + conf var for semantic search enablement.
zeppelin-server/src/main/java/org/apache/zeppelin/server/ZeppelinServer.java	Bind `EmbeddingSearch` when semantic search is enabled.
zeppelin-zengine/pom.xml	Add ONNX Runtime + DJL tokenizers dependencies.
zeppelin-web/src/app/search/result-list.html	Classic UI search results layout changes.
zeppelin-web/src/app/search/result-list.controller.js	Classic UI result parsing for code/output/tables + language badge.
zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.ts	Angular UI result parsing + simplified rendering (no Monaco/highlighting).
zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.html	Angular UI template to show code/output/tables and badge.
zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.less	Angular UI styling for new result layout.
zeppelin-web-angular/tsconfig.base.json	TS compiler option changes.
zeppelin-web-angular/projects/zeppelin-sdk/tsconfig.json	TS compiler option changes for SDK build.
zeppelin-web-angular/src/app/utility/get-keyword-positions.ts	Tighten type for `positions`.
zeppelin-web-angular/src/app/share/run-scripts/run-scripts.directive.ts	Type annotations / casts for script execution logic.
zeppelin-web-angular/src/app/services/save-as.service.ts	Type annotation for `binaryData`.
zeppelin-web-angular/src/app/pages/workspace/notebook/paragraph/code-editor/code-editor.component.ts	Type annotation for `newDecorations`.
zeppelin-web-angular/src/app/pages/workspace/notebook/notebook.component.ts	Safer optional chaining on permissions access.
zeppelin-web-angular/src/app/pages/workspace/credential/credential.component.ts	Type cast for destructuring credentials.
docs/embedding-search.md	New documentation for semantic search design and usage.
NOTICE	Add attributions for ONNX Runtime and DJL tokenizers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jongyoul · 2026-04-20T03:01:00Z

@kkalyan Thank you for the contribution. Could you please check the CI first and fix it? Moreover, I will review it but I have a simple question. Do we download the model when we start the server?

kkalyan · 2026-04-20T05:10:42Z

@kkalyan Thank you for the contribution. Could you please check the CI first and fix it? Moreover, I will review it but I have a simple question. Do we download the model when we start the server?

Hi @jongyoul - Yes, the model (~86MB) is downloaded on first start when zeppelin.search.semantic.enable=true (disabled by default). It's cached at {zeppelin.search.index.path}/models/all-MiniLM-L6-v2/ and reused on subsequent starts. I'll fix the CI issues — ESLint brace-style in the classic UI controller and the missing ASF license header on the docs file.

jongyoul · 2026-04-20T05:34:26Z

I think it's better to have it by default as we need to assume the environment not to download it dynamically. Moreover, don't we need to wait until it's downloaded when starting the server?

kkalyan · 2026-04-20T05:39:11Z

I think it's better to have it by default as we need to assume the environment not to download it dynamically. Moreover, don't we need to wait until it's downloaded when starting the server?

Thank you @jongyoul. You're right — downloading at startup is problematic for production/air-gapped environments.
A couple of questions so I get this right:

Should the model be bundled in the distribution (~86MB), or would a configurable local path (zeppelin.search.model.path) where admins
pre-stage the model be better?
If the model isn't found, should Zeppelin fall back to LuceneSearch with a warning, or fail fast?
Would an optional helper script (bin/install-search-model.sh) be acceptable for the download convenience?

Happy to implement whichever direction you prefer.

hyunw9 · 2026-04-22T10:56:10Z

This looks like a useful feature.
I have a couple of questions on this PR :

Model version pinning: The model URL points to resolve/main/, so the weights could silently change if the upstream model is updated. Since embedding_index.bin doesn't seem to store which model version generated the vectors, how would we detect a mismatch between an existing index and the current model? Pinning to a specific commit hash might be safer.
JNA compatibility: Looks like the DJL tokenizer's JNA dependency is excluded and the runtime falls back to zeppelin-server's JNA 4.1.0. This seems to work for now, but could this break if DJL needs to be upgraded to a version that requires a newer JNA? Just wondering if this is something we should keep in mind.

Add EmbeddingSearch — a new SearchService implementation that enables natural language search across notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2). Disabled by default, enabled with: zeppelin.search.semantic.enable = true Key improvements over keyword search: - Understands meaning, not just exact keywords - Indexes paragraph output (table data, text results) - Strips interpreter prefixes for cleaner matching - Zero external services — runs entirely in-process JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

Add EmbeddingSearch — a new SearchService implementation that enables natural language search across notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2). Disabled by default, enabled with: zeppelin.search.semantic.enable = true Key improvements over keyword search: - Understands meaning, not just exact keywords - Indexes paragraph output (table data, text results) - Extracts and boosts SQL table names (FROM/JOIN) - Two-phase search: discover relevant tables, then boost matches - Strips interpreter prefixes for cleaner matching - Zero external services — runs entirely in-process Frontend improvements (both Angular and Classic UI): - Search results show SQL code, output data, and table names in separate styled blocks instead of a single code editor - Language badges (sql/python/md) on search result cards New files: - EmbeddingSearch.java: core implementation - EmbeddingSearchTest.java: 11 tests including semantic validation - docs/embedding-search.md: architecture documentation JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

Add two-phase search, table extraction, output indexing, frontend changes, and live indexing test to documentation. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

Expand single-line if blocks in detectLang() to satisfy ESLint brace-style rule, and add ASF license header to embedding-search.md to pass Apache RAT audit. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

- Fix table boosting bug: results now re-sorted by boosted score - Add connect/read timeouts to model download (30s/60s) - Atomic index persistence: write to temp file, then rename - Strip <B> highlight tags from LuceneSearch results in both UIs - Hide language badge for unknown content types (return '' not 'text') - Remove unused SNIPPET_LENGTH constant - Share model directory across test methods to avoid 86MB re-download JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

Pin all-MiniLM-L6-v2 model and tokenizer URLs to commit c9745ed1d9f207416be6d2e6f8de32d1f16199bf instead of resolve/main/ to prevent silent model weight changes from upstream updates. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

Add bin/install-search-model.sh that downloads the ONNX model and tokenizer from a pinned HuggingFace commit (c9745ed1d9f2). Remove auto-download from EmbeddingSearch.initModel() — server now fails fast with a clear error if the model is not pre-installed. This avoids blocking server startup on network I/O and eliminates the risk of silent model version drift. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

kkalyan · 2026-04-22T15:33:53Z

@jongyoul:

Updated in the latest push:

Added bin/install-search-model.sh — downloads the model (~86MB) and tokenizer from a pinned HuggingFace commit. Run once before enabling semantic
search.
Removed all runtime download logic from EmbeddingSearch. Server now fails fast with a clear error message if the model isn't pre-installed.
No network I/O at startup.

Usage:

bin/install-search-model.sh            # uses default /tmp/zeppelin-index
bin/install-search-model.sh /my/path   # custom index path

kkalyan · 2026-04-22T15:33:59Z

Thanks for the review @hyunw9!

Model version pinning: Good catch. The URLs are now pinned to a specific HuggingFace commit (c9745ed1d9f2) in bin/install-search-model.sh, and the
runtime download logic has been removed entirely from EmbeddingSearch.java. The model is pre-installed via the helper script, so there's no risk of
silent drift. Upgrading the model becomes an explicit, reviewable change to the script.
JNA compatibility: You're right to flag this. DJL tokenizers 0.28.0 pulls JNA 5.14.0, which we exclude to avoid conflicting with Zeppelin's JNA
4.1.0 (from Hadoop). This works because DJL only uses basic Native.load() which is stable across JNA versions. If a future DJL upgrade requires newer
JNA APIs, it would fail immediately at startup (tokenizer init), not silently. At that point the fix would be upgrading Zeppelin's JNA globally —
same pattern other modules follow. Added a comment in the pom.xml exclusion to document this.

kkalyan · 2026-04-28T04:36:59Z

@jongyoul @hyunw9 checking in on this. any other feedback?

jongyoul · 2026-04-28T09:03:39Z

+    }
+    indexLock.writeLock().lock();
+    try {
+      index.entrySet().removeIf(e -> e.getKey().startsWith(noteId));


🐛 Prefix collision — if two note IDs share a prefix (e.g. 2A123 and 2A1234), this removes entries for both. Beyond the functional bug, this means a user deleting their own note can incidentally evict another user's index entries (low-grade availability impact; LuceneSearch is not affected since it uses Term-based matching).

Since the docId format is <noteId>/paragraph/<pId>, suggest:

removeIf(e -> e.getKey().equals(noteId) || e.getKey().startsWith(noteId + "/"));

jongyoul · 2026-04-28T09:03:39Z

+   * Format: [int:version=3][int:count] then for each entry:
+   *   [utf:docId] [utf:noteName] [utf:text] [utf:title] [utf:tables] [utf:output] [float[384]:embedding]
+   */
+  private void saveIndex() throws IOException {


This rewrites the entire binary index on every mutation, and it's called from every add/update/delete path (addParagraphIndex, updateParagraphIndex, addNoteIndex, updateNoteIndex, deleteNoteIndex, deleteParagraphIndex). For a 50K-paragraph deployment that's ~75MB written to disk on every paragraph save — not viable in production.

Options:

Debounced/scheduled flush (e.g. flush every N seconds if dirty)

Per-entry on-disk storage

Append-only log + periodic compaction

jongyoul · 2026-04-28T09:03:40Z

+  private void saveIndex() throws IOException {
+    Path file = indexPath.resolve("embedding_index.bin");
+    Path tmpFile = indexPath.resolve("embedding_index.bin.tmp");
+    indexLock.readLock().lock();


Holding indexLock.readLock() across disk I/O means index mutations are blocked for the entire flush duration. Query reads still proceed (correct semantics), but writers wait. Suggest serializing to a byte[] (or ByteArrayOutputStream) buffer under the lock, then writing to disk after releasing it.

jongyoul · 2026-04-28T09:03:40Z

+    }
+
+    if (zConf.isIndexRebuild()) {
+      notebook.addInitConsumer(this::addNoteIndex);


Initial indexing is only registered when isIndexRebuild() is true. If embedding_index.bin is missing or partial (first run with index.rebuild=false, or corrupted file), there's no bootstrap path — the index will simply stay empty. Suggest also bootstrapping when the index file is absent, e.g.:

if (zConf.isIndexRebuild() || !Files.exists(indexPath.resolve("embedding_index.bin"))) { notebook.addInitConsumer(this::addNoteIndex); }

jongyoul · 2026-04-28T09:03:40Z

+    super("EmbeddingSearch");
+    this.notebook = notebook;
+    this.indexPath = Paths.get(zConf.getZeppelinSearchIndexPath());
+    Files.createDirectories(indexPath);


🔒 Restrict permissions on the index directory. embedding_index.bin stores paragraph text and paragraph output (up to 1000 chars from extractOutput) in plaintext, which may include customer PII or financial values. The default index path is /tmp/zeppelin-index (world-traversable), and Files.createDirectories uses the umask default (typically 755).

This is the same threat class as the existing notebook-on-disk directory, but the new feature broadens it by including execution output. Suggest:

Files.createDirectories(indexPath); if (Files.getFileStore(indexPath).supportsFileAttributeView("posix")) { Files.setPosixFilePermissions(indexPath, PosixFilePermissions.fromString("rwx------")); } if (indexPath.toAbsolutePath().startsWith("/tmp")) { LOGGER.warn("zeppelin.search.index.path is under /tmp ({}); " + "paragraph text and output will be readable by other local users. " + "Consider setting it to a private directory.", indexPath); }

Also apply 0600 to embedding_index.bin itself in saveIndex() after the atomic move. Disk-level encryption (LUKS, FileVault) remains the operator's responsibility; this is just the in-app baseline.

jongyoul · 2026-04-28T09:03:40Z

+      System.arraycopy(attentionMask, 0, mask, 0, seqLen);
+
+      long[] shape = {1, seqLen};
+      OnnxTensor idsTensor = OnnxTensor.createTensor(ortEnv, LongBuffer.wrap(ids), shape);


🪲 Native tensor leak on partial allocation failure. Three OnnxTensor.createTensor calls happen sequentially before the try { ... } finally { close() } block; if the second or third call throws, the earlier tensors are never closed (native memory leak).

Suggest creating each tensor inside its own try-with-resources, or accumulating successfully-created handles in a list and closing all in finally.

jongyoul · 2026-04-28T09:03:40Z

+    return
+  fi
+  echo "Downloading ${url} ..."
+  curl -fSL --connect-timeout 30 --max-time 300 -o "${dest}.tmp" "${url}"


🔒 Verify SHA256 of downloaded files. The script pins a HuggingFace commit SHA, which protects against repository content drift, but it does not verify the bytes received. ORT 1.18.x has had RCE/DoS CVEs around model deserialization, so the following scenarios remain exploitable:

HuggingFace storage/CDN compromise

MITM with a valid-looking certificate (corporate proxy, captured CA)

Internal mirror/cache poisoning

Suggest hardcoding the expected SHA256 of model.onnx and tokenizer.json and verifying after download:

echo "<sha256> ${dest}" | sha256sum -c -

jongyoul · 2026-04-28T09:03:40Z

-    const matches = [];
-    let match = regexp.exec(mergedStr);
+    // snippet = SQL/code, header = tables + output
+    this.codeText = (this.result.snippet || '').replace(/<\/?B>/gi, '');


Stripping <B> tags here removes the keyword highlighting that LuceneSearch provides — so users who never enable semantic search will lose highlighting in search results too. Since EmbeddingSearch is opt-in (default off), this is a regression for the default user.

Could you preserve highlight rendering for the Lucene path? E.g. convert <B>...</B> into <mark> via sanitized innerHTML, or render it manually via a small parsing pass.

jongyoul · 2026-04-28T09:03:40Z

-    const model = this.editor?.getModel();
-    if (!model) {
-      throw new Error('Editor model is not defined.');
+    if (/select|insert|create|from|where/i.test(text)) {


/select|insert|create|from|where/i.test(text) will misclassify any paragraph containing the word "create" as SQL — including markdown like "Click Create to continue" or python comments mentioning "insert". Suggest:

Check the interpreter prefix (%spark.sql, %md, …) first — this is reliable

Fall back to the keyword heuristic only if no prefix is present

The backend IndexEntry could also include the resolved interpreter name to avoid duplicating detection in the frontend.

jongyoul · 2026-04-28T09:03:40Z

    "outDir": "./dist/out-tsc",
    "sourceMap": true,
    "strict": true,
+    "noImplicitAny": false,


These two flags relax type safety project-wide and aren't required by the semantic search feature. The accompanying as any casts in credential.component.ts, code-editor.component.ts, save-as.service.ts, run-scripts.directive.ts look unrelated too.

Could you split the tsconfig changes (and the related casts) into a separate PR? It would be much easier to review the underlying type issues independently, and avoids permanently lowering the project's type-safety bar as a side effect of a search feature.

I agree with this review. In the same vein as skipLibCheck, I think it would be better to remove these configuration changes as well.

jongyoul · 2026-04-28T09:16:58Z

Hello, I checked logic in details and found some improvement points. Some were critical issues and others were not, but I left all of them for your reference. Please check the list below:

Critical

Unrelated TypeScript strictness regression — tsconfig.base.json and zeppelin-sdk/tsconfig.json add noImplicitAny: false and skipLibCheck: true, plus several as any casts in credential.component.ts, code-editor.component.ts, save-as.service.ts, run-scripts.directive.ts. None of these are related to semantic search and they permanently lower project-wide type safety. Please split these out — ideally fix the underlying type issues in a separate PR.
deleteNoteIndex prefix collision bug (EmbeddingSearch.java:548) — startsWith(noteId) will also remove entries whose note IDs share the prefix (e.g. 2A123 vs 2A1234). Beyond the functional bug, this is also a low-grade availability concern: a user deleting their own note can incidentally remove another user's index entries (existing LuceneSearch is not affected since it uses Term-based matching). See inline.
saveIndex() rewrites the entire binary on every mutation (EmbeddingSearch.java:635) — called from every add/update/delete path. With 50K paragraphs that's ~75MB written to disk on every paragraph save. Not viable for production. See inline.
Initial indexing only registered when isIndexRebuild (EmbeddingSearch.java:137-138) — if embedding_index.bin is missing or partial, there's no bootstrap path. See inline.
UI regression for default (Lucene) users (result-item.component.ts:55) — the new UI strips <B> highlight tags, so users who never enable semantic search lose keyword highlighting in search results. See inline.

Major

LICENSE file not updated for bundled dependencies — both ONNX Runtime (MIT) and DJL Tokenizers (Apache-2.0) are Category A, so they are license-compatible (no dev@/Legal review needed; see LEGAL-230 for the precedent on JNI native binaries like Snappy-Java). However, per the ASF licensing-howto bundled dependencies should also be listed in LICENSE (not just NOTICE), with the bundled artifact path, license type, source URL, and license text or a link to it.
Index file/directory should be created with restrictive permissions (security) — EmbeddingSearch indexes paragraph output (extractOutput, up to 1000 chars) into embedding_index.bin in plaintext. Output may contain customer PII, monetary values, etc. The default index path is /tmp/zeppelin-index (world-traversable), and Files.createDirectories uses the umask default (typically 755). This widens the existing notebook-on-disk surface — same threat class as the notebook directory, but the new feature adds output text to it. Suggest creating the index dir/file with mode 0700 and warning if the configured path starts with /tmp. Disk-level encryption remains the operator's responsibility. See inline.
Model file integrity not verified after download (supply chain) — bin/install-search-model.sh pins to a HuggingFace commit SHA but doesn't verify SHA256 of the downloaded files. ONNX models can trigger RCE/DoS via deserialization (multiple CVEs in ORT 1.18.x), so MITM/cache poisoning/repository compromise scenarios should be hardened. See inline.
Model file integrity not re-verified at runtime — ortSession = ortEnv.createSession(modelFile.toString(), opts) loads whatever is on disk. Combined with the permission issue above, anyone with write access to the index dir can swap the model. A SHA256 comparison at startup would provide defense-in-depth.
Index file deserialization is unbounded (DoS) — loadIndex reads count = in.readInt() and loops count times, allocating per-entry memory. A corrupted/tampered index file with count = Integer.MAX_VALUE causes startup OOM (caught only as IOException — OutOfMemoryError is not caught). Suggest a sanity bound (e.g. 10M entries). See inline.
saveIndex performs disk I/O while holding indexLock.readLock() (EmbeddingSearch.java:638) — query reads still proceed (correct), but all index mutations block until the disk flush completes. Serialize to a buffer first, then write outside the lock.
Synchronous embedding on the notebook event thread — addParagraphIndex runs ONNX inference (~5ms) plus a full disk write on the same thread that delivers paragraph events. Consider an async worker queue with batching.
CI does not run the new tests — @EnabledIfEnvironmentVariable("ZEPPELIN_EMBEDDING_TEST", "true") means default mvn test skips them entirely. Please add a CI job (or Maven profile) that exercises EmbeddingSearchTest with the model downloaded/cached.
canIndexAndQuery doesn't actually validate semantic behavior — the assertion r.get("text").contains("all") is a substring check that keyword search would also pass. semanticSearchFindsRelatedConcepts is the only true semantic test; consider adding more cases like that.

Minor

detectInterpreter false positives (result-item.component.ts:78 and the matching detectLang in classic UI) — /select|insert|create|from|where/i will tag any markdown containing the word "create" as SQL. Suggest checking the interpreter prefix (%spark.sql, %md, …) first and falling back to heuristics.
extractTables only matches schema.table form — by design, but the limitation should be documented. Single-name tables (FROM customers) and dialect quoting are missed, so the table-boost phase is effectively a no-op for transactional-DB style queries. A proper SQL parser (JSqlParser, Calcite) would be more accurate but is a reasonable follow-up rather than a blocker.
Native tensor leak on partial allocation failure — three OnnxTensor.createTensor calls in embed() happen outside the try-with-resources block; if the second/third call throws, the earlier tensors are never closed. Wrap each tensor allocation so all successfully-created handles are released.
Magic numbers scattered through EmbeddingSearch (MIN_SIMILARITY=0.25, TABLE_BOOST=0.05, threshold 0.2, text limits 1500/2000/500/300). Consider exposing the tunables as ConfVars.
ImmutableMap.of(...) with 5 entries in query() is at Guava's positional-overload limit; a builder or Map.of(...) is safer for future field additions.
Classic UI inline styles (result-list.html) — long style="..." attributes; please move into the existing CSS file.
ONNX Runtime 1.18.0 / DJL 0.28.0 are slightly behind current stable; Java APIs are stable across minor versions, but bumping picks up security/native fixes. Worth a separate dependency-bump PR.
Test uses Files.createSymbolicLink to share the model dir across tests — fails on Windows without admin/dev mode. Cleaner solution: make EmbeddingSearch accept the model directory as a separate config so the test can point both indexDir and modelDir independently.

Suggested follow-ups (non-blocking)

Configurable model URL for air-gapped environments (already in your "Future Work" — good)

jongyoul · 2026-04-28T09:17:29Z

@tbonelee @dididy Could you please help review for the FE part?

@kkalyan Could you please check the CI as well?

voidmatcha

I left a small type-related review. I’ll continue reviewing the frontend changes as the PR progresses.

voidmatcha@7ae095f

I verified that tsc passes locally after applying the suggested changes.

voidmatcha · 2026-04-29T13:38:55Z

    }
    const text = model.getValue();
-    const newDecorations = [];
+    const newDecorations: any[] = [];


Suggested change

const newDecorations: any[] = [];

const newDecorations: MonacoEditor.IModelDeltaDecoration[] = [];

voidmatcha · 2026-04-29T13:39:09Z

      const controls = [...Object.entries(data.userCredentials)].map(e => {
        const entity = e[0];
-        const { username, password } = e[1];
+        const { username, password } = e[1] as any;


Suggested change

const { username, password } = e[1] as any;

const { username, password } = e[1] as CredentialForm;

voidmatcha · 2026-04-29T13:51:16Z

    "outDir": "./dist/out-tsc",
    "sourceMap": true,
    "strict": true,
+    "noImplicitAny": false,


I agree with this review. In the same vein as skipLibCheck, I think it would be better to remove these configuration changes as well.

voidmatcha · 2026-04-29T13:51:57Z

      return;
    }
-    this.ngZone.onStable.pipe(take(1)).subscribe(() => {
+    (this.ngZone.onStable as any).pipe(take(1)).subscribe(() => {


Suggested change

(this.ngZone.onStable as any).pipe(take(1)).subscribe(() => {

(this.ngZone.onStable).pipe(take(1)).subscribe(() => {

voidmatcha · 2026-04-29T15:02:27Z

+  <div *ngIf="tablesText" class="tables-block">
+    📊 {{ tablesText }}
+  </div>


Suggested change

<div *ngIf="tablesText" class="tables-block">

📊 {{ tablesText }}

</div>

<div *ngIf="tablesText" class="tables-block">📊 {{ tablesText }}</div>

The CI(run-playwright-e2e-tests) is failing due to lint issues, so this part needs to be fixed. Running npm run lint:fix in zeppelin-web-angular resolves the problem.

run-playwright-e2e-tests passed after applied two commits.

voidmatcha@7ae095f
voidmatcha@15fc35e

https://github.com/voidmatcha/zeppelin/actions/runs/25116480167/job/73604863894

kkalyan changed the title ~~Zeppelin 6411 semantic search~~ [feat] Semantic search for Zeppelin Apr 19, 2026

kkalyan changed the title ~~[feat] Semantic search for Zeppelin~~ [ZEPPELIN-6411] Semantic search for Zeppelin Apr 19, 2026

jongyoul requested a review from Copilot April 20, 2026 01:00

Copilot started reviewing on behalf of jongyoul April 20, 2026 01:01 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Kalyan Kanuri added 8 commits April 22, 2026 08:31

docs: Update embedding search documentation

5b3e18b

Add two-phase search, table extraction, output indexing, frontend changes, and live indexing test to documentation. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

chore: Revert unrelated package-lock.json change

0562c99

fix: Resolve CI failures for ESLint brace-style and RAT license check

641daa4

Expand single-line if blocks in detectLang() to satisfy ESLint brace-style rule, and add ASF license header to embedding-search.md to pass Apache RAT audit. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

kkalyan force-pushed the ZEPPELIN-6411-semantic-search branch from a9a11e3 to 6c63b5f Compare April 22, 2026 15:32

jongyoul requested changes Apr 28, 2026

View reviewed changes

voidmatcha suggested changes Apr 29, 2026

View reviewed changes

voidmatcha reviewed Apr 29, 2026

View reviewed changes

	const newDecorations: any[] = [];
	const newDecorations: MonacoEditor.IModelDeltaDecoration[] = [];

	const { username, password } = e[1] as any;
	const { username, password } = e[1] as CredentialForm;

	(this.ngZone.onStable as any).pipe(take(1)).subscribe(() => {
	(this.ngZone.onStable).pipe(take(1)).subscribe(() => {

Conversation

kkalyan commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jongyoul commented Apr 20, 2026

Uh oh!

kkalyan commented Apr 20, 2026

Uh oh!

jongyoul commented Apr 20, 2026

Uh oh!

kkalyan commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyunw9 commented Apr 22, 2026

Uh oh!

kkalyan commented Apr 22, 2026

Uh oh!

kkalyan commented Apr 22, 2026

Uh oh!

kkalyan commented Apr 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jongyoul commented Apr 28, 2026

Critical

Major

Minor

Suggested follow-ups (non-blocking)

Uh oh!

jongyoul commented Apr 28, 2026

Uh oh!

voidmatcha left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kkalyan commented Apr 19, 2026 •

edited

Loading

kkalyan commented Apr 20, 2026 •

edited

Loading

voidmatcha left a comment •

edited

Loading

voidmatcha Apr 29, 2026 •

edited

Loading