[ZEPPELIN-6411] Semantic search for Zeppelin#5218
[ZEPPELIN-6411] Semantic search for Zeppelin#5218kkalyan wants to merge 8 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces an optional semantic notebook search implementation (EmbeddingSearch) that uses ONNX-based sentence embeddings to match queries by meaning (plus output indexing and SQL table boosting), and updates both Classic and Angular UIs to render richer search results.
Changes:
- Add
EmbeddingSearch(ONNX Runtime + DJL tokenizer) with binary persistence and live indexing. - Add semantic-search config flag (
zeppelin.search.semantic.enable) and wire the server to select Lucene vs semantic search. - Update Classic + Angular search result rendering (code/output/tables blocks) and adjust TypeScript typing/build settings.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| zeppelin-zengine/src/main/java/org/apache/zeppelin/search/EmbeddingSearch.java | New semantic search service (model download, embedding, indexing, persistence, query ranking). |
| zeppelin-zengine/src/test/java/org/apache/zeppelin/search/EmbeddingSearchTest.java | New gated tests for semantic indexing/query behavior. |
| zeppelin-zengine/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java | Add config accessor + conf var for semantic search enablement. |
| zeppelin-server/src/main/java/org/apache/zeppelin/server/ZeppelinServer.java | Bind EmbeddingSearch when semantic search is enabled. |
| zeppelin-zengine/pom.xml | Add ONNX Runtime + DJL tokenizers dependencies. |
| zeppelin-web/src/app/search/result-list.html | Classic UI search results layout changes. |
| zeppelin-web/src/app/search/result-list.controller.js | Classic UI result parsing for code/output/tables + language badge. |
| zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.ts | Angular UI result parsing + simplified rendering (no Monaco/highlighting). |
| zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.html | Angular UI template to show code/output/tables and badge. |
| zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.less | Angular UI styling for new result layout. |
| zeppelin-web-angular/tsconfig.base.json | TS compiler option changes. |
| zeppelin-web-angular/projects/zeppelin-sdk/tsconfig.json | TS compiler option changes for SDK build. |
| zeppelin-web-angular/src/app/utility/get-keyword-positions.ts | Tighten type for positions. |
| zeppelin-web-angular/src/app/share/run-scripts/run-scripts.directive.ts | Type annotations / casts for script execution logic. |
| zeppelin-web-angular/src/app/services/save-as.service.ts | Type annotation for binaryData. |
| zeppelin-web-angular/src/app/pages/workspace/notebook/paragraph/code-editor/code-editor.component.ts | Type annotation for newDecorations. |
| zeppelin-web-angular/src/app/pages/workspace/notebook/notebook.component.ts | Safer optional chaining on permissions access. |
| zeppelin-web-angular/src/app/pages/workspace/credential/credential.component.ts | Type cast for destructuring credentials. |
| docs/embedding-search.md | New documentation for semantic search design and usage. |
| NOTICE | Add attributions for ONNX Runtime and DJL tokenizers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@kkalyan Thank you for the contribution. Could you please check the CI first and fix it? Moreover, I will review it but I have a simple question. Do we download the model when we start the server? |
Hi @jongyoul - Yes, the model (~86MB) is downloaded on first start when |
|
I think it's better to have it by default as we need to assume the environment not to download it dynamically. Moreover, don't we need to wait until it's downloaded when starting the server? |
Thank you @jongyoul. You're right — downloading at startup is problematic for production/air-gapped environments.
Happy to implement whichever direction you prefer. |
|
This looks like a useful feature.
|
Add EmbeddingSearch — a new SearchService implementation that enables natural language search across notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2). Disabled by default, enabled with: zeppelin.search.semantic.enable = true Key improvements over keyword search: - Understands meaning, not just exact keywords - Indexes paragraph output (table data, text results) - Strips interpreter prefixes for cleaner matching - Zero external services — runs entirely in-process JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
Add EmbeddingSearch — a new SearchService implementation that enables natural language search across notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2). Disabled by default, enabled with: zeppelin.search.semantic.enable = true Key improvements over keyword search: - Understands meaning, not just exact keywords - Indexes paragraph output (table data, text results) - Extracts and boosts SQL table names (FROM/JOIN) - Two-phase search: discover relevant tables, then boost matches - Strips interpreter prefixes for cleaner matching - Zero external services — runs entirely in-process Frontend improvements (both Angular and Classic UI): - Search results show SQL code, output data, and table names in separate styled blocks instead of a single code editor - Language badges (sql/python/md) on search result cards New files: - EmbeddingSearch.java: core implementation - EmbeddingSearchTest.java: 11 tests including semantic validation - docs/embedding-search.md: architecture documentation JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
Add two-phase search, table extraction, output indexing, frontend changes, and live indexing test to documentation. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
Expand single-line if blocks in detectLang() to satisfy ESLint brace-style rule, and add ASF license header to embedding-search.md to pass Apache RAT audit. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
- Fix table boosting bug: results now re-sorted by boosted score - Add connect/read timeouts to model download (30s/60s) - Atomic index persistence: write to temp file, then rename - Strip <B> highlight tags from LuceneSearch results in both UIs - Hide language badge for unknown content types (return '' not 'text') - Remove unused SNIPPET_LENGTH constant - Share model directory across test methods to avoid 86MB re-download JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
Pin all-MiniLM-L6-v2 model and tokenizer URLs to commit c9745ed1d9f207416be6d2e6f8de32d1f16199bf instead of resolve/main/ to prevent silent model weight changes from upstream updates. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
Add bin/install-search-model.sh that downloads the ONNX model and tokenizer from a pinned HuggingFace commit (c9745ed1d9f2). Remove auto-download from EmbeddingSearch.initModel() — server now fails fast with a clear error if the model is not pre-installed. This avoids blocking server startup on network I/O and eliminates the risk of silent model version drift. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
a9a11e3 to
6c63b5f
Compare
|
Updated in the latest push:
Usage: bin/install-search-model.sh # uses default /tmp/zeppelin-index
bin/install-search-model.sh /my/path # custom index path
|
|
Thanks for the review @hyunw9!
|
| } | ||
| indexLock.writeLock().lock(); | ||
| try { | ||
| index.entrySet().removeIf(e -> e.getKey().startsWith(noteId)); |
There was a problem hiding this comment.
🐛 Prefix collision — if two note IDs share a prefix (e.g. 2A123 and 2A1234), this removes entries for both. Beyond the functional bug, this means a user deleting their own note can incidentally evict another user's index entries (low-grade availability impact; LuceneSearch is not affected since it uses Term-based matching).
Since the docId format is <noteId>/paragraph/<pId>, suggest:
removeIf(e -> e.getKey().equals(noteId) || e.getKey().startsWith(noteId + "/"));| * Format: [int:version=3][int:count] then for each entry: | ||
| * [utf:docId] [utf:noteName] [utf:text] [utf:title] [utf:tables] [utf:output] [float[384]:embedding] | ||
| */ | ||
| private void saveIndex() throws IOException { |
There was a problem hiding this comment.
This rewrites the entire binary index on every mutation, and it's called from every add/update/delete path (addParagraphIndex, updateParagraphIndex, addNoteIndex, updateNoteIndex, deleteNoteIndex, deleteParagraphIndex). For a 50K-paragraph deployment that's ~75MB written to disk on every paragraph save — not viable in production.
Options:
- Debounced/scheduled flush (e.g. flush every N seconds if dirty)
- Per-entry on-disk storage
- Append-only log + periodic compaction
| private void saveIndex() throws IOException { | ||
| Path file = indexPath.resolve("embedding_index.bin"); | ||
| Path tmpFile = indexPath.resolve("embedding_index.bin.tmp"); | ||
| indexLock.readLock().lock(); |
There was a problem hiding this comment.
Holding indexLock.readLock() across disk I/O means index mutations are blocked for the entire flush duration. Query reads still proceed (correct semantics), but writers wait. Suggest serializing to a byte[] (or ByteArrayOutputStream) buffer under the lock, then writing to disk after releasing it.
| } | ||
|
|
||
| if (zConf.isIndexRebuild()) { | ||
| notebook.addInitConsumer(this::addNoteIndex); |
There was a problem hiding this comment.
Initial indexing is only registered when isIndexRebuild() is true. If embedding_index.bin is missing or partial (first run with index.rebuild=false, or corrupted file), there's no bootstrap path — the index will simply stay empty. Suggest also bootstrapping when the index file is absent, e.g.:
if (zConf.isIndexRebuild() || !Files.exists(indexPath.resolve("embedding_index.bin"))) {
notebook.addInitConsumer(this::addNoteIndex);
}| super("EmbeddingSearch"); | ||
| this.notebook = notebook; | ||
| this.indexPath = Paths.get(zConf.getZeppelinSearchIndexPath()); | ||
| Files.createDirectories(indexPath); |
There was a problem hiding this comment.
🔒 Restrict permissions on the index directory. embedding_index.bin stores paragraph text and paragraph output (up to 1000 chars from extractOutput) in plaintext, which may include customer PII or financial values. The default index path is /tmp/zeppelin-index (world-traversable), and Files.createDirectories uses the umask default (typically 755).
This is the same threat class as the existing notebook-on-disk directory, but the new feature broadens it by including execution output. Suggest:
Files.createDirectories(indexPath);
if (Files.getFileStore(indexPath).supportsFileAttributeView("posix")) {
Files.setPosixFilePermissions(indexPath,
PosixFilePermissions.fromString("rwx------"));
}
if (indexPath.toAbsolutePath().startsWith("/tmp")) {
LOGGER.warn("zeppelin.search.index.path is under /tmp ({}); "
+ "paragraph text and output will be readable by other local users. "
+ "Consider setting it to a private directory.", indexPath);
}Also apply 0600 to embedding_index.bin itself in saveIndex() after the atomic move. Disk-level encryption (LUKS, FileVault) remains the operator's responsibility; this is just the in-app baseline.
| System.arraycopy(attentionMask, 0, mask, 0, seqLen); | ||
|
|
||
| long[] shape = {1, seqLen}; | ||
| OnnxTensor idsTensor = OnnxTensor.createTensor(ortEnv, LongBuffer.wrap(ids), shape); |
There was a problem hiding this comment.
🪲 Native tensor leak on partial allocation failure. Three OnnxTensor.createTensor calls happen sequentially before the try { ... } finally { close() } block; if the second or third call throws, the earlier tensors are never closed (native memory leak).
Suggest creating each tensor inside its own try-with-resources, or accumulating successfully-created handles in a list and closing all in finally.
| return | ||
| fi | ||
| echo "Downloading ${url} ..." | ||
| curl -fSL --connect-timeout 30 --max-time 300 -o "${dest}.tmp" "${url}" |
There was a problem hiding this comment.
🔒 Verify SHA256 of downloaded files. The script pins a HuggingFace commit SHA, which protects against repository content drift, but it does not verify the bytes received. ORT 1.18.x has had RCE/DoS CVEs around model deserialization, so the following scenarios remain exploitable:
- HuggingFace storage/CDN compromise
- MITM with a valid-looking certificate (corporate proxy, captured CA)
- Internal mirror/cache poisoning
Suggest hardcoding the expected SHA256 of model.onnx and tokenizer.json and verifying after download:
echo "<sha256> ${dest}" | sha256sum -c -| const matches = []; | ||
| let match = regexp.exec(mergedStr); | ||
| // snippet = SQL/code, header = tables + output | ||
| this.codeText = (this.result.snippet || '').replace(/<\/?B>/gi, ''); |
There was a problem hiding this comment.
Stripping <B> tags here removes the keyword highlighting that LuceneSearch provides — so users who never enable semantic search will lose highlighting in search results too. Since EmbeddingSearch is opt-in (default off), this is a regression for the default user.
Could you preserve highlight rendering for the Lucene path? E.g. convert <B>...</B> into <mark> via sanitized innerHTML, or render it manually via a small parsing pass.
| const model = this.editor?.getModel(); | ||
| if (!model) { | ||
| throw new Error('Editor model is not defined.'); | ||
| if (/select|insert|create|from|where/i.test(text)) { |
There was a problem hiding this comment.
/select|insert|create|from|where/i.test(text) will misclassify any paragraph containing the word "create" as SQL — including markdown like "Click Create to continue" or python comments mentioning "insert". Suggest:
- Check the interpreter prefix (
%spark.sql,%md, …) first — this is reliable - Fall back to the keyword heuristic only if no prefix is present
The backend IndexEntry could also include the resolved interpreter name to avoid duplicating detection in the frontend.
| "outDir": "./dist/out-tsc", | ||
| "sourceMap": true, | ||
| "strict": true, | ||
| "noImplicitAny": false, |
There was a problem hiding this comment.
These two flags relax type safety project-wide and aren't required by the semantic search feature. The accompanying as any casts in credential.component.ts, code-editor.component.ts, save-as.service.ts, run-scripts.directive.ts look unrelated too.
Could you split the tsconfig changes (and the related casts) into a separate PR? It would be much easier to review the underlying type issues independently, and avoids permanently lowering the project's type-safety bar as a side effect of a search feature.
There was a problem hiding this comment.
I agree with this review. In the same vein as skipLibCheck, I think it would be better to remove these configuration changes as well.
|
Hello, I checked logic in details and found some improvement points. Some were critical issues and others were not, but I left all of them for your reference. Please check the list below: Critical
Major
Minor
Suggested follow-ups (non-blocking)
|
| } | ||
| const text = model.getValue(); | ||
| const newDecorations = []; | ||
| const newDecorations: any[] = []; |
There was a problem hiding this comment.
| const newDecorations: any[] = []; | |
| const newDecorations: MonacoEditor.IModelDeltaDecoration[] = []; |
| const controls = [...Object.entries(data.userCredentials)].map(e => { | ||
| const entity = e[0]; | ||
| const { username, password } = e[1]; | ||
| const { username, password } = e[1] as any; |
There was a problem hiding this comment.
| const { username, password } = e[1] as any; | |
| const { username, password } = e[1] as CredentialForm; |
| "outDir": "./dist/out-tsc", | ||
| "sourceMap": true, | ||
| "strict": true, | ||
| "noImplicitAny": false, |
There was a problem hiding this comment.
I agree with this review. In the same vein as skipLibCheck, I think it would be better to remove these configuration changes as well.
| return; | ||
| } | ||
| this.ngZone.onStable.pipe(take(1)).subscribe(() => { | ||
| (this.ngZone.onStable as any).pipe(take(1)).subscribe(() => { |
There was a problem hiding this comment.
| (this.ngZone.onStable as any).pipe(take(1)).subscribe(() => { | |
| (this.ngZone.onStable).pipe(take(1)).subscribe(() => { |
| <div *ngIf="tablesText" class="tables-block"> | ||
| 📊 {{ tablesText }} | ||
| </div> |
There was a problem hiding this comment.
| <div *ngIf="tablesText" class="tables-block"> | |
| 📊 {{ tablesText }} | |
| </div> | |
| <div *ngIf="tablesText" class="tables-block">📊 {{ tablesText }}</div> |
The CI(run-playwright-e2e-tests) is failing due to lint issues, so this part needs to be fixed. Running npm run lint:fix in zeppelin-web-angular resolves the problem.
There was a problem hiding this comment.
run-playwright-e2e-tests passed after applied two commits.
voidmatcha@7ae095f
voidmatcha@15fc35e
https://github.com/voidmatcha/zeppelin/actions/runs/25116480167/job/73604863894
What is this PR for?
Added
EmbeddingSearch— a newSearchServiceimplementation that enables natural language search across Zeppelin notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2).Disabled by default, enabled with
zeppelin.search.semantic.enable = true.The problem:
Zeppelin's built-in search uses Lucene's keyword matching, which works well for exact terms but falls short for the way analysts actually search.
A user looking for "yesterday's spending" gets zero results — even though their notebooks contain SELECT sum(cost) WHERE date = current_date -
interval '1' day. The words don't match, so Lucene can't find it.
This PR adds EmbeddingSearch, an alternative SearchService that uses sentence embeddings (all-MiniLM-L6-v2 via ONNX Runtime) to match by meaning
instead of keywords. It runs entirely in-process with no external services required.
Beyond semantic matching, EmbeddingSearch addresses other gaps in notebook search:
What type of PR is it?
Feature
Todos
What is the Jira issue?
How should this be tested?
Automated tests:
Screenshots (if appropriate)
Semantic Search with New UI


Semantic Search with Classic UI
Questions: