Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
4149580
Add WAL for direct deployment state recovery
Jan 11, 2026
e7da9d9
Updated tests and enhanced kill caller with an offset
Jan 12, 2026
8fcd7eb
Updated existing tests
Jan 23, 2026
51f1974
test fixes
Feb 2, 2026
36ff7c4
Fixes
Feb 7, 2026
338ae0e
fixed tests
Feb 9, 2026
ebb16ae
updated tests
varundeepsaini Mar 24, 2026
184d4a4
dedup
varundeepsaini Mar 24, 2026
784f7c5
Update WAL corrupted entry outputs
varundeepsaini Mar 26, 2026
0241232
WIP
denik Mar 27, 2026
8b18631
Updated tests and enhanced kill caller with an offset
Jan 12, 2026
0dd57ab
Updated existing tests
Jan 23, 2026
2029281
Merge simplified WAL handling into state.go
denik Mar 27, 2026
f79fa29
fixes
denik Mar 27, 2026
bb11b78
fixes
denik Mar 27, 2026
fb00793
rm unnecessary assert
denik Mar 27, 2026
9f3d0ec
Centralize state open/close lifecycle for direct engine
denik Mar 28, 2026
adf0621
lint
denik Mar 28, 2026
cdc8e2a
fixes
denik Apr 29, 2026
9c77b42
lint
denik Apr 29, 2026
0669b83
restore test
denik Apr 30, 2026
57a43a7
Skip state file write when WAL has no resource entries
denik Apr 30, 2026
f7d6a5c
Revert per-engine test splits for no-resource deploys
denik Apr 30, 2026
8c38c72
fmt
denik Apr 30, 2026
b0bad1a
Maintain stateIDs as single source of truth for resource IDs
denik Apr 30, 2026
44f6141
Remove defer Close from processBundleRetInternal; align with main app…
denik Apr 30, 2026
13d2ae9
Rename Close to Finalize; make plan a local var in processBundleRetIn…
denik Apr 30, 2026
34403ea
Restore process.go structure to match main more closely
denik Apr 30, 2026
02cd4af
Fix migration count, remove unnecessary defer Finalize, fix errcheck
denik Apr 30, 2026
dc973e1
Fix WAL validation: lowercase suffix, partial recovery, directory cre…
denik Apr 30, 2026
74d192f
restore non-material changes: assertions and comment
denik Apr 30, 2026
d041c19
deduplicate UpgradeToWrite+defer Finalize in Deploy
denik May 1, 2026
baffcb5
update out.test.toml
denik May 1, 2026
5867999
fix compilation in configsync/variables.go
denik May 4, 2026
19c0bf9
use OpenWithData+UpgradeToWrite in migrate to avoid disk roundtrip
denik May 6, 2026
2b294b8
use OpenWithData+UpgradeToWrite in uploadStateForYamlSync
denik May 7, 2026
bcf5d31
remove redundant defer Finalize in Deploy
denik May 10, 2026
ffa5c05
move Finalize into destroyCore before files.Delete
denik May 10, 2026
10d5e68
remove noise comment from bundle_apply.go
denik May 10, 2026
c7f54e8
fix gofumpt and test output
denik May 10, 2026
57ad571
shrink chain-10-jobs to chain-3-jobs
denik May 10, 2026
3bd8efe
fix test names in state_test.go: Close -> Finalize, restore SaveFinalize
denik May 10, 2026
870d434
clean up WAL acceptance tests
denik May 10, 2026
1748036
fix crash-after-create: handle Linux exit code 1 after KillCaller
denik May 10, 2026
01a3610
update selftest
denik May 10, 2026
69dfc3e
fix WAL acceptance test hygiene
denik May 11, 2026
97bd52c
update test output after rebase
denik May 11, 2026
db53353
destroyCore: warn on Finalize failure instead of aborting
denik May 11, 2026
4ef7c16
update test output
denik May 11, 2026
47e11ad
deployCore: use Finalize return value instead of re-opening state
denik May 11, 2026
deae01f
statemgmt.Load: accept state directly instead of engine
denik May 11, 2026
f896d36
fmt
denik May 11, 2026
08b304e
deployCore: move ParseResourcesState before PushResourcesState
denik May 11, 2026
32b4988
simplify test
denik May 11, 2026
f628251
update outputs
denik May 11, 2026
2c6d40c
fix Windows replacement for process kill during deployment
denik May 11, 2026
38c1756
formatting
denik May 11, 2026
f27e4ca
clean up
denik May 11, 2026
e113406
rm unnecessarial SERIAL replacement
denik May 11, 2026
190ce16
rm noop replacement for lineage
denik May 11, 2026
eca0376
clean up
denik May 11, 2026
78ef5f1
testserver: replace KillCaller config with HTTP kill API
denik May 11, 2026
284c4db
remove blank line
denik May 11, 2026
d2362e2
wal tests: remove redundant server stubs covered by default handlers
denik May 11, 2026
22fd654
wal tests: move test.toml comments to script, remove empty test.toml …
denik May 11, 2026
853b56b
Add databricks.yml
May 11, 2026
607f657
clean up
May 11, 2026
4b4a022
add replace_ids.py
May 11, 2026
29e0ca5
clean up
May 11, 2026
c98b3dd
test more commands for validation
May 11, 2026
f151e71
remove normal-deploy test
May 11, 2026
cae1012
clean up
May 11, 2026
088ed09
test recover in plan/deploy/summary
May 11, 2026
a52fa51
clean up
May 11, 2026
16de5c1
add assert_*.py
May 11, 2026
3866db1
corrupted-wal-entry: use envsubst + template file for WAL generation
May 11, 2026
ae293dd
kill_caller selftests: move test.toml comments to script, remove empt…
May 11, 2026
202c0ac
formatting
May 12, 2026
9a1bb57
fix CI: commit missing test.tomls and fix assert_*.py permissions
May 12, 2026
2d5eea9
fix: use TOML basic strings with \n escapes in Repls to avoid CRLF on…
May 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions acceptance/bin/assert_exists.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/usr/bin/env python3
import os, sys

errors = 0

for filename in sys.argv[1:]:
if not os.path.exists(filename):
sys.stderr.write(f"Unexpected: {filename} does not exist.\n")
errors += 1

if errors:
sys.exit(1)
12 changes: 12 additions & 0 deletions acceptance/bin/assert_not_exists.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/usr/bin/env python3
import os, sys

errors = 0

for filename in sys.argv[1:]:
if os.path.exists(filename):
sys.stderr.write(f"Unexpected: {filename} exists.\n")
errors += 1

if errors:
sys.exit(1)
39 changes: 39 additions & 0 deletions acceptance/bin/kill_after.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/env python3
"""Set up a kill rule on the testserver for the current test token.

Usage: kill_after.py PATTERN OFFSET TIMES

PATTERN HTTP method and path, e.g. "POST /api/2.2/jobs/create"
OFFSET number of requests to let through before killing starts
TIMES number of times to kill the caller

The rule is scoped to the current DATABRICKS_TOKEN so it only affects
the test that registers it, even when tests share a server.
"""

import json
import os
import sys
import urllib.request

host = os.environ.get("DATABRICKS_HOST", "")
token = os.environ.get("DATABRICKS_TOKEN", "")

if not host:
print("DATABRICKS_HOST not set", file=sys.stderr)
sys.exit(1)

if len(sys.argv) != 4:
print(f"usage: {sys.argv[0]} PATTERN OFFSET TIMES", file=sys.stderr)
sys.exit(1)

pattern, offset, times = sys.argv[1], int(sys.argv[2]), int(sys.argv[3])

data = json.dumps({"pattern": pattern, "offset": offset, "times": times}).encode()
req = urllib.request.Request(
f"{host}/__testserver/kill",
data=data,
headers={"Content-Type": "application/json", "Authorization": f"Bearer {token}"},
method="POST",
)
urllib.request.urlopen(req)
37 changes: 37 additions & 0 deletions acceptance/bundle/deploy/wal/chain-3-jobs/databricks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
bundle:
name: wal-chain-test

resources:
jobs:
# Linear chain: job_01 -> job_02 -> job_03
# Execution order: job_01 first, job_03 last
job_01:
name: "job-01"
description: "first in chain"
tasks:
- task_key: "task"
spark_python_task:
python_file: ./test.py
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: i3.xlarge
job_02:
name: "job-02"
description: "depends on ${resources.jobs.job_01.id}"
tasks:
- task_key: "task"
spark_python_task:
python_file: ./test.py
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: i3.xlarge
job_03:
name: "job-03"
description: "depends on ${resources.jobs.job_02.id}"
tasks:
- task_key: "task"
spark_python_task:
python_file: ./test.py
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: i3.xlarge
3 changes: 3 additions & 0 deletions acceptance/bundle/deploy/wal/chain-3-jobs/out.test.toml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

110 changes: 110 additions & 0 deletions acceptance/bundle/deploy/wal/chain-3-jobs/output.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
=== First deploy (crashes on job_03) ===

>>> errcode [CLI] bundle deploy
Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default/files...
Deploying resources...
[PROCESS_KILLED]

Exit code: [KILLED]

=== WAL content after crash ===
{
"cli_version": "[DEV_VERSION]",
"lineage": "[UUID]",
"serial": 1,
"state_version": 2
}
{
"k": "resources.jobs.job_01",
"v": {
"__id__": "[JOB_01_ID]",
"state": {
"deployment": {
"kind": "BUNDLE",
"metadata_file_path": "/Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default/state/metadata.json"
},
"description": "first in chain",
"edit_mode": "UI_LOCKED",
"format": "MULTI_TASK",
"max_concurrent_runs": 1,
"name": "job-01",
"queue": {
"enabled": true
},
"tasks": [
{
"new_cluster": {
"node_type_id": "[NODE_TYPE_ID]",
"spark_version": "15.4.x-scala2.12"
},
"spark_python_task": {
"python_file": "/Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default/files/test.py"
},
"task_key": "task"
}
]
}
}
}
{
"k": "resources.jobs.job_02",
"v": {
"__id__": "[JOB_02_ID]",
"depends_on": [
{
"label": "${resources.jobs.job_01.id}",
"node": "resources.jobs.job_01"
}
],
"state": {
"deployment": {
"kind": "BUNDLE",
"metadata_file_path": "/Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default/state/metadata.json"
},
"description": "depends on [JOB_01_ID]",
"edit_mode": "UI_LOCKED",
"format": "MULTI_TASK",
"max_concurrent_runs": 1,
"name": "job-02",
"queue": {
"enabled": true
},
"tasks": [
{
"new_cluster": {
"node_type_id": "[NODE_TYPE_ID]",
"spark_version": "15.4.x-scala2.12"
},
"spark_python_task": {
"python_file": "/Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default/files/test.py"
},
"task_key": "task"
}
]
}
}
}

=== Number of jobs saved in WAL ===
2

=== Bundle summary (reads from WAL) ===
Name: wal-chain-test
Target: default
Workspace:
User: [USERNAME]
Path: /Workspace/Users/[USERNAME]/.bundle/wal-chain-test/default
Resources:
Jobs:
job_01:
Name: job-01
URL: [DATABRICKS_URL]/jobs/[JOB_01_ID]?o=[NUMID]
job_02:
Name: job-02
URL: [DATABRICKS_URL]/jobs/[JOB_02_ID]?o=[NUMID]
job_03:
Name: job-03
URL: (not deployed)

=== WAL after successful deploy ===
WAL deleted (expected)
24 changes: 24 additions & 0 deletions acceptance/bundle/deploy/wal/chain-3-jobs/script
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Linear chain: job_01 -> job_02 -> job_03
# Let first 2 jobs/create succeed, then kill on the 3rd
kill_after.py "POST /api/2.2/jobs/create" 2 1

echo "=== First deploy (crashes on job_03) ==="
trace errcode $CLI bundle deploy

echo ""
echo "=== WAL content after crash ==="
jq -S . .databricks/bundle/default/resources.json.wal 2>/dev/null || echo "No WAL file"

echo ""
echo "=== Number of jobs saved in WAL ==="
grep -c '"k":"resources.jobs' .databricks/bundle/default/resources.json.wal 2>/dev/null || echo "0"

echo ""
echo "=== Bundle summary (reads from WAL) ==="
$CLI bundle summary

echo ""
echo "=== WAL after successful deploy ==="
cat .databricks/bundle/default/resources.json.wal 2>/dev/null || echo "WAL deleted (expected)"

replace_ids.py
1 change: 1 addition & 0 deletions acceptance/bundle/deploy/wal/chain-3-jobs/test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
print("test")
23 changes: 23 additions & 0 deletions acceptance/bundle/deploy/wal/corrupted-wal-entry/databricks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
bundle:
name: wal-corrupted-test

resources:
jobs:
valid_job:
name: "valid-job"
tasks:
- task_key: "task-a"
spark_python_task:
python_file: ./test.py
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: i3.xlarge
another_valid:
name: "another-valid"
tasks:
- task_key: "task-b"
spark_python_task:
python_file: ./test.py
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: i3.xlarge

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

34 changes: 34 additions & 0 deletions acceptance/bundle/deploy/wal/corrupted-wal-entry/output.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@

>>> cat .databricks/bundle/default/resources.json.wal
{"lineage":"test-lineage-123","serial":6}
{"k":"resources.jobs.valid_job","v":{"__id__":"","state":{"name":"valid-job"}}}
{"k":"resources.jobs.another_valid","v":{"__id__":"","state":{"name":"another-valid"}}}
{"k":"resources.jobs.partial_write","v":{"__id__":"33","state":{"name":"partial-

>>> [CLI] bundle deploy
Warn: Skipping corrupted WAL entry at [TEST_TMP_DIR]/.databricks/bundle/default/resources.json.wal:4: unexpected end of JSON input
Warn: Saved 1 corrupted WAL entries to [TEST_TMP_DIR]/.databricks/bundle/default/resources.json.wal.corrupted
Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-corrupted-test/default/files...
Deploying resources...
Updating deployment state...
Deployment complete!

>>> [CLI] bundle summary
Name: wal-corrupted-test
Target: default
Workspace:
User: [USERNAME]
Path: /Workspace/Users/[USERNAME]/.bundle/wal-corrupted-test/default
Resources:
Jobs:
another_valid:
Name: another-valid
URL: [DATABRICKS_URL]/jobs/[NUMID]?o=[NUMID]
valid_job:
Name: valid-job
URL: [DATABRICKS_URL]/jobs/[NUMID]?o=[NUMID]

>>> cat .databricks/bundle/default/resources.json.wal.corrupted
{"k":"resources.jobs.partial_write","v":{"__id__":"33","state":{"name":"partial-
=== WAL after successful deploy ===
WAL deleted (expected)
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"state_version": 1,
"cli_version": "0.0.0",
"lineage": "test-lineage-123",
"serial": 5,
"state": {}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{"lineage":"test-lineage-123","serial":6}
{"k":"resources.jobs.valid_job","v":{"__id__":"$JOB1","state":{"name":"valid-job"}}}
{"k":"resources.jobs.another_valid","v":{"__id__":"$JOB2","state":{"name":"another-valid"}}}
{"k":"resources.jobs.partial_write","v":{"__id__":"33","state":{"name":"partial-
22 changes: 22 additions & 0 deletions acceptance/bundle/deploy/wal/corrupted-wal-entry/script
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Create pre-existing jobs in the testserver so WAL recovery triggers DoUpdate (reset) instead of DoCreate
JOB1=$($CLI jobs create --json '{"name":"valid-job"}' | jq -r '.job_id')
JOB2=$($CLI jobs create --json '{"name":"another-valid"}' | jq -r '.job_id')
echo "$JOB1:JOB1_ID" >> ACC_REPLS
echo "$JOB2:JOB2_ID" >> ACC_REPLS

mkdir -p .databricks/bundle/default
cp resources.json .databricks/bundle/default/

envsubst < resources.json.wal.tmpl > .databricks/bundle/default/resources.json.wal

trace cat .databricks/bundle/default/resources.json.wal
trace $CLI bundle deploy
trace $CLI bundle summary
trace cat .databricks/bundle/default/resources.json.wal.corrupted

printf "\n=== WAL after successful deploy ===\n"
if [ -f ".databricks/bundle/default/resources.json.wal" ]; then
echo "WAL exists (unexpected)"
else
echo "WAL deleted (expected)"
fi
1 change: 1 addition & 0 deletions acceptance/bundle/deploy/wal/corrupted-wal-entry/test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
print("test")
25 changes: 25 additions & 0 deletions acceptance/bundle/deploy/wal/crash-after-create/databricks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
bundle:
name: wal-crash-test

resources:
jobs:
job_a:
name: "test-job-a"
description: "first job"
tasks:
- task_key: "task-a"
spark_python_task:
python_file: ./test.py
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: i3.xlarge
job_b:
name: "test-job-b"
description: "depends on ${resources.jobs.job_a.id}"
tasks:
- task_key: "task-b"
spark_python_task:
python_file: ./test.py
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: i3.xlarge
4 changes: 4 additions & 0 deletions acceptance/bundle/deploy/wal/crash-after-create/out.test.toml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading