Implement Phase 1 ingestion workflow: init/scan/status with SQLite ledger and hash-based deduplication#1
Merged
Conversation
Copilot created this pull request from a session on behalf of
The-TechLab
June 4, 2026 20:00
View session
Copilot
AI
changed the title
Bootstrap ScribeFlow with public-facing docs and initial CLI/project scaffolding
Implement Phase 1 ingestion workflow: init/scan/status with SQLite ledger and hash-based deduplication
Jun 4, 2026
The-TechLab
approved these changes
Jun 4, 2026
Contributor
|
push by Will |
There was a problem hiding this comment.
Pull request overview
This PR introduces ScribeFlow’s Phase 1 local ingestion foundation: a Typer-based CLI to initialize a workspace, scan inboxes for MP3/MP4 files, and track them in a local SQLite ledger with SHA-256 hash-based deduplication.
Changes:
- Added
scribeflow init/scan/status/versioncommands with Rich-formatted summaries and tables. - Implemented a SQLite-backed ledger with a
file_hashuniqueness constraint and basic aggregate/pending queries. - Added Phase 1 tests plus initial docs/config scaffolding for the workflow and workspace layout.
Reviewed changes
Copilot reviewed 18 out of 46 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_smoke.py | Adds a minimal package/version smoke test. |
| tests/test_phase1.py | Adds CLI + ledger + hashing tests for Phase 1 behavior. |
| tests/init.py | Test package marker (empty). |
| src/scribeflow/init.py | Defines package version export. |
| src/scribeflow/main.py | Adds python -m scribeflow entrypoint. |
| src/scribeflow/cli.py | Implements Typer CLI commands and Rich output. |
| src/scribeflow/config.py | Defines workspace layout constants and supported extensions. |
| src/scribeflow/utils.py | Adds directory creation, filename normalization, and relative path helpers. |
| src/scribeflow/hashing.py | Adds chunked SHA-256 file hashing utility. |
| src/scribeflow/ledger.py | Implements SQLite schema + insert/dedupe + status queries. |
| src/scribeflow/scanner.py | Scans inbox folders, hashes media, registers pending ledger entries. |
| src/scribeflow/status.py | Loads aggregate counts and pending rows for CLI status. |
| src/scribeflow/commands/init.py | Placeholder for future command modules (empty). |
| src/scribeflow/commands/init.py | Placeholder (empty). |
| src/scribeflow/commands/scan.py | Placeholder (empty). |
| src/scribeflow/commands/status.py | Placeholder (empty). |
| src/scribeflow/commands/process.py | Placeholder (empty). |
| src/scribeflow/commands/retry.py | Placeholder (empty). |
| src/scribeflow/commands/reprocess.py | Placeholder (empty). |
| src/scribeflow/commands/clean.py | Placeholder (empty). |
| src/scribeflow/core/init.py | Placeholder for future “core” modules (empty). |
| src/scribeflow/core/config.py | Placeholder (empty). |
| src/scribeflow/core/ledger.py | Placeholder (empty). |
| src/scribeflow/core/models.py | Placeholder (empty). |
| src/scribeflow/pipeline/init.py | Placeholder for future pipeline modules (empty). |
| src/scribeflow/pipeline/scanner.py | Placeholder (empty). |
| src/scribeflow/pipeline/media.py | Placeholder (empty). |
| src/scribeflow/pipeline/transcriber.py | Placeholder (empty). |
| src/scribeflow/pipeline/formatter.py | Placeholder (empty). |
| pyproject.toml | Defines packaging metadata, deps, and console script entrypoint. |
| scripts/bootstrap.sh | Adds dev bootstrap script for venv + editable install. |
| README.md | Documents Phase 1 workflow, CLI usage, and workspace structure. |
| docs/cli.md | Adds Phase 1 CLI documentation. |
| docs/ARCHITECTURE.md | Adds placeholder architecture doc. |
| docs/CONFIGURATION.md | Adds placeholder configuration doc. |
| config/scribeflow.example.toml | Adds example config scaffold including ledger path. |
| .gitignore | Ignores runtime state (ledger, outputs, logs) while keeping .gitkeep files. |
| inbox/mp3/.gitkeep | Keeps inbox directory in repo. |
| inbox/mp4/.gitkeep | Keeps inbox directory in repo. |
| output/markdown/.gitkeep | Keeps output directory in repo. |
| output/raw_json/.gitkeep | Keeps output directory in repo. |
| output/subtitles/.gitkeep | Keeps output directory in repo. |
| archive/completed/.gitkeep | Keeps archive directory in repo. |
| archive/failed/.gitkeep | Keeps archive directory in repo. |
| logs/.gitkeep | Keeps logs directory in repo. |
| .scribeflow/.gitkeep | Keeps local state directory in repo without committing the DB. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+18
to
+20
| def to_relative_posix(path: Path, root: Path) -> str: | ||
| """Return a workspace-relative POSIX path string.""" | ||
| return path.resolve().relative_to(root.resolve()).as_posix() |
Comment on lines
+47
to
+54
| ├── inbox/ | ||
| │ ├── mp3/ | ||
| │ └── mp4/ | ||
| ├── logs/ | ||
| ├── output/ | ||
| │ ├── markdown/ | ||
| │ ├── raw_json/ | ||
| │ └── subtitles/ |
Comment on lines
+72
to
+79
| ## Installation Requirements | ||
| - Python 3.11+ | ||
| - FFmpeg available on PATH | ||
| - OS: macOS, Linux, or Windows (WSL recommended on Windows) | ||
|
|
||
| ## FFmpeg Requirement | ||
| ScribeFlow depends on FFmpeg for media extraction and normalization. | ||
|
|
Comment on lines
+29
to
+31
| result = runner.invoke(app, ["init"]) | ||
|
|
||
| assert result.exit_code == 0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR establishes ScribeFlow’s Phase 1 local ingestion foundation: initialize a workspace, scan MP3/MP4 inboxes, and track files in a local SQLite ledger with duplicate prevention by content hash. It intentionally stops at tracking (no transcription or FFmpeg processing behavior yet).
CLI surface (Typer)
scribeflow version,scribeflow init,scribeflow scan,scribeflow status.initis idempotent and creates required runtime directories plus.scribeflow/ledger.sqlite.scanemits a compact Rich summary: scanned, newly registered, duplicate hashes skipped, unsupported ignored.statusreports total/pending/completed/failed and renders pending-file rows (filename, type, size, discovered_at).Ledger + data model
source_path,original_filename,normalized_filename,file_type,file_size,file_hash, timestamps, outputs,error_message,retry_count, etc.).file_hashuniqueness for content-level dedupe.pending | completed | failedand exposes aggregate/status queries.Hashing + scanner behavior
inbox/mp3andinbox/mp4, accepts only.mp3/.mp4, registers new files aspending, and skips already-known hashes even under different filenames.Project organization + docs
config,utils,hashing,ledger,scanner, andstatusundersrc/scribeflow/.docs/cli.mdto document implemented commands and current Phase 1 behavior..scribeflow/ledger.sqlite).Phase 1 test coverage