Skip to content

Implement Phase 1 ingestion workflow: init/scan/status with SQLite ledger and hash-based deduplication#1

Merged
The-TechLab merged 6 commits into
mainfrom
copilot/initial-repo-docs-structure
Jun 4, 2026
Merged

Implement Phase 1 ingestion workflow: init/scan/status with SQLite ledger and hash-based deduplication#1
The-TechLab merged 6 commits into
mainfrom
copilot/initial-repo-docs-structure

Conversation

Copilot AI commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

This PR establishes ScribeFlow’s Phase 1 local ingestion foundation: initialize a workspace, scan MP3/MP4 inboxes, and track files in a local SQLite ledger with duplicate prevention by content hash. It intentionally stops at tracking (no transcription or FFmpeg processing behavior yet).

  • CLI surface (Typer)

    • Implements scribeflow version, scribeflow init, scribeflow scan, scribeflow status.
    • init is idempotent and creates required runtime directories plus .scribeflow/ledger.sqlite.
    • scan emits a compact Rich summary: scanned, newly registered, duplicate hashes skipped, unsupported ignored.
    • status reports total/pending/completed/failed and renders pending-file rows (filename, type, size, discovered_at).
  • Ledger + data model

    • Adds a concrete SQLite ledger layer with the requested schema fields (source_path, original_filename, normalized_filename, file_type, file_size, file_hash, timestamps, outputs, error_message, retry_count, etc.).
    • Enforces file_hash uniqueness for content-level dedupe.
    • Constrains current statuses to pending | completed | failed and exposes aggregate/status queries.
  • Hashing + scanner behavior

    • Adds chunked SHA-256 file hashing over file contents (large-file safe).
    • Scanner walks inbox/mp3 and inbox/mp4, accepts only .mp3/.mp4, registers new files as pending, and skips already-known hashes even under different filenames.
  • Project organization + docs

    • Introduces dedicated modules for config, utils, hashing, ledger, scanner, and status under src/scribeflow/.
    • Updates README and adds docs/cli.md to document implemented commands and current Phase 1 behavior.
    • Aligns example config ledger path with implementation (.scribeflow/ledger.sqlite).
  • Phase 1 test coverage

    • Adds focused tests for idempotent init, ledger creation, hashing determinism, scan registration, hash-based dedupe, and status count retrieval.
# hash-based dedupe on content, not filename
file_hash = sha256_file(file_path)
inserted = ledger.register_pending(
    LedgerEntry(
        source_path=rel_path,
        original_filename=file_path.name,
        normalized_filename=normalize_filename(file_path.name),
        file_type=file_path.suffix.lstrip(".").lower(),
        file_size=file_path.stat().st_size,
        file_hash=file_hash,
        status="pending",
        discovered_at=datetime.now(UTC).isoformat(),
    )
)
# inserted == False when the same bytes were already seen under any name

Copilot AI changed the title Bootstrap ScribeFlow with public-facing docs and initial CLI/project scaffolding Implement Phase 1 ingestion workflow: init/scan/status with SQLite ledger and hash-based deduplication Jun 4, 2026
Copilot AI requested a review from The-TechLab June 4, 2026 20:09
@The-TechLab

Copy link
Copy Markdown
Contributor

push by Will

@The-TechLab The-TechLab marked this pull request as ready for review June 4, 2026 20:15
Copilot AI review requested due to automatic review settings June 4, 2026 20:15
@The-TechLab The-TechLab merged commit 3680bae into main Jun 4, 2026
2 checks passed

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces ScribeFlow’s Phase 1 local ingestion foundation: a Typer-based CLI to initialize a workspace, scan inboxes for MP3/MP4 files, and track them in a local SQLite ledger with SHA-256 hash-based deduplication.

Changes:

  • Added scribeflow init/scan/status/version commands with Rich-formatted summaries and tables.
  • Implemented a SQLite-backed ledger with a file_hash uniqueness constraint and basic aggregate/pending queries.
  • Added Phase 1 tests plus initial docs/config scaffolding for the workflow and workspace layout.

Reviewed changes

Copilot reviewed 18 out of 46 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_smoke.py Adds a minimal package/version smoke test.
tests/test_phase1.py Adds CLI + ledger + hashing tests for Phase 1 behavior.
tests/init.py Test package marker (empty).
src/scribeflow/init.py Defines package version export.
src/scribeflow/main.py Adds python -m scribeflow entrypoint.
src/scribeflow/cli.py Implements Typer CLI commands and Rich output.
src/scribeflow/config.py Defines workspace layout constants and supported extensions.
src/scribeflow/utils.py Adds directory creation, filename normalization, and relative path helpers.
src/scribeflow/hashing.py Adds chunked SHA-256 file hashing utility.
src/scribeflow/ledger.py Implements SQLite schema + insert/dedupe + status queries.
src/scribeflow/scanner.py Scans inbox folders, hashes media, registers pending ledger entries.
src/scribeflow/status.py Loads aggregate counts and pending rows for CLI status.
src/scribeflow/commands/init.py Placeholder for future command modules (empty).
src/scribeflow/commands/init.py Placeholder (empty).
src/scribeflow/commands/scan.py Placeholder (empty).
src/scribeflow/commands/status.py Placeholder (empty).
src/scribeflow/commands/process.py Placeholder (empty).
src/scribeflow/commands/retry.py Placeholder (empty).
src/scribeflow/commands/reprocess.py Placeholder (empty).
src/scribeflow/commands/clean.py Placeholder (empty).
src/scribeflow/core/init.py Placeholder for future “core” modules (empty).
src/scribeflow/core/config.py Placeholder (empty).
src/scribeflow/core/ledger.py Placeholder (empty).
src/scribeflow/core/models.py Placeholder (empty).
src/scribeflow/pipeline/init.py Placeholder for future pipeline modules (empty).
src/scribeflow/pipeline/scanner.py Placeholder (empty).
src/scribeflow/pipeline/media.py Placeholder (empty).
src/scribeflow/pipeline/transcriber.py Placeholder (empty).
src/scribeflow/pipeline/formatter.py Placeholder (empty).
pyproject.toml Defines packaging metadata, deps, and console script entrypoint.
scripts/bootstrap.sh Adds dev bootstrap script for venv + editable install.
README.md Documents Phase 1 workflow, CLI usage, and workspace structure.
docs/cli.md Adds Phase 1 CLI documentation.
docs/ARCHITECTURE.md Adds placeholder architecture doc.
docs/CONFIGURATION.md Adds placeholder configuration doc.
config/scribeflow.example.toml Adds example config scaffold including ledger path.
.gitignore Ignores runtime state (ledger, outputs, logs) while keeping .gitkeep files.
inbox/mp3/.gitkeep Keeps inbox directory in repo.
inbox/mp4/.gitkeep Keeps inbox directory in repo.
output/markdown/.gitkeep Keeps output directory in repo.
output/raw_json/.gitkeep Keeps output directory in repo.
output/subtitles/.gitkeep Keeps output directory in repo.
archive/completed/.gitkeep Keeps archive directory in repo.
archive/failed/.gitkeep Keeps archive directory in repo.
logs/.gitkeep Keeps logs directory in repo.
.scribeflow/.gitkeep Keeps local state directory in repo without committing the DB.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/scribeflow/utils.py
Comment on lines +18 to +20
def to_relative_posix(path: Path, root: Path) -> str:
"""Return a workspace-relative POSIX path string."""
return path.resolve().relative_to(root.resolve()).as_posix()
Comment thread README.md
Comment on lines +47 to +54
├── inbox/
│ ├── mp3/
│ └── mp4/
├── logs/
├── output/
│ ├── markdown/
│ ├── raw_json/
│ └── subtitles/
Comment thread README.md
Comment on lines +72 to +79
## Installation Requirements
- Python 3.11+
- FFmpeg available on PATH
- OS: macOS, Linux, or Windows (WSL recommended on Windows)

## FFmpeg Requirement
ScribeFlow depends on FFmpeg for media extraction and normalization.

Comment thread tests/test_phase1.py
Comment on lines +29 to +31
result = runner.invoke(app, ["init"])

assert result.exit_code == 0
@The-TechLab The-TechLab deleted the copilot/initial-repo-docs-structure branch June 4, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants