The-QAI-Lab · The-TechLab · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,30 @@
+# Python
+__pycache__/
+*.py[cod]
+*.so
+.pytest_cache/
+
+# Virtual environments
+.venv/
+
+# Local ledger/runtime state
+.scribeflow/*
+!.scribeflow/.gitkeep
+
+# Local outputs
+output/markdown/*
+!output/markdown/.gitkeep
+output/raw_json/*
+!output/raw_json/.gitkeep
+output/subtitles/*
+!output/subtitles/.gitkeep
+logs/*
+!logs/.gitkeep
+
+# OS/editor
+.DS_Store
+.vscode/
+.idea/
+
+# Packaging artifacts
+*.egg-info/
diff --git a/.scribeflow/.gitkeep b/.scribeflow/.gitkeep
diff --git a/README.md b/README.md
@@ -1,2 +1,239 @@
 # ScribeFlow
-CLI tool for converting MP4 and MP3 lectures into clean, timestamped Markdown transcripts using local speech-to-text.
+
+> Local-first CLI for converting MP4/MP3 lectures into timestamped Markdown transcripts.
+
+## Project Overview
+ScribeFlow is an open-source, local-first transcription workflow for people who want structured notes from recorded audio or video.
+
+It is designed for students, researchers, educators, and builders who need:
+- reproducible transcript generation,
+- clear file tracking,
+- clean Markdown outputs suitable for study, search, and AI/RAG workflows,
+- a privacy-friendly workflow that can run fully on local machines.
+
+## Why ScribeFlow Exists
+Most transcription workflows are fragmented: one tool for conversion, another for transcription, another for cleanup, and no reliable ledger to track what was processed.
+
+ScribeFlow solves this by combining ingestion, normalization, transcription, and Markdown formatting in one CLI workflow with a local SQLite ledger to prevent duplicate work.
+
+## Core Features (Phase 1)
+- Local-first ingestion pipeline for MP4 and MP3 files
+- Inbox-based workflow (`inbox/mp4/` and `inbox/mp3/`)
+- SHA-256 file hashing with chunked reads for large files
+- Duplicate prevention by content hash (not filename)
+- SQLite ledger for file tracking and status counts
+- Idempotent workspace initialization (`scribeflow init`)
+- Rich terminal summaries for scan and status commands
+
+## Example Workflow (Current)
+1. Run `scribeflow init` to create workspace folders and ledger
+2. Add media files to `inbox/mp4/` and/or `inbox/mp3/`
+3. Run `scribeflow scan`
+4. ScribeFlow scans inbox files, hashes content, and skips duplicate hashes
+5. New files are registered in SQLite with status `pending`
+6. Run `scribeflow status` to see totals and pending queue
+
+## Folder Structure
+```text
+ScribeFlow/
+├── archive/
+│   ├── completed/
+│   └── failed/
+├── config/
+│   └── scribeflow.example.toml
+├── docs/
+│   ├── ARCHITECTURE.md
+│   └── CONFIGURATION.md
+├── inbox/
+│   ├── mp3/
+│   └── mp4/
+├── logs/
+├── output/
+│   ├── markdown/
+│   ├── raw_json/
+│   └── subtitles/
+├── scripts/
+│   └── bootstrap.sh
+├── src/
+│   └── scribeflow/
+│       ├── __main__.py
+│       ├── cli.py
+│       ├── commands/
+│       ├── core/
+│       └── pipeline/
+├── tests/
+├── .scribeflow/
+│   └── (sqlite ledger lives here)
+├── pyproject.toml
+├── LICENSE
+└── README.md
+```
+
+## Installation Requirements
+- Python 3.11+
+- FFmpeg available on PATH
+- OS: macOS, Linux, or Windows (WSL recommended on Windows)
+
+## FFmpeg Requirement
+ScribeFlow depends on FFmpeg for media extraction and normalization.
+
+Check installation:
+```bash
+ffmpeg -version
+```
+
+Install examples:
+- macOS (Homebrew): `brew install ffmpeg`
+- Ubuntu/Debian: `sudo apt-get install ffmpeg`
+- Windows (choco): `choco install ffmpeg`
+
+## Python Setup
+```bash
+python3.11 -m venv .venv
+source .venv/bin/activate
+pip install --upgrade pip
+pip install -e .[dev]
+```
+
+## Basic Usage
+```bash
+scribeflow version
+scribeflow init
+scribeflow scan
+scribeflow status
+```
+
+## CLI Commands
+- `scribeflow version` — print installed version
+- `scribeflow init` — initialize folders and local SQLite ledger
+- `scribeflow scan` — scan inbox folders and register new files as pending
+- `scribeflow status` — show tracked totals and pending files table
+
+Planned later:
+- `scribeflow process`
+- `scribeflow retry`
+- `scribeflow reprocess --file <filename>`
+- `scribeflow clean`
+
+## Configuration
+Configuration is expected to be file-based (TOML/YAML support planned; TOML shown by default).
+
+Suggested config surface:
+- input/output directories
+- archive behavior
+- transcription model + device settings
+- subtitle output toggle (SRT/VTT)
+- hashing and duplicate strategy
+- retries and failure policy
+- logging verbosity
+
+See `/config/scribeflow.example.toml` and `/docs/CONFIGURATION.md`.
+
+## How the SQLite Ledger Works
+ScribeFlow keeps a local SQLite database to persist processing state.
+
+Recommended ledger responsibilities:
+- track canonical file path and content hash
+- track current statuses (`pending`, `completed`, `failed`) in Phase 1
+- store attempt count and timestamps
+- record output artifact locations
+- avoid duplicate processing by hash match
+
+Current location: `.scribeflow/ledger.sqlite`
+
+## File Status Lifecycle
+Current statuses implemented in Phase 1:
+1. `pending`
+2. `completed`
+3. `failed`
+
+Phase 1 behavior registers new files as `pending` and reports counts by status.
+
+## Markdown Output Format
+Markdown transcript rendering is planned for a later phase.
+The output folders already exist (`output/markdown`, `output/raw_json`, `output/subtitles`) but are not written in Phase 1.
+
+## Example Command Output
+```text
+$ scribeflow scan
+            Scan Summary            
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ Metric                     ┃ Count ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ Files scanned              │     3 │
+│ New files registered       │     2 │
+│ Duplicates skipped         │     1 │
+│ Unsupported files ignored  │     0 │
+└────────────────────────────┴───────┘
+```
+
+```text
+$ scribeflow status
+         Ledger Status         
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ Metric                ┃ Count ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ Total files tracked   │     2 │
+│ Pending               │     2 │
+│ Completed             │     0 │
+│ Failed                │     0 │
+└───────────────────────┴───────┘
+```
+
+## Roadmap
+Near-term:
+- robust `init/scan/process/status/retry/reprocess/clean` command implementation
+- stable SQLite schema with migrations
+- better failure diagnostics and retry policies
+
+Planned future commands:
+- `scribeflow process`
+- `scribeflow retry`
+- `scribeflow reprocess --file <filename>`
+- `scribeflow clean`
+- `scribeflow watch`
+- `scribeflow summarize`
+- `scribeflow quiz`
+- `scribeflow terms`
+- `scribeflow index`
+- `scribeflow search`
+
+Long-term:
+- additional STT backends (`whisper.cpp`, hosted APIs)
+- semantic indexing and retrieval support
+- plugin architecture for custom post-processing
+
+## Development Setup
+```bash
+git clone https://github.com/The-QAI-Lab/ScribeFlow.git
+cd ScribeFlow
+python3.11 -m venv .venv
+source .venv/bin/activate
+pip install -e .[dev]
+```
+
+## Testing
+```bash
+pytest -q
+```
+
+## Contributing
+Contributions are welcome.
+
+Suggested flow:
+1. Fork and create a feature branch
+2. Add tests for behavior changes
+3. Run `pytest`
+4. Open a PR with clear context and examples
+
+Please keep changes focused, documented, and reproducible.
+
+## License Placeholder
+ScribeFlow is currently released under the MIT License (see `/LICENSE`).
+
+If licensing strategy changes before 1.0, this section will be updated with migration guidance.
+
+## Disclaimer
+Transcription quality depends on audio quality, speaker clarity, domain vocabulary, and model selection.
+
+ScribeFlow may produce errors and should be reviewed before use in academic, legal, medical, or business-critical contexts.
diff --git a/archive/completed/.gitkeep b/archive/completed/.gitkeep
diff --git a/archive/failed/.gitkeep b/archive/failed/.gitkeep
diff --git a/config/scribeflow.example.toml b/config/scribeflow.example.toml
@@ -0,0 +1,22 @@
+# ScribeFlow example config (placeholder)
+
+[paths]
+inbox_mp4 = "inbox/mp4"
+inbox_mp3 = "inbox/mp3"
+output_markdown = "output/markdown"
+output_raw_json = "output/raw_json"
+output_subtitles = "output/subtitles"
+archive_completed = "archive/completed"
+
+[transcription]
+backend = "faster-whisper"
+model = "base"
+language = "auto"
+
+[processing]
+write_subtitles = true
+auto_archive_completed = false
+max_retries = 3
+
+[ledger]
+path = ".scribeflow/ledger.sqlite"
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -0,0 +1,3 @@
+# ScribeFlow Architecture (Placeholder)
+
+This document will describe module boundaries, processing flow, and extension points.
diff --git a/docs/CONFIGURATION.md b/docs/CONFIGURATION.md
@@ -0,0 +1,3 @@
+# ScribeFlow Configuration (Placeholder)
+
+This document will define supported keys, defaults, and environment overrides.
diff --git a/docs/cli.md b/docs/cli.md
@@ -0,0 +1,93 @@
+# ScribeFlow CLI (Phase 1)
+
+This page documents the currently implemented commands:
+
+- `scribeflow version`
+- `scribeflow init`
+- `scribeflow scan`
+- `scribeflow status`
+
+## `scribeflow version`
+Print the installed version.
+
+```bash
+scribeflow version
+```
+
+Example output:
+```text
+ScribeFlow 0.1.0
+```
+
+## `scribeflow init`
+Creates required workspace folders and initializes the local SQLite ledger.
+
+Creates directories if missing:
+- `inbox/mp4/`
+- `inbox/mp3/`
+- `working/audio/`
+- `working/temp/`
+- `working/logs/`
+- `output/markdown/`
+- `output/raw_json/`
+- `output/subtitles/`
+- `archive/completed/`
+- `archive/failed/`
+- `.scribeflow/`
+
+Creates ledger database:
+- `.scribeflow/ledger.sqlite`
+
+Safe to run multiple times.
+
+```bash
+scribeflow init
+```
+
+Example output:
+```text
+Workspace ready.
+Ledger: .scribeflow/ledger.sqlite
+```
+
+## `scribeflow scan`
+Scans `inbox/mp3/` and `inbox/mp4/`, processes only `.mp3` and `.mp4`, hashes file contents with SHA-256, and registers new files as `pending`.
+
+Duplicate content hashes are skipped even when filenames differ.
+
+```bash
+scribeflow scan
+```
+
+Example output:
+```text
+            Scan Summary            
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ Metric                     ┃ Count ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ Files scanned              │     3 │
+│ New files registered       │     2 │
+│ Duplicates skipped         │     1 │
+│ Unsupported files ignored  │     0 │
+└────────────────────────────┴───────┘
+```
+
+## `scribeflow status`
+Shows aggregate ledger counts and a pending-files table.
+
+```bash
+scribeflow status
+```
+
+Example output:
+```text
+         Ledger Status         
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ Metric                ┃ Count ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ Total files tracked   │     2 │
+│ Pending               │     2 │
+│ Completed             │     0 │
+│ Failed                │     0 │
+└───────────────────────┴───────┘
+```
diff --git a/inbox/mp3/.gitkeep b/inbox/mp3/.gitkeep
diff --git a/inbox/mp4/.gitkeep b/inbox/mp4/.gitkeep
diff --git a/logs/.gitkeep b/logs/.gitkeep
diff --git a/output/markdown/.gitkeep b/output/markdown/.gitkeep
diff --git a/output/raw_json/.gitkeep b/output/raw_json/.gitkeep
diff --git a/output/subtitles/.gitkeep b/output/subtitles/.gitkeep
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# ScribeFlow Architecture (Placeholder)

		This document will describe module boundaries, processing flow, and extension points.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# ScribeFlow Configuration (Placeholder)

		This document will define supported keys, defaults, and environment overrides.