Transforming movie and TV subtitles into structured language learning data using LLMs.
This project builds an end-to-end data pipeline that converts unstructured subtitle data into structured, personalized language learning content.
It leverages LLMs (Gemini) to extract meaningful vocabulary and expressions, and transforms them into a learning experience enhanced with text-to-speech (TTS).
Unstructured subtitles → Structured learning data → Personalized language learning experience
- Data Ingestion
- TMDB API (movie/TV metadata)
- OpenSubtitles (subtitle extraction)
- Data Processing
- Subtitle cleaning & alignment
- Sentence segmentation
- LLM Processing
- Gemini-based extraction of:
- Key expressions
- Vocabulary
- Contextual meaning
- Gemini-based extraction of:
- Output Layer
- Structured learning dataset
- TTS-based sentence playback
- Large-scale subtitle data ingestion pipeline
- LLM-based extraction of language learning content
- Adaptive vocabulary structuring by difficulty level
- Natural speech playback using TTS
- End-to-end pipeline from raw data → user-ready content
- Python
- LLM (Gemini)
- TMDB API
- OpenSubtitles API
- Text-to-Speech (TTS)
- Data Pipeline Design
Most subtitle data is unstructured and not suitable for learning.
This system explores how LLMs can transform raw, noisy text into structured educational data that can be used for personalized learning experiences.
- LLM-based data transformation
- End-to-end pipeline design
- Real-world unstructured → structured data problems
- Product thinking for data systems
Scene Note is a SwiftUI subtitle-learning app core packaged as Swift Package Manager module (SubtitleCore).
It helps learners browse TV titles, save shows, analyze subtitles with Gemini, collect expressions/words, and review with study flows.
- Browse local library and remote TMDB search results
- Save shows/episodes and manage favorites
- Parse and clean SRT subtitles
- Analyze transcript chunks with Gemini models
- Build expression/word lists and study progress tracking
- Voice settings for text-to-speech preview (voice, rate, pitch, volume)
- Data backup/import for local app data (non-secret settings and files)
- API keys stored in system Keychain (Gemini/TMDB/OpenSubtitles)
- Swift 5.9
- SwiftUI
- Swift Package Manager
- Platforms: iOS 17+, macOS 14+
Sources/SubtitleCore: app core views, models, services, persistenceTests/SubtitleCoreTests: unit tests for parsing, formatting, analysis pipeline, and settings security behaviorPackage.swift: SPM manifest
Open the package folder in Xcode and use the package product:
- Product name:
SubtitleCore - Main root view:
ContentView
swift buildswift testConfigure API values in Settings > API inside the app:
- Gemini API key and model
- TMDB API key
- OpenSubtitles API key (+ optional User-Agent)
Notes:
- Secret API keys are stored in Keychain.
- Legacy keys previously in
UserDefaultsare migrated automatically on access.
Settings > Backup exports/imports local learning data and caches.
- Included: library, subtitles, episode analyses, poster cache, TMDB season cache, study progress, and non-secret settings
- Excluded: secret API keys (remain on-device in Keychain)
- Keep UI strings and product copy aligned with current app language strategy.
- Run
swift testbefore committing. - If you modify persistence formats, keep backup compatibility in mind.
Add screenshots to a docs/images/ folder and link them here.
Example:



- Improve subtitle import UX (better validation and error guidance)
- Expand study analytics and progress visualization
- Add more robust offline-first behavior for metadata/search cache
- Strengthen backup/restore version migration handling
- Increase test coverage around UI state transitions and persistence edges
- Create a feature branch from
develop - Make focused commits with clear messages
- Run local checks:
swift buildswift test
- Open a pull request to
developwith:- change summary
- test plan
- screenshots for UI changes (if applicable)