📌 LLM-based Subtitle Learning System

Transforming movie and TV subtitles into structured language learning data using LLMs.

🔥 Overview

This project builds an end-to-end data pipeline that converts unstructured subtitle data into structured, personalized language learning content.

It leverages LLMs (Gemini) to extract meaningful vocabulary and expressions, and transforms them into a learning experience enhanced with text-to-speech (TTS).

🧠 Key Idea

Unstructured subtitles → Structured learning data → Personalized language learning experience

⚙️ System Architecture

Data Ingestion
- TMDB API (movie/TV metadata)
- OpenSubtitles (subtitle extraction)
Data Processing
- Subtitle cleaning & alignment
- Sentence segmentation
LLM Processing
- Gemini-based extraction of:
  - Key expressions
  - Vocabulary
  - Contextual meaning
Output Layer
- Structured learning dataset
- TTS-based sentence playback

🚀 Features

Large-scale subtitle data ingestion pipeline
LLM-based extraction of language learning content
Adaptive vocabulary structuring by difficulty level
Natural speech playback using TTS
End-to-end pipeline from raw data → user-ready content

🧩 Tech Stack

Python
LLM (Gemini)
TMDB API
OpenSubtitles API
Text-to-Speech (TTS)
Data Pipeline Design

🎯 Why This Project

Most subtitle data is unstructured and not suitable for learning.

This system explores how LLMs can transform raw, noisy text into structured educational data that can be used for personalized learning experiences.

💡 What This Demonstrates

LLM-based data transformation
End-to-end pipeline design
Real-world unstructured → structured data problems
Product thinking for data systems

🔗 Related

This Repository (`Scene Note`)

Scene Note is a SwiftUI subtitle-learning app core packaged as Swift Package Manager module (SubtitleCore).

It helps learners browse TV titles, save shows, analyze subtitles with Gemini, collect expressions/words, and review with study flows.

App Highlights

Browse local library and remote TMDB search results
Save shows/episodes and manage favorites
Parse and clean SRT subtitles
Analyze transcript chunks with Gemini models
Build expression/word lists and study progress tracking
Voice settings for text-to-speech preview (voice, rate, pitch, volume)
Data backup/import for local app data (non-secret settings and files)
API keys stored in system Keychain (Gemini/TMDB/OpenSubtitles)

Repository Tech Stack

Swift 5.9
SwiftUI
Swift Package Manager
Platforms: iOS 17+, macOS 14+

Project Structure

Sources/SubtitleCore: app core views, models, services, persistence
Tests/SubtitleCoreTests: unit tests for parsing, formatting, analysis pipeline, and settings security behavior
Package.swift: SPM manifest

Getting Started

1) Open in Xcode

Open the package folder in Xcode and use the package product:

Product name: SubtitleCore
Main root view: ContentView

2) Build

swift build

3) Test

swift test

Configuration

Configure API values in Settings > API inside the app:

Gemini API key and model
TMDB API key
OpenSubtitles API key (+ optional User-Agent)

Notes:

Secret API keys are stored in Keychain.
Legacy keys previously in UserDefaults are migrated automatically on access.

Backup Policy

Settings > Backup exports/imports local learning data and caches.

Included: library, subtitles, episode analyses, poster cache, TMDB season cache, study progress, and non-secret settings
Excluded: secret API keys (remain on-device in Keychain)

Development Notes

Keep UI strings and product copy aligned with current app language strategy.
Run swift test before committing.
If you modify persistence formats, keep backup compatibility in mind.

Screenshots

Add screenshots to a docs/images/ folder and link them here.

Example:

![Browse](docs/images/browse.png)
![Episode Detail](docs/images/episode-detail.png)
![Study](docs/images/study.png)
![Settings](docs/images/settings.png)

Roadmap

Improve subtitle import UX (better validation and error guidance)
Expand study analytics and progress visualization
Add more robust offline-first behavior for metadata/search cache
Strengthen backup/restore version migration handling
Increase test coverage around UI state transitions and persistence edges

Contributing

Create a feature branch from develop
Make focused commits with clear messages
Run local checks:
- swift build
- swift test
Open a pull request to develop with:
- change summary
- test plan
- screenshots for UI changes (if applicable)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.swiftpm/xcode/package.xcworkspace		.swiftpm/xcode/package.xcworkspace
SceneNote		SceneNote
Sources/SubtitleCore		Sources/SubtitleCore
Tests/SubtitleCoreTests		Tests/SubtitleCoreTests
sceneNote.xcodeproj		sceneNote.xcodeproj
.gitignore		.gitignore
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📌 LLM-based Subtitle Learning System

🔥 Overview

🧠 Key Idea

⚙️ System Architecture

🚀 Features

🧩 Tech Stack

🎯 Why This Project

💡 What This Demonstrates

🔗 Related

This Repository (`Scene Note`)

App Highlights

Repository Tech Stack

Project Structure

Getting Started

1) Open in Xcode

2) Build

3) Test

Configuration

Backup Policy

Development Notes

Screenshots

Roadmap

Contributing

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📌 LLM-based Subtitle Learning System

🔥 Overview

🧠 Key Idea

⚙️ System Architecture

🚀 Features

🧩 Tech Stack

🎯 Why This Project

💡 What This Demonstrates

🔗 Related

This Repository (Scene Note)

App Highlights

Repository Tech Stack

Project Structure

Getting Started

1) Open in Xcode

2) Build

3) Test

Configuration

Backup Policy

Development Notes

Screenshots

Roadmap

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

This Repository (`Scene Note`)

Packages