→ Benchmark results — compatibility and speed for 14 tested models on Apple M2 Max.
RustyLLM is an educational GGUF inference runner for developers who want to understand how a local language-model runtime works. You do not need AI experience to read the project: the code is organized as ordinary file parsing, arrays, math kernels, state management, HTTP routing, and optional browser/WASM experiments.
At a high level, RustyLLM reads a .gguf model file, converts input text into
integer token IDs, repeatedly runs those IDs through model weights to predict
the next token, and converts the chosen output tokens back into text. The
project is deliberately small enough to read end to end, while still showing the
complete path from model-file parsing to a command-line tool and a minimal
OpenAI-compatible HTTP API.
The runner loads model weights directly from disk, keeps quantized tensors in memory-mapped storage on native targets, and exposes the same core through a CLI, a small HTTP server, LM Studio-compatible aliases, Ollama-compatible routes, and a Rust library API.
RustyLLM is best treated as learning-oriented infrastructure: practical enough to run small local models, intentionally dependency-light, and transparent enough for studying how local inference systems are assembled without first adopting a large production runtime.
If the AI terms are new, start with AI inference for non-AI developers. It explains the core vocabulary used by the codebase before the module-by-module architecture guide.
If you are reading the code to understand local inference, use this order:
src/gguf.rsparses the GGUF container, tensor directory, and metadata.src/tokenizer.rsturns text into token IDs and decodes generated tokens.src/simd.rsimplements scalar, NEON, AVX2/FMA, and quantized math kernels.src/model.rsloads tensors and runs transformer forward passes.src/runtime.rswraps the model with generation, chat templates, embeddings, benchmark helpers, and optional session reuse.src/server.rsmaps the runner onto the native, OpenAI-compatible, LM Studio-compatible, and Ollama-compatible HTTP routes.
Additional documentation:
- AI inference for non-AI developers explains the vocabulary and mental model used by the project.
- Architecture guide explains the inference pipeline and module responsibilities.
- MTP usage guide explains greedy assistant-based speculative decoding, benchmark comparison, and troubleshooting.
- Function reference documents every non-test Rust
function under
src/.
- Native GGUF loading with zero-copy memory mapping on macOS and Linux.
- GGUF metadata inspection, model discovery, model selection, and tensor listing.
- Tokenizer support for SentencePiece-style and GPT-2-style metadata.
- Quantized inference paths for
Q8_0,Q4_0,Q4_K,Q6_K, andMXFP4tensors. - SIMD kernels for Apple Silicon NEON and x86_64 AVX2/FMA, with scalar fallback.
- Metal acceleration for Q4_K/Q6_K matrix-vector work on macOS, enabled by
default when the Objective-C shim builds and the GPU backend is available.
Set
RUSTY_LLM_METAL=0to force the CPU path. - One-shot generation, interactive REPL mode, benchmark mode, JSON benchmark output, and append-only chat history logging.
- Prompt-selected
SKILL.mdloading via--skills-dir, with per-session de-duplication so long chats do not inject the same skill repeatedly. - OpenAI-compatible
/v1/models,/v1/completions,/v1/chat/completions,/v1/responses, and/v1/embeddingsroutes. - LM Studio-style
/api/v0/*aliases and Ollama-style/api/*compatibility routes. - Server-Sent Events streaming for OpenAI-compatible completions and chat completions, plus Responses API streaming and Ollama-style NDJSON streaming.
- OpenAPI 3.1 document at
/openapi.jsonwith Swagger UI at/docs. - Model Context Protocol stdio mode with
generate,chat,embed, andmodelstools. - Text embeddings via
Runner::embed, mean-pooled over the last transformer layer and L2-normalized for cosine similarity. - Minimal browser chat UI served from
/chat, an expert UI from/chat?expert, and a GGUF explorer from/explorer. - Library API for embedding RustyLLM in other Rust applications.
wasm32-unknown-unknowncheck support for the no-default-features WASM build.
RustyLLM accepts GGUF files whose general.architecture metadata matches one of
the supported architecture identifiers:
llama, llama2, llama3, mistral, mistral3, mixtral, ministral,
qwen2, qwen3, gpt-oss, gemma, gemma2, gemma3, gemma4,
gemma4n, gemma4-assistant, granite, granite3, granite4,
deepseek, deepseek-v2, deepseek2, nemotron, hermes, phi, phi2,
phi3, phi4, falcon, falcon3, stablelm, starcoder2, command-r,
cohere, internlm2, olmo, olmo2, exaone, solar, yi, arctic,
nomic-bert, nomic-embed, and
text-embedding-nomic-embed-text.
Support still depends on the tensors present in a specific GGUF file. Use
--inspect before loading an unfamiliar model to verify architecture, tensor
types, tokenizer metadata, and API compatibility.
Gemma-family GGUFs use the dedicated Gemma loader and native
<start_of_turn> chat formatting when the tokenizer template exposes it.
Q4_0 QAT GGUFs such as google/gemma-4-12B-it-qat-q4_0-gguf are supported by
the same path and use fused CPU Q/K/V and Gate/Up projection jobs when Metal is
not selected for those projections.
Mistral 3 / Ministral 3 instruction GGUFs use their native
[SYSTEM_PROMPT]...[/SYSTEM_PROMPT] and [INST]...[/INST] chat formatting when
the tokenizer template exposes those markers. The startup optimization summary
prints the selected renderer, for example chat-template=mistral3-inst, so you
can quickly spot whether a model uses a native template or the plain fallback.
- Rust 1.95 or newer. The repository pins
1.95.0in rust-toolchain.toml. - A GGUF model file. The runner does not download models.
- macOS or Linux for native memory-mapped execution.
- Optional for WebAssembly experiments:
wasm-packand thewasm32-unknown-unknowntarget. - Optional for macOS Metal experiments: Xcode command line tools with
xcrun,clang, andar.
cargo build --releaseThe release binary is written to:
target/release/rusty-llm
For local performance work, build for the native CPU:
RUSTFLAGS="-C target-cpu=native" cargo build --releaseThe Makefile wraps the common commands:
make help
make release
make release-max
make run MODEL=/path/to/model.gguf PROMPT="Explain GGUF in one paragraph"
make repl MODEL=/path/to/model.gguf
make serve MODEL=/path/to/model.gguf ADDR=127.0.0.1:8080 CHAT=1
make bench MODEL=/path/to/model.gguf BENCH_RUNS=5 PROMPT="Explain SIMD briefly"make release uses the default release profile with ThinLTO for faster rebuilds.
Use make release-max only when you explicitly want the slower FatLTO profile
for final A/B measurements.
Run one prompt:
./target/release/rusty-llm ./models/model.gguf \
--prompt "Explain rotary embeddings in two sentences." \
--max-tokens 128Start a chat REPL:
./target/release/rusty-llm ./models/model.gguf --replStart the HTTP API:
./target/release/rusty-llm ./models/model.gguf --serve 127.0.0.1:8080Start the HTTP API with the built-in chat UI:
./target/release/rusty-llm ./models/model.gguf --serve 127.0.0.1:8080 --chatThen open:
http://127.0.0.1:8080/chathttp://127.0.0.1:8080/chat?experthttp://127.0.0.1:8080/explorer
The explorer shows GGUF metadata, tokenizer output, token-embedding vectors,
nearest vocabulary neighbors, the tensor directory, and the model catalog
discovered from the configured --model-dir.
The Chat and Expert views expose a model-independent Thinking toggle. When enabled, RustyLLM first asks the loaded model to rewrite the latest user prompt with a compact meta-prompt, then uses that rewritten prompt for the final answer.
Optional Skills can be enabled by pointing RustyLLM at a directory tree that
contains SKILL.md files:
./target/release/rusty-llm ./models/model.gguf \
--skills-dir ./skills/default \
--replRustyLLM indexes skill names and descriptions, loads only the matching
SKILL.md bodies for each prompt, and remembers loaded skill paths inside the
REPL or server session. The repository includes self-contained example Skills
under skills/default:
rust-code-reviewlocal-llm-troubleshootingskill-authoringgerman-technical-writing
RustyLLM does not execute skill scripts or lazily read references/ files; keep
Skills self-contained unless the user explicitly provides additional context.
When serving an exact .gguf file path, RustyLLM skips the recursive startup
catalog scan so the server can start loading the requested model immediately.
The general CLI form is:
rusty-llm [model.gguf|model-name|model-dir] [options]You can pass an exact .gguf file:
rusty-llm ./models/model.gguf --prompt "Hello"You can also select a model from a directory:
rusty-llm --model-dir ./models --list-models
rusty-llm --model-dir ./models --model phi-4 --prompt "Write a Rust enum example"When --model is an exact .gguf file name or a path relative to
--model-dir, RustyLLM resolves it with a lightweight file scan before falling
back to full metadata discovery.
If no model directory is provided, RustyLLM uses:
RUSTY_LLM_MODEL_DIR, when set and non-empty.- the default local LM Studio community model cache.
For benchmark automation, bench_models.sh and the Makefile's default
MODEL_DIR scan additional common LM Studio, Ollama, GPT4All, Jan, ~/models,
and project-local model paths.
Model selection is intentionally lenient: --model can match repository names,
file names, relative IDs, or GGUF metadata names. If a selector matches multiple
models, RustyLLM prints the matching choices and asks for a more specific value.
Projector files such as mmproj-*.gguf are ignored for text model selection.
rusty-llm [model.gguf|model-name|model-dir] [options]
Model and inspection options:
--model <name>selects a GGUF from--model-dir.--model-dir <path>recursively scans a directory for.gguffiles.--list-modelslists discovered models and exits.--inspectprints a JSON compatibility report without loading weights.--list-tensorsloads the model and prints tensor names, dtypes, and shapes.--verboseor-vprints startup timing details, including model resolution, mmap open, GGUF parsing, tokenizer build, and weight setup.
Execution modes:
--prompt <text>or-p <text>runs one-shot generation.--replstarts an interactive chat session.--serve <addr>starts the HTTP(S) server, for example127.0.0.1:8080.--mcpstarts a Model Context Protocol stdio server for the loaded model.--chatenables the built-in web UIs at/chat,/chat?expert, and/explorer.--embedembeds--promptand prints the embedding vector.--benchruns a non-streaming generation benchmark.--bench-jsonruns benchmark mode and emits a machine-readable JSON report.
Generation options:
--max-tokens <N>or-n <N>sets the maximum number of generated tokens.--temp <F>or-t <F>sets temperature;0uses greedy decoding.--top-p <F>sets nucleus sampling in the range(0, 1].--top-k <N>sets top-k filtering.--repeat-penalty <F>applies a repetition penalty to recent tokens.--seed <N>sets the RNG seed.0uses the default time-based behavior.--system-prompt <text>overrides the default chat system prompt.--thinkingrewrites each prompt with the built-in Thinking meta-prompt before answering.--thinking-prompt <text>overrides the Thinking meta-prompt.--thinking-max-tokens <N>caps the internal Thinking rewrite.--skills-dir <path>enables prompt-selected Skills from a directory tree ofSKILL.mdfiles. Use--skills-dir skills/defaultto try the bundled examples.--max-skills <N>limits new Skills loaded for one prompt. The default is3.--skill-max-bytes <N>caps the loaded bytes perSKILL.md. The default is16384.--stop <text>stops generation when the text appears. The flag can be repeated.--threads <N>overrides the SIMD worker thread count.--threads-batch <N>overrides the SIMD worker thread count only during prompt/prefill processing, mirroring llama.cpp's split between generation threads and batch/prompt threads.--ubatch <N>sets the logical prefill chunk size. This is the runtime planning layer for llama.cpp-style microbatch prefill; current kernels still evaluate tokens sequentially inside each chunk.--no-auto-batch-threadsdisables automatic widening of prefill worker threads when decode was configured with fewer workers than the machine offers.--poll <N>controls how many spin iterations SIMD worker threads use while waiting for the next micro-job before sleeping. This ports llama.cpp's threadpool polling idea; use--poll 0for the lowest idle CPU use.- Large CPU matvec jobs use llama.cpp-style dynamic row chunks: workers take 64-row chunks from a shared atomic counter and fall back to static row ranges when a matrix is too small for chunking to pay off.
- Fused K-quant Metal projections also keep tiny shapes on CPU, matching llama.cpp's practice of shape-gating GPU dispatch instead of paying command buffer overhead for every small kernel.
--cpu-affinityenables best-effort SIMD worker affinity on supported operating systems.--mlockasks the OS to keep mapped model pages resident in RAM. This is best-effort and can be limited by user or system lock limits.--backend <name>selects runtime dispatch policy:auto,cpu,metal, ormetal-ultra.--bench-threads <LIST>runs the same benchmark across comma-separated SIMD worker counts, for example--bench-threads 1,2,4,8. This follows the llama.cpp tuning practice of measuring thread oversubscription instead of assuming that all logical CPUs are fastest.--profile <name>selects runtime planning:auto,mistral,mistral-ultra, orgemma.mistral-ultrais the aggressive Metal mode for Mistral/Ministral-style GGUFs; it lowers Metal dispatch thresholds for Q4_K/Q6_K projections and attention scans, with native SIMD fallback for kernels that still run on CPU. WithRUSTY_LLM_METAL=1,autoenables this backend for Ministral 3 models.--mtp-assistant <path>loads a smaller assistant GGUF for greedy speculative decoding.--mtp-tokens <N>sets the maximum speculative draft tokens.--mtp-min-accept-rate <F>disables MTP when the acceptance rate drops below this threshold. The default is0.5.--no-mtp-adaptivekeeps the MTP draft length fixed instead of adapting it.--no-speculativedisables MTP/speculative decoding.
Server options:
--tls-cert <path>enables HTTPS with a PEM certificate.--tls-key <path>enables HTTPS with a PEM private key.--max-connections <N>caps concurrent server connections. The default ismax(16, available_threads * 8).--chat-history <path>or--chat-log <path>appends CLI and server turns to a JSON file.
One-shot generation:
rusty-llm ./models/model.gguf \
--prompt "Name three practical uses for local embeddings." \
--max-tokens 96 \
--temp 0.7 \
--top-p 0.9Read a prompt from stdin:
printf "Summarize grouped-query attention." | rusty-llm ./models/model.ggufStop at a custom delimiter:
rusty-llm ./models/model.gguf \
--prompt "Name three fruits:" \
--stop "\n" \
--max-tokens 32Use the LM Studio community cache:
rusty-llm --list-models
rusty-llm --model phi-4 --prompt "Write a concise Rust trait example"Run a local HTTPS server:
rusty-llm ./models/model.gguf \
--serve 127.0.0.1:8443 \
--tls-cert cert.pem \
--tls-key key.pemWrite chat history:
rusty-llm ./models/model.gguf \
--repl \
--chat-history ./runs/chat-history.jsonStart the server:
rusty-llm ./models/model.gguf --serve 127.0.0.1:8080Health and metadata routes:
GET /,GET /health,GET /healthz,GET /readyGET /api/versionGET /openapi.json,GET /swagger.json,GET /docsGET /v1/modelsGET /api/v0/modelsGET /api/tagsGET /api/explorer/model(loaded model metadata, tensor inventory, and discovered model catalog)
Generation and embedding routes:
POST /generatePOST /v1/completionsPOST /v1/chat/completionsPOST /v1/responsesPOST /v1/embeddingsPOST /api/v0/completionsPOST /api/v0/chat/completionsPOST /api/v0/embeddingsPOST /api/generatePOST /api/chatPOST /api/embeddingsPOST /api/embedPOST /api/explorer/tokenizePOST /api/explorer/vectorPOST /api/explorer/neighbors
All POST routes require Content-Type: application/json. Requests are bounded
by header and body limits and a per-connection I/O timeout. CORS headers are
included on responses.
Prompt input:
curl -X POST http://127.0.0.1:8080/generate \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Summarize grouped-query attention in two sentences.",
"max_tokens": 80,
"temp": 0.7,
"top_p": 0.9,
"top_k": 40,
"repeat_penalty": 1.1,
"stop": ["</s>", "\n\n"]
}'Chat input:
curl -X POST http://127.0.0.1:8080/generate \
-H 'Content-Type: application/json' \
-d '{
"messages": [
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "What is GGUF?"}
],
"max_tokens": 64
}'Response:
{
"text": "...",
"prompt_tokens": 123,
"generated_tokens": 64,
"prefill_ms": 42,
"decode_ms": 180,
"total_ms": 223
}Non-streaming:
curl -X POST http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3",
"messages": [
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "What is GGUF?"}
],
"max_tokens": 64,
"temperature": 0.7,
"stop": ["</answer>"]
}'Streaming SSE:
curl -N -X POST http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3",
"messages": [{"role": "user", "content": "Tell me a joke."}],
"max_completion_tokens": 128,
"stream": true
}'Each chunk is emitted as:
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":...,"model":"llama3","choices":[{"index":0,"delta":{"content":"..."},"finish_reason":null}]}
The final event is:
data: [DONE]
max_completion_tokens is accepted as an alias for max_tokens.
curl -X POST http://127.0.0.1:8080/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "local-model",
"prompt": "Complete this sentence: Rust is",
"max_tokens": 48,
"temperature": 0.5
}'Streaming is also supported on /v1/completions and /api/v0/completions with
"stream": true.
curl -X POST http://127.0.0.1:8080/v1/responses \
-H 'Content-Type: application/json' \
-d '{
"model": "local-model",
"instructions": "You are concise.",
"input": "Explain GGUF in one sentence.",
"max_output_tokens": 64
}'Streaming is supported with "stream": true and emits Responses-style SSE
events ending in data: [DONE].
RustyLLM accepts response_format, text.format, tools, and tool_choice
fields for OpenAI client compatibility. JSON response formats add a prompt-level
instruction to produce valid JSON; RustyLLM does not yet enforce JSON schemas
token by token, and tool definitions are not executed by the model server.
RustyLLM accepts OpenAI-style multimodal content arrays on chat routes:
curl -X POST http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe what you see:"},
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
]
}
],
"max_tokens": 128
}'Image references are converted into text placeholders such as
[image: https://example.com/photo.jpg] or [image: base64 data]. RustyLLM does
not currently run a vision encoder.
OpenAI-compatible single input:
curl -X POST http://127.0.0.1:8080/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{
"model": "nomic-embed",
"input": "The quick brown fox jumps over the lazy dog"
}'OpenAI-compatible batch input:
curl -X POST http://127.0.0.1:8080/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{
"model": "nomic-embed",
"input": ["First sentence.", "Second sentence."]
}'Response shape:
{
"object": "list",
"data": [
{"object": "embedding", "embedding": [0.012, -0.034], "index": 0}
],
"model": "nomic-embed",
"usage": {"prompt_tokens": 9, "total_tokens": 9}
}Ollama-style embeddings:
curl -X POST http://127.0.0.1:8080/api/embeddings \
-H 'Content-Type: application/json' \
-d '{"model": "nomic-embed", "prompt": "The quick brown fox"}'List tags:
curl http://127.0.0.1:8080/api/tagsGenerate:
curl -X POST http://127.0.0.1:8080/api/generate \
-H 'Content-Type: application/json' \
-d '{
"model": "local",
"prompt": "Why are GGUF models convenient?",
"options": {
"num_predict": 80,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"repeat_penalty": 1.1
}
}'Chat:
curl -X POST http://127.0.0.1:8080/api/chat \
-H 'Content-Type: application/json' \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "What is memory mapping?"}]
}'Ollama stream: true requests on /api/generate and /api/chat return
newline-delimited JSON chunks with a final done: true object:
curl -N -X POST http://127.0.0.1:8080/api/generate \
-H 'Content-Type: application/json' \
-d '{"model": "local", "prompt": "Say hello.", "stream": true}'The server exposes a machine-readable OpenAPI document and a browser docs page:
GET /openapi.jsonGET /swagger.jsonGET /docs
/docs loads Swagger UI from a CDN and uses /openapi.json as its spec source.
Start a stdio MCP server for the loaded model:
rusty-llm ./models/model.gguf --mcpThe MCP server exposes generate, chat, embed, and models tools. It uses
newline-delimited JSON-RPC on stdin/stdout and writes boot logs to stderr.
Run a text benchmark:
rusty-llm ./models/model.gguf \
--bench \
--bench-runs 5 \
--max-tokens 64 \
--threads 8 \
--threads-batch 12 \
--prompt "Explain local LLM inference performance in one concise paragraph."Thread tuning sweep:
rusty-llm ./models/model.gguf \
--bench \
--bench-runs 3 \
--bench-threads 1,2,4,8 \
--max-tokens 128 \
--temp 0 \
--seed 42 \
--prompt "Explain local LLM inference performance in one concise paragraph."Use the best measured decode throughput for that model, backend, prompt shape,
and machine. For long prompts, compare --threads-batch separately because the
fastest prefill setting can differ from the fastest decode setting.
Emit JSON for scripts or CI artifacts:
rusty-llm ./models/model.gguf \
--bench-json \
--bench-runs 5 \
--max-tokens 64 \
--prompt "Explain SIMD briefly" > benchmark.jsonBenchmark output includes prompt tokens, generated tokens, prefill time, decode time, wall time, and aggregate throughput. Use the same model, prompt, temperature, seed, thread count, and build flags when comparing changes.
Ollama and LM Studio are often faster on macOS because they use heavily tuned llama.cpp kernels and GPU paths. RustyLLM benchmark numbers are most useful for tracking RustyLLM changes against itself.
Inspect compatibility without loading model weights:
rusty-llm ./models/model.gguf --inspectList tensors through the main binary:
rusty-llm ./models/model.gguf --list-tensorsRun utility binaries:
cargo run --release --bin list_tensors -- ./models/model.gguf
cargo run --release --bin analyze_gguf -- ./models/model.ggufanalyze_gguf is currently focused on Gemma-style layer structure analysis.
The embedding demo computes embedding vectors and compares cosine similarity:
cargo run --release --bin embedding_demo -- \
./models/embed.gguf \
"Albert Einstein was a physicist." \
"Einstein developed the theory of relativity." \
"A banana is a tropical fruit."You can also call the CLI embedding mode directly:
rusty-llm ./models/embed.gguf --embed --prompt "The quick brown fox"use rusty_llm::runtime::{GenerationOptions, Runner};
fn main() -> Result<(), String> {
let (runner, _) = Runner::from_path("./models/model.gguf")?;
let result = runner.generate("Hello", &GenerationOptions::default())?;
println!("{}", result.text);
let emb = runner.embed("The quick brown fox")?;
println!("dim={} tokens={}", emb.embedding.len(), emb.token_count);
Ok(())
}Chat generation:
use rusty_llm::runtime::{ChatMessage, GenerationOptions, Runner};
fn main() -> Result<(), String> {
let (runner, _) = Runner::from_path("./models/model.gguf")?;
let messages = vec![
ChatMessage::user("Explain GGUF in one sentence."),
];
let result = runner.generate_chat(&messages, &GenerationOptions::default())?;
println!("{}", result.text);
Ok(())
}Cosine similarity:
use rusty_llm::runtime::{cosine_similarity, Runner};
fn main() -> Result<(), String> {
let (runner, _) = Runner::from_path("./models/embed.gguf")?;
let a = runner.embed("Einstein developed relativity.")?;
let b = runner.embed("Relativity was developed by Einstein.")?;
println!("{:.4}", cosine_similarity(&a.embedding, &b.embedding)?);
Ok(())
}Default Cargo features:
full: enables the default native application feature set.cli: builds the CLI binaries and enables JSON helpers for command-line tools.server: enables the HTTP server.tls: enables HTTPS serving throughrustls.metal: compiles the optional macOS Metal backend when Xcode command line tools are available. When compiled and the GPU backend is available, it is used by default at runtime (setRUSTY_LLM_METAL=0to opt out).
Mistral Ultra mode:
RUSTY_LLM_METAL=1 rusty-llm --model-dir ./models --model Ministral \
--profile mistral-ultra --prompt "Explain metal inference briefly."With RUSTY_LLM_METAL=1, --backend auto uses the standard Metal route for
Ministral 3 models. That is the measured faster default for the current 3B
Q4_K_M benchmark. Use --backend metal-ultra or --profile mistral-ultra only
when explicitly comparing the more aggressive dispatch thresholds on your Mac.
Use --backend cpu for CPU-only A/B checks without changing environment
variables. --backend metal-ultra enables the same aggressive per-thread Metal
routing as --profile mistral-ultra while leaving the model profile explicit.
For repeatable checks, use make bench-model-ultra MODEL=... or
make kernel-bench-ultra MODEL=.... Tune the aggressive routing thresholds with
RUSTY_LLM_METAL_ULTRA_Q4K_MIN_ROWS, RUSTY_LLM_METAL_ULTRA_Q6K_MIN_ROWS, and
RUSTY_LLM_METAL_ULTRA_ATTENTION_MIN_TOKENS; all default to 512. Metal
matvec and attention calls use reusable copy buffers by default, which is faster
on the current Ministral 3B Q4_K_M benchmark than Shared/NoCopy wrapping. Set
RUSTY_LLM_METAL_NOCOPY=1 only when benchmarking the no-copy path on your Mac.
Immutable Metal weight buffers use private storage by default for better decode
throughput; set RUSTY_LLM_METAL_PRIVATE_WEIGHTS=0 if memory pressure matters
more than speed. Mistral-style Q4_K/Q4_K/Q6_K FFN blocks and post-attention FFN
blocks are fused into fewer Metal command buffers by default; set
RUSTY_LLM_METAL_FUSED_FFN=0 or RUSTY_LLM_METAL_POST_FFN=0 to compare against
the older split-dispatch paths. Set RUSTY_LLM_METAL_PROFILE=1 to print aggregate
Metal command-buffer, dispatch, transfer, allocation, CPU encode, and GPU timing
counters at process exit.
Optional feature:
wasm: enables the wasm-bindgen interface and is intended forwasm32-unknown-unknownbuilds without default native features.
Examples:
cargo build --release --features full
cargo check --no-default-features --features cli,server,tls
cargo check --no-default-features --features wasm --target wasm32-unknown-unknown --lib
make wasmThe release profile uses opt-level = 3, fat LTO, one codegen unit, stripping,
and panic = "abort". The bench profile mirrors the release optimizer while
keeping line-table debug info for profiler output. For smaller WebAssembly
artifacts, cargo build --profile wasm-release --no-default-features --features wasm --target wasm32-unknown-unknown --lib
uses size-oriented optimization.
The browser demo lives in demo/wasm/index.html. Generated wasm-bindgen output
is intentionally ignored via demo/wasm/pkg/ and should not be committed to the
main branch.
The Deploy WASM demo GitHub Actions workflow builds the WASM package in CI,
assembles a temporary Pages artifact, and deploys it with GitHub Pages. To use
it, configure the repository's Pages source to GitHub Actions in the GitHub
repository settings. The deployed page contains:
index.htmlfromdemo/wasm/index.html- generated
pkg/rusty_llm.js - generated
pkg/rusty_llm_bg.wasm - generated TypeScript declaration files
No generated WASM binaries are written back to the repository branch.
RUSTY_LLM_MODEL_DIR: default directory used by model discovery.RUSTY_LLM_FAST_ATTN: enables the approximate fast attention path when set.RUSTY_LLM_METAL: controls the macOS Metal Q4_0/Q8_0/Q4_K/Q6_K and long-context attention GPU paths. When the binary was built with themetalfeature and the backend compiled and is available, Metal is used by default. SetRUSTY_LLM_METAL=0to force the CPU path;RUSTY_LLM_METAL=1keeps it explicit.
Useful checks:
cargo fmt --check
cargo clippy --all-targets --features full -- -D warnings
cargo test --features full
cargo check --no-default-features --features wasm --target wasm32-unknown-unknown --libThe CI workflow runs the full native check set and the no-default-features WASM
library check on Ubuntu. Local GitHub Actions runs are supported with act:
act pull_requestThe repository includes .actrc runner mappings and skips GitHub-hosted-only
deployment/cache steps when ACT=true.
Focused embedding tests:
cargo test runtime::tests- Native builds use memory mapping; WASM builds load GGUF bytes from memory.
- Generation calls are serialized inside a
Runnerto protect shared inference state. - The HTTP parser is intentionally small and expects HTTP/1.1 requests with
Content-Lengthfor JSONPOSTbodies. - Server requests have bounded header and body sizes, per-connection timeouts, and a configurable concurrency cap.
- Some GGUF chat templates are mapped into internal prompt renderers;
unsupported templates fall back to a plain
System/User/Assistanttranscript. - SSE responses do not include
Content-Length; the stream ends withdata: [DONE]and the socket closes. - Embeddings are mean-pooled over input token positions and L2-normalized.
- Unknown model IDs sent to the API are accepted and mapped to the loaded model, which helps existing OpenAI, LM Studio, and RAG clients work without knowing the exact local model name.
- Multimodal request bodies are accepted for API compatibility, but images are represented as text placeholders rather than processed by a vision encoder.
RustyLLM is intentionally small and learning-oriented. If you need production throughput, GPU offloading, a polished GUI, or broader model and quantization coverage, one of the following projects is likely a better fit.
| Project | Language | Focus |
|---|---|---|
| llama.cpp | C/C++ | Reference implementation for GGUF inference; the origin of the GGUF format, quantization schemes, and most SIMD/GPU kernels used across the ecosystem. Highest raw throughput for CPU and GPU inference. |
| Ollama | Go + llama.cpp | User-friendly CLI and REST API wrapping llama.cpp; pulls models automatically and exposes the same /api/ routes that RustyLLM emulates. Best choice when you want a local model running in one command. |
| LM Studio | Electron + llama.cpp | Desktop GUI for discovering, downloading, and chatting with local GGUF models; includes an OpenAI-compatible local server. Best for non-developers or when a visual interface matters. |
| mistral.rs | Rust | Production-grade Rust inference engine with CUDA/Metal GPU support, speculative decoding, vision models, and a Python/HTTP API. The Rust alternative to RustyLLM for real workloads. |
| candle | Rust | Hugging Face's minimalist Rust ML framework. Runs many model families from Safetensors or GGUF; designed as a library rather than a standalone runner. |
| llamafile | C/C++ | Packages a model and the llama.cpp runtime into a single cross-platform executable. Useful when you want to distribute a self-contained model binary. |
| GPT4All | C++ + Qt | Cross-platform desktop application with a chat UI and a local model store; targets end users rather than developers. |
| koboldcpp | Python + llama.cpp | llama.cpp frontend focused on creative writing and role-play; includes a web UI and KoboldAI-compatible API routes. |