π§ Steering and Interpreting AI-Like Discourse Directions in Hidden States
BeyondStyle is the codebase for studying whether the recognisable "AI-generated" feel of LLM prose is represented as a direction in model hidden states. The project builds controlled human/AI continuation pairs, fits layerwise source-contrast directions in an open-weight target model, and tests whether those directions can separate, steer, and interpret AI-like discourse.
The central claim is intentionally scoped: the learned direction is not a universal detector or a fully disentangled authorship axis. It is an operational AI-like discourse direction under controlled continuation settings.
The pipeline starts from a shared prefix and compares two continuations: the original human continuation and an AI continuation generated under a coarse content sketch. The sketch controls topic and entities, but it does not prescribe wording, ordering, discourse markers, or rhetorical framing. This makes the paired corpus suitable for studying discourse realisation rather than only topic mismatch.
πΌοΈ Figure: paired continuation construction, layerwise source-direction fitting, inference-time steering, and linguistic readouts.
For each layer
Each sample is a tuple
For every inspected layer, the repository compares three estimators:
| Estimator | What it fits |
|---|---|
raw |
Normalised human/AI mean difference. |
nullspace |
Mean difference after removing leading within-class variation. |
fisher |
Ridge-regularised Fisher LDA direction. |
The direction is oriented so that AI continuations have larger projection scores than human continuations. Separability is then measured by projecting held-out hidden states onto this one-dimensional direction and computing ROC-AUC.
Steering is a lightweight intervention, not full weight fine-tuning of the LLM. Once a direction is fitted from hidden states, a forward hook adds the direction during decoding:
The same prompts and decoding settings are used across steering strengths, so differences are attributable to the sign and magnitude of the hidden-state intervention.
The paper audits whether the direction is reducible to sketch leakage, sketch adherence, length, shallow surface features, lexical n-grams, or generator-specific fingerprints. These checks support interpreting the direction as a source-associated discourse-realisation contrast rather than a simple artefact of the paired construction.
πΌοΈ Figure: construct-validity checks show that the paired-corpus signal is not explained by sketch-only prediction or simple sketch-following behaviour.
π§ Hidden states contain a strong human/AI contrast
On 500 paired CNN/DailyMail continuations with Llama-3.1-8B-Instruct, source directions separate held-out human and AI continuations with near-ceiling ROC-AUC in middle layers. At layer 15, raw and nullspace both reach 0.998 held-out AUC, while fisher reaches 1.000.
πΌοΈ Figure: held-out human/AI separability is strongest in middle-to-late layers, where the main steering experiments are performed.
The direction is not confined to the original news corpus. Without refitting on the target domain, the layer-15 nullspace direction trained on the main English corpus reaches 0.975 AUC on OpenWebText; the reverse OpenWebText -> original transfer reaches 0.994 AUC.
πΌοΈ Figure: OpenWebText generator-held-out evaluation remains strong across a disjoint generator pool, supporting transfer beyond a single generator fingerprint.
At layer 15, steering along the fitted direction changes generated continuations in the expected signed direction. For the nullspace direction, Winston AI-likelihood moves from 0.555 at
The main downstream evaluation uses Llama-3.1-8B-Instruct with lm-evaluation-harness on nine accuracy tasks plus WikiText perplexity. The clearest positive pattern appears under mild AI-side nullspace steering. At
πΌοΈ Figure: task-level response surface for nullspace steering. Mild AI-side steering gives the most consistent positive accuracy pattern, while human-side steering is often harmful.
The AI side is not a single keyword list. English token, phrase, and construction readouts associate it with explicit explanation, salience marking, continuity, institutional/evaluative wording, and overt causal, purposive, concessive, or contrastive linking. The Chinese interpretability check with Qwen3-4B shows compatible discourse-level evidence: explicit progression, evaluative emphasis, broad explanatory framing, and purposive or contrastive linking.
The checked-in paired datasets can be used directly for open-weight target-model fitting and evaluation. API-backed generators are only needed if you want to build new paired continuations.
pip install -r requirements.txt
cp .env.example .env
# Set LOCAL_MODEL_PATH in .env, for example:
# LOCAL_MODEL_PATH=/path/to/Llama-3.1-8B-Instruct
python -c "from src.config import print_config; print_config()"python scripts/steering_pipeline.py \
--dataset data/paired/en/paired_mixed_500.json \
--multi \
--denoise raw nullspace fisher \
--alphas -2 -1 0 1 2This extracts hidden states from the local target model, fits the three source-direction estimators, and writes steered generations under outputs/en/03_steering_vectors/.
# Held-out human/AI separability
python scripts/eval_heldout_separability.py --language en
# Leave-one-generator-out robustness
python scripts/eval_generator_holdout.py \
--dataset data/paired/en/paired_mixed_500.json
# lm-evaluation-harness steering grid
python scripts/eval_lm_benchmark.py --mode grid --language enThe paper's open-model benchmark grid covers HellaSwag, ARC-Easy, ARC-Challenge, OpenBookQA, WinoGrande, PIQA, BoolQ, RACE, TruthfulQA-MC2, and WikiText.
python scripts/steering_pipeline.py \
--dataset data/paired/en/paired_openwebtext_mixed_300.json \
--multi \
--denoise raw nullspace \
--skip_generation
python scripts/eval_cross_domain_transfer.py
python scripts/eval_cross_domain_transfer.py --reversepython scripts/steering_pipeline.py \
--dataset data/paired/zh/paired_mixed_500.json \
--multi \
--denoise raw nullspace fisher
python scripts/run_interpretability.py --language zhThe Chinese analysis is intended as cross-linguistic interpretability evidence, not as a full replication of the English separability, steering, and downstream experiments.
src/ Python package: config, data, extraction, steering, eval, visualization
scripts/ Pipeline entry points; each script supports --help
data/
prompts/{en,zh}/ held-out generation prompts
paired/{en,zh}/ paired human/AI continuation datasets
outputs/ result tree; artefacts are git-ignored except layout READMEs
paper/ paper source and figures
assets/ README figures copied from paper/figs
Outputs are organised as:
outputs/{lang}/{section}/{model_id}/{dataset_tag}/...
For example, mixed_500 and openwebtext_mixed_300 are kept separate by dataset tag.




