Skip to content

Ignore invisible whitespace in text matching#147

Open
passsy wants to merge 1 commit into
mainfrom
normalize-visible-text
Open

Ignore invisible whitespace in text matching#147
passsy wants to merge 1 commit into
mainfrom
normalize-visible-text

Conversation

@passsy

@passsy passsy commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Fixes #138

spotText and the AnyText text selectors now match what a user sees: invisible characters and characters that render as a space are ignored, so tests can use plain characters instead of escapes.

// Given Text('foo\u{200B}bar')  → ZWSP inserted for line breaks
spotText('foobar').existsOnce();

// Given Text('12\u{202F}345\u{00A0}€')  → narrow + non-breaking spaces
spotText('12 345 €').existsOnce();

Normalization rule

normalizeVisibleText(String) applies a clear, two-part rule:

  • Removed (invisible, no meaning): Zero Width Space U+200B, Soft Hyphen U+00AD, Word Joiner U+2060, BOM U+FEFF
  • Folded to a regular space: every Unicode space separator (Zs) — Non-Breaking Space U+00A0, Narrow No-Break Space U+202F, thin/em/figure spaces, ideographic space, …

Deliberately left untouched, because they carry meaning or structure: Zero Width Joiner / Non-Joiner (emoji sequences, Indic/Arabic shaping), bidi controls, the U+FFFC WidgetSpan placeholder, tabs and line breaks.

Matching the exact characters

When a test needs the literal bytes, AnyText now exposes a rawText property alongside the normalized text, generating a raw selector family:

// Text('foo\u{00A0}bar')
spotText('foo').withText('foo bar').existsOnce();           // normalized
spotText('foo').withRawText('foo\u{00A0}bar').existsOnce(); // exact

New: whereRawText, withRawText, hasRawText, getRawText, hasRawTextWhere on WidgetSelector<AnyText>, plus public normalizeVisibleText, extractTextContent(Element), the TextContent holder (raw/normalized), and zeroWidthSpace/nonBreakingSpace constants.

Scope

The concrete spot<Text>() / spot<SelectableText>() / spot<EditableText>() selectors read their native diagnostic props and are unchanged. Use spotText / spot<AnyText> for the normalized behavior.

Builds on the idea from #139 (thx @MichaelTamm), extended to a complete character set with a raw escape hatch.

spotText and the AnyText text selectors now ignore characters that are
invisible or visually indistinguishable from a regular space, so tests
can be written with plain characters instead of escapes.

AnyText.normalizeVisibleText removes meaningless invisibles (ZWSP, soft
hyphen, word joiner, BOM) and folds every Unicode space separator (Zs)
to a regular space. Meaningful invisibles (ZWJ, bidi controls, the
U+FFFC WidgetSpan placeholder) and line structure (tabs, line breaks)
are kept.

Text extraction is consolidated into AnyText.extractText(), which returns
both the raw and normalized text via AnyTextContent. AnyText exposes the
normalized value as the 'text' diagnostic prop and the literal as a new
'rawText' prop, generating the whereRawText/withRawText/hasRawText family
to match exact characters when needed.

Closes #138
@passsy passsy force-pushed the normalize-visible-text branch from 22329ae to c23470f Compare June 11, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve support for ZWSP and NBSP characters

1 participant