Skip to content

Page holdings checker improvements#59

Open
eumalin wants to merge 2 commits into
mainfrom
ETT-1559-ETT-1563-holdings-checker
Open

Page holdings checker improvements#59
eumalin wants to merge 2 commits into
mainfrom
ETT-1559-ETT-1563-holdings-checker

Conversation

@eumalin

@eumalin eumalin commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

This includes tickets ETT-1507 ETT-1508 ETT-1559 ETT-1560 ETT-1561 ETT-1563

You can check it out in dev at https://dev-3.www.hathitrust.org/print-holdings-checker/

ETT-1559 - Code quality

I addressed security in the previous PR for this ticket.
In this PR I used Promise.all instead of a manual counter for parallel file processing. There's a few smaller cleanups too, file extension check is now case-insensitive so .TSV works.

Page loaded and valid .TSV accepted

image

ETT-1560 - Filename validation

Here we check a bunch of things: file extension, member ID, holding type, update type, and date. partial is an error since the backend never implemented it. Date has to be YYYYMMDD and a real calendar date. Member ID is validated against the HathiTrust institutions list, fetched on load.

The institutions fetch is blocked by CORS everywhere except www.hathitrust.org itself, so the member id check only runs in production. It fails silently elsewhere (not blocking, the check is skipped in this case), so we only can realistically test it fully once it's in production.

Bad date in filename

image

Partial update type rejected

image

Unrecognized member ID

image

ETT-1561 - Loading states and accessibility

Each file gets a spinner icon indicator when it's drag dropped and processing is not done yet. Added aria-live="polite" to the results container. VoiceOver on Safari doesn't reliably announce live region updates on a container that starts as display:none, so we also set aria-label to "Processing N files..." and focus the container, which gets announced on focus instead. Clear results button appears after processing. Spinner respects prefers-reduced-motion.

Spinner loading state

image

Multiple files - one pass, one error

image

ETT-1507, ETT-1508, ETT-1563 - Data row validation

Checks the first 1,000 rows and notes on the card when a file was too large to check fully.
image

If there were a lot of invalid status values:
image

  • Checks for empty lines, rows with comma separators instead of tabs (with hint to re export as tsv), and wrong column counts, with line numbers
  • OCNs in scientific notation (Excel reformatting large integers), Alma MMS IDs (which start with 99, over 15 digits, with hint to use OCLC numbers instead), and other non-integer values (must be a positive integer)
  • Enforces that status must be CH, LM, WD, or empty. Condition must be BRT or empty. Govdoc must be 0, 1, or empty.

Data row errors

image

If there were a lot of rows with similar errors:
image

Encoding detection

I used TextDecoder with fatal: true to detect non-UTF-8 files. If it throws, we warn the user and fall back to lenient decoding so the rest of the checks still run.

Non-UTF-8 encoding warning

image

Rejected files

Non .tsv files get an inline error.

CSV rejected

image

Created with help of AI

eumalin added 2 commits June 12, 2026 14:38
ETT-1560 filename format validation

ETT-1561 loading states and clear button
accept .tsv regardless of extension case

focus output container on load, fix VoiceOver announcement

add encoding detection, member ID validation, and pluralization fix

Improve error message clarity and consistency in holdings checker

Fix meta display, add clarifying comments, fix indentation
@eumalin eumalin marked this pull request as ready for review June 15, 2026 13:38
@eumalin eumalin requested review from carylwyatt and moseshll June 15, 2026 15:56
import { buildCard, buildPendingCard, showError } from './ui.js';

// Fetch is CORS-blocked outside www.hathitrust.org (local dev, test); catch returns null and member ID check is silently skipped.
const memberIdsPromise = fetch('https://www.hathitrust.org/files/ht_institutions.tsv')

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the CORS we can test the member id check fully once it's in production.

@moseshll

moseshll commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@eumalin Not sure if this is out of scope, but I found pattern of false positives in the recently submitted file from usu (see ETT-1604) -- they submit multiple OCNs per line delimited by semicolon (the spec page 10 says semicolon or comma are fine -- Scrub::Common is a bit overly generous [IMHO] with OCN_SPLIT_DELIM = /[,:;|\/ ]+/).

Example line:

in00015445124	(OCoLC)742453;(OCoLC)784466;ocm00742453

holdings-backend's Scrub::Ocn and Scrub::OcnCandidate might provide a framework for this effort. I think it's safe to say that split must be done on this field. Alas, it may make for more tedious error reporting.

I think it would be fair to make a new ticket to address ocn parsing if it is indeed out of scope. It's also fair to say that Scrub::MaxOcn would be inappropriate for this tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants