Skip to content

# Source-agnostic ETL pipeline for Bibliometrix-Python (Extract → Transform → Validate + live API)#9

Open
mattiadenicola02 wants to merge 10 commits into
PRAISELab-PicusLab:mainfrom
mattiadenicola02:etl-source-agnostic
Open

# Source-agnostic ETL pipeline for Bibliometrix-Python (Extract → Transform → Validate + live API)#9
mattiadenicola02 wants to merge 10 commits into
PRAISELab-PicusLab:mainfrom
mattiadenicola02:etl-source-agnostic

Conversation

@mattiadenicola02
Copy link
Copy Markdown

Summary

This PR makes Bibliometrix-Python source-agnostic, replicating the conceptual robustness of R bibliometrix's convert2df(). It introduces a centralized ETL pipeline that turns heterogeneous bibliographic exports (Web of Science, Scopus, Dimensions, PubMed, Lens, Cochrane) — plus live API queries (OpenAlex, PubMed) — into a single, strictly-typed Web of Science schema that the dashboard and analytical functions can consume without crashing.

Implementation level: Advanced (API retrieval with pagination, rate-limiting and retries, reusing the same transformation pipeline as the file-based path).

Problems in the current implementation that this PR addresses

  • No single entry point like convert2df() → added BibliometrixETL.run() / run_api().
  • Scattered transformation logic → centralized in one transform() method, no monolith (Extract / Transform / Validate are separate, independently testable methods).
  • Weak type enforcement → explicit type contracts (PY as 4-digit string, TC as int, multi-value fields as list[str]).
  • Poor null handling → NaN/None systematically replaced with "" (scalars) or [] (multi-value).
  • Implicit WoS dependency / incomplete column mapping → declarative per-source mapping dictionaries.
  • Non-standard reference/citation parsing → source-specific delimiters (e.g. newline for WoS CR).

Architecture

1. Dispatcherextract() in www/services/etl_pipeline.py routes each (source, file_type) pair to the right parser (reusing the existing www/services/parsers.py), raising clear ValueError/FileNotFoundError/ImportError instead of failing silently.

2. Mapping dictionarieswww/services/column_mappings.py holds one declarative {source_column: WoS_tag} table per database. Adding a new source = appending one sub-dictionary, no other module changes.

3. Type contractstransform() enforces the schema in 7 documented phases: pre-processing (e.g. Dimensions affiliation extraction, pagination split into BP/EP), SR computation reusing the existing format_functions.format_sr_column (per the brief: SR is not rewritten from scratch), column rename, duplicate-column resolution, mandatory-column presence, type coercion, null cleaning.

4. Validationwww/services/validator.py programmatically verifies: all mandatory columns present, no NaN/None remaining, multi-value columns are list[str].

5. Live API (Advanced)www/services/api_retriever.py:

  • OpenAlex: paginated /works, exponential backoff on 429/5xx (1-2-4-8-16s, cap 30s), per-page retry budget, abstract reconstruction from the inverted index; already-fetched rows are never dropped on error.
  • PubMed: ESearch + EFetch, MEDLINE written to a race-free tempfile.mkstemp cleaned in finally, then reusing the existing parse_pubmed_data (no duplicated logic).

Files

New ETL modules (~1,631 lines):

  • www/services/etl_pipeline.py (768) — orchestrator
  • www/services/api_retriever.py (376) — OpenAlex + PubMed clients
  • www/services/column_mappings.py (176) — per-source mapping tables
  • www/services/validator.py (135) — schema validator
  • test_etl_pipeline.py (176) — end-to-end execution evidence harness

Debugging / patches applied to existing analytical & service functions (to make them work with non-WoS data instead of assuming WoS-only formats):

  • functions/get_annualproduction.py — robust PY handling across sources
  • functions/get_worldmapcollaboration.py
  • www/services/format_functions.py, histnetwork.py, biblionetwork.py, parsers.py, utils.py
  • requirements.txt — made pywin32 Windows-only (sys_platform == "win32") so pip install no longer fails on macOS/Linux (on Python 3.9–3.12; the pinned scipy/numpy versions have no prebuilt wheels for Python 3.13 yet); pinned kaleido==0.2.1 (the version compatible with the existing plotly to_image/write_image calls).

Execution evidence

End-to-end harness (python test_etl_pipeline.py) over four real source files:

Source Rows PY filled Assigned functions
Scopus (CSV) 1000 100% annual_production ✅ · co_citation ✅ · clustering_coupling ✅
Dimensions (XLSX) 500 100% annual_production ✅ · co_citation N/A* · coupling N/A*
PubMed (TXT) 10000 100% annual_production ✅ · co_citation N/A* · coupling N/A*
Web of Science (TXT) 500 100% annual_production ✅ · co_citation ✅ · clustering_coupling ✅

Result: PASS=8 N/A=4 FAIL=0 — all assigned functions run on every source.

* Co-citation and bibliographic coupling are computed from cited references (CR). Dimensions and PubMed exports do not include a reference list, so these networks cannot be built — marked N/A rather than FAIL, consistent with the brief ("assuming the raw data contains the necessary underlying information").

Dashboard demonstration

The Shiny dashboard (shiny run app.py) starts cleanly and serves HTTP 200, and the standardized DataFrame produced by the ETL allows non-WoS data (e.g. Scopus CSV) to be loaded and analyzed through the UI.

…do==0.2.1

pywin32 non e importato in nessun modulo e su macOS/Linux faceva
fallire pip install. Aggiunto marker PEP 508 sys_platform=='win32'.
kaleido (usato da plotly to_image/write_image) pinnato a 0.2.1,
l'ultima versione con API sincrona compatibile col codice esistente.
…non-WoS

Risolve crash che apparivano importando CSV non-WoS (Scopus/Dimensions/Lens/
PubMed) e usando filtri/overview/most-cited.

- utils.normalize_dashboard_types(): coercizione PY->Int64 nullable e TC->numerico
  al confine dati (chiamata in get_data.py, app.py per sample e API). Elimina
  'unsupported operand type(s) for -: str and str' in Main information, Filters,
  Average citations, Most cited documents, Sources production.
- get_filters: maschera di filtraggio NA-safe (fillna(False)) per anni mancanti.
- get_localcitedsources/references/authors: guardia early-return con placeholder
  quando la sorgente non ha citazioni (CR/CR_SO vuoti). Prima int(NaN) sollevava
  un'eccezione dentro il render Shiny e congelava l'intera sessione.
- get_sourcesproduction: scarta gli anni mancanti via wrapper locale (no mutazione
  del reactive condiviso) prima del round-trip astype(int).
- etl_pipeline: messaggi 'Unsupported file type' piu' chiari, con i formati
  accettati per ciascuna sorgente (es. PubMed solo .txt MEDLINE).

Non tocca la logica ETL/validator/serializzazione: test_etl_pipeline.py resta
PASS=8 N/A=4 FAIL=0. Verificato su Scopus, PubMed, WoS, Dimensions, Lens.
… poi)

Audit di tutte le funzioni di analisi su 5 sorgenti (Scopus/PubMed/WoS/
Dimensions/Lens): 28 fallimenti -> 0. Cause e fix:

- get_affiliationproductionovertime: leggeva AU_UN senza mai estrarlo
  (KeyError su TUTTE le sorgenti). Ora estrae AU_UN on-demand via
  metaTagExtraction, allinea gli anni alle righe ritenute e mostra un
  placeholder se manca C1.
- get_localcitedauthors / get_localciteddocuments: histNetwork restituisce
  None per sorgenti senza CR (PubMed/Dimensions/Lens) -> crash 'NoneType not
  subscriptable' che congelava la UI. Guardia if H is None -> placeholder.
- get_referencesspectroscopy: su sorgenti senza CR, range(NaN,NaN) sollevava
  'float cannot be interpreted as integer'. Filtra anni validi + placeholder.
- get_citedcountries / get_citeddocuments: x/x.max() con max=0 (PubMed TC=0,
  Lens senza paesi) -> size NaN nei marker. Divisore sicuro + placeholder se
  la tabella e' vuota.
- get_wordcloud / get_trendtopics: campo keyword (ID) vuoto su PubMed/
  Dimensions/Lens -> IndexError / cocMatrix None. Guardie -> placeholder.
- get_wordfrequency: chiamava term_extraction anche per ID/DE (CountVectorizer
  'empty vocabulary'); ora solo per TI/AB. keyword_growth: dropna sugli anni
  e cast int (PY e' Int64 nullable) + early-return se il campo e' vuoto.

Nota ambiente: i corpora NLTK (stopwords/punkt/wordnet) non erano installati
nella venv -> word_frequency falliva ovunque. Scaricati.

Verificato: audit 155 celle = 0 FAIL; test_etl_pipeline.py PASS=8 N/A=4 FAIL=0.
Le tab di text-mining (Most Frequent Words, Word Cloud, Tree Map, Word
Frequency, Trend Topics, Thematic Map/Evolution) richiedono i corpora
NLTK stopwords e wordnet, che 'pip install nltk' NON scarica. Su un
ambiente pulito quelle tab crashavano con LookupError.

- www/services/utils._ensure_nltk_resources(): all'import scarica le
  risorse mancanti (stopwords, wordnet, omw-1.4, punkt), le scompatta se
  NLTK lascia solo lo .zip, ed è no-op se già presenti. Fallisce in modo
  soft (warning, non crash) se la macchina è offline.
- README: nota sul requisito NLTK + comando manuale per ambienti offline.

Verificato da macchina pulita (corpus e zip assenti) -> bootstrap scarica
e scompatta correttamente; dashboard si avvia HTTP 200.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant