# Source-agnostic ETL pipeline for Bibliometrix-Python (Extract → Transform → Validate + live API) by mattiadenicola02 · Pull Request #9 · PRAISELab-PicusLab/bibliometrix-python

mattiadenicola02 · 2026-06-02T09:52:48Z

Summary

This PR makes Bibliometrix-Python source-agnostic, replicating the conceptual robustness of R bibliometrix's convert2df(). It introduces a centralized ETL pipeline that turns heterogeneous bibliographic exports (Web of Science, Scopus, Dimensions, PubMed, Lens, Cochrane) — plus live API queries (OpenAlex, PubMed) — into a single, strictly-typed Web of Science schema that the dashboard and analytical functions can consume without crashing.

Implementation level: Advanced (API retrieval with pagination, rate-limiting and retries, reusing the same transformation pipeline as the file-based path).

Problems in the current implementation that this PR addresses

No single entry point like convert2df() → added BibliometrixETL.run() / run_api().
Scattered transformation logic → centralized in one transform() method, no monolith (Extract / Transform / Validate are separate, independently testable methods).
Weak type enforcement → explicit type contracts (PY as 4-digit string, TC as int, multi-value fields as list[str]).
Poor null handling → NaN/None systematically replaced with "" (scalars) or [] (multi-value).
Implicit WoS dependency / incomplete column mapping → declarative per-source mapping dictionaries.
Non-standard reference/citation parsing → source-specific delimiters (e.g. newline for WoS CR).

Architecture

1. Dispatcher — extract() in www/services/etl_pipeline.py routes each (source, file_type) pair to the right parser (reusing the existing www/services/parsers.py), raising clear ValueError/FileNotFoundError/ImportError instead of failing silently.

2. Mapping dictionaries — www/services/column_mappings.py holds one declarative {source_column: WoS_tag} table per database. Adding a new source = appending one sub-dictionary, no other module changes.

3. Type contracts — transform() enforces the schema in 7 documented phases: pre-processing (e.g. Dimensions affiliation extraction, pagination split into BP/EP), SR computation reusing the existing format_functions.format_sr_column (per the brief: SR is not rewritten from scratch), column rename, duplicate-column resolution, mandatory-column presence, type coercion, null cleaning.

4. Validation — www/services/validator.py programmatically verifies: all mandatory columns present, no NaN/None remaining, multi-value columns are list[str].

5. Live API (Advanced) — www/services/api_retriever.py:

OpenAlex: paginated /works, exponential backoff on 429/5xx (1-2-4-8-16s, cap 30s), per-page retry budget, abstract reconstruction from the inverted index; already-fetched rows are never dropped on error.
PubMed: ESearch + EFetch, MEDLINE written to a race-free tempfile.mkstemp cleaned in finally, then reusing the existing parse_pubmed_data (no duplicated logic).

Files

New ETL modules (~1,631 lines):

www/services/etl_pipeline.py (768) — orchestrator
www/services/api_retriever.py (376) — OpenAlex + PubMed clients
www/services/column_mappings.py (176) — per-source mapping tables
www/services/validator.py (135) — schema validator
test_etl_pipeline.py (176) — end-to-end execution evidence harness

Debugging / patches applied to existing analytical & service functions (to make them work with non-WoS data instead of assuming WoS-only formats):

functions/get_annualproduction.py — robust PY handling across sources
functions/get_worldmapcollaboration.py
www/services/format_functions.py, histnetwork.py, biblionetwork.py, parsers.py, utils.py
requirements.txt — made pywin32 Windows-only (sys_platform == "win32") so pip install no longer fails on macOS/Linux (on Python 3.9–3.12; the pinned scipy/numpy versions have no prebuilt wheels for Python 3.13 yet); pinned kaleido==0.2.1 (the version compatible with the existing plotly to_image/write_image calls).

Execution evidence

End-to-end harness (python test_etl_pipeline.py) over four real source files:

Source	Rows	PY filled	Assigned functions
Scopus (CSV)	1000	100%	annual_production ✅ · co_citation ✅ · clustering_coupling ✅
Dimensions (XLSX)	500	100%	annual_production ✅ · co_citation N/A* · coupling N/A*
PubMed (TXT)	10000	100%	annual_production ✅ · co_citation N/A* · coupling N/A*
Web of Science (TXT)	500	100%	annual_production ✅ · co_citation ✅ · clustering_coupling ✅

Result: PASS=8 N/A=4 FAIL=0 — all assigned functions run on every source.

* Co-citation and bibliographic coupling are computed from cited references (CR). Dimensions and PubMed exports do not include a reference list, so these networks cannot be built — marked N/A rather than FAIL, consistent with the brief ("assuming the raw data contains the necessary underlying information").

Dashboard demonstration

The Shiny dashboard (shiny run app.py) starts cleanly and serves HTTP 200, and the standardized DataFrame produced by the ETL allows non-WoS data (e.g. Scopus CSV) to be loaded and analyzed through the UI.

…penAlex

…do==0.2.1 pywin32 non e importato in nessun modulo e su macOS/Linux faceva fallire pip install. Aggiunto marker PEP 508 sys_platform=='win32'. kaleido (usato da plotly to_image/write_image) pinnato a 0.2.1, l'ultima versione con API sincrona compatibile col codice esistente.

…non-WoS Risolve crash che apparivano importando CSV non-WoS (Scopus/Dimensions/Lens/ PubMed) e usando filtri/overview/most-cited. - utils.normalize_dashboard_types(): coercizione PY->Int64 nullable e TC->numerico al confine dati (chiamata in get_data.py, app.py per sample e API). Elimina 'unsupported operand type(s) for -: str and str' in Main information, Filters, Average citations, Most cited documents, Sources production. - get_filters: maschera di filtraggio NA-safe (fillna(False)) per anni mancanti. - get_localcitedsources/references/authors: guardia early-return con placeholder quando la sorgente non ha citazioni (CR/CR_SO vuoti). Prima int(NaN) sollevava un'eccezione dentro il render Shiny e congelava l'intera sessione. - get_sourcesproduction: scarta gli anni mancanti via wrapper locale (no mutazione del reactive condiviso) prima del round-trip astype(int). - etl_pipeline: messaggi 'Unsupported file type' piu' chiari, con i formati accettati per ciascuna sorgente (es. PubMed solo .txt MEDLINE). Non tocca la logica ETL/validator/serializzazione: test_etl_pipeline.py resta PASS=8 N/A=4 FAIL=0. Verificato su Scopus, PubMed, WoS, Dimensions, Lens.

… poi) Audit di tutte le funzioni di analisi su 5 sorgenti (Scopus/PubMed/WoS/ Dimensions/Lens): 28 fallimenti -> 0. Cause e fix: - get_affiliationproductionovertime: leggeva AU_UN senza mai estrarlo (KeyError su TUTTE le sorgenti). Ora estrae AU_UN on-demand via metaTagExtraction, allinea gli anni alle righe ritenute e mostra un placeholder se manca C1. - get_localcitedauthors / get_localciteddocuments: histNetwork restituisce None per sorgenti senza CR (PubMed/Dimensions/Lens) -> crash 'NoneType not subscriptable' che congelava la UI. Guardia if H is None -> placeholder. - get_referencesspectroscopy: su sorgenti senza CR, range(NaN,NaN) sollevava 'float cannot be interpreted as integer'. Filtra anni validi + placeholder. - get_citedcountries / get_citeddocuments: x/x.max() con max=0 (PubMed TC=0, Lens senza paesi) -> size NaN nei marker. Divisore sicuro + placeholder se la tabella e' vuota. - get_wordcloud / get_trendtopics: campo keyword (ID) vuoto su PubMed/ Dimensions/Lens -> IndexError / cocMatrix None. Guardie -> placeholder. - get_wordfrequency: chiamava term_extraction anche per ID/DE (CountVectorizer 'empty vocabulary'); ora solo per TI/AB. keyword_growth: dropna sugli anni e cast int (PY e' Int64 nullable) + early-return se il campo e' vuoto. Nota ambiente: i corpora NLTK (stopwords/punkt/wordnet) non erano installati nella venv -> word_frequency falliva ovunque. Scaricati. Verificato: audit 155 celle = 0 FAIL; test_etl_pipeline.py PASS=8 N/A=4 FAIL=0.

…new non usati)

…rtati

Le tab di text-mining (Most Frequent Words, Word Cloud, Tree Map, Word Frequency, Trend Topics, Thematic Map/Evolution) richiedono i corpora NLTK stopwords e wordnet, che 'pip install nltk' NON scarica. Su un ambiente pulito quelle tab crashavano con LookupError. - www/services/utils._ensure_nltk_resources(): all'import scarica le risorse mancanti (stopwords, wordnet, omw-1.4, punkt), le scompatta se NLTK lascia solo lo .zip, ed è no-op se già presenti. Fallisce in modo soft (warning, non crash) se la macchina è offline. - README: nota sul requisito NLTK + comando manuale per ambienti offline. Verificato da macchina pulita (corpus e zip assenti) -> bootstrap scarica e scompatta correttamente; dashboard si avvia HTTP 200.

mattiadenicola02 added 10 commits May 30, 2026 15:14

Add source-agnostic ETL pipeline: Scopus, Dimensions, PubMed, Lens, O…

9e46520

…penAlex

fix

c794c86

chore: rimuovi file di lavoro (backup requirements e dataset sources/…

dc011fb

…new non usati)

docs(readme): segna Scopus/PubMed/Dimensions/Lens/Cochrane come suppo…

09077c1

…rtati

pr boy aggiornata

d756834

risoluzione bug app

6f52326

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# Source-agnostic ETL pipeline for Bibliometrix-Python (Extract → Transform → Validate + live API)#9

# Source-agnostic ETL pipeline for Bibliometrix-Python (Extract → Transform → Validate + live API)#9
mattiadenicola02 wants to merge 10 commits into
PRAISELab-PicusLab:mainfrom
mattiadenicola02:etl-source-agnostic

mattiadenicola02 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mattiadenicola02 commented Jun 2, 2026

Summary

Problems in the current implementation that this PR addresses

Architecture

Files

Execution evidence

Dashboard demonstration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant