# Source-agnostic ETL pipeline for Bibliometrix-Python (Extract → Transform → Validate + live API)#9
Open
mattiadenicola02 wants to merge 10 commits into
Conversation
…do==0.2.1 pywin32 non e importato in nessun modulo e su macOS/Linux faceva fallire pip install. Aggiunto marker PEP 508 sys_platform=='win32'. kaleido (usato da plotly to_image/write_image) pinnato a 0.2.1, l'ultima versione con API sincrona compatibile col codice esistente.
…non-WoS Risolve crash che apparivano importando CSV non-WoS (Scopus/Dimensions/Lens/ PubMed) e usando filtri/overview/most-cited. - utils.normalize_dashboard_types(): coercizione PY->Int64 nullable e TC->numerico al confine dati (chiamata in get_data.py, app.py per sample e API). Elimina 'unsupported operand type(s) for -: str and str' in Main information, Filters, Average citations, Most cited documents, Sources production. - get_filters: maschera di filtraggio NA-safe (fillna(False)) per anni mancanti. - get_localcitedsources/references/authors: guardia early-return con placeholder quando la sorgente non ha citazioni (CR/CR_SO vuoti). Prima int(NaN) sollevava un'eccezione dentro il render Shiny e congelava l'intera sessione. - get_sourcesproduction: scarta gli anni mancanti via wrapper locale (no mutazione del reactive condiviso) prima del round-trip astype(int). - etl_pipeline: messaggi 'Unsupported file type' piu' chiari, con i formati accettati per ciascuna sorgente (es. PubMed solo .txt MEDLINE). Non tocca la logica ETL/validator/serializzazione: test_etl_pipeline.py resta PASS=8 N/A=4 FAIL=0. Verificato su Scopus, PubMed, WoS, Dimensions, Lens.
… poi) Audit di tutte le funzioni di analisi su 5 sorgenti (Scopus/PubMed/WoS/ Dimensions/Lens): 28 fallimenti -> 0. Cause e fix: - get_affiliationproductionovertime: leggeva AU_UN senza mai estrarlo (KeyError su TUTTE le sorgenti). Ora estrae AU_UN on-demand via metaTagExtraction, allinea gli anni alle righe ritenute e mostra un placeholder se manca C1. - get_localcitedauthors / get_localciteddocuments: histNetwork restituisce None per sorgenti senza CR (PubMed/Dimensions/Lens) -> crash 'NoneType not subscriptable' che congelava la UI. Guardia if H is None -> placeholder. - get_referencesspectroscopy: su sorgenti senza CR, range(NaN,NaN) sollevava 'float cannot be interpreted as integer'. Filtra anni validi + placeholder. - get_citedcountries / get_citeddocuments: x/x.max() con max=0 (PubMed TC=0, Lens senza paesi) -> size NaN nei marker. Divisore sicuro + placeholder se la tabella e' vuota. - get_wordcloud / get_trendtopics: campo keyword (ID) vuoto su PubMed/ Dimensions/Lens -> IndexError / cocMatrix None. Guardie -> placeholder. - get_wordfrequency: chiamava term_extraction anche per ID/DE (CountVectorizer 'empty vocabulary'); ora solo per TI/AB. keyword_growth: dropna sugli anni e cast int (PY e' Int64 nullable) + early-return se il campo e' vuoto. Nota ambiente: i corpora NLTK (stopwords/punkt/wordnet) non erano installati nella venv -> word_frequency falliva ovunque. Scaricati. Verificato: audit 155 celle = 0 FAIL; test_etl_pipeline.py PASS=8 N/A=4 FAIL=0.
Le tab di text-mining (Most Frequent Words, Word Cloud, Tree Map, Word Frequency, Trend Topics, Thematic Map/Evolution) richiedono i corpora NLTK stopwords e wordnet, che 'pip install nltk' NON scarica. Su un ambiente pulito quelle tab crashavano con LookupError. - www/services/utils._ensure_nltk_resources(): all'import scarica le risorse mancanti (stopwords, wordnet, omw-1.4, punkt), le scompatta se NLTK lascia solo lo .zip, ed è no-op se già presenti. Fallisce in modo soft (warning, non crash) se la macchina è offline. - README: nota sul requisito NLTK + comando manuale per ambienti offline. Verificato da macchina pulita (corpus e zip assenti) -> bootstrap scarica e scompatta correttamente; dashboard si avvia HTTP 200.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR makes Bibliometrix-Python source-agnostic, replicating the conceptual robustness of R bibliometrix's
convert2df(). It introduces a centralized ETL pipeline that turns heterogeneous bibliographic exports (Web of Science, Scopus, Dimensions, PubMed, Lens, Cochrane) — plus live API queries (OpenAlex, PubMed) — into a single, strictly-typed Web of Science schema that the dashboard and analytical functions can consume without crashing.Implementation level: Advanced (API retrieval with pagination, rate-limiting and retries, reusing the same transformation pipeline as the file-based path).
Problems in the current implementation that this PR addresses
convert2df()→ addedBibliometrixETL.run()/run_api().transform()method, no monolith (Extract / Transform / Validate are separate, independently testable methods).list[str]).""(scalars) or[](multi-value).CR).Architecture
1. Dispatcher —
extract()inwww/services/etl_pipeline.pyroutes each(source, file_type)pair to the right parser (reusing the existingwww/services/parsers.py), raising clearValueError/FileNotFoundError/ImportErrorinstead of failing silently.2. Mapping dictionaries —
www/services/column_mappings.pyholds one declarative{source_column: WoS_tag}table per database. Adding a new source = appending one sub-dictionary, no other module changes.3. Type contracts —
transform()enforces the schema in 7 documented phases: pre-processing (e.g. Dimensions affiliation extraction, pagination split into BP/EP), SR computation reusing the existingformat_functions.format_sr_column(per the brief: SR is not rewritten from scratch), column rename, duplicate-column resolution, mandatory-column presence, type coercion, null cleaning.4. Validation —
www/services/validator.pyprogrammatically verifies: all mandatory columns present, no NaN/None remaining, multi-value columns arelist[str].5. Live API (Advanced) —
www/services/api_retriever.py:/works, exponential backoff on 429/5xx (1-2-4-8-16s, cap 30s), per-page retry budget, abstract reconstruction from the inverted index; already-fetched rows are never dropped on error.tempfile.mkstempcleaned infinally, then reusing the existingparse_pubmed_data(no duplicated logic).Files
New ETL modules (~1,631 lines):
www/services/etl_pipeline.py(768) — orchestratorwww/services/api_retriever.py(376) — OpenAlex + PubMed clientswww/services/column_mappings.py(176) — per-source mapping tableswww/services/validator.py(135) — schema validatortest_etl_pipeline.py(176) — end-to-end execution evidence harnessDebugging / patches applied to existing analytical & service functions (to make them work with non-WoS data instead of assuming WoS-only formats):
functions/get_annualproduction.py— robust PY handling across sourcesfunctions/get_worldmapcollaboration.pywww/services/format_functions.py,histnetwork.py,biblionetwork.py,parsers.py,utils.pyrequirements.txt— madepywin32Windows-only (sys_platform == "win32") sopip installno longer fails on macOS/Linux (on Python 3.9–3.12; the pinnedscipy/numpyversions have no prebuilt wheels for Python 3.13 yet); pinnedkaleido==0.2.1(the version compatible with the existing plotlyto_image/write_imagecalls).Execution evidence
End-to-end harness (
python test_etl_pipeline.py) over four real source files:Result:
PASS=8 N/A=4 FAIL=0— all assigned functions run on every source.* Co-citation and bibliographic coupling are computed from cited references (CR). Dimensions and PubMed exports do not include a reference list, so these networks cannot be built — marked N/A rather than FAIL, consistent with the brief ("assuming the raw data contains the necessary underlying information").
Dashboard demonstration
The Shiny dashboard (
shiny run app.py) starts cleanly and serves HTTP 200, and the standardized DataFrame produced by the ETL allows non-WoS data (e.g. Scopus CSV) to be loaded and analyzed through the UI.