Our Methodology
Transparent documentation of how we map global biomedical research networks using PubMed data
Built on Trusted Data Sources
All our data comes from PubMed, the U.S. National Library of Medicine's authoritative database containing over 36 million biomedical citations. We do not modify or editorialize the research data—we only organize and visualize it geographically.
1Data Sources
Primary Source: PubMed
PubMed is maintained by the U.S. National Library of Medicine (NLM) and provides free access to MEDLINE, the premier bibliographic database of life sciences and biomedical information.
Coverage Scope
- • Biomedical & Life Sciences
- • Medicine & Clinical Research
- • Neuroscience & Psychology
- • Pharmacology & Toxicology
- • Public Health & Epidemiology
Time Range
We analyze publications from 2020 to 2025 to ensure data reflects current research activity and institutional affiliations.
2Data Collection Method
Query Construction
For each research field (e.g., "CRISPR Gene Editing"), we construct targeted search queries using field-specific keywords and MeSH (Medical Subject Headings) terms.
Example query for CRISPR research:
(CRISPR OR "CRISPR-Cas9" OR "gene editing" OR "genome editing") AND ("2020"[Date - Publication] : "2025"[Date - Publication])Paper Retrieval
Using the official PubMed E-utilities API, we retrieve all matching publications including metadata such as title, authors, affiliations, publication date, and journal information.
Author Identification
From each paper's metadata, we extract all listed authors and their institutional affiliations as provided in the publication. We count each unique author-affiliation combination as a research contributor.
3Geographic Information Extraction
Institution Identification
We parse author affiliation strings (e.g., "Department of Biology, Harvard University, Cambridge, MA, USA") to extract institution names, departments, and location information.
Geographic Coding
Extracted institutions are geocoded to determine their city and country. This process involves:
- Parsing location information from affiliation strings
- Standardizing institution names (e.g., "MIT" → "Massachusetts Institute of Technology")
- Mapping institutions to cities using geographic databases
- Resolving ambiguous cases through multiple data sources
Validation Mechanism
We cross-reference institution locations using multiple sources including institutional databases, GeoNames, and manual verification for high-volume institutions. Ambiguous cases are flagged for review.
4Data Processing Flow
Our data pipeline processes publications through the following stages:
PubMed Query
Search for publications by field and date range
Paper Retrieval
Fetch publication metadata via API
Author Extraction
Parse author names and affiliations
Institution Recognition
Identify and standardize institution names
Geographic Coding
Map institutions to cities and countries
Aggregation & Statistics
Count researchers by location and field
Quality Validation
Verify data accuracy and completeness
5Update Frequency
Data Synchronization
Our data stays synchronized with PubMed. As new publications are indexed in PubMed, they are incorporated into our analysis pipeline to ensure you see the most current research landscape.
6Coverage and Limitations
✅ What We Cover
- •Medicine & Clinical Research
- •Biology & Life Sciences
- •Neuroscience & Cognitive Science
- •Pharmacology & Drug Development
- •Public Health & Epidemiology
- •Genetics & Genomics
- •Immunology & Microbiology
- •Biotechnology Applications
❌ What We Don't Include
- •Economics, Business, or Finance research
- •Social Sciences (unless health-related)
- •Engineering (except biomedical engineering)
- •Computer Science (unless bioinformatics/health informatics)
- •Physics, Chemistry, or Mathematics (unless applied to biological systems)
Important Limitations
- •Publication Lag: Researchers without recent publications (2020-2025) may not appear in our data.
- •Early Career Researchers: PhD students or postdocs who haven't yet published may be underrepresented.
- •Clinical-Only Positions: Clinicians focused solely on patient care without research publications won't be included.
- •Affiliation Accuracy: We rely on author-provided affiliation information, which may occasionally be incomplete or outdated.
7Data Quality Assurance
We implement multiple quality control measures to ensure data accuracy:
Deduplication
We identify and merge duplicate researchers based on name, institution, and publication patterns.
Anomaly Detection
Statistical algorithms flag unusual patterns such as implausible researcher counts or geographic mismatches.
Manual Spot Checks
We manually verify a sample of high-volume institutions and popular research fields to ensure geocoding accuracy.
Continuous Improvement
We welcome user feedback to improve our data quality. If you notice any inaccuracies, please contact us with specific details so we can investigate and correct them.
Ready to Explore?
Now that you understand our methodology, start discovering research opportunities in your field.