Our Methodology

Transparent documentation of how we map global biomedical research networks using PubMed data

Built on Trusted Data Sources

All our data comes from PubMed, the U.S. National Library of Medicine's authoritative database containing over 36 million biomedical citations. We do not modify or editorialize the research data—we only organize and visualize it geographically.

1Data Sources

Primary Source: PubMed

PubMed is maintained by the U.S. National Library of Medicine (NLM) and provides free access to MEDLINE, the premier bibliographic database of life sciences and biomedical information.

Coverage Scope

• Biomedical & Life Sciences
• Medicine & Clinical Research
• Neuroscience & Psychology
• Pharmacology & Toxicology
• Public Health & Epidemiology

Time Range

We analyze publications from 2020 to 2025 to ensure data reflects current research activity and institutional affiliations.

2Data Collection Method

Query Construction

For each research field (e.g., "CRISPR Gene Editing"), we construct targeted search queries using field-specific keywords and MeSH (Medical Subject Headings) terms.

Example query for CRISPR research:

(CRISPR OR "CRISPR-Cas9" OR "gene editing" OR "genome editing") AND ("2020"[Date - Publication] : "2025"[Date - Publication])

Paper Retrieval

Using the official PubMed E-utilities API, we retrieve all matching publications including metadata such as title, authors, affiliations, publication date, and journal information.

Author Identification

From each paper's metadata, we extract all listed authors and their institutional affiliations as provided in the publication. We count each unique author-affiliation combination as a research contributor.

3Geographic Information Extraction

Institution Identification

We parse author affiliation strings (e.g., "Department of Biology, Harvard University, Cambridge, MA, USA") to extract institution names, departments, and location information.

Geographic Coding

Extracted institutions are geocoded to determine their city and country. This process involves:

Parsing location information from affiliation strings
Standardizing institution names (e.g., "MIT" → "Massachusetts Institute of Technology")
Mapping institutions to cities using geographic databases
Resolving ambiguous cases through multiple data sources

Validation Mechanism

We cross-reference institution locations using multiple sources including institutional databases, GeoNames, and manual verification for high-volume institutions. Ambiguous cases are flagged for review.

4Data Processing Flow

Our data pipeline processes publications through the following stages:

PubMed Query

Search for publications by field and date range

Paper Retrieval

Fetch publication metadata via API

Author Extraction

Parse author names and affiliations

Institution Recognition

Identify and standardize institution names

Geographic Coding

Map institutions to cities and countries

Aggregation & Statistics

Count researchers by location and field

Quality Validation

Verify data accuracy and completeness

5Update Frequency

Data Synchronization

Our data stays synchronized with PubMed. As new publications are indexed in PubMed, they are incorporated into our analysis pipeline to ensure you see the most current research landscape.

6Coverage and Limitations

✅ What We Cover

•Medicine & Clinical Research
•Biology & Life Sciences
•Neuroscience & Cognitive Science
•Pharmacology & Drug Development
•Public Health & Epidemiology
•Genetics & Genomics
•Immunology & Microbiology
•Biotechnology Applications

❌ What We Don't Include

•Economics, Business, or Finance research
•Social Sciences (unless health-related)
•Engineering (except biomedical engineering)
•Computer Science (unless bioinformatics/health informatics)
•Physics, Chemistry, or Mathematics (unless applied to biological systems)

Important Limitations

•Publication Lag: Researchers without recent publications (2020-2025) may not appear in our data.
•Early Career Researchers: PhD students or postdocs who haven't yet published may be underrepresented.
•Clinical-Only Positions: Clinicians focused solely on patient care without research publications won't be included.
•Affiliation Accuracy: We rely on author-provided affiliation information, which may occasionally be incomplete or outdated.

7Data Quality Assurance

We implement multiple quality control measures to ensure data accuracy:

Deduplication

We identify and merge duplicate researchers based on name, institution, and publication patterns.

Anomaly Detection

Statistical algorithms flag unusual patterns such as implausible researcher counts or geographic mismatches.

Manual Spot Checks

We manually verify a sample of high-volume institutions and popular research fields to ensure geocoding accuracy.

Continuous Improvement

We welcome user feedback to improve our data quality. If you notice any inaccuracies, please contact us with specific details so we can investigate and correct them.

Ready to Explore?

Now that you understand our methodology, start discovering research opportunities in your field.

Browse by Country Create Free Account

Related Resources

About ScholarMap

Learn about our mission and team

Explore Research Opportunities

Browse by country and field