How Our Data Is Collected and Validated
Transparency is central to the Arkansas Research Expertise Database. This page explains exactly where our data comes from, how it is validated, and the quality measures we apply to ensure researcher profiles are accurate and up to date.
Data Sources
Every researcher profile is assembled from multiple authoritative, independent sources. No single source is used in isolation. Cross-referencing across sources increases confidence and helps detect errors.
OpenAlex
Primary source
An open catalog of the global research system maintained by OurResearch. Provides publication records, citation counts, h-index, research topics, and institutional affiliations for over 200 million scholarly works and 90 million authors.
ORCID
Identity verification
The Open Researcher and Contributor ID registry. Used to verify researcher identities, confirm institutional affiliations, and extract verified email addresses. Cross-referenced with OpenAlex to link profiles to their canonical records.
UAMS TRI Profiles
Institutional data
Detailed clinical and research profiles from the University of Arkansas for Medical Sciences Translational Research Institute. Provides researcher biographies, profile images, grants, and clinical specialties for UAMS faculty.
NIH RePORTER
Federal grants
The NIH Research Portfolio Online Reporting Tools database. Provides active and historical NIH grant data including project titles, award amounts, funding institute, and principal investigator roles.
NSF Awards API
Federal grants
The National Science Foundation awards database. Provides NSF grant data including award titles, amounts, directorates, programs, and investigator roles for federally funded research projects.
Web Enrichment
Supplementary discovery
Automated web search (via Tavily) to discover Google Scholar profiles, lab websites, research group pages, and personal academic websites. Results are validated before inclusion.
Gemini AI
AI-generated content
Google's Gemini 3 Flash model generates research narrative summaries and derives research topic labels for each profile. Narratives are based on verified publications, grants, collaborator networks, and metrics. Topics are derived from actual publication data rather than relying on external metadata alone. AI-generated content is clearly labeled on all profiles.
Vertex AI Search
Search infrastructure
Google's Vertex AI Search (Discovery Engine) powers the semantic search functionality, enabling natural-language queries that understand meaning and context rather than simple keyword matching.
Validation Pipeline
Every researcher record passes through a multi-stage validation pipeline before appearing in search results. This process ensures that profiles represent real, currently active researchers at Arkansas institutions.
Harvest
Researcher records are collected from OpenAlex for 17 Arkansas institutions. We query by institutional affiliation to build an initial roster of all researchers with published work linked to each institution.
ORCID Validation
Each researcher is cross-referenced against the ORCID registry. We verify that their ORCID profile confirms affiliation with the claimed Arkansas institution and extract verified email addresses when available.
Works Validation
Publication history is verified through OpenAlex works data. We confirm that the researcher has publications with the claimed institution in the author affiliations, and compute citation metrics including h-index.
Composite Scoring
A composite confidence score (0-10) is calculated based on multiple signals: ORCID verification, institutional email confirmation, publication volume, recency of activity, and affiliation evidence strength. Researchers are classified as "confirmed," "probable," or "needs review."
Metric Snapshots
After scoring, the pipeline captures a monthly snapshot of each researcher's key metrics (h-index, publication count, citation count). These snapshots are stored as a time series, enabling longitudinal tracking of researcher growth. Profile pages display sparkline charts and year-over-year growth indicators once sufficient history accumulates.
Deduplication
An automated deduplication pipeline detects and merges duplicate researcher records caused by name variations, institutional transfers, or source-level entity resolution errors. Merge candidates are scored by name similarity, ORCID overlap, and co-authorship evidence, and resolved using topological ordering to handle multi-way merge chains correctly.
Enrichment
Validated profiles are enriched with additional data discovered through web search: Google Scholar profiles, lab websites, research group pages, and federal grant records from NIH RePORTER and NSF Awards. Grant data includes project titles, award amounts, and investigator roles.
Narrative & Topic Generation
For confirmed and probable researchers, Gemini 3 Flash generates both a research narrative and a set of research topic labels in a single pass. The model uses a three-tier depth system: data-rich profiles (grant PIs, high h-index) receive detailed 200-300 word narratives, moderate profiles receive 120-180 words, and sparse profiles receive concise 60-80 word summaries. Topics are derived from verified publication data and standardized against a database-wide vocabulary for consistency across profiles. All AI-generated content is clearly labeled.
Collaboration & Similarity
Co-author networks are computed from shared publications to identify each researcher's key collaborators. A similarity index compares researchers by topic overlap and departmental alignment, enabling discovery of researchers with related expertise across different institutions.
Search Index Export
Validated researcher profiles are exported to Vertex AI Search, which powers the semantic search functionality. The index is rebuilt incrementally, ensuring that search results reflect the latest data from all upstream pipeline stages.
Data Quality Measures
We apply several safeguards to maintain data accuracy and prevent errors from propagating through the system.
Verification Tiers
Researchers are assigned a verification tier (0-3) based on the strength of evidence for their identity and affiliation. Tier 1 indicates automated validation through ORCID and publications. Tier 2 is granted when a researcher claims and verifies their own profile via institutional email. Tier 3 is reserved for manual administrative verification.
Multi-Source Cross-Referencing
No profile is built from a single source. OpenAlex records are cross-checked against ORCID, institutional directories, and publication affiliations. Discrepancies trigger manual review rather than automatic inclusion.
Topic Quality Controls
Research topic labels are derived by AI from each researcher's verified publications, grants, and biography rather than relying solely on external metadata. Topics are generated against a standardized vocabulary of labels already used across the database, ensuring consistent naming for similarity matching. Each profile displays a maximum of 10 topics, ordered by relevance.
Activity Detection
Profiles are flagged as inactive when a researcher has no publications in recent years. Inactive profiles are hidden from default search and directory views but remain accessible if explicitly requested, ensuring the database reflects the current research landscape.
Designations & Career Stage
Researchers are automatically classified with designations based on objective criteria: Federal Grant PI (active NIH or NSF principal investigator), High Impact (top-percentile citation metrics), and ARA Academy Member (verified through the Arkansas Research Alliance). Career stage is inferred from publication history length and metrics.
Community Flagging
Every profile includes a "Report an Error" option that allows anyone to flag inaccurate information. When multiple independent reports are received for the same profile, the system automatically moves it to "needs review" status for administrative attention.
Researcher Self-Service
Researchers can claim their profiles by verifying their institutional email address. Once claimed, they can update their title, department, research biography, lab website, and Google Scholar link. They can also opt out of the database entirely.
Update Frequency
The database is maintained through a combination of automated pipelines and periodic manual review.
Incremental harvest of new and updated researcher records from OpenAlex, followed by ORCID employment verification. Only changes from the past 30 days are processed.
Validation scores are recomputed, metric snapshots are captured, duplicate records are detected and merged, and NIH/NSF grant data is refreshed.
AI narratives are regenerated for researchers whose data has changed, the collaboration network is recomputed, and the Vertex AI Search index is rebuilt.
Researcher metrics (h-index, publications, citations) are captured as longitudinal snapshots, enabling growth tracking and trend visualization on profile pages.
Researcher-submitted profile updates take effect immediately upon verification.
Community-submitted error reports are logged immediately and reviewed by the administrative team.
Known Limitations
We believe in being transparent about what the database does well and where it has limitations.
- OpenAlex disambiguation. OpenAlex uses automated author disambiguation, which can occasionally assign publications to the wrong researcher, especially for common names. We mitigate this through ORCID cross-referencing, fuzzy name matching, and deduplication. Research topics are derived from verified publication data rather than relying on OpenAlex author-level topic metadata, which can be affected by misattributed works.
- Institutional affiliation lag. When a researcher moves between institutions, there may be a delay before the updated affiliation appears in OpenAlex. Researchers who have recently joined or left an Arkansas institution may be temporarily misclassified.
- AI-generated content. Research narratives and topic labels are generated by AI (Gemini 3 Flash) and may contain inaccuracies or oversimplifications. Narratives are constrained to be strictly factual and grounded in provided data, but occasional errors are possible. All AI-generated content is clearly labeled and is not a substitute for reading the researcher's actual publications.
- Coverage gaps. Researchers who do not publish in indexed venues, or whose work falls outside traditional academic publishing, may be underrepresented. Clinical faculty, practitioners, and industry-facing researchers may have sparser profiles.
- Grant data completeness. Federal grant records cover NIH and NSF awards. State grants, foundation grants, and industry-sponsored research are not currently captured.
See Something Wrong?
If you notice inaccurate information on any researcher profile, you can report it directly from the profile page using the "Report an Error" link. You can also contact our team for questions about data accuracy.
Questions About Our Data
For questions about data sources, methodology, or how a specific profile was generated, contact the Arkansas Research Alliance.
(501) 450-7818