(1) Overview
Introduction
Systematic reviews are widely recognized as a standard method for evidence synthesis in fields ranging from medicine to social policy [1]. However, conducting a comprehensive review is labor-intensive. A single study often requires researchers to manually screen thousands of titles and abstracts, review hundreds of full-text articles, and extract data into standardized formats. This workflow is often slow, taking months, and is prone to human error [2], particularly as researcher fatigue sets in.
Historically, software tools for systematic reviews (e.g., Rayyan [26] and EndNote [20]) have relied on keyword matching or basic pattern matching [3]. While useful for organizing references, these ‘rule-based’ systems can struggle with the nuances of language. They often miss relevant studies that use different terminology to describe the same concept or include irrelevant studies that happen to contain specific keywords.
The recent advent of large language models (LLMs) offers a technical solution [4]. Unlike simple keyword searches, LLMs can read and understand context, allowing them to interpret the scientific reasoning behind a text. However, integrating these models into rigorous academic research is difficult. Proprietary tools are often expensive ‘black boxes’ that offer no insight into how decisions are made. Furthermore, LLMs are prone to ‘hallucinations’ (confidently stating incorrect information) and formatting inconsistencies [5].
Several existing AI-assisted tools have emerged to address these challenges. For example, ASReview [9] utilizes active learning to prioritize screening, and Sysrev [10] offers a collaborative platform with integrated extraction features. However, many of these solutions either lock users into specific proprietary models or cloud environments, or they focus primarily on screening rather than end-to-end data extraction. ReviewAid [7] differentiates itself through a vendor-agnostic, multi-provider architecture. It allows users to switch between high-performance cloud models (e.g., GPT-4o [11] and Claude [12]) and local models (via Ollama [30]) to address privacy concerns, a feature less common in purely cloud-based competitors. Additionally, ReviewAid implements specific engineering innovations, such as the ‘Bulletproof Parsing Pipeline’, to handle LLM formatting errors and a hierarchical confidence scoring system to quantify uncertainty. While established platforms like Covidence [19] offer broader project management and team integration features, ReviewAid focuses on technical flexibility, robustness in the AI interaction layer, and local privacy options.
Crucially, ReviewAid is explicitly designed as a decision-support tool and does not replace human judgment. It was developed to bridge the gap between the reasoning capabilities of modern AI and the strict, error-averse requirements of academic research. It addresses the ‘black box’ problem by visualizing the AI’s internal ‘thought process’ via a system terminal. Furthermore, it tackles the issue of formatting errors through a parsing pipeline that is designed to correct AI mistakes automatically. The motivation behind ReviewAid was not to replace the researcher but to act as a supplementary ‘aid’, a ‘third reference’ to minimize manual errors and ensure no potential papers are missed.
Implementation and architecture
ReviewAid is a web-based application designed for maximum usability and accessibility. It is fully open-source, allowing researchers to inspect, modify, and deploy the code on their own computers. The software is built using Python 3.12+ [23] and utilizes Streamlit [22] for the frontend, providing a reactive, modern interface that updates in real time without needing to refresh the page.
ReviewAid follows a modular client-server architecture that decouples the application logic from the AI inference engine. The client, built with a Streamlit frontend, manages user inputs and file uploads while providing a clean, responsive UI and a real-time ‘system terminal’ for operational transparency. The server handles backend logic and communication with various AI providers, supporting a wide range of models; rigorous testing has been conducted specifically on OpenAI (GPT-4o) [11], Anthropic (Claude-Sonnet-4–20250514) [12], DeepSeek (DeepSeek-Chat) [13], Cohere (Command-A-03–2025) [14], Z.ai (GLM-4.6V-Flash, GLM-4.5V-Flash) [6], and Ollama (Llama3) [15]. While these models are officially supported, the architecture is model-agnostic and designed to accommodate future AI models. Users can utilize new models by entering the specific name in the configuration, provided the API endpoint adheres to standard OpenAI or Anthropic protocols. Crucially, ReviewAid supports local execution via Ollama, allowing researchers to run models like Llama3 entirely on their hardware. This ensures that no data leaves the local machine, a critical feature for sensitive patient data, and API keys are never stored by the application to protect privacy.
The backend uses PyMuPDF (v1.26.3) [24] for initial text extraction, while the AI models perform visual and semantic understanding to handle scanned documents or complex layouts. Data manipulation is handled by pandas [25], with exports to Word and Excel facilitated by python-docx [27] and xlsxwriter [28]. Supplementary Figure 1 illustrates the basic conceptual architecture of ReviewAid.
A. Hierarchical four-tier confidence scoring system to support reliability
ReviewAid does not treat AI decisions as simple ‘yes/no’ answers. Instead, it treats certainty as a measurable metric using a four-tier system. The logic prioritizes deterministic rule-based decisions before progressively falling back to algorithmic and heuristic estimation only when necessary.
Tier 1 (Deterministic Rule-Based): The system performs a preliminary scan for specific exclusion and inclusion keywords (e.g., ‘rats’ for a human study). If exclusion keywords are found without inclusion keywords, the paper is automatically rejected with high confidence. If both are present, the system ignores these rules to avoid false positives and lets the AI decide.
Tier 2 (LLM Self-Assessment): The AI is explicitly instructed to evaluate its own screening or extraction decision. It assigns a confidence score between 0.0 and 1.0 based on the explicit evidence it found in the text. This captures context that simple rules might miss.
Tier 3 (Heuristic Estimation): If the AI fails to provide a confidence score (due to a formatting error), the system calculates a ‘best guess’ based on keyword density. For screening, it matches inclusion/exclusion criteria against the full text; for extraction, it compares extracted data against the full text.
Tier 4 (Low-Confidence Default): If data extraction fails completely, a low score (e.g., 0.2) is assigned, and the result is flagged for mandatory human review. This design choice aims to prevent the system from failing silently (see Figure 1).

Figure 1
Multi-tier screening and confidence assignment framework.
Multi-tier screening and confidence assignment framework
The interpretation of these scores is detailed in Table 1 below.
Table 1
Confidence Score Interpretation.
| CONFIDENCE SCORE | CLASSIFICATION | DESCRIPTION | IMPLICATION |
|---|---|---|---|
| 1.0 (100%) | Definitive Match | Rule-based classification/No ambiguity. | Fully automated decision. |
| 0.8–1.0 | Very High | AI strongly validates the decision using explicit evidence. | Safe to accept. |
| 0.6–0.79 | High | Criteria appear satisfied based on standard academic content. | Review is optional. |
| 0.4–0.59 | Moderate | Ambiguous context or loosely met criteria. | Manual verification recommended. |
| 0.1–0.39 | Low | Based mainly on keyword estimation. | High risk of error. |
| <0.1 | Unreliable | Derived from failed extraction methods. | Mandatory manual review. |
B. Bulletproof parsing pipeline
A common failure point in AI software is that models often return data in JSON (a structured data format) that is technically broken, for instance, by missing a closing bracket or containing a trailing comma. This causes standard software to crash. ReviewAid addresses this with a Bulletproof Parsing Pipeline that attempts to fix errors in six sequential steps:
Null Check: If the result is empty, immediately try a basic text search using Regex.
Sanitization: Remove common formatting errors like Markdown code block markers (e.g., ```json).
Standard JSON Parsing: Attempt to read the data using standard Python tools (json.loads).
JSON5 Parsing: If standard reading fails, use a more tolerant library (JSON5 [29]) that can handle missing punctuation.
AI Repair: If that fails, send the broken text back to the AI with a prompt asking it to fix its own formatting mistakes.
Regex Fallback: As a last resort, use pattern matching to hunt for specific data keys in the raw text.
This pipeline is designed to prevent system crashes caused by malformed outputs and aims to recover usable data from potentially corrupted AI responses, thereby mitigating the risk of silent failures (see Figure 2).

Figure 2
Robust API response parsing and recovery workflow.
C. Execution logic and pipelines
The application implements distinct logical flows for its two primary modes: screening and extraction.
Screener execution logic
The screener module initiates by accepting user-defined inclusion and exclusion criteria (PICO framework). It processes uploaded PDFs by first converting them to text via PyMuPDF [24]. The system then applies the hierarchical confidence scoring (tiers 1–4). If the paper passes the initial deterministic checks, it is sent to the LLM for semantic evaluation. The LLM returns a classification (include, exclude, or maybe) along with a justification and a confidence score. This data is then parsed via the bulletproof pipeline and presented to the user. Supplementary Figure 2 illustrates the full-text paper screener’s execution logic.
AI data extraction pipeline
The Extractor module begins by accepting user-defined data fields (e.g., ‘Sample Size’, ’P-value’). It iterates through the uploaded papers, sending the full text and the requested fields to the LLM. The LLM is prompted to return the results in a structured JSON format. Upon receiving the response, the system engages the bulletproof parsing pipeline to ensure the JSON is readable. Once parsed, the extracted fields are populated into a data frame for user review and export. Supplementary Figure 3 illustrates the full-text AI data extraction pipeline.
D. Operational robustness and common issues
ReviewAid includes several mechanisms to handle real-world operational errors. To manage API rate limits, the system is designed to retry up to three times if the API throttles requests; if failures persist, it generates a new API request for the specific paper. Users should be aware that the AI may not understand domain-specific abbreviations (e.g., ‘SD’ for standard deviation) without context, so they are advised to expand these abbreviations in the extractor configuration (e.g., ‘Intervention_SD: standard deviation…’). Additionally, because the web version is hosted on community cloud resources, users may experience a 30-second initialization wait (cold start). Finally, to ensure stability on shared resources, users are restricted to a maximum of 20 articles per submission session.
Quality control
A comprehensive validation strategy was employed to assess ReviewAid (v2.1.0), involving both stress testing on large datasets and a detailed precision analysis using a gold-standard reference. It is important to note that, irrespective of our internal testing on larger datasets (100+), we strictly set a maximum of 20 studies to be used in one batch to ensure operational stability. The validation studies were conducted by the authors (V.S. and M.B.), researchers at Georgian National University SEU, who verified the outputs through technical text comparison against the gold standard reference [8]. The validation artifacts, including source PDFs and detailed results, extraction accuracy, and confidence, are publicly available in the ReviewAid Validation Repository. Additionally, to support transparency, the repository includes the scripts and configuration files necessary to replicate the validation tests described below.
Extraction accuracy and confidence validation
To evaluate extraction precision, a targeted study was conducted on the first 19 articles from the supplementary material of a high-impact systematic review by Bakker et al. [8]. This selection method was chosen to provide a consistent benchmark derived from an established peer-reviewed study. This study utilized the default GLM-4.6V-Flash model [6] to extract specific fields: journal, DOI, category, therapeutic area, medical condition, concept of interest, outcome assessment, comparator measure, technology, sensor(s), make and model, wear location, algorithm/analysis software, sample size, participant age, and participant gender.
Methodology: Papers were matched by title. A ‘hit’ was counted if information was semantically equivalent (e.g., ‘Wrist’ matching ‘Non-dominant wrist’). A ‘miss’ was counted for contradictory, missing, or significantly different data.
Validation results: The overall average accuracy across all fields was 78.19% (see Table 2). The most important aspect to note is that this accuracy of 78.19 percent is for the Default GLM-4.6V-Flash, which is a free-tier AI. Using the tool’s configuration to switch to high-precision models like Claude Sonnet 4.1 [12] or DeepSeek-chat [13] would be expected to improve precision. As such, the focus should be on the architecture and nature of the tool that facilitates such model flexibility rather than the precision of any single default model.
Confidence score validation
To validate the predictive validity of the hierarchical four-tier confidence scoring system, records were grouped into high (0.8–1.0), medium (0.6–0.79), and low (<0.6) confidence buckets. The hypothesis tested was whether higher confidence scores correlate with higher extraction accuracy.
Software validation results demonstrated that High Confidence scores were strongly predictive of accuracy (93.4%) (see Table 3). Interestingly, the medium confidence group showed a dip in accuracy (57.1%) compared to the low confidence group (71.4%), suggesting that when the AI is unsure, it is often due to missing details (low) rather than failing to extract data entirely (medium anomaly).
Multi-provider architecture validation
To verify the modular multi-provider architecture, testing was conducted by two researchers (V.S. and M.B.) across all supported AI providers (OpenAI, Anthropic, DeepSeek, Cohere, Z.ai, and Ollama) using batches of 20 papers.
Extractor results (N = 20)
The extractor was tested on fields including conclusion, type of study, population, intervention, comparison, outcome, and results. All providers processed the full batch without failure (see Table 4).
Table 4
Software validation results – extractor performance by AI provider.
| PROVIDER | MODEL | TIME (MM:SS) | ERROR RATE | TESTER NOTES |
|---|---|---|---|---|
| Default/GLM | GLM-4.6V-Flash (6) | 4:39 | 0% | Fast with no errors |
| OpenAI | gpt-4o (11) | 3:52 | 0% | Fast, no errors |
| Anthropic | Claude-Sonnet-4–20250514 (12) | 4:13 | 0% | Fast, good for extraction |
| Cohere | Command-A-03–2025 (14) | 1:53 | 0% | Very fast and accurate |
| Deepseek | DeepSeek-Chat (13) | 4:57 | 0% | Accurate, slower speed |
Full-Text Extraction Result using Deepseek Configuration (AI-Model: Deepseek-chat) can be seen in Table 5.
Table 5
Software validation results – full-text AI data extraction result.
| FILENAME | CONFIDENCE | PAPER TITLE | CONCLUSION | TYPE OF STUDY | POPULATION | INTERVENTION | COMPARISON | OUTCOME | RESULTS |
|---|---|---|---|---|---|---|---|---|---|
| 1-s2.0-S0022510X21003166-main.pdf | 0.95 | Wearing-off symptoms during standard and extended natalizumab dosing intervals: Experiences from the COVID-19 pandemic [16] | Our observations support the need to study the effect of EID on wearing-off symptoms in randomized controlled trials. | Observational study | 30 relapsing-remitting multiple sclerosis (RRMS) patients over 18 years of age receiving natalizumab at the Department of Neurology, Haukeland University Hospital | Extended interval dosing (EID) of natalizumab every six weeks | Standard interval dosing (SID) of natalizumab every four weeks | Change in prevalence or intensity of wearing-off symptoms | 50% (15/30) reported new or increased wearing-off symptoms during EID. Symptom increase was more frequent among patients with pre-existing wearing-off symptoms during SID compared to patients without such pre-existing symptoms [p = 0.0005]. None had decreased symptoms or signs of clinical relapse. |
| 1-s2.0-S221103482100612X-main.pdf | 0.9 | Safety of Natalizumab infusion in multiple sclerosis patients during active SARS-CoV-2 infection [17] | Natalizumab redosing in people with multiple sclerosis during active SARS-CoV-2 infection is not associated with worsening of COVID-19 symptoms or recovery delay and is reasonably safe. The data supports the safety of NTZ redosing in these circumstances and suggests not to delay retreatment to minimize the risk of MS rebound. | Retrospective observational case series/cohort study | 18 relapsing-remitting people with Multiple Sclerosis (pwMS) under Natalizumab treatment, infected by SARS-CoV-2 between October 2020 and May 2021, from 6 Italian MS centers. All had mild COVID-19. | Natalizumab (NTZ) reinfusion (retreatment/redosing) during confirmed active SARS-CoV-2 infection (before achieving a negative swab). | Not Found (The study is a single-arm observational study with no explicit comparison group. Implicit comparison is to general population recovery times.) | Safety outcomes: worsening of SARS-CoV-2 infection symptoms, recovery delay, development of new neurological symptoms suggestive of CNS invasion, time to full recovery, and interval from positive to negative swab. | No patient reported worsening of SARS-CoV-2 symptoms or developed new neurological symptoms after redosing. Mean time to full recovery after NTZ for symptomatic patients was 10 ± 12 days. For the whole cohort, mean interval from first symptom to full recovery was 13 ± 9 days. Mean interval from first positive to first negative swab was 32 ± 15 days. No patient required oxygen support or hospitalization. |
Screener Results (N = 20)
The screener was configured with PICO criteria: Population (Adults/Children), Intervention (Natalizumab, SID, EID/Ocrelizumab, Fingolimod), and Outcome (Multiple Sclerosis) (see Table 6).
Table 6
Software validation results – screener performance by AI provider.
| PROVIDER | MODEL | TIME (MM:SS) | ERROR RATE | TESTER NOTES |
|---|---|---|---|---|
| Default/GLM | GLM-4.6V-Flash [6] | 2:49 | 0% | Fast; every field found |
| OpenAI | gpt-4o [11] | 3:46 | 0% | Fast, no errors |
| Anthropic | Claude-Sonnet-4–20250514 [12] | 3:46 | 0% | Good; better for screening than OpenAI |
| Cohere | Command-A-03–2025 [14] | 1:33 | 0% | Screened everything; fastest |
| Deepseek | DeepSeek-Chat [13] | 2:58 | 0% | Fast |
Full-Text Screening Result using Default Configuration (AI-Model: GLM-4.6V-Flash – Z.ai [6]) can be seen in Table 7.
Table 7
Software validation results – full-text screening result.
| FILENAME | TITLE | AUTHOR | YEAR | CONFIDENCE | REASON FOR INCLUSION |
|---|---|---|---|---|---|
| 1-s2.0-S0022510X21003166-main.pdf | Wearing-off symptoms during standard and extended natalizumab dosing intervals: Experiences from the COVID-19 pandemic [16] | Gerd Haga Bringeland | 2021 | 0.95 | The paper studies adults (population) receiving natalizumab (intervention) with outcomes related to multiple sclerosis (MS) regarding wearing-off symptoms during standard (SID) and extended (EID) dosing intervals. It meets all inclusion criteria for population, intervention, and outcomes. |
| 1-s2.0-S221103482100612X-main.pdf | Safety of Natalizumab infusion in multiple sclerosis patients during active SARS-CoV-2 infection [17] | Landi D | 2021 | 0.95 | The paper focuses on natalizumab (intervention) in adults with multiple sclerosis (population) during active SARS-CoV-2 infection, which meets all inclusion criteria (population: adults; intervention: natalizumab, SID, EID; outcomes: MS). No exclusion criteria are violated. |
| 1-s2.0-S1878747924000370-main.pdf | Commentary extended interval dosing of natalizumab: More evidence in support [18] | Karlo Toljan | 2024 | 0.95 | The paper evaluates extended interval dosing (EID) of natalizumab versus standard interval dosing (SID) in adults with multiple sclerosis (MS), assessing outcomes such as disease activity, relapse rates, and safety (including PML risk). It meets all PICO criteria: population (adults), intervention (natalizumab, SID/EID), comparison (SID vs EID), and outcomes (MS). |
System performance and robustness
In addition to the software validation precision tests, a broader stress test was conducted using a corpus of 100+ research articles on multiple sclerosis treatments (validation for v2.0.0).
Screener software validation: The tool processed 100 papers in approximately 15 minutes, averaging nine seconds per paper. The error rate was 0%. The hierarchical confidence scoring proved effective, with 0 papers flagged as ‘low confidence’ (<0.6) in the screener mode.
Extractor software validation: Extraction speed averaged ~15 seconds per paper. In a stress test of 100 papers, the bulletproof parsing pipeline successfully recovered data from API responses that would have crashed standard parsers. Only 11 out of 100 papers failed due to Streamlit timeouts (a limit of cloud hosting), confirming that local deployment directly via Streamlit run locally is viable for heavy workloads.
Users can monitor the software’s health via the real-time system terminal logs, which display granular events (e.g., ‘Parser: JSON5 fallback successfully parsed response’).
(2) Availability
Operating system: Platform Independent (web-based/local deployment).
Minimum OS requirements: Any OS supporting Python 3.12+ (Windows, macOS, Linux).
Programming language
Python 3.12+
Additional system requirements
Memory: 1GB RAM recommended.
Disk Space: 250MB for application and dependencies.
Processor: Modern dual-core processor or better.
Network: Active internet connection for API calls (unless using Ollama locally).
Dependencies
streamlit==1.49.0
PyMuPDF==1.26.3
fpdf==1.7.2
zai-sdk==0.1.0
json5==0.12.0
plotly==6.2.0
pandas==2.3.1
python-dotenv==1.1.1
cryptography==45.0.5
streamlit-lottie==0.0.5
python-docx==1.2.0
xlsxwriter==3.2.5
sniffio==1.3.1
firebase-admin==7.1.0
openai==2.14.0
anthropic==0.75.0
cohere==5.16.1
ollama==0.6.1 (See requirements.txt in the repository for a complete and up-to-date list)
List of contributors
Vihaan Sahu – Lead Developer, Architecture Design, Implementation
Mohith Balakrishnan – Validation, Testing, Error Analysis
Software location
Archive (e.g., institutional repository, general repository)
Name: Zenodo
Persistent identifier: https://doi.org/10.5281/zenodo.18060972
License: Apache 2.0
Publisher: Vihaan Sahu
Version published: 2.1.0
Date published: 26/12/2025
Code repository (e.g., SourceForge, GitHub, etc.)
Name: GitHub
Identifier: https://github.com/aurumz-rgb/ReviewAid
License: Apache 2.0
Date published: 26/7/25
Validation repository
Name: GitHub
Identifier: https://github.com/ReviewAid/Validation
License: Apache 2.0
Emulation environment (if appropriate) not applicable. The software is a natively running application, not requiring emulation for legacy systems.
Language
English
(3) Reuse Potential
ReviewAid is designed for high reuse potential across any field that relies on systematic reviews and evidence synthesis.
Primary use cases
Systematic reviews in healthcare and social sciences: Researchers conducting reviews in medicine, psychology, education, and public policy can use ReviewAid to accelerate the screening and data extraction phases. Its PICO-based screening is directly applicable to the vast majority of clinical systematic reviews.
Rapid evidence assessments: For situations where a full systematic review is not feasible due to time constraints, ReviewAid enables a faster preliminary assessment of the literature.
Living systematic reviews: The tool can be integrated into workflows where reviews are updated continuously, as new studies can be screened and extracted with minimal effort.
Privacy-centric research: By utilizing the Ollama integration, researchers dealing with sensitive or confidential data can run the entire pipeline locally without data ever leaving their machine.
Methodological research: Researchers studying systematic review methodology itself can use ReviewAid as a platform to test and compare new screening or extraction algorithms.
Modification and extension potential
Custom model backends: The architecture allows for swapping the underlying AI model. Developers can integrate new LLMs as they are released, offering researchers flexibility in cost and performance.
Domain-specific adaptation: The confidence scoring tiers and parsing pipeline are fully configurable. They can be tuned for specific domains where the criteria for inclusion/exclusion or the structure of documents differ (e.g., patent reviews or historical document analysis).
New functionality modules: The modular design permits adding new features. For instance, a module for automated risk of bias assessment, or a module to extract data from figures and tables using the underlying visual model, could be developed.
Integration with other tools: ReviewAid could be integrated as a component within larger research software ecosystems, such as citation managers (e.g., Zotero [21] and EndNote [20]) or systematic review management platforms (e.g., Covidence [19]).
Support mechanisms
To ensure effective support, comprehensive documentation is available in the GitHub repository, including a detailed README, installation instructions, and user guides for the Screener and Extractor modules. Users can report bugs, request features, or ask questions via the GitHub Issues tracker. For more complex inquiries, direct email support is provided by the developer (vsahu@seu.edu.ge). Furthermore, the project is open to community contributions under the Apache 2.0 license, and developers are encouraged to submit pull requests for bug fixes, new features, or documentation improvements.
Additional File
The additional file for this article can be found as follows:
Acknowledgements
We gratefully acknowledge the developers of GLM-4.6V-Flash (Z.ai) [6] for providing the default AI model used in ReviewAid. We also thank the open-source community for the libraries (Streamlit [22], PyMuPDF [24], and Pandas [25]) that made this project possible.
Competing Interests
The authors have no competing interests to declare.
