metaScreener: A Plugin-Based Desktop Application for Human-in-the-Loop Systematic Literature Screening

Alejandro Reyes-Consuelo; Jocelyne Kiss; Julien Voisin

doi:10.5334/jors.730

(1) Overview

Introduction

Systematic literature reviews are a cornerstone methodology in evidence synthesis across health sciences, engineering, and social research. The screening phase—in which a potentially large corpus of candidate citations is progressively filtered against explicit inclusion and exclusion criteria—is widely recognized as the most time-intensive component of the review workflow [1]. For reviews targeting multiple bibliographic databases, initial corpora routinely comprise several hundred to several thousand records, and manual screening may require hundreds of person-hours before full-text eligibility assessment can begin [2].

Large language models (LLMs) have recently demonstrated capacity for zero-shot and few-shot classification tasks that closely parallel systematic review inclusion/exclusion screening [3, 4]. Empirical evaluations have yielded encouraging but mixed results: Khraisha et al. [5] found that GPT-4 achieved accuracy on par with human performance on some screening tasks, though performance varied substantially with dataset characteristics and prompt reliability. Matsui et al. [6] reported human-comparable sensitivity using a multi-layered GPT-4 screening strategy, while a large-scale evaluation of 18 LLMs across three clinical systematic reviews demonstrated that workload reductions between 33% and 93% are achievable, provided inclusion and exclusion criteria are carefully formulated [7]. A recent scoping review covering 37 studies confirmed that LLM approaches have been applied to the majority of defined systematic review steps, but concluded that fully validated (i.e., benchmarked against independent human screening decisions), production-ready applications remain scarce [8].

Importantly, the value of automation in evidence synthesis lies not in replacing human judgment but in structuring the interaction between algorithmic efficiency and domain expertise. Kwok et al. [9] found that overlapping systematic reviews on the same topic frequently fail to meet key methodological standards, suggesting that increasing the volume of reviews does not by itself ensure quality. A screening pipeline that preserves deliberate points of human decision-making—rather than delegating the entire process to an opaque model—can help ensure that the resulting reviews remain methodologically sound.

Several semi-automated tools have been proposed to capitalize on this capability; however, existing implementations typically operate as cloud-hosted web services, require manual dataset export between stages, or lack the transparent decision logging that structured evidence synthesis methodology demands [10].

Concurrent with the present work, an unrelated tool that also carries the name MetaScreener has been released by an Oxford-based author group [11]; that tool pursues a web-hosted, multi-provider prompt-engineered approach to title-and-abstract screening. The metaScreener described in this paper was developed independently at Laval University and differs in four substantive design choices: local desktop execution rather than browser hosting, a deterministic-first staged pipeline rather than a single end-to-end LLM call, SHA-256-verified bundle archives providing an end-to-end decision-level audit trail, and a plugin architecture allowing third-party extension of individual screening stages. We acknowledge the naming collision and, for namespace disambiguation, distribute the present tool on PyPI under the package name metascreener-lars-ulaval.

metaScreener addresses these limitations through a locally executable, plugin-based desktop application that chains deterministic and LLM-based filters within a reproducible, auditable pipeline. Each stage operates on a structured bundle file format—a ZIP archive preserving input records, configuration parameters, and decision outputs under a SHA-256 integrity hash—enabling full retrospective audit of every screening decision. The software is designed for single-researcher use on a local machine, requiring no dedicated server infrastructure, and targets any OpenAI-compatible Application Programming Interface (API) endpoint for its LLM inference stages.

Implementation and architecture

metaScreener is implemented in Python 3.x with a Tkinter graphical user interface. The application is structured around a plugin registry in which each screening stage is an independently executable module conforming to a shared interface contract. Plugins consume and produce bundle files that are exchanged sequentially across pipeline stages. This design allows individual stages to be re-run, swapped, or extended without modifying the core application codebase.

The pipeline comprises seven plugins organized into four functional groups: (i) corpus ingestion (Reference Markers, References-of-X AI), (ii) criteria structuring (Criteria Parser), (iii) deterministic heuristic-based filtering (EH, IH), and (iv) LLM-assisted filtering (EL, IL). Figure 1 illustrates the full pipeline architecture and the data flow between groups.

metaScreener pipeline architecture. Plugins are grouped by functional role; arrows indicate bundle data flow between stages.

Table 1 summarizes the seven plugins, their functions, and their operational methods.

Table 1

Plugin inventory for metaScreener version 3. T = inference temperature. Plugin 03 produces a structured criteria_harmonized.csv file consumed by all four downstream filtering plugins.

#	PLUGIN	FUNCTION	METHOD
01	Reference Markers (experimental)	Extracts visually-present reference markers (e.g., `[1]`) from images supplied as PDF or PNG; not designed for standard PRISMA flow diagrams	GPT-4o vision API
02	References-of-X AI	Resolves and enriches bibliographic references via federated API queries	OpenAlex, Crossref, Semantic Scholar
03	Criteria Parser	Converts free-text inclusion/exclusion criteria into a structured, machine-readable criteria file	Rule-based inference, optional LLM refinement
04	EH (Exclusion by Heuristic)	Removes records matching any exclusion criterion at title/abstract level	Deterministic keyword/regex
05	IH (Inclusion by Heuristic)	Retains only records matching at least one inclusion criterion at title/abstract level	Deterministic keyword/regex
06	EL (Exclusion by LLM)	Applies LLM-based eligibility adjudication against exclusion criteria over full record text	OpenAI-compatible endpoint, T=0.0
07	IL (Inclusion by LLM)	Applies LLM-based eligibility adjudication against inclusion criteria over full record text	OpenAI-compatible endpoint, T=0.0

Figure 3 shows the desktop interface for the Criteria Parser plugin (Plugin 03), the stage at which free-text inclusion and exclusion criteria are transformed into the structured criteria_harmonized.csv consumed by all four downstream filtering plugins. The same visual style—a left-hand input panel, a right-hand structured output panel, and a bottom-most log panel showing per-step actions—is used by all seven plugins; differences between plugins are confined to the contents of the input and output panels and the controls in the settings region.

Criteria Parser module

The Criteria Parser (Plugin 03) accepts free-text eligibility criteria as input and produces a structured criteria_harmonized.csv file that serves as the single authoritative specification consumed by all four downstream filtering plugins. This ensures that criterion definitions remain consistent across pipeline stages and that any change to the criteria propagates automatically through the entire workflow.

The module operates in three phases, detailed in Algorithm 1. In the first phase, the free-text input is parsed into individual criterion records using pattern matching: the parser recognizes explicit inclusion criteria (IC)/exclusion criteria (EC) identifiers (e.g., “IC-1 — …”), section headers followed by bulleted lists, and several numbering formats, assigning each criterion a unique identifier, a polarity (inclusion or exclusion), and the original text.

In the second phase, each parsed criterion is submitted to a rule-based inference engine that determines its pipeline stage assignment (Exclusion by Heuristic (EH), Inclusion by Heuristic (IH), Exclusion by LLM (EL), or Inclusion by LLM (IL)) and matching operator. The engine evaluates six pattern categories in sequence: language constraints (mapped to the lang column with an equals operator), publication year thresholds (mapped to the year column with comparison operators), document type filters, venue or journal name matches, DOI presence checks, and keyword-in-text searches. Criteria matching any of these deterministic categories are assigned to the heuristic stages (EH or IH). Criteria that do not match any deterministic pattern default to the LLM-assisted stages (EL or IL) with operator=llm and a configurable confidence threshold (default 0.6). Target field names are validated against the actual columns present in the input corpus to prevent downstream lookup failures.

An optional third phase invokes an LLM endpoint to refine the harmonized output. The model receives the parsed criteria table alongside the list of available corpus columns and is instructed to adjust stage assignments, operators, and operand values. The prompt specifies the four screening stages, the available operators, and the output JavaScript Object Notation (JSON) schema, while enforcing two structural guardrails: the number of output rows must equal the number of input rows, and criterion identifiers and polarities must remain unchanged. If either guardrail is violated, the refinement is rejected and the unmodified rule-based output is preserved. The structured output should be reviewed and confirmed by the researcher before proceeding to the filtering stages, as errors in criterion operationalization at this stage would propagate through all downstream screening decisions.

Deterministic screening modules

Plugins EH and IH implement rule-based screening at the title and abstract level. EH evaluates each record against a set of exclusion criteria expressed as keyword lists or regular expressions; records matching any exclusion criterion are removed from the active corpus and logged with the triggering rule identifier. IH subsequently retains only those records matching at least one inclusion criterion, with the same decision-logging semantics.

Both plugins execute without LLM inference and impose no token cost, making them suitable for high-volume pre-filtering. In practice, the deterministic stages are expected to account for the majority of exclusions in a well-specified review, with the LLM stages providing fine-grained adjudication over the residual candidate set.

LLM-assisted screening modules

Plugins EL and IL extend the deterministic pipeline with LLM-based eligibility assessment at the full record level. Both modules submit structured prompts to any OpenAI-compatible API endpoint with inference temperature set to 0.0 to minimize output variability across repeated runs, although strict determinism cannot be guaranteed by temperature setting alone due to hardware-level floating-point non-determinism in model inference. Responses are parsed as JSON arrays in which each element carries a record identifier, an eligibility decision token, and a supporting verbatim quotation extracted from the record text. A prompt versioning field (PROMPT_VERSION) is recorded in the bundle manifest to support reproducibility across model updates.

A response-level validity flag (valid_quote) is assigned by checking whether the supporting quotation returned by the model is traceable to the source record text. This flag provides a lightweight signal of response quality without requiring human adjudication: records with valid_quote = False are not automatically excluded but are surfaced for reviewer attention. All LLM responses are persisted in a local cache keyed by record content hash, so that re-running a stage over an unchanged corpus does not incur additional API cost. Records receiving an uncertain decision at any LLM stage are not excluded from the pipeline but are assigned to a flagged queue for human adjudication, ensuring that ambiguous classifications do not produce silent false negatives.

Algorithms 2 and 3 present pseudocode for EL and IL, respectively. The two algorithms share an identical structural pattern; they differ only in the criterion type supplied to the prompt (exclusion versus inclusion) and in the resulting partition logic applied to the output.

Reproducibility and audit trail

Each plugin in the metaScreener pipeline produces a bundle ZIP archive containing three core artifacts: (i) a records comma-separated values (CSV) file carrying the full record set with per-record decision annotations appended at the current stage; (ii) a manifest JSON file recording pipeline configuration including the criteria file hash, prompt version string, model identifier, and Coordinated Universal Time (UTC) execution timestamp; and (iii) a decision log CSV providing a flat record of all eligibility decisions with their supporting evidence.

Bundle files are integrity-verified using SHA-256 hashes at ingestion and export, ensuring that any modification to the record set or configuration between pipeline stages is detectable. The LLM response cache in plugins EL and IL preserves all API responses keyed by record content hash, enabling exact reproduction of screening decisions given the same model, prompt version, and temperature setting. Together, these mechanisms produce an audit trail that satisfies the traceability requirements expected in the systematic review methodology documentation.

Quality control

The metaScreener codebase has been verified through functional testing of individual plugins and an end-to-end feasibility demonstration on a real-world corpus. Each plugin can be executed independently against known input bundles, and the resulting output bundles can be compared against expected decision outcomes.

The demonstration use case described below illustrates the pipeline’s operational behavior and throughput on a representative corpus; a complementary measurement of human-versus-LLM agreement on a subset of the same corpus is reported in the Human validation subsection that follows. The pipeline was executed end-to-end on a corpus of 776 candidate records examining head-mounted display (HMD) virtual reality (VR) interventions. The pipeline was configured with six eligibility criteria operationalized via the Criteria Parser: two exclusion criteria targeting non-English language publications (EC-1) and conference proceedings (EC-4), and four inclusion criteria targeting English-language publications (IC-1), empirical studies (IC-2), publications from 2018 onwards (IC-3), and studies involving HMD VR as a primary intervention (IC-4). The full set of input bundles, output bundles, the harmonized criteria file, and the LLM response caches used to produce the funnel reported below is committed under docs/data/ in the project repository, providing a persistent archive for independent replication.

Table 2 presents the complete screening funnel; Figure 2 displays the same data as a sequential flow diagram.

Table 2

Sequential screening funnel for the demonstration use case (initial corpus $N = 776$ ).

STAGE	INPUT	SURVIVORS	EXCLUDED	PRIMARY EXCLUSION REASON
Initial corpus	776	776	—	752 English, 14 French; years 1962–2025
EH (Exclusion by Heuristic)	776	651	125	Conference proceedings ( $n = 112$ ); non-English ( $n = 13$ )
IH (Inclusion by Heuristic)	651	85	566	Publication year < 2018 ( $n = 556$ ); non-English ( $n = 10$ )
EL (Exclusion by LLM)	85	85	0	No records met exclusion criteria
IL (Inclusion by LLM)	85	73	12	Did not meet HMD VR inclusion criterion (IC-4)
Final review corpus	—	73	703	90.6% reduction from initial corpus

Sequential screening funnel for the demonstration use case. Excluded records are shown with exclusion counts at each stage transition.

metaScreener desktop interface, shown on the Criteria Parser plugin (Plugin 03). The left panel accepts free-text inclusion and exclusion criteria; the right panel displays the structured harmonized table, with each row’s pipeline-stage assignment (EH/IH/EL/IL) and matching operator determined by the rule-based inference engine described in Algorithm 1. The log panel at the bottom shows the harmonizer parsing eight criteria and applying optional LLM refinement.

The deterministic pre-filtering stages (EH and IH) accounted for 691 of the 703 total exclusions (98.3%) without incurring any LLM inference cost. The EL stage correctly identified zero records meeting the configured exclusion criteria, consistent with the narrow scope of the exclusion criteria relative to the post-IH corpus composition. The IL stage excluded 12 records that did not satisfy the HMD VR inclusion criterion, yielding a final review corpus of 73 records. LLM response caching across 170 EL cache entries demonstrated stable decision behavior under temperature = 0.0, with the persistent cache providing an additional reproducibility safeguard by replaying stored decisions on re-runs independently of model-level determinism. Of the 170 cached responses, 169 reached a valid_quote=True classification. The single record classified as valid_quote=False was not automatically excluded; consistent with the evidence-gating protocol described above, it was surfaced for human review, ensuring that no record is removed from the pipeline without a verified supporting quotation.

Bundle integrity verification was confirmed by re-loading each stage output and verifying the SHA-256 hash against the manifest. Users can replicate this integrity check by running the pipeline on the sample corpus included in the project repository and comparing the resulting bundle contents against the reference outputs documented in the repository’s README.

Human validation of LLM-stage decisions

In response to the reviewer’s compulsory request for a clearer evaluation of LLM decision quality, we conducted a structured human-versus-LLM agreement study on the three LLM-adjudicated criteria from the demonstration corpus: IC-1 (inclusion: the paper considers immersive virtual reality OR a virtual simulation using a head-mounted display), EC-2 (exclusion: the paper’s primary focus is spatial navigation in a virtual maze), and EC-3 (exclusion: the paper’s primary focus is the rubber hand illusion paradigm). The five deterministic criteria in the same corpus (language detection, year threshold, document-type filter, keyword presence) are out of scope for this comparison, since deterministic filters always agree with themselves and validation against human judgment would be a category error. The full methodology, raw human and LLM decisions, confusion matrices, and disagreement subsets are committed under docs/llm-evaluation.md and docs/data/ in the project repository.

The evaluation used a multi-annotator design with stratified overlap. Three raters (the paper’s authors) adjudicated 254 LLM-screening decisions in total. A fixed 15-record overlap per stage was rated by all three raters and supports the inter-rater Fleiss’ kappa; the remaining records were partitioned disjointly across raters with load-balanced sampling under a fixed seed, and contribute additional pairs to the human-versus-LLM Cohen’s kappa. Raters were blinded to the LLM’s per-record decision: the grid generator deliberately strips the bundle’s LLM-evidence columns before producing the rater workbooks, and a unit test guards this invariant. Raters worked from the same abstract-level evidence (title, abstract, keywords, venue, DOI) seen by the LLM, with the option to consult full text if the abstract was considered genuinely insufficient. Dropdown options presented to each rater quoted the criterion text verbatim, so that no rater needed to translate from natural-language judgment to operator vocabulary or reason about polarity. To compare humans and LLM on a single canonical scale, each LLM status value (MET, FAILED, UNCERTAIN) was mapped to a polarity-aware canonical decision (yes the criterion’s claim holds, no it does not, unsure); this mapping inverts for exclusion criteria, since MET on an exclusion criterion means the paper passes screening (i.e., the criterion’s claim does not hold).

Table 3 reports the per-criterion agreement metrics. Confusion matrices, raw decisions, and the 88-row disagreement subset are available in the repository.

Table 3

Human-versus-LLM agreement on the three LLM-adjudicated criteria from the demonstration corpus. Cohen’s $κ$ is computed between the human aggregate decision and the LLM canonical decision; Fleiss’ $κ$ is computed across the three raters on the 15-record overlap subset per stage. $P_{obs}$ is the percent observed agreement (human vs. LLM) over the same N.

STAGE	CRITERION	COHEN’S $κ$	FLEISS’ $κ$	$P_{obs}$	N
EL	EC-2 (spatial-navigation focus)	–0.05	–0.13	83.5%	85
EL	EC-3 (rubber-hand-illusion focus)	0.10	–0.05	87.1%	85
IL	IC-1 (HMD VR/virtual simulation)	0.28	0.26	56.0%	84

On the two exclusion-stage criteria (EC-2 and EC-3), the human aggregate and the LLM agree on 83.5% and 87.1% of records, respectively, while Cohen’s $κ$ is –0.05 and 0.10. This apparent contradiction reflects the prevalence paradox in agreement measurement [12, 13]: when one decision category dominates the marginal distribution—here, approximately 95% of records are judged as not about spatial navigation in a maze (EC-2) and approximately 96% as not about the rubber hand illusion paradigm (EC-3) by both human aggregate and LLM—expected agreement under chance is already near ceiling, and Cohen’s formula divides a small numerator by a small denominator. Kappa becomes informationally degenerate in this regime. The substantive finding is that, in the dominant decision category, LLM behavior on exclusion-stage criteria tracks human reviewer behavior at high observed agreement; reporting observed agreement alongside kappa avoids both under-reporting actual concordance and encouraging post-hoc claims about LLM weakness that the data does not support.

The inclusion-stage criterion (IL/IC-1) shows fair agreement on both metrics ( $κ = 0.28$ Cohen, $κ = 0.26$ Fleiss; both in Landis and Koch’s fair band [14]), with observed agreement at 56.0%. The Fleiss’ $κ$ is itself informative about the criterion’s difficulty: even among human raters working from abstracts alone, inter-rater agreement on the IC-1 question is fair rather than substantial. The dominant disagreement pattern is asymmetric: of the 52 disagreements between the human aggregate and the LLM on IC-1, 29 are records on which the human aggregate is decisive (yes or no) and the LLM has elected UNCERTAIN, while only 4 are records on which the LLM is decisive against a human-aggregate unsure. This pattern is consistent with the conservative-by-design intent stated in the Introduction: rather than auto-deciding marginal cases, the inclusion stage propagates uncertainty to a human-in-the-loop review queue. We do not claim that this asymmetry generalizes beyond the present corpus; $N = 52$ disagreements is a small sample, and validating the pattern across diverse corpora and criteria sets is future work.

Limitations

The findings reported in this subsection are bounded by several conditions. First, the sample is small: 254 LLM-screening decisions across three criteria, with $N = 15$ items per stage contributing to the inter-rater computation. Individual confusion-matrix cells with low counts carry high relative uncertainty; reading the matrices, not just the scalar kappa, is essential. Second, the raters are the paper authors—a known in-group bias. Truly independent multi-disciplinary adjudication (e.g., a panel of domain experts unconnected to metaScreener’s development) would be a stronger validation design. The present exercise responds to the reviewer’s compulsory item but should not be over-interpreted as a population-level estimate of LLM-versus-human agreement. Third, the agreement metrics are reported on a single corpus and a single set of criteria; generalization across review domains is future work. Fourth, all raters worked from abstract-level evidence to match what the LLM sees; this is methodologically appropriate for the title-and-abstract screening stage but does not validate what either humans or LLMs would conclude after full-text review. Fifth, decisions were taken at the LLM’s default temperature setting for the bundle ( $T = 0.0$ ); run-stability under repeated sampling at non-zero temperature has not been measured. Finally, agreement is reported in aggregate and is not broken down by record characteristics such as publication year, venue, or source database; a longer corpus would justify such a subgroup analysis.

An automated test suite comprising 166 unit tests is provided in the tests/ directory and can be executed via pytest. The suite covers five areas: (i) criteria parsing (_parse_free_text_criteria and _infer_criterion_details), verifying that free-text eligibility statements are correctly decomposed into structured criterion records with appropriate stage assignments and operators; (ii) bundle integrity, verifying SHA-256 hash computation and manifest schema conformance; (iii) deterministic filter logic (EH/IH), testing the _eval_criterion function against known records and criteria for both exclusion and inclusion semantics; (iv) evidence-gating utilities (EL/IL), testing quote-in-text validation and LLM cache key construction without requiring API credentials; and (v) the agreement-ingestion pipeline supporting the human-versus-LLM validation reported in the Human validation subsection, including pure-Python Cohen’s and Fleiss’ kappa implementations exercised against textbook reference values. All tests execute in under one second and require no OpenAI API key, network access, or graphical display server.

(2) Availability

Operating system

Windows, macOS, Linux (any platform supporting Python 3.x and Tkinter). GUI functionality was developed and tested on Windows 10; the automated test suite was additionally executed on Ubuntu 24.04 (headless, via Docker) to confirm cross-platform module compatibility.

Programming language

Python 3.x (tested with Python 3.10+).

Additional system requirements

No special hardware requirements. An internet connection is required for LLM inference stages (Plugins 03, 06, 07) and for bibliographic API queries (Plugin 02). Corpus size is limited only by available system memory.

Dependencies

Tkinter (Python standard library), openai (OpenAI-compatible API client), pymupdf, Pillow, pytesseract, rapidfuzz, requests, pandas, openpyxl, langdetect. A complete list of pinned dependencies is maintained in the project repository. The software can be installed via PyPI: pip install metascreener-lars-ulaval.

List of contributors

Alejandro Reyes-Consuelo — Conceived, designed, and implemented the software, designed the demonstration use case, and wrote the manuscript.

Jocelyne Kiss — Provided supervisory feedback on the project and the manuscript.

Julien Voisin — Provided methodological and technical guidance and revised the manuscript.

Software location

Archive

Name: Zenodo
Persistent identifier: 10.5281/zenodo.19360124
Licence: MIT
Publisher: Zenodo
Version published: 3.1.0
Date published: May 2026

Code repository

Name: metaScreener
Identifier: https://github.com/lars-ulaval/metaScreener
Licence: MIT
Date published: May 2026

Language

English.

(3) Reuse Potential

metaScreener is designed for reuse across any research domain in which systematic literature screening is conducted. The plugin architecture allows researchers to adapt the pipeline to discipline-specific workflows by reconfiguring criteria definitions, substituting LLM endpoints, or developing new plugins that conform to the shared interface contract. Because the software targets any OpenAI-compatible API endpoint, it can be used with commercial services (e.g., OpenAI, Azure OpenAI) as well as locally hosted open-weight models served through compatible inference frameworks (e.g., Ollama, llama.cpp, vLLM), giving researchers flexibility to balance cost, privacy, and performance considerations. Switching between providers requires only setting the OPENAI_BASE_URL environment variable; no code changes are necessary.

Extending the pipeline with a new screening stage is intentionally straightforward. Each plugin is a self-contained module under plugins/ that implements three responsibilities: consuming an input bundle and validating its manifest, applying its stage-specific logic, and writing an output bundle with an updated manifest and decision log. The shared plugins/_common/ package provides reusable helpers for bundle ingestion and export, prompt formatting, LLM cache management, and evidence gating, so that new plugins can be written without reimplementing the audit-trail infrastructure. Examples of natural extension points include a duplicate-record detector ingesting a corpus bundle and emitting a deduplicated bundle, a domain-specific keyword expander producing enriched search terms before EH/IH, and an inter-rater agreement adjudicator consuming two parallel decision logs.

External data integration is supported at two points in the pipeline. The References-of-X plugin (Plugin 02) issues federated queries to OpenAlex, Crossref, and Semantic Scholar to resolve and enrich bibliographic records ingested from PRISMA flow images or reference lists. Researchers can replace this plugin or add additional bibliographic sources by implementing the same input/output bundle contract. At the criteria stage, the Criteria Parser supports column-name validation against any arbitrary corpus schema, so corpora produced by reference management tools (e.g., Zotero, EndNote, Mendeley) or downloaded from bibliographic databases (e.g., PubMed, Scopus, Web of Science) can be ingested without preprocessing as long as their column names are referenced in the eligibility criteria.

The structured bundle format provides a reusable data exchange mechanism: bundles produced by metaScreener can be consumed by downstream analysis scripts or archival workflows independently of the application itself. Although the current version is designed for single-researcher use, the plugin architecture could be extended in future versions to support multi-reviewer workflows, for example, through additional plugins implementing inter-rater agreement computation or consensus adjudication stages.

Future development directions include support for batch inference APIs to reduce per-record latency, UI exposure of the per-stage confidence threshold currently configurable through the harmonized criteria file, an integrated inter-rater agreement module to support dual-reviewer validation workflows required in high-stakes systematic reviews, and a per-stage running-time estimator for the LLM stages to help researchers plan large screening runs.

Community contributions are welcomed via the public GitHub repository (https://github.com/lars-ulaval/metaScreener). Bug reports and feature requests can be submitted through the issue tracker; pull requests conforming to the existing code style are encouraged. Support is provided through the repository’s issue tracker.

Acknowledgements

The authors acknowledge the support of the Center of Interdisciplinary Research in Rehabilitation and Social Integration (CIRRIS), Laval University, Québec, Canada, and the International Observatory on the Societal Impacts of AI and Digital Technologies (OBVIA).

Author Contributions

A.R.C. - conceived, designed, and implemented the software, designed the demonstration use case, and wrote the manuscript. J.K. - provided supervisory feedback on the project and reviewed the manuscript. J.V. - provided methodological and technical guidance, and reviewed and revised the manuscript.