Data Management in a Community-Based Birth Cohort: What the SEMILLA Study Teaches Us

Nataly Cadena; Fadya Orozco; Stephanie Montenegro; Fabián Muñoz; Alexis J. Handal

doi:10.5334/dsj-2026-004

Data Management in a Community-Based Birth Cohort: What the SEMILLA Study Teaches Us

Data Science Journal

Volume 25 (2026): Issue 1

By: Nataly Cadena , Fadya Orozco , Stephanie Montenegro , Fabián Muñoz and Alexis J. Handal

Open Access

|Feb 2026

Figures & Tables

Organic structure of data management in the SEMILLA research study (2018–2024).

Workflow of data management in SEMILLA research study (2018–2024).

Table 1

Possible combinations of Permanence and compliance with Assessments.

		ASSESSMENT
		YES	NO
PERMANENCE	Yes	Participant continued in the study and complied with study activities	Participant continued in the study but did not comply with the activities, the reason is detailed in Observation
PERMANENCE	No	⦰	Participant did not continue in the study; the reason is detailed in Observation

Table 2

Key lessons for future cohort studies, based on challenges identified in SEMILLA.

COMPONENT	CHALLENGES IDENTIFIED	LESSON
Planning	The absence of a data management protocol in the early stages of data collection required redesigning instruments and training interviewers while fieldwork was already in progress. This led to initial data entry errors and made it difficult to validate entries promptly due to the initial choice of ODK as the capture system.	-Develop a data management protocol to define resources, timelines, and effective procedures for data generation. -Involve the data manager from the instrument design stage to anticipate critical requirements for data collection, such as the appropriate software based on instrument complexity and the workflow needed to guarantee data quality.
Instrument construction and refinement	Long questionnaires caused participant fatigue; some participants memorized the questions and responded mechanically. Additionally, certain concepts were misunderstood (e.g., paid work, marital status, ‘household members’), which required rewording and additional field instructions.	-Validate each instrument not only for content but also for length and usability, evaluating the degree of fatigue of both the participant and the interviewer. -Avoid redundant items and adjust the language to the participants’ sociocultural context and the interviewers’ training level to ensure that each question yields high-quality responses.
Data collection procedures	Omissions of questionnaires, incomplete activities, and typographical errors in idmadre were observed. The Tracking Planner, implemented from the start, allowed weekly monitoring and required clear justifications. Later, migration to CSPro further strengthened this control by incorporating automatic validations and skip checks during data entry.	Implement a monitoring protocol with periodic data entry validations for each instrument to ensure timely correction of inconsistencies and improve data accuracy.
Staff and training	Some interviewers struggled to build rapport and to correctly apply skip patterns or specialized activities.	-Develop a training manual and a checklist of best practices for interviewers, complemented by continuous feedback. -In some cases, interviewers required additional support to establish rapport with respondents and to apply skip patterns or specialized activities accurately.
Instrument coding	At the beginning of the study, both the field and data management teams were still becoming familiar with the coding rules. This learning phase required ongoing supervision to ensure consistent application of the criteria, which initially resulted in some inconsistencies in variable naming and delays in data cleaning. Once the rules were fully standardized and consolidated, errors could be identified and resolved more efficiently.	-Share coding rules with the field team to streamline data cleaning and, if necessary, to facilitate re-interviews. -Automate double-entry procedures wherever possible and run checks for an early detection of systematic errors.
Software programming	The initial use of ODK generated multiple technical limitations (e.g., handling complex skip patterns, ensuring longitudinal follow-up). Detecting these problems and migrating to CSPro was a key decision that improved data quality without affecting the fieldwork calendar.	Maintain flexibility in choosing data collection platform; be open to system migration, even mid-operation, if technical limitations arise. In SEMILLA, transitioning from ODK to CSPro allowed us to resolve operational challenges without disrupting the fieldwork schedule.
Documentation	Technical documentation was prepared at the end of data collection, which made it impossible to identify in time problems such as recall bias in information about pesticide application or last pregnancies, which could not be rectified retrospectively.	Prepare as many manuals, protocols, and field reports as possible before starting the data collection, as each record is an essential resource to reproduce the workflow, guarantee traceability, and facilitate the reuse of the data by other researchers.

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/dsj-2026-004 | Journal eISSN: 1683-1470

Journal RSS Feed

Language: English

Page range: 4 - 4

Submitted on: Jul 1, 2025

Accepted on: Jan 13, 2026

Published on: Feb 6, 2026

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

data management process,

lessons learned,

cohort studies

© 2026 Nataly Cadena, Fadya Orozco, Stephanie Montenegro, Fabián Muñoz, Alexis J. Handal, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 25 (2026): Issue 1

Data Management in a Community-Based Birth Cohort: What the SEMILLA Study Teaches Us

Figures & Tables

Figure 1

Figure 2

Table 1

Table 2

Paradigm

My account