Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Expanded key studies on bovine vocalization analysis
| No. | Reference | Context | Recording setup | Algorithm applied | Data volume (calls/hours) | Performance metric(s) | Major insight/Key finding |
|---|---|---|---|---|---|---|---|
| 1 | Mac et al., 2022 | Calf distress at weaning | 3 chest-high mics, 44.1 kHz, indoor pen | k-NN on MFCC mean ± SD | 600 calls | 94% accuracy | High-pitched, long calls reliably indicated distress |
| 2 | Sharma and Kadyan, 2023 | Dairy estrus detection | Neck collar mic, 16 kHz | SVM, RF comparison | 2000 calls | SVM 95% accuracy | Estrus vocalization has signature harmonic pattern |
| 3 | Vidana-Vila et al., 2023 | Continuous barn monitoring | 12 ceiling mics, 8 kHz | MobileNet CNN detector | 25 h audio | AUROC 0.93 | Real-time detection feasible on edge device |
| 4 | Patil et al., 2024 | Hunger vs. cough vs. estrus | Hand-held recorder, 48 kHz | 7-layer CNN | 5200 clips | 0.97 accuracy | Deep CNN discriminates four intent categories |
| 5 | Ferrero et al., 2023 | 6-class health dataset | Static barn mic array | CNN-LSTM hybrid | 7800 segments | 0.80 macro-F1 | Temporal context Boosts recall on rare classes |
| 6 | Röttgen et al., 2020 | Individual ID in group | Dual-mic collar (airborne + structure-borne) | CNN | ∼2171 events | 87% correct cow ID | Wearable sensors enable individual vocal detection |
| 7 | Hagiwara, 2023 | Self-supervised AVES | Mixed-species archive, cow subset | Transformer encoder | 160 h unlabeled + 800 labeled | + 7 pp F1 vs. CNN | SSL cuts annotation cost, improves few-shot |
| 8 | Martinez-Rau et al., 2023 a | Chew detection collar | Collar mic + accel | RF on chewing spectra | 4 h per cow × 20 | 92% chew vs. rumination | Detects feeding bouts for intake estimation |
| 9 | Gavojdian et al., 2024 | Stress isolation study | Lav-mics, 22 kHz | Bi-LSTM | 3000 sequences | 0.91 F1 | Sequence model spots stress more reliably |
| 10 | Sattar, 2022 | Multi-intent cough/food/estrus | 6 mics, 48 kHz | Spectrogram CNN | 4400 clips | 0.82 macro-F1 | Combined dataset demonstrates multi-class viability |
| 11 | Peng et al., 2024 | Behavior fusion EdgeNeXt | Audio + ACC | EdgeNeXt + fusion | 220 h | 95% behavior acc. | Multimodal fusion > single modality |
Comparative analysis of AI methods in bovine vocalization classification
| No. | AI approach/Architecture | Typical training data volume | Key input representation | Reported best accuracy/F1 | Strengths in reviewed studies | Main limitations/Failure modes | Representative use-case(s) |
|---|---|---|---|---|---|---|---|
| 1 | Random Forest (RF) | ≈500–3 000 labeled calls | Hand-crafted MFCC + temporal stats | 88–93% F1 (distress vs. non-distress) | Robust to noise, interpretable feature importance | Needs manual feature engineering, weak on temporal context | Estrus-call detection (Sharma and Kadyan, 2023) |
| 2 | Support Vector Machine (SVM) | 200–2 000 calls | MFCC mean ± SD, fundamental F0 | 86–95% accuracy (estrus vs. baseline) | Performs well on small datasets, strong margins | Sensitive to parameter tuning, scales poorly with >10 k samples | Early estrus detection wearables (Peng et al., 2023) |
| 3 | k-Nearest Neighbour (k-NN) | 600 calls | Spectral centroid, duration, energy | 94% accuracy for open- vs. closed-mouth calls | Simple, no training time | Storage heavy, cannot model sequence | Call-type classifier in Japanese Black cattle (Peng et al., 2023) |
| 4 | CNN (2-D spectrogram) | ≥5000 call segments | Mel-spectrogram images (128 bins) | 97% accuracy, 0.96 F1 (multi-class-4) | Learns spectral patterns, no manual features | Needs GPU and large data, poor temporal memory alone | Multi-intent classifier (hunger, cough, estrus, normal) (Patil et al., 2024) |
| 5 | Lightweight CNN (MobileNet) | 25 h continuous barn audio | 64-bin log-mel | AUROC 0.93 at 1 s stride | Fast edge inference (<20 ms), low power | Precision drops in heavy machinery noise | Real-time call detection collar (Vidana-Vila et al., 2023) |
| 6 | LSTM/Bi-LSTM | 3000 labeled sequences | Per-frame MFCC + delta MFCC (time series) | 91% F1 (calf isolation vs. contact) | Captures temporal dynamics, good on sequences | Over-fitting on short clips, GPU-heavy | Isolation stress monitor (Martinez-Rau et al., 2025) |
| 7 | Hybrid CNN + LSTM | 7800 segments (6 classes) | CNN spectrograms embedding -> LSTM | 80% overall F1, +6 pp over CNN-only on rare classes | Combines spectrum + sequence info | Needs >10 k samples to beat pure CNN | Multi-class health event detector (Ferrero et al., 2023) |
| 8 | Transformer Audio Encoder (AVES) | 160 h unlabeled pretrain + 800 labels finetune | Raw 16 kHz waveform | 3–7 pp increases F1 over baseline CNN | Self-supervised, strong few shots; domain adaptable | Needs GPU for pretrain, complex | Few-shot call classification after self-pre-training (Hagiwara, 2023) |
| 9 | EdgeNeXt Multi-Sensor Fusion | 220 cow-hours (ACC, audio) | Spectrogram + 6-DoF inertial images | 95% accuracy behavior classification | Multimodal, noise-robust | Needs synchronized sensors, heavy preprocessing | Social licking vs. ruminating (Peng et al., 2024) |
| 10 | Explainable AutoML DT/Rule set | 1200 calls | 24 acoustic stats features | 90% accuracy, full rule trace | Human-readable decision paths | 3–4 pp lower F1 vs. deep nets | White-box distress detection |
Acoustic characteristics and contextual interpretation of bovine vocalizations
| Vocalization type | Dominant frequency (Hz) | Typical duration (s) | Typical mouth/posture | Principal behavioral context | Practical welfare interpretation |
|---|---|---|---|---|---|
| Maternal contact (lowing/closed-mouth call) (Green et al., 2021) | F0 ∼120–280 Hz (mean ∼180 Hz) | ∼0.8–2.5 s | Closed or partially open, head lowered toward calf | Cow-calf proximity, gentle bonding, reassurance | Indicates calm social contact and maternal bonding, normally a positive welfare cue |
| Calf isolation distress call (Mac et al., 2022) | F0 ∼450–780 Hz | ∼1–4s (modal ∼2 s) | Open-mouth, elevated head, often repeated bouts | Calf separated from dam/herd | Signals acute distress, should trigger rapid reunion or comfort |
| Adult distress/pain call (Martinez-Rau et al., 2025) | F0 ∼600–1200 Hz | > 2 s (mean ∼3.1 s) | Fully open mouth, tense neck | Pain (e.g., lameness, injury) or extreme fear | High-urgency alert, immediate welfare check required |
| Hunger/feed-anticipation call (Sattar, 2022) | F0 ∼220–380 Hz | ∼0.5–2.0 s | Open-mouth, pacing near feed-gate | Imminent feeding, empty trough | Indicates motivational state (feed expectation) |
| Estrus (heat) call (Sharma and Kadyan, 2023) | F0 ∼160–320 Hz (rich harmonic stack) | ∼0.8–3 s | Extended vocal tract, head raised | Reproductive behavior, seeking mates | Reliable cue for breeding/AI scheduling, positive management indicator |
| Social affiliative call (Schnaider et al., 2022 a) | F0 ∼110–260 Hz | ∼0.4–1.2 s | Closed-mouth, nasal | Group re-joining, mild excitement | Normal herd cohesion signal, neutral/positive welfare |
| Alarm/novel object call (Miron et al., 2025) | F0 ∼650–1100 Hz | – | Sudden, sharp, head-up stance | Perceived predator, startling event | Short-term fear, monitor environment and animal safety |
| Cough/respiratory (Sattar, 2022) | Broadband burst 200–1 200 Hz | ∼0.12–0.35 s | Forced exhalation, closed glottis | Respiratory irritation or disease onset | Early health-risk indicator (e.g., BRD), triggers clinical exam |
| Pain-related moan (low-frequency) (Volkmann et al., 2021) | F0 ∼90–190 Hz | ∼1.5–5 s | Mouth partially open, minimal movement | Chronic discomfort (lameness, parturition) | Persistent occurrence warrants veterinary assessment |
| Play/excitement call (Vogt et al., 2025) | F0 ∼260–450 Hz | ∼0.3–0.9 s | Short bursts during running/bucking | Calf play, social excitement | Positive affect indicates good welfare environment |
Key public (or semi-public) datasets and benchmarks for bovine bioacoustics – scope, best-fit models, and limitations
| No. | Dataset/Benchmark | Scope and modality snapshot | Best-fit models and intended task | Strengths for model development | Main limitation |
|---|---|---|---|---|---|
| 1 | CowVox-2023 Mini (Sharma and Kadyan, 2023) | 8 h audio, 10k labeled calls, 2 Holstein farms | SVM/RF for estrus-call detection | Clean labels, free download (CC-BY) | Narrow breed and low noise |
| 2 | DeepSound26 Archive (Ferrero et al., 2023) | 120 h audio + 5 h collar IMU, 4 farms/3 breeds | CNN-LSTM fusion for multimodal health events | Synchronized streams; individual IDs | Non-standard file names; requires resync |
| 3 | BEANS bovine subset (Hagiwara et al., 2023) | 6 h cow audio inside 35 h multi-species corpus | Wav2Vec 2.0 or AVES SSL encoder for zero-/few-shot stress detection | Noise-rich clips; ready for SSL pre-train | Sparse bovine labels; class imbalance |
| 4 | Agri-LLM Pilot Set (Chen et al., 2024) | 200 paired “call → English tag” clips, Jersey herd | AudioGPT/GPT-J prompt-tuning for captioning | Paired acoustic–semantic examples | Tiny; heavy text bias |
| 5 | SmartFarm Open-Noise (Martinez-Rau et al., 2025) | 40 h barn ambience (negative class), 5 barn layouts | Spec-BERT masking or NRFAR denoiser pre-train | Diverse negative class for contrastive learning | No positive calls; must be combined with other sets |
Compact overview of sensor types that can be fused with barn-acoustic streams on low-power edge devices
| No. | Sensor (mount) | Signal + Edge load* | Audio-synergy example (welfare alert) | Field limitation |
|---|---|---|---|---|
| 1 | Tri-axial ACC (collar/ear) (Martinez-Rau et al., 2023 a; Peng et al., 2024) | 100 Hz; low (3 Kbps) | Chew rate high + high-F0 “feed call” → early feeding cue | Battery life: collar fit |
| 2 | UWB/RFID (tag grid) (Wang et al., 2022) | Distance events; low | >10 m isolation + distress bawl → weaning-stress alert | Antenna cost; metal interference |
| 3 | Thermal cam (fixed) (Slob et al., 2021) | 5–15 fps; med (0.5 Mbps) | Eye-temp high + panting sound → heat-stress risk | Night IR; occlusion |
| 4 | Top view RGB cam (Arazo et al., 2022) | 25 fps; high unless pruned | Limp posture + low-F moan → lameness warning | Bandwidth; privacy; dirt |
| 5 | Directional mic (collar) (Röttgen et al., 2020) | 16 kHz; low | High-F0 estrus call detected → identify cycling cow | Collar fit; battery life |
| 6 | NH3/CO2 gas (wall) (Pérez-Granados and Schuchmann, 2023) | 1 Hz; low | Gas spike + drop in calling → respiratory risk | Sensor drift |
| 7 | Water trough pressure mat (Shi et al., 2024) | Sip events; low | Few sips + thirst call → blocked drinker alert | Hardware wear |
Key technical and ethical challenges in AI-driven bovine bioacoustics vs_ proposed solutions
| No. | Challenge/Pain-point | Underlying cause(s) and typical manifestation | Impact on research/farm adoption | Proposed technical/operational solutions |
|---|---|---|---|---|
| 1 | Data scarcity and class imbalance | Costly, time-consuming manual labeling of calls Rare yet critical events (e.g., pain bawls, calving distress) under-represented Farm privacy limits data sharing | Over-fitting, poor generalization Models ignore rare but critical classes | Large open acoustic repositories (multi-farm, multi-breed) |
| 2 | Cross-farm variability and domain shift | Differences in barn acoustics, microphone type, breed dialects, management routines | Performance drop when models deployed outside training site, farmer distrust | Domain-adversarial training, feature-space alignment |
| 3 | Background noise and multi-speaker overlap | Machinery, wind, multiple cows calling simultaneously | High false positives/negatives, missed welfare events | Beam-forming or multi-mic arrays for source separation |
| 4 | Limited interpretability (AI) | Deep nets learn latent features not visible to users | Farmers hesitant to trust alerts, regulators demand transparency | SHAP/LIME heatmaps on spectrograms |
| 5 | Sparse contextual labeling (why a call occurred?) | Audio often logged without behavioral or physiological context | Misclassification of benign calls as distress (or vice-versa) | Multimodal fusion sync audio with accelerometer, video |
| 6 | Real-time processing on resource-constrained edge devices | GPU-heavy models vs. limited power/connectivity in barns | Latency or dropout; costly cloud fees | Lightweight architectures (MobileNet, DistilBERT-audio) |
| 7 | Ethical risk of anthropomorphism and over-interpretation | AI may project human emotion labels inaccurately Farmers may act on unverified alerts | Questionable welfare interventions, misleading claims | Cross-validation against physiological stress markers |
| 8 | Farmer adoption and usability barriers | Alert fatigue, complex interfaces, unclear ROI | System ignored despite accuracy; missed welfare benefit | Tiered alerting (red/high vs. yellow/medium) |
| 9 | Data privacy and ownership concerns | Audio streams may reveal proprietary operations | Reluctance to share data, slows collaborative progress | Federated or encrypted model updates |
| 10 | Regulatory alignment and standardization gaps | No harmonized acoustic welfare metrics yet | Hard to benchmark systems; variable certification hurdles | Develop ISO-style standards for recording and annotation |
NLP/LLM techniques applied to bovine vocalizations
| No. | Technique/Model | Up-stream pre-training base | Fine-tuning data (bovine) | Key output/Capability | Demonstrated advantage | Current limitations |
|---|---|---|---|---|---|---|
| 1 | Wav2Vec 2.0 (SSL) | 960 h Librispeech human speech | 2 h labeled cow calls | 768-dim latent embeddings → downstream classifier | Cuts labeled data need by ≈70% (Hagiwara, 2023) | Requires long GPU pre-train, bovine prosody differs, latent units not explainable |
| 2 | HuBERT-style Audio LM | 60 k h Youtube-Audio8M | 5 h cow distress calls | Discrete token stream for LLM conditioning | Self-supervised tokens improve LLM prompt ability | – |
| 3 | Whisper (large-v2) | 680 k h multilingual speech | Zero-shot (no cow data) | “Transcript” string + log-prob | Noise-robust segmentation, auto-timestamp | Tokenizer trained on words → outputs nonsense on raw moos; needs post-filter |
| 4 | AudioGPT Controller | GPT-4 (text) + plug-in ASR/encoders | 50 labeled prompts (few-shot) | Multi-step reasoning over acoustic embeddings | Flexible zero-shot Q&A about herd sounds | Heavy compute, pipeline latency, still prototype |
| 5 | CNN Encoder + GPT-2 Decoder | ImageNet CNN weights | 7 k spectrograms with text tags | Generates sentence caption (e.g., “hungry calf call”) | Early end-to-end audio-caption success | Needs dataset of paired call + explanation, currently small |
| 6 | Prompt-Tuned GPT-J | 6 B-param code GPT | 400 synthetic “call→meaning” pairs | Rapid adaptation to cow vocabulary (<1 epoch) | Works with minimal GPU | Synthetic pairs risk bias; real validation pending |
| 7 | Spec-BERT | 100 h farm audio (masked) | 800 labeled segments | Predicts masked time-frequency patches, improves downstream F1 +4 pp | Learns robust representations under barn noise | Mask strategy sensitivity; limited to short clips |