Have a personal or library account? Click to login
Through the Citizen Scientists’ Eyes: Insights into Using Citizen Science with Machine Learning for Effective Identification of Unknown-Unknowns in Big Data Cover

Through the Citizen Scientists’ Eyes: Insights into Using Citizen Science with Machine Learning for Effective Identification of Unknown-Unknowns in Big Data

Open Access
|Dec 2024

Figures & Tables

cstp-9-1-740-g1.png
Figure 1

An example collage of images from the Galaxy Zoo: Weird & Wonderful (GZ:W&W) project that have been discussed in the Talk boards and their corresponding volunteer-provided tags.

cstp-9-1-740-g2.png
Figure 2

Left panel: The anomaly score versus the fraction of times a volunteer identified a subject as interesting in the Galaxy Zoo: Weird & Wonderful (GZ:W&W) (volunteer chosen fraction), with an upweighting of selections by experienced volunteers who have substantial participation in GZ:W&W, with the subset of those subjects discussed in Talk boards (red points). Right panel: The feature score versus image score for our entire ~200,000 GZ:W&W sample color-coded by the GZ experienced volunteer response weighted chosen fractions, respectively.

cstp-9-1-740-g3.png
Figure 3

The probability distribution of the feature scores (left) and image score (right) for our entire sample (black bars, 99 percentile value in black dashed line) along with the subset that have weighted chosen fraction >0.5 (green bars).

cstp-9-1-740-g4.png
Figure 4

A visualization of the GZ:W&W: Galaxy Zoo: Weird & Wonderful subjects in the three prominent UMAP: Uniform Manifold Approximation and Projection dimensions, color coded by different quantitative metrics: anomaly score (panel a), image score (panel b), feature score (panel c). We also show those subjects that were #tagged in the “Talk” discussion boards (see legend; panel d; WC: #white_dwarf, SN: #supernova_candidates).

cstp-9-1-740-g5.png
Figure 5

UMAP: Uniform Manifold Approximation and Projection distribution of a subset of images validated with our logistic regression decision boundary (panel a) and those images that were chosen as satisfying the decision criteria (black points; panel b), respectively. Panel c shows the P: precision, R: recall, and the F1 score (2PR/P+R) as a function of an applied lower-limit on the feature score: Sfeature. Panel d shows the precision vs recall of various logistic regression decision boundaries where each of the three parameters are incrementally thresholded: weighted chosen fraction (red points), feature score (blue crosses), and a product of feature score and weighted chosen fraction (green triangles).

cstp-9-1-740-g6.png
Figure 6

The precision and recall of the logistic regression decision boundary derived by varying the binarizing threshold: Σf for two different scores: weighted chosen fraction (left panel; ΣCF) and the product of feature score and weighted chosen fraction (right panel; ΣCF × Feature). In each panel, we also show the overall fraction of a new sample of images that requires visual inspection as a function of Σ.

DOI: https://doi.org/10.5334/cstp.740 | Journal eISSN: 2057-4991
Language: English
Submitted on: Feb 16, 2024
Accepted on: Sep 23, 2024
Published on: Dec 9, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Kameswara Bharadwaj Mantha, Hayley Roberts, Lucy Fortson, Chris Lintott, Hugh Dickinson, William Keel, Ramanakumar Sankar, Coleman Krawczyk, Brooke Simmons, Mike Walmsley, Izzy Garland, Jason Shingirai Makechemu, Laura Trouille, Clifford Johnson, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.