Have a personal or library account? Click to login
A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications Cover

A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications

By: Qiuzi Zhang,  Qikai Cheng,  Yong Huang and  Wei Lu  
Open Access
|Sep 2017

Figures & Tables

Figure 1

Bootstrapping framework for extraction.
Bootstrapping framework for extraction.

Figure 2

Extensibility of pattern changes over the process of iteration under COM-SEED. COM-SEED refers to the strategy of selecting both the names of a few well-known datasets and a category of general indicative words as seed words.
Extensibility of pattern changes over the process of iteration under COM-SEED. COM-SEED refers to the strategy of selecting both the names of a few well-known datasets and a category of general indicative words as seed words.

Figure 3

Extensibility o f pattern changes over the process of iteration under GEN-SEED. GEN-SEED refers to the strategy of selecting a category of general indicative words as seed words.
Extensibility o f pattern changes over the process of iteration under GEN-SEED. GEN-SEED refers to the strategy of selecting a category of general indicative words as seed words.

Figure 4

Extensibility of patterns in the form of “Predicate + Object” changes over the process of iteration. COM-SEED refers to the strategy of selecting both the names of a few well-known datasets and a category of general indicative words as seed words. GEN-SEED refers to the strategy of selecting a category of general indicative words as seed words.
Extensibility of patterns in the form of “Predicate + Object” changes over the process of iteration. COM-SEED refers to the strategy of selecting both the names of a few well-known datasets and a category of general indicative words as seed words. GEN-SEED refers to the strategy of selecting a category of general indicative words as seed words.

Figure 5

Extensibility of patterns in the form of “Subject + Predicate” changes over the process of iteration. COM-SEED refers to the strategy of selecting both the names of a few well-known datasets and a category of general indicative words as seed words. GEN-SEED refers to the strategy of selecting a category of general indicative words as seed words.
Extensibility of patterns in the form of “Subject + Predicate” changes over the process of iteration. COM-SEED refers to the strategy of selecting both the names of a few well-known datasets and a category of general indicative words as seed words. GEN-SEED refers to the strategy of selecting a category of general indicative words as seed words.

Figure 6

Extensibility of pattern changes over the process of iteration under an optimum combination of seed-selection strategy and pattern-construction strategy.
Extensibility of pattern changes over the process of iteration under an optimum combination of seed-selection strategy and pattern-construction strategy.

DUS examples_

StatementsPositive (☑) OR negative (☒)
In our experiments, the experimental subset contains 1,552 images selected from the GT database and the FERET databases.☑ The name, source, and compositions of data
The large-scale database contains 93,638 images captured from 9,668 palms of 4,834 individuals, in which 4–10 images are collected for each palm.☑ The source and compositions of data
Consequently, both of the two experimental subsets contain 1,200 samples for training and 1,200 samples for testing.☑ Data compositions and application
In order to show the robustness over short noisy intervals and satisfy the two defined semantics R1 and R2, we generate two completely separated clusters, C1 and C2, using two disjoint interval sequences, Q1 and Q2, and add the synthetically generated short noisy intervals marked in red. Each group contains 10 subjects.☒ Algorithm description
☒ Experiment participants
The average training time of the repeated random sub-sampling validation is 1.83 × 30 = 54.9 s, and that of the CBE cross-validation is 1.84 × 5 = 9.2 s.☒ Experiment process

Elementary statistics on extraction results_

Seed-selection strategyPatternSeed numberPattern numberStatement number
COM-SEEDPredicate + Object14,00067029,722
Subject + Predicate5,10559611,869
GEN-SEEDPredicate + Object18,23540435,711
Subject + Predicate5,53033411,247

Exemplifications of pattern construction_

PatternSentences covered by this pattern and the extracted data_clue words
Consists of # samplesThe breast cancer set consists of 569 samples with 357 benign and 212 malignant. Dataset 1 is referred to as Char250, which has 250 samples per category for lower and upper cases, respectively; dataset 2 is referred to as Char1000, which has 1,000 samples per category for lower and upper cases, respectively. (Please note this pattern occurs twice here.)
We perform experimentsTo assess the ability of the proposed clustering algorithm for classifying the shape classes, we perform experiments on an increasing number of shapes in the two Aslan and Tari datasets. We perform our experiments on a real-estate system with real-life house dataset used in.

Initial seed words_

Seed-selection strategyCOM-SEEDGEN-SEED
Initial seed wordstree #data
kdd eupdataset
treecorpus
wall street journaldata set
the # kdd eup
dataset
corpus

Precision of statement extraction from CSExperiment-triple (2000–2013)_

Seed-selection strategyPatternPrecision (%)
COM-SEEDPredicate + Object96.34
Subject + Object69.67
Overall83.01
GEN-SEEDPredicate + Object95.34
Subject + Predicate37.00
Overall66.17
DOI: https://doi.org/10.20309/jdis.201606 | Journal eISSN: 2543-683X | Journal ISSN: 2096-157X
Language: English
Page range: 69 - 85
Submitted on: Jan 21, 2016
Accepted on: Feb 26, 2016
Published on: Sep 1, 2017
Published by: Chinese Academy of Sciences, National Science Library
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2017 Qiuzi Zhang, Qikai Cheng, Yong Huang, Wei Lu, published by Chinese Academy of Sciences, National Science Library
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.