Have a personal or library account? Click to login
ROCP: A rapid ontology construction platform from unstructured data Cover

ROCP: A rapid ontology construction platform from unstructured data

Open Access
|Sep 2018

Figures & Tables

dsj-17-707-g1.jpg
Figure 1

The QA process between ROCP and domain experts.

dsj-17-707-g2.jpg
Figure 2

The overview of our approach.

Table 1

The algorithm of the Domain document word segmentation.

Algorithm 1.1 Domain document word segmentation
Input: Domain documents List D
Output: The segmented words W;
1. List W;
2. for each i in D
3. List Wi = D.WordSegmentationByLucene();
4.    for each j in Wi
5.    Wij.stemming();
6.       if Wij in stopwordlist
7.          Wi.remove(Wij);
8.       end if
9.    end for
10. end for
11. return W;
Table 2

The algorithm of the construction of VSM.

Algorithm 1.2 The construction of VSM
Input: The segmented words W, words number N.
Output: The vector space model of each document VSM;
1. List HFW;
2. List VSM;
3.    for each i in W
4.       HFWi=Wi.findHighFrequencyWords(WN);
5.    end for
6. List WA=HFW.allHighFrequencyWords();
7.    for each j in W
8.       for each k in WA
9.          VSMk=WAk.appearedTimesIn(Wj);
10.      end for
11. end for
dsj-17-707-g3.jpg
Figure 3

Documents Validation.

dsj-17-707-g4.jpg
Figure 4

The cosine similarity algorithm to locate invalid documents.

Table 3

The algorithm to remove invalid documents.

Algorithm 1.3 Remove invalid documents
Input: The Vector Space Model VSM, the cosSimilarity threshold CT, The domain documents D;
Output: The valid domain documents D;
1. sumcos1=0; sumcos2=0;
2. for each i in VSM
3.    for each j in VSM
4.       cosSimij=VSMi.computeCosSimilarityWith(VSMj);
5.       sumcos1+=cosSimij;
6.    end for
7.       avgCosSimi=sumcos1/j
8.       sumcos2+= avgCosSimi
9. end for
10.    totalAvgCosSim=sumcos2/i
11.    for each i in avgCosSim
12.       if Math.abs(totalAvgCosSim-avgCosSimi)>CT
13.       D.removeDocumentByItsVSMIndex(i)
14.       end if;
15. end for;
16. return D;
dsj-17-707-g5.jpg
Figure 5

The 3-layers taxonomy.

dsj-17-707-g6.jpg
Figure 6

A part of the selection for domain experts to achieve 3-layers taxonomy.

dsj-17-707-g7.jpg
Figure 7

Object property creation.

dsj-17-707-g8.jpg
Figure 8

Ontology assembly.

Table 4

The algorithm for ontology classes hyponymy construction.

Algorithm 2 The construction of ontology classes hyponymy
Input: list NodesPool;
Output: list OntTree;
1. List rootNodes=SelectRootNodesByExperts(NodesPool);
2. OntTree0=rootNodes;
3. NodesPool.remove(rootNodes);
4. int n=1;
5.    while(NodesPool.hasElement())
6.       tempnodes=SelectNodesByExperts (NodesPool);
7.       OntTreen.addsubnodes(tempnodes);
8.       OntTreen+1.add(tempnodes);
9.       NodesPool.remove(tempnodes);
10.       n++;
11. end while
dsj-17-707-g9.jpg
Figure 9

The tag cloud in space debris mitigation domain.

dsj-17-707-g10.jpg
Figure 10

The main part of the ontology in space debris mitigation domain.

Table 5

The detailed information of the corpus and experimental data.

DocumentsThe CorpusDomain documents set 1Domain documents set 2
SourceChina DailySpace debris mitigationAstronautics fundamentals
Number of documents10002050
Total number of words177776354619145628
Average number of words177827312513
Table 6

The statistics of the extracted terminologies.

Total Valid words(TW)Total Terminologies(TT)Number of Extraction(NE)Number of Correct words(NC)
DS1-MPVW2617129155123
DS1-TF-IDF261712915581
DS2-MPVW4126288346254
DS2-TF-IDF4126288346209
Table 7

The result of the recall, precision and F1-Measure.

RecallPrecisionF1 Measure
DS1-MPVW95.3%79.4%86.6%
DS1-TF-IDF62.8%52.3%57.1%
DS2-MPVW88.1%73.4%80.1%
DS2-TF-IDF72.6%60.4%65.9%
dsj-17-707-g11.jpg
Figure 11

The accuracy comparison of algorithm MPVW and TF-IDF.

Table 8

The time cost of each period of the manual operation.

Data setsDS3DS4DS5DS6
Number of Terminologies85123171254
3-layers taxonomy382s579s856s1366s
Hyponymy construction415s695s1056s1690s
Properties and instances link236s346s491s747s
ROCP
Total time
1033s
12.15 s/word
1620s
13.17 s/word
2403s
14.05 s/word
3803s
14.97 s/word
Protégé
Total Time
1787s
21.02 s/word
2867s
23.31 s/word
4602s
26.91 s/word
8708s
30.28 s/word
dsj-17-707-g12.jpg
Figure 12

The time test of ontology construction by ROCP and manual work by Protégé.

Language: English
Submitted on: Feb 14, 2017
|
Accepted on: May 2, 2017
|
Published on: Sep 25, 2018
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2018 Chongchong Zhao, Chao Dong, Xiaoming Zhang, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.