ROCP: A rapid ontology construction platform from unstructured data

Chongchong Zhao; Chao Dong; Xiaoming Zhang

doi:10.5334/dsj-2018-023

Figures & Tables

The QA process between ROCP and domain experts.

Table 1

The algorithm of the Domain document word segmentation.

Algorithm 1.1 Domain document word segmentation
Input: Domain documents List D
Output: The segmented words W;
1. List W;
2. for each i in D
3. List W_i = D.WordSegmentationByLucene();
4. for each j in W_i
5. W_ij.stemming();
6. if W_ij in stopwordlist
7. Wi.remove(W_ij);
8. end if
9. end for
10. end for
11. return W;

Table 2

The algorithm of the construction of VSM.

Algorithm 1.2 The construction of VSM
Input: The segmented words W, words number N.
Output: The vector space model of each document VSM;
1. List HFW;
2. List VSM;
3. for each i in W
4. HFW_i=W_i.findHighFrequencyWords(WN);
5. end for
6. List WA=HFW.allHighFrequencyWords();
7. for each j in W
8. for each k in WA
9. VSMk=WA_k.appearedTimesIn(W_j);
10. end for
11. end for

The cosine similarity algorithm to locate invalid documents.

Table 3

The algorithm to remove invalid documents.

Algorithm 1.3 Remove invalid documents
Input: The Vector Space Model VSM, the cosSimilarity threshold CT, The domain documents D;
Output: The valid domain documents D;
1. sumcos₁=0; sumcos₂=0;
2. for each i in VSM
3. for each j in VSM
4. cosSim_ij=VSMi.computeCosSimilarityWith(VSM_j);
5. sumcos₁+=cosSim_ij;
6. end for
7. avgCosSimi=sumcos₁/j
8. sumcos₂+= avgCosSim_i
9. end for
10. totalAvgCosSim=sumcos₂/i
11. for each i in avgCosSim
12. if Math.abs(totalAvgCosSim-avgCosSim_i)>CT
13. D.removeDocumentByItsVSMIndex(i)
14. end if;
15. end for;
16. return D;

A part of the selection for domain experts to achieve 3-layers taxonomy.

Table 4

The algorithm for ontology classes hyponymy construction.

Algorithm 2 The construction of ontology classes hyponymy
Input: list NodesPool;
Output: list OntTree;
1. List rootNodes=SelectRootNodesByExperts(NodesPool);
2. OntTree₀=rootNodes;
3. NodesPool.remove(rootNodes);
4. int n=1;
5. while(NodesPool.hasElement())
6. tempnodes=SelectNodesByExperts (NodesPool);
7. OntTree_n.addsubnodes(tempnodes);
8. OntTree_n+1.add(tempnodes);
9. NodesPool.remove(tempnodes);
10. n++;
11. end while

The tag cloud in space debris mitigation domain.

The main part of the ontology in space debris mitigation domain.

Table 5

The detailed information of the corpus and experimental data.

Documents	The Corpus	Domain documents set 1	Domain documents set 2
Source	China Daily	Space debris mitigation	Astronautics fundamentals
Number of documents	1000	20	50
Total number of words	1777763	54619	145628
Average number of words	1778	2731	2513

Table 6

The statistics of the extracted terminologies.

	Total Valid words(TW)	Total Terminologies(TT)	Number of Extraction(NE)	Number of Correct words(NC)
DS1-MPVW	2617	129	155	123
DS1-TF-IDF	2617	129	155	81
DS2-MPVW	4126	288	346	254
DS2-TF-IDF	4126	288	346	209

Table 7

The result of the recall, precision and F1-Measure.

	Recall	Precision	F1 Measure
DS1-MPVW	95.3%	79.4%	86.6%
DS1-TF-IDF	62.8%	52.3%	57.1%
DS2-MPVW	88.1%	73.4%	80.1%
DS2-TF-IDF	72.6%	60.4%	65.9%

The accuracy comparison of algorithm MPVW and TF-IDF.

Table 8

The time cost of each period of the manual operation.

Data sets	DS3	DS4	DS5	DS6
Number of Terminologies	85	123	171	254
3-layers taxonomy	382s	579s	856s	1366s
Hyponymy construction	415s	695s	1056s	1690s
Properties and instances link	236s	346s	491s	747s
ROCP Total time	1033s 12.15 s/word	1620s 13.17 s/word	2403s 14.05 s/word	3803s 14.97 s/word
Protégé Total Time	1787s 21.02 s/word	2867s 23.31 s/word	4602s 26.91 s/word	8708s 30.28 s/word

The time test of ontology construction by ROCP and manual work by Protégé.

ROCP: A rapid ontology construction platform from unstructured data

Figures & Tables

Figure 1

Figure 2

Table 1

Table 2

Figure 3

Figure 4

Table 3

Figure 5

Figure 6

Figure 7

Figure 8

Table 4

Figure 9

Figure 10

Table 5

Table 6

Table 7

Figure 11

Table 8

Figure 12

Paradigm

My account