RDFAdaptor: Efficient ETL Plugins for RDF Data Process

Li, Jiao; Xian, Guojian; Zhao, Ruixue; Huang, Yongwen; Kou, Yuantao; Luo, Tingting; Sun, Tan

doi:10.2478/jdis-2021-0020

Figures & Tables

Workflow of RDF data generation with RDFZier.

Configuration template of RDFTranslatorAndLoader.

Configuration template of SPARQLIn and SPARQLUpdate.

Dum p All AGROVOC RDF Triples from SPARQL Endpoint to Local Files.

Parameters defined in RDFTranslatorAndLoader_

Parameter		Description
Input	Source	RDF tiples to be converted or loaded
	Source Type	data source, such as local file system, Remote URL or string stream
	Source RDF Format	format of the input RDF data, fully supporting the common RDF formats
	Large Input Triples	a selector for input data scale large or not, if the input is large, then the output step can not count, merge or split the triples
Advance	BaseIRI	resolve against a Base IRI if RDF data contains relative IRIs
	BNode	a selector for preserving BNode IDs
	Verify URI syntax	a selector for URI syntax/relative URIs/language tags/datatypes check
	Verify relative URIs	which returns fail log when corresponding errors occur
	Verify language tags
	Verify datatypes
	Language tags	a selector for language tags / datatype, including fail parsing if
	Datatype	languages / datatypes are not recognised and normalizing recognised language tags / datatypes values
Output	Target RDF Format	RDF format of the converted output
	Commit or Split Size	number of RDF triples for the output to each RDF files or submit to stores every batch, the default value is 0, which means all the input data would be processed at one time
	Local File Setting	options of file system storage, including three selectors for “Save to File System”, “Keep Source FileName” and “Merge to Single File (take precedence over “Commit or Split Size”)”, File name and location
	TripleStore Setting	options of RDF store, including a selector for “Save to Store”, Triple Store, Server URL, Database/RepositoryID/NameSpace (identifier of database for different triple store), UserName, Password, and Graph URI.
	Stream setting	option of String Stream for further data transferring, including a selector for “Save to Stream”, and Result Field

Parameters defined in SparqlUpdate_

Parameter		Description
SPARQL Setting	Query Endpoint Url From Field?	checkbox, if checked means the Url of the SPARQL Query Endpoint would be coming from Kettle's previous steps and the value could get from the “Query Endpoint Url Field”
	Query Endpoint Url Field	only used by giving a list of drop-down options of input fields when the option “Query Endpoint Url From Field” is selected
	Query Endpoint Url	The value of the Query Endpoint Url would be used when “Query Endpoint Url From Field” is unchecked
	Update Endpoint Url From Field?	checkbox, if checked means the Url of the SPARQL Update Endpoint would be coming from Kettle's previous steps and the value could get from the “Update Endpoint Url Field
	Update Endpoint Url Field	only used by giving a list of drop-down options of input fields when the option “Update Endpoint Url From Field” is selected
	Update Endpoint Url	The value of the Update Endpoint Url would be used when “Update Endpoint Url From Field” is unchecked
	Query From Field?	checkbox, if checked means the SPARQL Update Query would be coming from Kettle's previous steps and the value could get the “Query Field Name”
	Query Field Name	only used when the option “Query From Field” is selected
	Base URI	resolve against a Base IRI if RDF data contains relative IRIs
	SPARQL Update Query	JavaScript programming for graph update which is only used when the option “Query From Field” is disable
Output Setting	Result Field Name	field specified for file saving
Http Auth	HTTP UserID	user ID of SPARQL endpoint if any
Http Auth	HTTP Password	password of SPARQL endpoint if UserID exists

RDF data generation/translation and loading_

Data Source	Data Format	Number of Records	Number of mapped fields	Number of RDF generated	Total Time-consuming
MongDB	json	1,948,268	17	37,038,563	32min18s
SqlServer	RDB	336,831	5	1,159,687	38.6s
SqlServer	RDB	798,389	9	7,521,876	5min4s

Parameters defined in RDFizer_

Parameter		Description
Namespace	Prefix	collections of names identified by URI references
Namespace	Namespace	different prefixes depending on the required namespaces
Mapping Setting	Subject URI	HTTPURI template for the Subject/Resource, a placeholder {sid} would be used and replaced by UniqueKey
	Class Types	the classes to which the resource belongs, supporting multi-class types(split by semicolon), such as skos:Concepts; foaf:Person
	UniqueKey	the unique and stable primary key of resource, part of the Subject URI
	Fields Mapping Parameters	a list of field map from selected data source to target RDF schema, including the input Stream Field, Predicates, Object URIs, Multi-Values Sepator, Data Type, Lang Tag
Dataset Metadata	Meta Subject URI	URI pattern of generated dataset
	Meta Class Types	the classes to which the resource belongs
	Parameters	a list of descriptions of generated dataset, including PropertyType, Predicates, Object Values, DataType, Lang Tag
Output Setting	File system setting	option for file system storage, including Filename and RDF format
Output Setting	RDF store setting	option for RDF store, including triple store name, server URL, Repository ID, Username (if any), Password, Graph URI

Parameters defined in SparqlIn_

Parameter		Description
SPARQL Setting	Accept URL from field	checkbox, if checked means the Url of the SPARQL Endpoint would be coming from Kettle's previous steps and the value could get from the “URL field name”
	URL field name	only used by giving a list of drop-down options of input fields when the option “Accept URL from field” is selected
	SPARQL Endpoint URL	endpoint Url queried when “Query Endpoint Url From Field” is disabled
	Query Type	query type which provides two options: Graph query or Tuple query
	SPARQL Query	SPARQL query forms: SELECT or CONSTRUCT
	Limit	limitation on data size to be processed if necessary
	Offset	the starting position of data processing
Output Setting	Result Field Name	field specified for file saving
	RDF Format	target local data format, either JSON, XML, CSV or TSV for SELECT query, RDF format only for CONSTRUCT query
	Max Rows	definition of the maximum size of the output file, empty of 0 means get all the triples
Http Auth	HTTP UserID	user ID of SPARQL endpoint if any
Http Auth	HTTP Password	password of SPARQL endpoint if UserID exists

RDFAdaptor: Efficient ETL Plugins for RDF Data Process

Figures & Tables

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Parameters defined in RDFTranslatorAndLoader_

Parameters defined in SparqlUpdate_

RDF data generation/translation and loading_

Parameters defined in RDFizer_

Parameters defined in SparqlIn_

Paradigm

My account