A Network View on the Big Exchange Project: Integrating and Analysing Heterogeneous Datasets

Mattis thor Straten; Steffen Strohm; Johanna Hilpert; Benjamin Serbe; Tim Kerig; Matthias Renz

doi:10.5334/jcaa.262

Full Article

1. Introduction

This work introduces a novel approach to integrating archaeological data within a heterogeneous information network (HIN) and performing similarity search to reveal patterns across datasets (see Shi et al. 2017 for an overview). This takes place within the Big Exchange project (Kerig et al. 2023), which investigates long-distance exchange systems across Eurasia and Africa from 8000 to 1 BCE, focusing on archaeological objects of raw materials with known origin (‘sourced finds’). Rather than reconstructing entire exchange or social networks, the project uses network analysis and machine learning techniques to analyse the direct connections between materials, their sources, and related find locations. By modelling these relationships as networks, key structural parameters describing network characteristics are identified, and shifts in material distribution are traced over time and space. This approach reveals patterns of resource access, economic interdependence, and underlying social inequalities, while encouraging broad collaboration on data and methods.

By mid-2025, the Big Exchange platform aggregates 38 published datasets derived from the analysis of several hundred thousand artefacts. These datasets record the geographical distribution of 22 raw materials from over 8,000 archaeological sites. This study draws on a deliberately limited subset of these data: 14 datasets from the Big Exchange collection, considering 11 different materials. It reflects a specific stage in the project, during which the datasets were acquired in parallel with the development of an HIN to integrate them. By explicitly representing a variable number of object and relationship types, HINs enable more meaningful interpretations in archaeological contexts by preserving the heterogeneous semantic information of archaeological data.

Failing to consider a variable degree of heterogeneity – for example, by using homogeneous or bimodal networks – can result in valuable contextual information being lost. The 14 datasets provide a clear example of the feasibility of integrating archaeological data into an HIN, demonstrating the potential and analytical power of HINs.

This work aims to reveal cross-dataset knowledge and facilitate domain-specific reasoning by analysing the integrated HIN (e.g., Patel, Paraskevopoulos & Renz 2018a; 2018b). Cross-dataset connections are considered by deriving pairwise material similarities based on spatial paths in the network. This exploratory feasibility assessment is designed to evaluate the potential for integrating archaeological data into HINs and to identify potential avenues for future applications.

2. Related Work and Methodology

Network analysis, which is based on mathematical graph theory, has found its way into archaeology primarily through social network analysis (SNA). It has since established itself as an important analytical framework (for current overviews, see Ahnert et al. 2020, pp. 43–51; Brughmans 2013; Brughmans & Peeples 2018; Collar et al. 2015; Crabtree & Borck 2019; Peeples 2019). In recent decades, the number and diversity of archaeological SNA studies has increased significantly, encompassing a wide range of thematic and methodological approaches (e.g., Aragon 2023; Kroon 2024; Notarian 2024; Pereira, Manen & Rigaud 2023; Tambs 2025).

General network science (e.g., Barthelemy 2014; Wasserman & Faust 1994) and current archaeological network analysis methods (overview in Brughmans et al. 2023; Brughmans & Peeples 2023) provide the conceptual and analytical basis for isolating and quantitatively describing network attributes. When the spatial and chronological distributions of raw material distributions are combined, patterns of interchangeability and economic side effects become apparent. However, this project does not aim to reconstruct complete exchange systems with individual actors – the archaeological datasets under consideration are too sparse and too heterogeneous for such reconstructions.

HINs are a computer science-based methodological approach for representing and analysing large, multi-typed networks. They are well-suited to integrating the diverse archaeological information from the Big Exchange project, as they preserve the semantic information of the data within an intuitive network structure, enabling the simultaneous analysis of diverse relationships (e.g., between sites and artefact types). This approach aligns with recent calls in archaeological network research to move beyond uniform graph structures and towards more complex, information-rich representations of past interaction systems (e.g. Ahnert et al. 2020).

To the best of our knowledge, no previous work has applied HIN analysis to an archaeological use case. This chapter begins by defining HINs and introducing the Big Exchange data and the created HIN, as well as the PathSim similarity measure and its currently developed spatial extension.

2.1 Heterogeneous Information Network (HIN)

The formal definitions adhere to standard notation in HIN research. Figures 1, 2, 3 show concrete Big Exchange examples to ensure accessibility for readers without a mathematical background.

Formally, an HIN is a directed graph $G = (V, E, ϕ, ψ)$ , where V is the set of nodes and E is the set of edges, representing different objects and their pairwise relationships, respectively.

The functions $ϕ : V \to A$ and $ψ : E \to R$ assign exactly one type to each node and edge, respectively. Here, $A$ denotes the set of node types and $R$ denotes the set of edge types, with $| A | > 1$ or $| R | > 1$ . In the scope of this work, the edge type is directly determined by its incident node types as there exists at most one edge type for each pair of node types and is therefore not mentioned explicitly.

The network schema $T_{G} = (A, R)$ of an HIN $G = (V, E, ϕ, ψ)$ is a directed graph that captures the network’s high-level structure by representing each node type in $A$ as a node and each edge type in $R$ as an edge that defines a possible relation between a pair of node types. (Sun et al. 2011)

2.2 Big Exchange Data

This study uses 14 archaeological datasets covering 11 raw materials. The datasets are detailed in a forthcoming volume (Kerig and Hilpert (eds.), in press), and the complete corpus of collected data within in the Big Exchange Project will be released in an open-access format once the volume appears. The considered datasets are inherently heterogeneous, reflecting the varied workflows involved in their collection and compilation. They differ in scope, structure, and source material. However, they consistently include site information with assigned geographic coordinates (called geolocations here), albeit with varying degrees of precision. Certain collections focus on a single raw material over extended chronological spans, while others are temporally restricted but encompass a variety of materials. Temporal and spatial overlaps between datasets are frequent, motivating the integration of the data into an HIN.

Rijckholt Flint [LBK]. The dataset is based on the study by Zimmermann (1995) on the distribution of Rijckholt Flint, a characteristic lithic raw material in the northwestern distribution of the Early Neolithic Linearbandkeramik (LBK; ca. 5400 – 4900 BCE) distribution area.

Actinolite-Hornblende Schist [LBK]. Adzes are commonly found in LBK graves and settlements and are often made of amphibolite (Actinolite-Hornblende Schist – AHS). This work uses the compilation of Nowak (2008), which documents all known LBK Jistebsko-AHS adzes.

Abensberg-Arnhofen Chert [LBK]. The dataset is derived from the compilation by Roth (2008) documenting the frequencies of Abensberg-Arnhofen Chert in LBK settlements.

Site Compilation NW Bavaria [LBK]. Scharl (2010; 2015) examined the procurement and exchange of lithic raw materials in Western Franconia, documenting the relative presence of raw materials from sources such as Rijckholt and Abensberg-Arnhofen at Early and Middle Neolithic sites.

Kraków Jurassic Silicite [LBK]. This study draws on a subset of the comprehensive database compiled by Mateiciucová (2008), restricted to LBK sites where Kraków Jurassic Silicite was recorded. This flint was a key lithic resource within the Early Neolithic of Poland and the Czech Republic.

Site Compilation Polish Lowlands [LBK]. In their investigation of procurement and exchange patterns across the Polish Lowlands, Pyzel and Wąs (2018) quantified the proportions of various raw material types at 41 Polish LBK sites, including the percentage of Kraków Jurassic Silicite in the assemblages.

Burials [LBK]. This collection is based on the extensive collection of LBK burials and grave goods in the Lifeways database (Bickle & Whittle 2013).

Spondylus Artefacts [ca. 6500 – 3800 BCE]. The dataset documents Spondylus Gaederopus artefacts from 490 sites, 111 of which date to 5500 – 4900 BCE and 95 of which belong to Early Neolithic LBK contexts (Windler 2018).

Alpine Greenstone Axes [ca. 5th–4th millennium BCE]. Large, polished jade axes were circulated widely across Western Europe during the 5th–4th millennium BC. They are made of jadeitite, a green-coloured rock for which there are sources in the western Italian Alps (Monte Viso quarries, 1500–2400 m above sea level). This study incorporates the continuously updated JADE project database, which is supplemented with central coordinates for the recorded ‘parishes’.¹

Lousberg Adzes [Late Neolithic]. This study builds upon the research conducted by Schyle (2010) into the systematic open-air extraction and distribution of Lousberg Flint near Aachen (c. 3800–3000 BCE).

African Ivory Iberian Peninsula [Chalcolithic – Early Bronze Age]. This work incorporates the distribution data of over 1,200 ivory objects from 152 sites on the Iberian Peninsula dating from the 3rd–2nd century, collected by Schuhmacher (2012).

Lapis Lazuli Objects [4th–3rd millennium BCE]. The blue Lapis Lazuli was sourced only from two deposits in Afghanistan and Tajikistan from the 4th millennium BCE onwards. This work uses the spatial data on Lapis Lazuli objects from Rahmstorf (2022), who examined cultural interactions from the Aegean to western India.

Ivory and Lapis Lazuli Objects of Upper Mesopotamian-Anatolian Region [ca. 3200–1600 BCE]. Massa and Palmisano (2018) investigated long-distance exchange networks to identify patterns of change and continuity. A subset of their dataset is used here.

Oxhide Ingots [2nd millennium BCE]. Oxhide ingots (OHI) are standardised metal bars that resemble oxhides and originate mainly in Cyprus (Sabatini 2016a). This dataset compiles information from various sources (Kaiser 2013; Kassianidou 2008; Lo Schiavo 2018; Sabatini 2016a; 2016b; Sabatini & Lo Schiavo 2020).

2.3 Big Exchange HIN

Each dataset contributes to the integrated HIN by providing find data for a specific material type, related contextual data, and spatio-temporal information. In the resulting HIN, each find corresponds to a node. All concepts linked to that find (e.g., material or site) are represented as separate nodes of the respective types. This leads to a node instance for each unique value combination defining this concept, e.g., latitude and longitude column information defining a geolocation in a row of the pre-existing relational dataset. Edges are created to connect objects that occur in the same row of the original dataset. This ensures that the HIN structure accurately reflects the integrated input data. Spatial and temporal node types act as connectors between the diverse datasets, linking different materials through associated finds with a similar spatial or temporal origin.

Figure 1 shows the network schema of the Big Exchange HIN. The node type labels are derived from the original datasets in order to preserve the authors’ nomenclature as far as possible. Note that edge types linking transitively connected node types – such as find–site and site–geolocation, which together cover the direct find–geolocation relation – are included in the network schema but are omitted from Figure 1 for readability and structural clarity. For more details, see the schema information in the published HIN code base. A find is the central node type of this HIN. It is related to a material type, which has a material creation rank and a material processing rank established by domain experts. These ranks assess the technological complexity involved in extracting or collecting material (creation), and the technical knowledge required to further shape it (processing).

A find also has a provenance indicating its ‘source’, which is located at a geolocation. A find may also be part of a feature – for example, a grave – which is found at a site. The site is also related to a geolocation, as well as the hierarchically ordered spatial concepts of parish, district, country subdivision, and country.

The country subdivision is also connected to cultural innovation, referring to systemically relevant inventions (Lechterbeck & Kerig 2024), such as the first general occurrence of cultivated plants and animals. According to Bocquet-Appel (2008), the earliest date of such cultural innovations is set to zero and is called the delta-value. Data from Kohler et al. (2025) is used to describe the effects of innovations over time and to recognise time lags in the process of evolutionary fitting (Kerig et al. 2025).

The site is also related to a time period, providing absolute temporal information, and a dating, which creates the concept of the presence of an archaeological phase or culture at a specific time and location.

2.4 Similarity Search in HINs

The HIN integrates diverse datasets into a unified structure, providing a semantically enriched foundation for discovering cross-dataset patterns. Similarity measures quantify the relatedness of two objects of the same type on a scale from 0 to 1, where values near 1 indicate strong similarity. Similarity search then ranks objects based on their similarity to a query object (He, Bailey & Zhang 2014).

The link-based PathSim similarity measure (Sun et al. 2011) considers paths, which are sequences of nodes representing higher-order relationships. The schema of a path (meta path), which is defined by its constituent node types, characterises a specific semantic relationship (see Figure 2). Formally, a meta path P of length $l \in ℕ$ on the network schema $T_{G} = (A, R)$ is defined as $P = A_{1} \overset{R_{1}}{\to} A_{2} \overset{R_{2}}{\to} \dots \overset{R_{l}}{\to} A_{l + 1}$ , where $A_{i} \in A$ and $R_{j} \in R$ for $i \in [1, \dots, l + 1]$ and $j \in [1, \dots, l]$ . A meta path $P = P_{k} \cdot P_{k}^{- 1}$ , where $P_{k}^{- 1}$ is the inverse path of P_k, is called a round-trip meta path – in Figure 2, P_k is material-find-site-geolocation. A path in the HIN that follows a specific meta path P is called a path instance of P. (Sun et al. 2011)

The round-trip meta path connecting two materials if finds of these materials are found in sites with the exact same geolocation.

PathSim measures the similarity between two objects along a user-selected meta path by counting the path instances connecting them and normalising this count by the number of path instances linking each object to itself. This mitigates any bias towards highly visible objects. Formally, the PathSim score between two objects $x, y \in V$ using a round-trip meta path $P = P_{k} \cdot P_{k}^{- 1}$ is computed as follows:

PathSim (x, y) = \frac{2 \cdot | {p_{x ⇝ y} : p_{x ⇝ y} \in P} |}{| {p_{x ⇝ x} : p_{x ⇝ x} \in P} | + | {p_{y ⇝ y} : p_{y ⇝ y} \in P} |}

where $p_{x ⇝ y} \in P$ is a path instance of P between x and y, and $p_{x ⇝ x} \in P$ is one between x and itself.

PathSim is established and well-suited to analyse archaeological HINs because it can naturally represent the diverse semantic link information incorporated. The user-selected meta path reveals a story hidden in the data by considering user-defined semantics. Counting and normalising path instances – instead of counting the number of different reachable nodes – explicitly captures multiple occurrences of this story in the HIN, and relates to the story’s importance in the network. PathSim can be applied to the Big Exchange HIN to assess similarity of spatial regions, similar to the case study by Kerig et al. (2023, Figure 4) comparing profiles of different geographical regions based on the absence and presence of raw materials, where regional similarity is proportional to the number of different shared materials. PathSim can measure both qualitative and quantitative connection information in a network, where not only the number of different materials but also the number of shared instances of the same material compared to the network visibility of the considered spatial region influences the similarity score. Comparing PathSim scores relative to each other provides insight into regional commonalities and differences.

The meta path in Figure 2 assumes that two materials are transitively related to the exact same geolocation. However, this is unlikely to be reflected in real datasets due to varying spatial precision and the vast number of different locations on the Earth’s surface. Therefore, this work serves as a proof of concept for the currently developed PathSim extension, which considers spatially extended meta paths that incorporate spatial proximity between objects. Figure 3 presents a spatially extended meta path based on Figure 2.

The meta path connecting two materials if finds of these materials are found in sites with geolocations that share a spatially similar geolocation.

3. Application & Interpretation

The spatially extended meta path, illustrated in Figure 3, is applied throughout the experiments in this chapter without repeated reference. The spatial similarity of geolocations follows the (Euclidean) distance classification for lithic raw material procurement zones proposed by Kerig and Shennan (as cited by Mateiciucová & Trnka 2015, p. 8). The distance classes are defined by a minimum distance of 0 km and the following maximum distances:

On Site: 6.25 km
Local I: 12.5 km
Local II: 25 km
Regional I: 50 km
Regional II: 100 km
Interregional: 200 km
Long Distance: 400 km
Continental: 700 km
Transcontinental: ∞

Figure 4 shows the actual spatial distributions of the considered geolocations. Geolocations without valid coordinate values (especially missing or non-numerical) are excluded from the spatial proximity computation. In the present dataset, this applies to only a single geolocation. In general, this affects only the spatial similarity step: the corresponding nodes remain part of the HIN but are not considered in spatially extended meta paths because calculating spatial proximity requires coordinates. Materials that occur exclusively at sites without coordinates would therefore contribute only through conventional meta paths.

The spatial distribution of all 5,905 considered geolocations.

Figure 5 shows the tables containing the PathSim scores for each material, considering the On Site and Continental distance classes. Note that the tables are symmetrical due to the symmetry of the PathSim measure. Each row contains the PathSim similarity scores of a material compared to all other materials, indicated by the corresponding letter in the column header. It should be noted that Shells corresponds to the query for the material ‘Shells’ within the dataset compiled by Bickle and Whittle (2013). Importantly, sites containing burials for which no information regarding potential grave goods was available were also deliberately included. Consequently, all sites contained in the database are considered, providing a comprehensive overview of the distribution of LBK burials. This procedure was necessitated by the semantics of the material-based query.

Pairwise similarity scores for all materials considering the On Site **(a)** and Continental **(b)** distance classes.

Each material has the maximum PathSim similarity score of 1 with itself by definition of PathSim. 50 of the 110 PathSim scores between different materials in Figure 5a are 0 (empty cells), indicating that there is not a single path instance connecting the corresponding materials when using the On Site distance. Only four 0-values appear when considering the Continental distance in Figure 5b.

Figure 5a also has 28 PathSim scores that are less than 0.001, indicating a large number of very small similarity scores in the geospatial distribution of materials on the On Site distance class. Figure 5b has only 14 such small PathSim scores. The majority of materials receive a higher similarity score when the larger Continental distance class is considered. However, note that this does not apply every time the distance is increased, as relatively more edges need to connect the materials being compared than connecting each material to itself (e.g., via connections to other materials) to increase the PathSim score (see PathSim formula).

To evaluate the archaeological interpretation of PathSim, a subset of the material data was compared with the aforementioned case study (Kerig et al. 2023), which examined the circulation of raw materials in the Early Neolithic LBK. The focus was on the distribution of AHS adzes, Abensberg-Arnhofen Chert, Kraków Jurassic Silicite, Rijckholt Flint, Shells, and Spondylus artefacts. Adopting a network perspective, the case study identifies regional clusters defined by shared distributions of raw materials. It is demonstrated that the northwestern group of the LBK, largely structured around the Rijckholt distribution, was framed by exchange sub-systems in which it played a minimal or negligible role (cf. Jeunesse 1996). Sub-systems such as this are interpreted as reflecting processes of LBK expansion, assuming that raw material exchange mostly followed established contacts in accordance with the social lineage of the initial spread.

Building on these results, this work conducts an exploratory analysis of the PathSim scores for the corresponding materials. These include lithic raw materials from the LBK (Rijckholt Flint, Amphibolite [AHS], Abensberg-Arnhofen Chert [AA], Kraków Jurassic Silicite), as well as the ‘Shell’ query to represent LBK graves and the full Neolithic Spondylus dataset. Figure 6 displays the distributions of PathSim similarity scores across the different distance classes considering all materials given one fixed query material.

PathSim distributions for each LBK material compared with all other LBK materials across distance classes.

In this context, PathSim values can be interpreted as indicators of potential contact zones. High similarity scores within small distance classes suggest that the spatial distributions of finds consisting of the two materials under analysis (their respective ‘raw material spheres’) overlap significantly and share a comparable visibility in the network. In a mobility context, this overlap indicates greater opportunities for communities using these materials to come into contact within the close surroundings of their settlement areas. Conversely, low similarity scores in these small distance classes imply limited chances of such interaction, given the distribution of geolocations shown in Figure 4. Increasing similarity scores for larger distance classes point to a higher likelihood of contact within broader activity zones that extend beyond the local to regional scale. Figure 6 illustrates how likely users of different raw materials were to encounter one another within the distance classes, relative to activity zones around, e.g., sites. Assuming that the formation of raw material spheres reflects the social relationships underlying the expansion of the LBK, these similarity profiles can be interpreted as inverse indices of social distance.

Some specific observations relating to the aforementioned LBK case study are highlighted here. Notably, the distribution of Rijckholt Flint consistently exhibits high similarity scores with AHS across all distance classes. At the same time, the PathSim profiles of Rijckholt Flint and AHS display an almost identical pattern (see Figures 6a and 6b). This can be interpreted as follows: a) a large number of sites with Rijckholt Flint and AHS are located in close spatial proximity, and b) both materials exhibit a very similar structure in their overall spatial distribution. Consequently, there is a high probability that communities associated with these two raw material spheres came into contact across all spatial scales, implying low social distance between the two groups.

By contrast, AHS and AA exhibit low similarity values across all distance classes, even though the AA distribution area largely overlaps with that of AHS. This results from the PathSim score being normalised with respect to the visibility of each material, based on the size and structure of each individual material network. While potential contact zones do exist in areas where the two raw material distributions overlap, the overall similarity remains low because there are potential contact zones with other materials that are not involved in the similarity between AHS and AA. In a simplified mobility model, most AHS users were unlikely to encounter AA users locally, indicating low connectedness between the two spheres and potentially greater social distance, despite some possible contacts.

The spatial seclusion of Krakow Jurassic Silicite is clearly evident in its PathSim profile, with increasing similarity values only appearing in large, continental distance classes. In the limited scope of the raw material distribution considered here, this implies a high social distance and few opportunities for contact with each other.

In the network representation of Kerig et al. (2023), the AHS sphere connects the larger complexes of different lithic raw materials, linking the otherwise disconnected Rijckholt Flint and Spondylus spheres. In the present study, however, this bridging is less apparent. Instead, a high similarity emerges between AHS and Rijckholt, suggesting close social proximity. This is particularly notable given the materials were used in different contexts – Rijckholt Flint for the production of ‘everyday’ tools and AHS in burials – and were derived from distinct geographical sources. They move along the exchange network in different directions, possibly following established connections that reflect the lineage behind the initial dispersion process (cf. Kerig et al. 2023, Figure 5). While the raw materials themselves cannot be directly compared, the spatial patterns indicated by the PathSim scores reveal consistent similarities between these two spheres.

Beyond insights into potential contact networks and social distance, PathSim scores also illuminate spatial relationships. For Spondylus and Rijckholt Flint, it remains unclear whether the observed pattern reflects genuine regionalisation or results from differential taphonomic processes. The apparent absence of Spondylus in the core Rijckholt area has previously been linked to a presumed scarcity of graves, given that organic remains are thought to preserve poorly in this region (cf. Eckmeier, Altemeier & Gerlach 2014, p. 151).

To test this, the distributions of LBK burials (‘Shells’) and the complete Spondylus dataset are each compared to lithic raw materials (see Figures 6e and 6f). The PathSim profile of the ‘Shell’ dataset shows nearly identical similarity value distributions (Figure 6e) with three of the four lithic materials. If preservation bias were responsible for the low Rijckholt-burial connectivity, its profile would diverge from that of the other lithics. However, the observed alignment instead indicates that such bias is unlikely to account for the pattern. Accordingly, the absence of Spondylus, primarily recorded in burials, cannot be explained by a systematic lack of graves due to poor preservation in the Rijckholt distribution area.

In this respect, the spatial relationship of the overall Spondylus distribution to lithic artefacts and LBK burials is assumed to be particularly revealing. At a local to regional scale, one would expect the PathSim scores between Spondylus and Rijckholt Flint to be markedly lower than those between Spondylus and other LBK raw materials. Instead, similarity scores with all lithics and burials remain consistently low, particularly in smaller spatial neighbourhoods. This outcome is largely driven by the broad geographical spread and comparatively high frequency of Spondylus finds. Through normalisation, these finds suppress local variation in PathSim scores. This highlights a methodological limitation: without incorporating spatial boundaries on the studied area, PathSim may flatten localised patterns under certain conditions.

These preliminary results suggest considerable potential, which should be examined in more detail, considering additional raw materials and extended datasets.

In summary, the spatially extended PathSim approach is used to evaluate potential contact zones and social distances among Early Neolithic LBK communities based on the distribution of lithic raw materials. This complements the insights gained from a previous network analysis. High similarity at small spatial scales indicates overlapping material spheres and likely local interaction, whereas low values suggest limited contact and greater social distance. The strong spatial and social proximity between Rijckholt Flint and AHS hints at patterns hidden in the data.

By generating a proxy for LBK burial distributions by applying a broad material query (‘Shells’), the spatial relationship between burials as a site type and materials could also be examined. Analysis of the spatial correspondence between burials and lithic raw materials suggests that the absence of Spondylus in certain regions is not necessarily indicative of a genuine lack of graves. This highlights the method’s ability to differentiate between patterns of material circulation and potential taphonomic bias.

4. Conclusion and Future Work

This explorative work demonstrates the potential of HINs for the integration and analysis of archaeological data. It provides proof of concept for the application of spatially extended PathSim. HIN-based approaches preserve data heterogeneity, providing a powerful framework for uncovering structural and semantic patterns in material distributions and complementing and extending established analytical methods.

By condensing complex distributions into similarity scores, spatially extended PathSim reveals hidden spatial patterns in the data, providing archaeologists with an additional layer of information. These scores offer a dynamic measure of when and at what scales raw material spheres began to overlap, providing a proxy for changing patterns of social proximity and interaction within Early Neolithic Europe. This helps to overcome archaeological bias by providing key figures based on network characteristics, contributing to a deeper understanding of network development and supporting domain-specific reasoning. Therefore, PathSim should be considered a complementary tool that enhances the perception of relational patterns. However, its results must be contextualised with archaeological expertise to distinguish meaningful connections from methodological artefacts.

Despite promising results, challenges remain – particularly in aligning and integrating heterogeneous datasets into a global, interoperable network. These limitations, and the need for methodological adaptations to archaeological research questions, highlight important directions for future work.

Building on this proof of concept, future research should aim to expand the underlying dataset by using additional raw materials, burial contexts, settlement information, and temporal resolution, in order to capture the full complexity of prehistoric societies. This could provide more nuanced insights into exchange networks, contact zones, and patterns of social connectivity in prehistory. Another important direction is mapping the Big Exchange HIN to a standardised semantic model, such as CIDOC CRM (Doerr 2005), to improve interoperability (e.g., thor Straten et al. 2025). Improvements to the performance and functionality of spatially extended PathSim are currently being developed, with the aim of publishing the methodology and codebase.

Data Accessibility Statement

The HIN code base is available online (DOI: https://doi.org/10.5281/zenodo.17019270; thor Straten 2025a). In order to examine the schema, please combine the information from node_types and edge_types with the schema file (also described in the README file). The detailed code base for the similarity search using spatially extended meta paths will be published once a publication introducing the technical details is released and can then be obtained (DOI: https://doi.org/10.5281/zenodo.16686023; thor Straten 2025b).

Notes

[1] As of 2024: https://jade.univ-fcomte.fr/.

Competing Interests

The authors have no competing interests to declare.

A Network View on the Big Exchange Project: Integrating and Analysing Heterogeneous Datasets

Full Article

1. Introduction

2. Related Work and Methodology

2.1 Heterogeneous Information Network (HIN)

2.2 Big Exchange Data

2.3 Big Exchange HIN

Figure 1

2.4 Similarity Search in HINs

Figure 2

Figure 3

3. Application & Interpretation

Figure 4

Figure 5

Figure 6

4. Conclusion and Future Work

Data Accessibility Statement

Notes

Competing Interests

Paradigm

My account