Have a personal or library account? Click to login
Simulation and the building performance gap Cover

Simulation and the building performance gap

By: Michael Donn  
Open Access
|Sep 2025

Full Article

1. Introduction

An internet search for a concept as apparently simple as the ‘performance gap’ in buildings reveals the pattern of concern in building science research. In the past two decades the number of papers with this subject has risen sharply. The majority of these papers are not in building simulation journals or conferences. There is an apparent split in the publishing field where the media with a simulation focus are far less concerned about trusting simulation. This editorial examines and looks to reconcile the split. It is conventional to paraphrase at this point authors such as Galvin (2014: 516–518) who differentiates the ‘discrepancy between simulated and actual building performance’ from the ‘classic rebound (Jevons) effect’ and the ‘energy savings deficit’. Are simulationists naïve disciples of a flawed approach? Is the gap a reason for concern? The underlying question in the call for papers arose from George Box’s aphorism (Box 1976, 1979): ‘all models are wrong, some are useful’.

In view of the simulationist, too often this notion is taken as a reason to say models are not to be trusted. To those who rely on the results of simulation, the performance gap becomes evidence for this lack of trust. As Box (1976, 1979) notes, as models are always an approximation of reality, why is a direct match between model and reality expected? By focusing on the gap, is Box’s original suggestion that good science arises from understanding the reasons a model does not fit reality being followed?:

the discrepancy between what tentative theory suggests should be so and what practice says is so—that can produce learning.

(Box 1976: 791)

The credibility gap places great pressure on the simulation community when asked to produce tools to address the global drive towards a sustainable built environment, which is leading to increasingly stringent building energy and daylighting codes (Evans et al. 2017). Recognising the limitations of purely prescriptive approaches, nations and regions are embracing performance-based compliance pathways (Spataro et al. 2011). This shift necessitates the use of sophisticated building performance simulation (BPS) tools, enabling designers to demonstrate that their designs meet or exceed code targets through predicted performance rather than by an adherence to fixed specifications (Rosenberg & Hart 2014). Merely setting a target heat loss without looking at the overall drivers of energy flows is recognised as problematic (Foroushani et al. 2022; Goulden et al. 2020). The increasing focus on overheating protection in temperate climates requires all to think carefully about how to optimise a design for cold winters without compromising health and comfort in summer (Alrasheed & Mourshed 2023; Aste et al. 2022; Baba et al. 2022; Escandón et al. 2022; López-Cabeza et al. 2022).

This special collection and many other papers are symptomatic of the concerns in the modelling community about the accuracy or relevance of inputs to ‘prediction’ tools. The call for papers attracted 18 abstracts and this yielded five papers accepted for publication (Table 1).

Table 1

Articles in this special collection ‘Trusting Building Performance Simulation’, Buildings & Cities (2026), 6(1), guest editor Michael Donn

AUTHORSTITLEDOI
M. DonnSimulation and the building performance gap [Editorial]https://doi.org/10.5334/bc.688
M. BohmBuilding performance simulation for sense-making in architectural pedagogyhttps://doi.org/10.5334/bc.554
H. Järvelä, A. Lehto, T. Pirilä & M. KuittinenMetrics for building component disassembly potential: a practical frameworkhttps://doi.org/10.5334/bc.556
D. Glew, F. Thomas, D. Miles-Shenton & J. ParkerQuantifying inter-dwelling air exchanges during fan pressurisation testshttps://doi.org/10.5334/bc.557
M. Baborska-Narożny, K. Bandurski & M. GrudzińskaUnderstanding shading through home-use experience, measurement and modellinghttps://doi.org/10.5334/bc.568
K. Cebrat, J. Narożny, M. Baborska-Narożny & M. SmektałaModelling site-specific outdoor temperature for buildings in urban environmentshttps://doi.org/10.5334/bc.576

2. Background

The question is: can a truly integrated BPS platform be constructed that is able to represent real world complexity in a manner that supports user understanding? Or will the problems derived from partial, over-simplified tools continue to be pushed on to unsuspecting users? Only time will tell.’

(Clarke & Hensen 2015: 305)

This special collection asks: in the succeeding decade, has there been progress?

The call for papers sought original research:

reporting on quality assurance measures, case studies, user studies that address the development of trust in the performance calculations of designers.

Papers that addressed all aspects of design performance analysis: ‘indoor air flows, daylight, energy, overheating and acoustics’, were welcome. The intent was to focus on the building industry’s use of all these phenomena. There are really interesting papers out there examining the ‘reality’ of acoustic models (Katz et al. 2018; Postma & Katz 2017); and of daylight (Mardaljevic et al. 2025); or of the effect of airflow and urban heat islands on performance (Li 2021). However, all accepted papers used the energy performance gap as their principal rationale.

The ‘credibility gap’ (Bordass et al. 2004) has many suggested sources. A remarkably similar set of issues affecting reliability and trust across all fields arose from running a series of keyword searches looking for analyses of the gap between model and reality in the fields of acoustics, lighting, thermal, indoor air quality, natural ventilation and wind–building interactions in cities. These were similar to the trust concerns recently raised by Zheng et al. (2024) in their own review of the literature:

  1. The quality of the input data, including: geometry simplifications; the relevance of the variable data describing the external environment (weather data; noise levels); material and appliance properties.

  2. Simulation tool user expertise and modelling practices.

  3. The accuracy and reliability of the algorithms that simplify the real-world sufficiently that it can be represented in a set of mathematical relationships.

  4. Occupant behaviour, though technically a subset of the first item, is often dealt with separately.

All papers in this special collection start from the premise that BPS plays a critical role in building design renovation, and developing an understanding of the drivers of building performance.

Glew et al. illustrate trust concern 1, showing that fan pressurisation tests in attached homes often overestimate air leakage due to inter-dwelling air flow not previously accounted for.

Bohm takes an innovative approach to trust concern 2. It describes using BPS as a tool for nurturing ‘spatial-data sensemaking’ in architectural pedagogy, helping students construct understanding from complex information and deepen their comprehension of building–energy relationships, refining their intuition about interrelationships and trade-offs.

Trust concern 3 is not addressed by the papers in this collection. It is fairly typical that laboratories studying and developing the algorithms eventually picked up and used in BPS rarely start with the whole-building simulation. They focus—unlike the papers here—on a very specific measured phenomenon and building material or design properties.

Baborska-Narożny et al. address trust concern 4 and suggest that shading and energy performance have many office studies, but few residences. Using measured data and a calibrated simulation model, they found that the inhabitants behave very differently than the standard behaviour models assume.

Whilst all users and developers of simulation tools are concerned about the four issues listed, item 3 is mostly the focus of the simulation tool developer and is not addressed further in this special collection.

3. A fragile cornerstone

Increasingly simulation tools are being used routinely to assess issues of sustainability and resilience for code compliance. Modellers have long held onto the ‘doctrine’ that while there might be differences in absolute performance data from different simulation programs, the relative change reported by each program will be similar. It is hard to find data to confirm this view, but some exist (Raslan 2010; Raslan et al. 2009; Schwartz & Raslan 2013). It seems ambivalent: values of relative change can be inconsistent, but often overall performance indices do not change.

Writers of building codes have an interest in achieving fixed targets. If one were to compare absolute numbers for embodied energy or carbon with lifetime energy or carbon calculations, then perhaps the long-known acceptance that each simulation code will produce different numbers (Judkoff 1985) should be questioned. Or perhaps other ways need to be chosen to demonstrate compliance?

BPS software offers the potential to transcend the limitations of traditional design methods. Designers can explore a vast array of design parameters and calculate performance outcomes using virtual twins of unprecedented detail. Early-stage optimisation, cost reduction and the realisation of ambitious sustainability goals is held out as the tantalising reward.

Before the arrival of these ‘simulation tools’, centuries of designers’ heuristics existed in acoustics, lighting, air quality, thermal performance and quality urban environments (Hawkes 1975; Lewis 2017). From the 1958 Oxford Conference (Zamarian 2021) into the 1970s, the building professions developed increasingly sophisticated means of modelling how a building might perform when it was eventually constructed. Only in the field of the urban impacts of buildings on wind is the computer not the dominant means of design analysis (Blocken et al. 2016; RWDI 2019). The issue of reliability and trust in performance simulation has been a significant concern since the 1980s (de Wilde 2014; Donn 1999; Judkoff 1985; Maamari et al. 2005). Performance simulations of all types are the cornerstones of design and code compliance practice.

BPS relies on the mathematical representation of physical phenomena. These models, while apparently sophisticated, are still simplifications of reality. They rely on assumptions, empirical correlations and numerical algorithms, all of which introduce potential sources of error. The level of complexity just of the energy performance calculations is best illustrated by the size of the document that is ‘organized to give you the best possible look into the EnergyPlus calculations’: the EnergyPlus Engineering Reference (US DoE 2024), comprising more than 1800 pages. Oddly, the simplified tools regularly reported in simulation conferences (https://ibpsa.org/conferences/) often focus on speeding the design calculation and the input of the dimensions and physical properties of the building is as complex as the supposedly ‘more complex’ full-simulation programs, as this author found when transitioning from three decades of the latter to using spreadsheet-based calculators such as ALF (Stoecklein & Bassett 2000) and PHPP (Feist & Schnieders 2009).

Bohm argues that the potential for false or inaccurate assumptions leads to ‘epistemic uncertainty’. This is the kind of uncertainty that could in theory be reduced by gathering more data. However, Bohm argues for framing the questions asked by the BPS and improving the BPS users’ understanding of the uncertainty, providing design indications, rather than directives. This is much more in the spirit of Box’s aphorism, and has a logical place in design ideation. The key question left has to be: if the underlying conclusions are based on such fragile data, should simulation be used for code compliance?

4. Beyond the black box: input data and user expertise

Even in wind tunnel modelling of building performance, where it might be expected that the physical model makes the design results more real to the average building design team, the complexity of the process and the need for significant airflow modelling expertise distances the designers from the analysis. For example, the otherwise extremely readable City of London wind microclimate guidelines (RWDI 2019) has a section on how to model the wind climate which lists ‘Weibull parameters’ for each season and at 10o direction intervals. The wind tunneller can be expected to understand these. But if the developer or architect is to explore the impact of building form on urban safety and comfort, then they need tools with a user focus on environmental and human impacts, not the statistical distribution of wind speeds over time.

When it comes to computer simulation codes, the black box effect arises from the lack of understanding of the underlying algorithms and the default values in the building description that are incorporated when the program user does not specify a value. This lack of simulation literacy is referenced in several studies by Beausoleil-Morrison (2019). A simple reading of this and other papers (Fernandez-Antolin et al. 2022; Imam et al. 2017) would initially suggest the ‘credibility gap’ is unsurprising.

There seem to be two approaches to the literacy issue: educate the user in the fundamentals and influence of the underlying algorithms (Beausoleil-Morrison 2019), or use simulation within a design programme in order to ‘teach critical thinking’ about building performance (Bohm; Soebarto 2019). Bohm argues that their approach to the use of BPS in a graduate class for architects enabled the students to describe themselves as having a ‘deeper understanding of [the] interrelationship of buildings and energy’. In and of itself, increasing self-belief might be viewed as potentially self-delusional. However, what gives great confidence in the results is the observation that:

many students parsed their descriptions of BPS outputs carefully, describing clues to the behaviour of buildings, rather than as ‘truth’ as such. Also, students frequently described their BPS work with the open-ended language of experimentation, play or exploration or even a more directional improvement and optimisation, rather than the absolute language like ‘pass’ or ‘fail’.

(Bohm: 460)

This indicates models being used with intelligence—appropriately viewed as idealisations of reality. There may still not be the ability to critique the model itself that Beausoleil-Morrison (2019) seeks, but it is also not a blind acceptance of the ‘truth’ of the ‘prediction’ of performance that underlies much of the rhetoric around the ‘performance gap’.

Elsewhere, even when provided with a relatively simple set of template input files, architecture students reported ‘challenges’ in assessing outdoor comfort (Berger et al. 2024). However, once these were overcome, the authors reported ‘occasionally a mistaken sensation of expertise when students master’ a simulation tool. They suggest there is a strong need to counter the ‘culture of unquestioningly trusting simulation results’. The development of a quality assurance process requires the associated development of what these authors see as necessary: a counter to the observed ‘shallow or even misleading understanding of the phenomena or system being simulated’. This paper also references another reporting the increasing use of BPS in architectural education (Kalpkirmaz Rizaoglu & Karsten 2020).

It is worthwhile examining the nature of the black box. In thermal performance simulation, several computer codes exist for detailed modelling of thermal performance (US DoE 2025; EQUA Simulation 2025; IES 2025). There are several interfaces to the US DoE-financed EnergyPlus program. All these programs and interfaces attempt to calculate an energy balance between indoors and outdoors by taking account of the weather and the combinations of orientation and glass mass and insulation in a building (Donn & Thomas 2010). Given their apparent complexity, it is easy to forget that many other algorithms ‘model’ a building’s thermal performance.

An R-value is a relatively simple idealisation of heat flow. It is in this sense a simulation. It is quite easy to understand that R = 6 m2K/W is twice as good as R = 3 m2K/W. But this apparently elegant representation of ‘measured’ properties of materials has flaws: it assumes heat flow is one-dimensional, perpendicular to the plane of the roof, wall or floor being studied. It also assumes the R-value does not change over time, or through the depth of each material or across the face of the material.

One of the early incentives for the more complex thermal environment modelling that are today described as BPS is R-values create a singular focus on instantaneous heat gain or loss, with no account taken of the heat capacity of materials and the dynamics of the weather. Computer analyses that look at the impact of the weather hour by hour for a year were introduced to deal with this situation. Similarly, in lighting, traditional daylight factor static calculations for a limited number of days per year are insufficient for analysis of glare from daylight (Pierson et al. 2018; Reinhart & Walkenhorst 2001; Walkenhorst et al. 2002). In urban wind, flows are concerned with the frequency of occurrence of wind over a typical year (RWDI 2019). And in acoustics through L10 or Leq measures about what is a nuisance decibel level, the statistical variation of outdoor noise and the types of noise source (Naish et al. 2011) are analysed.

The desire to produce a model that looks like the real building, and is even representative of its urban context, drives the creation of ever-more-complex models. This in turn multiplies the potential for error. In a paper arguing for ‘a new professionalism’ with regard to the building design professions, Bordass & Leaman (2013) argue strongly for improved public documentation of data on the performance of buildings. Perhaps there is a potential solution here to the most difficult problem outlined by Berger et al. (2024: 12): ‘the relation between simulation results and phenomenological experiences’. If there were databases of input data and performance data (Hand et al. 2008), where the provenance could be tracked reliably, perhaps this issue could be reduced in importance. It seems likely in the age of artificial intelligence (AI) that such databases could be developed and provide the much-longed-for intelligent agent assisting the design process (Donn 2004).

It is remarkable, even in the face of the strong desire to make models more real, that the difference between standard external conditions for BPS and reality is regularly so large that the differences between design options is much smaller than the likely difference driven by these weather differences. The most obvious such influence is microclimate. Li (2021) showed differences up to 25% in the energy performance between normal models and models that examined the importance of the lapse rate of temperature with height; the airflow between and above buildings; and the influence of the urban heat island.

5. Input data

The approaches of Glew et al. and Baborska-Narożny et al. are to improve the reliability of the input data to the BPS. Glew et al. point out the potential flaws in the apparently sensible concept of using measured data from fan pressurisation tests as input about the leakiness of a building. It is shown to be potentially anywhere from 9% to 51% in error. This is consistent with findings of others such as Marshall et al. (2017) who have examined the performance gap comparing measured building property values with the standard values used in an energy model.

Although it is focused on the effect of shading on residential energy performance, Baborska-Narożny et al. pay almost as much attention to the importance of reliable modelling of airflow ‘between spatially connected zones’. They reference Hensen & Djuneady (2005) for an ‘algorithm for modelling a decision on how detailed an airflow model should be’ (Baborska-Narożny et al.: 484). The authors argue that in order to understand shading, they needed to explicitly model ‘air mixing between the lower and upper parts of the double-height living room’ (Baborska-Narożny et al.: 471). The strength of this study is the combination of interviews, regular walk-through observations and complex simulation modelling. It raises serious questions about the need for this approach in BPS for design analysis and code compliance. The importance of accounting for ‘sub-optimal user behaviours’ when modelling energy performance is revealed.

In the early 2000s, and using the 1983 definition of ‘daylight coefficients’ (Tregenza & Waters 1983), daylight simulations overcame the inherent issue of tracing millions of light rays bouncing around building interiors (Reinhart & Walkenhorst 2001). Iversen (2013) notes that the most-often-used analytical model of sky luminance is the so-called ‘Perez model’ (Perez et al. 1993, 2002). The concern that Iverson studied was in part whether a weather data file of hourly sunlight and daylight was of sufficient sensitivity to the rapid fluctuations of light on a minute-by-minute basis. In a study comparing measured daylight availability at five different world locations, Walkenhorst et al. (2002: 393) demonstrated differences of up to:

27% on the simulations based on 1-h means instead of 1-min means of measured beam and diffuse irradiances.

Iversen (2013) reported little reduction of potential error by generating minute-by-minute light variations from the hourly readings using the ‘stochastic Skartveit–Olseth model’. Greater detail in the weather data is unnecessary. This is reassuring for the practitioner–user of calculation software. However, as evidenced by recent critiques of the effect of the origin model for the illuminance maps in ‘typical’ weather files (Brembilla et al. 2019; Mardaljevic et al. 2025), there are still serious researcher concerns.

The papers in this special collection give the impression that improved databases of measured data could inform better BPS models. However, the fundamental flaw in this approach to the ‘black box’ is the sheer impossibility of measuring all possible inputs. To take two examples from building aerodynamics: an ‘expert system’ based on wind tunnel tests of standard building forms; and ‘machine-learning’ algorithms analysing computational fluid dynamics (CFD) simulations of 34 different tall building shapes (Stathopoulos et al. 1992; Wu et al. 2025). The resulting design guidance relies on the assumption that the proposed building is like one of the underlying ‘typical’ buildings. The user of the data may well be unaware of the limited range of prototypical buildings analysed. None of the underlying studies can possibly look at the range of urban configurations around the site of the proposed building.

Lighting and acoustics offer a seductive alternative to assuring trust in BPS: visualisations and auralisations, respectively. Representations that look or sound like reality. Arguing that a standard numerical means of reporting glare and light distribution in a space ‘creates a barrier for collaboration between design teams and clients’, Natephra et al. (2017: 2) propose a virtual reality immersive environment. From a health and safety point of view it is likely always to be impractical to produce an image that reproduces the glare from the sun, so this realism is limited. More concerning is the existence of other computer programs that markets themselves as ‘physically based’ and produces apparently realistic images that carry no real illuminance data. Quality assessment tools to assure the user that ‘realistic’ is real are missing.

Wang & Kang (2022) examined 20 papers published between 1979 and 2020 and suggested that acoustics accuracy—as measured by numerical performance criteria—has improved in that time frame. Their focus was the peak of acoustician achievement: the performance space. And they reviewed only papers that compared digital simulation with measured data. However, they also noted ‘overly detailed models have been shown in many studies to be detrimental to accuracy […]’ (Wang & Kang 2022: 1742). They suggested that there are clear indications from this particular performance gap of a need for ‘future algorithm improvements’. From a different point of view, Katz et al. (2018) showed good agreement between user perceptions of auralisations of four buildings and user perceptions of recordings made in the same spaces. Quality assurance processes for a listener to a piece of anechoic music played ‘through’ a building model remain apparently unpublished.

The question users of simulation models for design and code compliance face can be summarised by a slight rephrasing of another of Box’s aphorisms (Box 1976): ‘It is inappropriate to be concerned about mice when there are tigers abroad.’ In design analysis, faced with the plethora of data on potential flaws in modelling data, how does the practitioner identify the ‘tigers’: the important drivers of performance? If they were that well known, surely BPS is unnecessary?

6. It’s the building user!

Baborska-Narożny et al., with its reference to ‘sub-optimal user behaviours’, add to the body of papers that exist in the BPS field that study how to model the user.

With nearly 600 citations, the paper entitled ‘A generalised stochastic model for the simulation of occupant presence’ (Page et al. 2008) looks to be a valued resource in this field. Papers referenced on modelling user behaviour when simulating building performance date from the 1980s–90s (Fritsch et al. 1990; Hunt 1980). Building models of behaviour built on statistical representations of observations in real buildings has been seen as the solution since then.

Two recent papers (Zheng et al. 2024; Zou et al. 2018) attempted to summarise the research on the performance gap in building performance analysis. Zou et al. (2018) focused on the building energy performance gap (BEPG) in 227 papers mostly published after Page et al. (2008). Zou et al. (2018: 178) conclude:

(1) occupant behavior in operation stage attracted most attention from researchers; (2) technology and method are the main strategy for bridging BEPG; and (3) information and energy related stakeholders play an important role in BEPG.

Zheng et al. (2024) analysed a similarly large number of papers. Their contribution suggests there are variations in size of BEPG between building types, and even between countries. Zheng et al. (2024: 6) report that the literature they reviewed suggests:

deep-rooted drivers of parameter setting errors, which were summarized in seven aspects from the literature reviewed, respectively, inappropriate design, uncertainty not considered, improper construction, poor management, lack of communication, lack of accountability, and conflict of interest, which occurred in the design, construction, and operation stages of the building lifecycle.

The mismatch between measured and simulated performance can be a result of many things not currently considered during design, such as poor construction practice and operation.

All these papers hint at a perception issue raised by Molina et al. (2023: 422) who ‘aimed at understanding comfort without the constraints of it being quantifiable’. Their observation noted that when people talk of being warm and cosy, then they often fold in other considerations such as acoustics. Supporting this are Postma and Katz’s auralisation studies (Katz et al. 2018; Postma & Katz 2017). Postma & Katz (2017: 1) note that ‘judgments of several room acoustic attributes are influenced by the presence of visuals’.

Nowhere is the difficulty of modelling users more obvious than in Khalid et al.’s (2024) attempts to match a thermal simulation model’s outputs to measured temperatures. Using temperature, relative humidity and CO2 measurements, it was shown to be essential that the presence of people and moisture-generating activities is simulated on a sub-hourly basis! This is not a sound basis for design analysis, but necessary for representing a particular building in a model. At this level of detail, the extensive quality assurance advice, such as coefficient of variation root mean square (CVRMS) or mean bias error, to confirm a match of simulation to reality (ASHRAE 2023; Coakley et al. 2014; Raftery et al. 2011) has little relevance. Their focus on energy totals per month or year smooth out these user details. The standard suggestions about modelling users (Mylonas et al. 2024) focus on the match to reality. What is missing is how these user behaviour models might influence code compliance or design calculations.

In a related doctoral thesis, Bourgeois (2005) reports on development of a sub-hourly model of occupant behaviour in a thermal simulation program. Again, the effect is significant. Iversen (2013) notes that Bourgeois (2005) shows that automatic control of lighting uses less electric energy than manual control. The overall focus of Bourgeois is on how the simulated overall energy performance is affected by occupancy. The basis of the modelling of openable windows for cooling reveals modelling difficulties. Bourgeois defines ‘active users’ as that proportion of the building’s population who will likely operate the windows. Note is made of the temperatures at which the windows might be operated. How cold does it need to be outside before the windows will not be opened? How hot does it need to be outside or inside when the windows are closed and the air-conditioning is operated? Again, how do the results improve the designers’ performance simulation for code compliance, or for exploration of resilience?

7. Asking relevant questions

The performance gap is not merely a problem of inaccurate models, but it seems to be a result of misguided expectations of what BPS tools can predict. As BPS outputs are the result of models that are approximations of reality, building codes need more appropriate performance metrics that test the robustness or reliability of performance, not compliance with some mythical ‘typical’ performance number.

The increasingly common reaction to the performance gap is to invoke a version of the first half of the Box aphorism: ‘All models are wrong.’ From this it is but a short step to proposals to scrap modelling for code compliance. No thought is given to the nonsensical concept of performing a minimum of six energy balance calculations per hour for 8760 hours in a year for energy or a minimum of four daylight ray traces per hour for around 4000 hours and amalgamating the answers into a single annual figure.

As can be seen in building standards related to condensation and mould risk on surfaces inside the rooms of a building, apparently precise simplified calculation methods only work if a building is heated above a temperature threshold and below a relative humidity threshold for every hour of the year. Cherrill (2024) notes this certainly does not reflect real-world scenarios where temperatures and humidities fluctuate indoors in response to fluctuations in outdoor climate, and to people’s occupation of the building.

Any code that genuinely focused on the minimisation of risk of indoor mould growth must define the risk scenarios. A building code is required that uses the power of simulation to provide answers about risk scenarios related to whole-building performance in relation to outdoor weather scenarios and in the face of specific indoor moisture production scenarios. Building code developers could then define reasonable indoor moisture-generation extremes and outdoor weather conditions with which a building should be able to cope.

Throwing out the model because ‘all models are wrong’ has led to an ongoing search for alternatives. For at least 16 years (Cohen & Bordass 2015; Eley et al. 2009; Foroushani et al. 2022; Spataro et al. 2011) the BEPG has been characterised as a result of the ‘reference building approach’ to the use of BPS. Eley et al. (2011) suggest that setting a performance standard that seeks a building that is better than a similar building meeting a prescriptive building energy code is confusing because ‘codes frequently change’. A building that was 30% better than code a decade ago will therefore no longer be better than code today. Eley et al. (2011) proposed a performance scale where net zero is a baseline best score and actual energy use measured in the year 2000 is the worst score. However, such a definition is in itself not immutable as the discussion about regenerative design and regenerative sustainability reveals (Robinson & Cole 2015).

Another serious issue noted by the National Institute of Building Sciences (2017: 3) is that many reference building codes focus on a very small subset of what is important in the operation of a building: they note factors such as ‘building orientation, plug loads, operations and maintenance practices, quality of installation, and systems-level interactions’ are not normally addressed by a building energy code.

This type of thinking has influenced many to propose codes based on variations of Eley et al.’s (2009) ‘target’. Some such as Foroushani et al. (2022: 1) propose what is essentially a focused variant of the original reference building approach:

the Passive House energy performance level (15 kWh/m2/a thermal energy demand) as ‘net-zero energy ready’.

This ‘target’ seems in philosophy to be no different than quoting a prescriptive standard for the R-values of the roof, walls, floors and windows as the baseline for the reference building that these papers critique as the source of the performance gap. This focus purely on building skin heat transfer was already viewed as limiting in the 1980s as BPS became commonplace. Users of BPS quickly discover that modelling just the outer skin of even a small stand-alone house will produce unrealistic performance predictions. Just modelling the heat losses and gains through the external fabric ensures that heat from the sun will magically be transferred from the sunny to the cold side of the house, even though there are doors and rooms in between.

In the early days of BPS research the focus of highly influential design publications such as Design with Climate (Olgyay & Olgyay 1963) on aspects of design such as compactness of form and orientation to the sun were identified as a trivialisation of the issues. For all buildings, depth of penetration of daylight and access to fresh air supply from easily understood and operated windows were necessary additions to a thoughtful design approach. By the late 20th century, the lessons from post-occupancy evaluation (Bordass et al. 2001) suggested a more nuanced view. People were found to appreciate the control over their environment that design features such as access to openable windows provide. But they also want an air-conditioner if it is too hot. BPS informed designs with less compact, shallower plan forms allowing access to windows for light and air were shown also to require systems that are easy to understand and to maintain in good working order. BPS optimism about users, equipment and construction quality needs to be tempered by these real-world lessons. Building codes’ focus on single performance figures still ignore them and therefore inevitably create a performance gap.

The continuation of basing building code compliance on the apparent simplicity of climate analysis was shown by Cory et al. (2011) to trivialise performance. Geographers’ classifications of climates as temperate or cold, etc. ignores the reality of buildings. In a single climate, a single-family home may face heating challenges, while an office largely requires cooling. Critiquing a reference building approach by suggesting targets based purely on a building’s heating and cooling needs, without accounting for all the other energy flows, risks perverse outcomes.

A far more robust approach first proposed in the 1990s to the New South Wales state government in Australia (Pears et al. 1999) has morphed today into the National Australian Built Environment Rating System (NABERS) schemes in Australia, New Zealand and the UK, amongst others (Precious 2022). These are real performance targets for the holistic performance of a building tested by post-occupancy measurement. The important aspects of performance such as maintenance and commissioning are crucial to this measurement.

Understandably, standards authorities still seek to create rules that do not require a priesthood of analysts who they perceive as a potential barrier to design innovation, advising developers and builders what they cannot do. Often, the resulting ‘design tool’ was based on a research laboratory running multiple thousands of calculations and publishing observed correlations (Balcomb & Jones 1980/82; Stoecklein & Bassett 2000). The resulting simple formulae produce what are argued to be near-instant ‘simulation’ results. Their interfaces typically require the same building design inputs as full BPS. However, they hide the most important issues: the assumptions that created the foundation of the simulations that underly the basic correlations.

The current fashion for examining the role that AI might play in BPS seems a child of that increasingly irrelevant correlational approach. If AI mines the completed BPS results, it seems likely it will have the same weakness: the user has limited or no access to the underlying assumptions—or the default values incorporated in the BPS results being used.

However, if AI focuses on allowing people to mine the data submitted to a BPS in order to improve the quality assurance of new BPS simulations, there are fewer risks. It offers the potential that the work on uncertainty in the data fed into the simulation tools (Aly & William 2018; Woloszyn & Beausoleil-Morrison 2017) may facilitate access to measured data of the type in the studies of both Baborska-Narożny et al. and Glew et al.

8. The future?

The challenge facing research is how to inform standards developers on the appropriate use of building performance simulation (BPS). This means moving away from seeking ‘truth’ or absolute ‘pass/fail’ results towards the Box ideal of BPS as a tool for exploration, experimentation, and deepening the understanding of building interrelationships and trade-offs.

Performance indices are needed that test real-world scenarios. Performance standards must not reduce human needs and desires to some simplistic index. The consideration of the performance of a building clearly needs to test whether it could perform well given certain assumptions about its operation. If language is framed in this way, then the focus shifts onto those assumptions.

When implementing BPS in building codes, the question arises about the meaning of ‘perform well’. Is it really a problem that an advanced building built in 2025 is no longer advanced in 2035? Is a performance scale set in the manner suggested by Eley et al. (2011) where best means net zero sufficient? How might that type of sustainable certification actually measure or encourage regenerative design (Robinson & Cole 2015)? And where in this calculation might there be considerations of encouraging buildings to be grid-friendly (Hu et al. 2024; Zhang et al. 2022)? Is the National Australian Built Environment Rating System (NABERS) approach, based on measured data, a sufficient definition of performing well? Can benchmarks based on existing building performance encourage advanced design?

When implementing BPS in building codes, the question arises concerning which tools can be trusted to provide this rich, risk analysis. Many papers still being published demonstrate that there is a wide variation in BPS absolute numbers (Bianco et al. 2023; Del Ama Gonzalo et al. 2023; Serag et al. 2024). This is an issue that is at least four decades old (Judkoff 1985). More concerning is that it is still rare to find papers that test the common assumption that while individual calculation tools typically produce different absolute numbers, the underlying common physics ensures the relative differences between two designs will be consistent. Raslan (2010), Raslan et al. (2009) and Schwartz & Raslan (2013) have examined this issue with ambivalent results.

The key to consistency of tool output is still argued to be the consistency of input (Hand et al. 2008; Zheng et al. 2024). Environmental product declarations to ensure the accuracy of inputs to life cycle analyses and quality assurance of these data (Olanrewaju 2025) are an ongoing concern, requiring the use of trusted, measured data, with its provenance guaranteed. Other building performance standards require a similar rigour. Training informed (sceptical) users on the software comes after there are trusted sources of input data and relevant, rich definitions of adequate performance.

To create a model that Box would define as useful, then for the provision of evidence of compliance with a standard, that standard needs to define the following:

  • Measures of performance that relate to reasonable expectations of users’ perceptions and desires, e.g.:

    • definitions of acoustic comfort that relate to time of day and activity, not a single construction acoustic transmission value and which are robust under different external noise and operation scenarios

    • definitions of light comfort that relate to users’ ability to move away from glare, that relate to time of day and activity, and that account for light distribution quality as well as quantity (Sullivan 2021), and define robustness under different weather and operation scenarios

    • definitions of thermal comfort that relate as much to time of day and activity as they do to annual energy use and define robustness under different weather and operation scenarios.

  • A mode of standard operation with a typical, low-risk and high-risk mode.

  • Approved processes for documenting the source of the measured data used to describe the materials and material assemblies in the building model.

  • For daylight and energy performance calculations, regulated weather files that have, wherever possible, used satellite data, not cloud cover, to estimate solar radiation and with a minimum of two regulated scenarios: a hot summer year and a potential climate change 50 years into the future.

Acknowledgements

The author acknowledges the support and inspiration of his doctoral and master’s student colleagues whose work he has ‘supervised’ over the years. In addition, this work would not exist without the publications and debates that are part of the International Building Performance Association conferences and publications and which have provided the forum for his personal focus on quality assurance in building performance simulation (BPS) in all its forms over three decades. Finally, whilst endorsing the recognition of the limitations of models as predictions of the future, it must be recognised that the near-constant investigation in research of the reasons for the performance gap has enabled a better understanding how best to apply BPS to be reached.

Competing interests

The author has no competing interests to declare. He is a member of the journal’s editorial board and guest editor of this special collection. However, he was not involved in the editorial processes and decisions concerning this manuscript.

DOI: https://doi.org/10.5334/bc.688 | Journal eISSN: 2632-6655
Language: English
Submitted on: Aug 19, 2025
|
Accepted on: Aug 19, 2025
|
Published on: Sep 12, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Michael Donn, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.