29 The Hedonic Housing Price Model
Many data science tutorials use the example of predicting housing sales prices in regression modeling exercises. Many spatial analysis tutorials also use housing price models to help illustrate the concept of spatial autocorrelation and the use of formal models to ivestigate spatial structure. Nearly all of the examples in modern curricula, however, focus almost exclusively on the modeling frameworks themselves, with little background on the theory of why we apply these models in the first place (save that they can accomodate spatial features). This is understandable, because covering both urban economic theory in addition to the statistical analysis of spatial data is a lot of ground to cover. But it is also unfortunate, because the lack of theoretical background can lead to confusion regarding which modeling strategies are appropriate in different situations.
The reason housing price models are ubiquitous in regression modeling curricula is because of their classical use in urban economics and regional science to help uncover the determinants of housing market pricing. That may sound like a tautology, but unlike many modern examples, the goal of housing price modeling in the social sciences is almost never to predict the selling price of a home. Rather, it is to understand how public policies or other exogenous shocks to a housing market may affect consumers’ willingness to pay for certain features of the urban fabric, like access to jobs, clean air, or other place-based policies (Harrison & Rubinfeld, 1978; Neumark & Kolko, 2010; Neumark & Simpson, 2015; Reynolds & Rohlin, 2014; Won Kim et al., 2003).
Thus, although estimating them as such requires considerable data and forethought into the modeling structure, housing price models in urban economics were conceived in a causal inference framework. As such, the application of spatial econometric regression modeling is not simply to treat geography as an additional “feature” that helps predict variation in prices, but to formalize explicitly and explore the process(es) of spatial spillover, and to ensure that any unobserved spatial process does not bias estimates of the “true” effect under study.
This is an underemphasized point in the era of kaggle competitions and Deep Learning chatbots, because the goals of data science/ML and social science are, quite often, fundamentally distinct. Whereas data science in industry is more commonly focused on prediction (e.g. a real estate company is more interested in predicting the selling price of a home than they are about whether the price is more affected by proximity to restaurants or a regional minimum-wage law, because the former is valuable to users of their website), the reverse is generally true in policy analysis and the social
sciences.
A major reason for estimating hedonic prices and inverse demand functions is to be able to measure the benefits of changes in the level of environmental amenities. Briefly, a household’s marginal benefit for a small improvement in amenities is its marginal willingness to pay-as estimated by the marginal implicit price it faces. For a non-marginal change the benefit is approximated by the area under the inverse demand curve for the change in question. And aggregate benefits for an urban area are found by summing the relevant household measures across all households
– Freeman III (1979)
The logic of housing price regression models originates from Rosen (1974) and the theory of hedonic pricing in implicit markets1. The concept holds that a housing unit is a bundle of goods, rather than a single item; that is, a housing unit represents several consumption choices at once: a location decision that defines access to education systems, employment markets, environmental externalities and so on, as well as a piece of architecture, with size, quality, amenities, and maintenance characteristics. The selling price of a home is a combination of all these attributes that we can only observe in aggregate. To recover the implicit prices of a housing unit’s constituent components, we can follow Rosen’s approach (extended by Harrison). Following Harrison & Rubinfeld (1978), the hedonic price model assumes consumers maximize a utility function is defined as
\[U(x,h)\]
subject to the budget constraint: \[Y = x+p(h)\] Where
\(x\) = quantity of composite private goods, whose price is set equal to one
\(h = (h_1,\dots,h_n)\) is a bundle of housing attributes, including accessibility, structure and neighborhood characteristics, and air pollution concentrations
\(y\) = annual money income,
\(p(h)\) = housing (or hedonic) price function
Then, we can use regression models of various forms to approximate \(p(h)\), according to the following assumptions:
- All consumers accurately perceive the characteristics represented by the vector \(h\) at every location.
- There is sufficient variation in \(h\) so that the function \(p(h)\) is continuous, with continuous first and second partial derivatives.
- The market is in short-run equilibrium.
- Spatial variations in housing characteristics are capitalized into differentials in housing prices.
The first assumption is a big one, and it’s false in many cases. Real estate agents, prior inequalities, and other market forces absolutely ensure that some groups of consumers have better knowledge of \(h\) than others (and this knowledge will vary imperfectly across space as well). Large real estate firms that buy and flip homes nationwide certainly have better knowledge of the housing market than a first-time homebuyer with a median income. Like most economic assumptions about perfect markets, we can swallow this large grain of salt and proceed, but it is worth remembering that because everyone’s utility function is different, the hedonic approach is designed to approximate the consumption of the ‘marginal consumer’. Exactly who the ‘marginal consumer’ represents is an open question (Follain & Jimenez, 1985).
The second assumption is essentially the law of large numbers; we need to observe good amount of variation before the signal is reliably detectable. The third assumption follows loosely from Tiebout (1956), another seminal contribution in urban economics and location theory, holding that people “vote with their feet,” by sorting into neighborhoods that provide the greatest utility to them. This assumption is subject to the same critiques as the first. Perfect freedom of mobility is not a realistic assumption, especially for lower income strata, and the correlation between housing unit quality and neighborhood attributes mean that it’s unlikely that the housing market provides a perfect range of options that allow consumers to optimize perfectly. In the country’s best school district, it’s unlikely to find a housing unit that is small and cheaply constructed, even if there are consumers who would happily trade size and perceived quality for access to the best schools (DeLuca et al., 2024). Our view of the optimization is limited because swaths of the population may remain “stuck in place,” (Sharkey, 2008, 2013; Wilson, 1987).
Nonetheless, empirical evidence for the Tiebout hypotheis is strong, suggesting that ‘short run equilibrium’ can be achievived as long as location sorting remains viable for some part of the housing market. There are parallels here to Schelling’s work showing small preferences can yield large changes in regional patterns. The geography of opportunity may well be defined by Tiebout equilibrium–according to the neighborhood tastes of the affluent (tieback to Sampson work on importance of concentration of affluence vs historical focus on concentration of poverty)
“hedonic price theories of housing are most useful in providing a conceptual basis for the determination of a temporary equilibrium in which supply is fixed.”
— Arnott (1987)
The fourth assumption is the fundamental logic behind the hedonic model; the combination of physical qualities, location attributes, and other externalities affecting a housing unit are capitalized into its price (note that unlike Harrison & Rubinfeld (1978), in the formulation above we assume transportation costs are also part of the housing function rather than a separate term). An interesting question at the frontier is when the capitalization signal for certain amenities can be detected (for example, if a new transit station is planned, when do rents in the nearby apartments rise? Does the station need to be open, or does speculation cause rents to rise when the station is announced?). If the amenity has not been realized yet, will renters pay the premium for an anticipated good?
The hedonic approach was conceived as a structural model, designed to uncover the determinants of housing demand based on principles of economic theory (Koopmans, 1949; Nevo & Whinston, 2010). Rosen (1974) postulated that under equilibrium conditions, the bundle of housing goods can be decomposed into its constituent parts. The structural components rely on two critical assumptions: 1. that the housing market is in short-run equilibrium (i.e. a tieboutian process is at work) and (relatedly) 2. that each consumer individually maximizes her utility function. Under those two assumptions, the market clears and we can observe revealed preferences and willingness to pay for different housing attributes. In practice, many scholars want to estimate hedonic models to understand the effect of some policy change on home prices, but in these cases identification issues are rife, especially if home prices can affect one another.
29.1 Spatial Hedonics
Spatial data help solve some inherent difficulties in identifying the hedonic model. First, as Won Kim et al. (2003) describe, the traditional hedonic model is biased in the presence of spatial processes such as land use regulations or environmental quality
“The traditional hedonic property value model does not capture these induced [spatial] effects. It is customary in traditional models to include exogenous variables in the neighborhood characteristics category that try to explain why some neighborhoods tend to have higher or lower housing prices than other neighborhoods. Such a specification cannot capture spatial price effects that are generated by a change in a neighborhood’s housing characteristics (such as a shift in environmental quality). Therefore, a traditional hedonic property value model may lead to a biased or at least imprecise estimate of the benefits of a housing characteristic change if these induced effects are present”
– Won Kim et al. (2003)
Second, as Bartik (1987) describes, OLS estimates will be biased because quantity and consumption are endogenous, so an instrumental variable approach is necessary to provide accurate estimates. Modern spatial econometrics help solve both of these problems simultaneously. “Spatial lag models” which include endogenous spillovers in the \(y\) variable use instrumental variable regression (two-stage least squares) with \(WX\) variables as instruments to identify the parameters of interest (Anselin, 2011; Arraiz et al., 2009; Kelejian & Prucha, 1999). This follows Bartik’s logic of using cities or neighborhoods as plausible instruments while simulateneously providing insight into spillovers and capturing unobserved spatial effects.
Finally, the housing market is a clear example where spatial autoregressive processes are both conceptually intuitive and substantively interesting. Houses are sold at different times and their value is often unclear until the point of sale. As such, each sale in the market sends a signal about the value of homes in that neighborhood; there is a clear process of spatial spillover, as other nearby homes are revalued according to the nearby sale.2
In this specification of HPF, the value of a house at any location is dependent on its counterparts at nearby locations in addition to its structural and neighborhood attributes. The hypothesized spatial dependence among residential structures is determined by W which is specified in an a priori fashion. The coefficient p measures the absolute price impact of nearby houses on the price of a particular house. As argued in Can (1990), this conceptualization corresponds more closely with the actual workings of the real estate institution in urban housing markets. A realtor will appraise a house given the price history of houses in the immediate vicinity in addition to other substantive characteristics. At the same time, home owners will initiate or forego certain improvements based on the anticipated return on their investment considering housing prices in the immediate area.
– Can (1992)
Spatial econometric methods have been developed explicitly for use in contexts such as hedonic price modeling because they provide for efficient and unbiased estimation of coefficients when property sales are spatially interactive (Anselin & Lozano-Gracia, 2008; Can, 1992; Comber & Arribas-Bel, 2017; Diao, 2015; Diao et al., 2017; Dubé et al., 2014; Dubin, 1992; Kim et al., 2020; Steimetz, 2010; Won Kim et al., 2003). For example spatial econometric models provide an avenue for studying the relationship between housing characteristics and sales prices, even when nearby home sales have an endogenous influence on prices. These same properties provide an ideal opportunity for understanding how to value land and its features in a hedonic modeling framework when land and improvement characteristics would otherwise be difficult to separate.
Spatial econometrics offer important solutions to some identification problems, but also occupy a nebulous space in the econometric modeling world. Technically, spatial econometrics models are structural because they require a priori specification of a spatial weights matrix that defines the feedback structure among observations, and they should be guided by a strong theoretical basis for which DGP is at play (Gibbons & Overman, 2012; McMillen, 2012). But they are not fully structural because there is little theory guiding the appropriate specification of \(W\) (Corrado & Fingleton, 2012), and it is often hard to specify with certainty whether residual autocorrelation is caused by unobserved variables or processes of spatial spillover. The spillover term can never be known in practice, and differentiating spillover from spatially autocorrelated errors is exceedingly difficult in practice, so the inclusion of a global spillover parameter needs to be governed by theory (LeSage, 2014; LeSage & Pace, 2014). That is, we might alternatively view the autoregressive term as a structural parameter.
“The main point is that the model should correspond to the workings of the spatial economies investigated, and that the consequences for their ulterior use – or uses – should be explicitly considered.”
– Paelinck (2007)
29.2 Estimating Land Value
Many questions in urban studies revolve specifically around the question of land value. This is a critical issue for things like the appraisal of land in a public trust, or ensuring low income households do not pay a disproportionate share of tax burden relative to high income landowners (Berry & Bednarz, 1975).
29.2.1 Exogenous and Endogenous Spillovers
If we take the Spatial Durbin Model as our point of departure, then lets assume we have two broad categories of variables related to land (\(L\)) and improvements (\(I\)). Then we can stylize the SDM model slightly using the decomposition approach from the land-value literature. Here, the goal is to fit a hedonic model where we differentiate the land components from the built-improvement components. Spatial econometric models provide a unique opportunity to parse these components because the endogenous relationship between land and improvement values can be decomposed with greater clarity (Can & Megbolugbe, 1997). Given a decomposed spatial Durbin model:
\[ y = \alpha + \rho Wy + \beta L + \delta I + \theta WL_j + \gamma WI_j+ \epsilon, \] this specification holds that for a given parcel of land, the selling price \(y\) is a function of
- \(\alpha\) an intercept
- \(\rho WY\) the selling prices of nearby parcels
- \(\beta L\) the land characteristics of the parcel itself
- \(\delta I\) the improvement characteristics of the parcel itself
- \(\theta WL_j\) the land characteristics of nearby parcels
- \(\gamma WI_j\) the improvement characteristics of nearby parcels
- and \(\epsilon\) a random error term
Fitting this model to all observations in our dataset (including developed and undeveloped), should be a conceptual match for the DGP we expect to govern the hedonic pricing model (in my opinion, anyway). That is, if you were to use this model to predict the price of an undeveloped unit (out of sample), the only term that gets zeroed out is \(\delta I\) because there is no development on the parcel itself. But through the rest of the variables, we should have captured both the exogenous and endogenous spillovers inherent in the pricing model because we’re still accounting for features of the nearby developed and undeveloped parcels (as well as the general spillover in land prices). This view is consistent with classic contributions to hedonic price modeling by Can (1992, p. 456) who defines a generic hedonic function where
“the value of a house at any location is dependent on its counterparts at nearby locations in addition to its structural and neighborhood attributes. The hypothesized spatial dependence among residential structures is determined by W which is specified in an a priori fashion. The coefficient \(\rho\) measures the absolute price impact of nearby houses on the price of a particular house. As argued in Can (1990), this conceptualization corresponds more closely with the actual workings of the real estate institution in urban housing markets. A realtor will appraise a house given the price history of houses in the immediate vicinity in addition to other substantive characteristics. At the same time, home owners will initiate or forego certain improvements based on the anticipated return on their investment considering housing prices in the immediate area.”
Obviously, the \(\rho Wy\) term cannot be decomposed into land and improvement values, it’s just the effect of nearby sales prices. In the context of land value modeling, I think the decomposition of that term is actually immaterial. What \(\rho WY\) does, effectively, is change the way coefficients at one observation propagate through the system to affect others. For example, adding a new brick facade to my house may raise my neighbor’s property directly because they are now next to a new amenity, which raises the price of their house. My neighbor also gets a small boost simply because they now sit next to a more expensive property.
That is, my neighbor benefits from two positive externalities: the aesthetic improvement and the more expensive property next door (Anselin, 2003). Then, because of price spillovers, I have indirectly increased my neighbor’s neighbor’s house (second order neighbor), albeit by a smaller magnitude, because my second-order neighbor now abuts a more expensive property (my neighbor). Those spillover effects propagate the system through spatial interaction, i.e. proximity to one another, which is ultimately a land (and infrastructure connectivity) effect.
Here it is reasonable to swap “land” for “neighborhood”. Since we have so many vacant observations in the dataset, including a dummy for “developed” also means that the intercept refers to the average price of undeveloped land. And since we’ve already accounted for both exogenous and endogenous spillovers through the \(\rho WY\) and \(\theta WX\) terms, the intercept is no longer contaminated by endogeneity issues about the average cost with improvements and land mixed together. Here, \(\alpha\) is an unbiased estimate (exclusively) of the average cost of an undeveloped parcel.
29.2.2 Exogenous Spillovers and Endogenous Error
Alternatively, we could take the decomposed spatial durbin error model as a point of departure:
\[ \begin{gathered} y = \alpha + \beta L + \delta I + \theta WL_j + \gamma WI_j + u, \\ u = \lambda Wu + \epsilon, \end{gathered} \] which holds that, for a given parcel of land, the selling price \(y\) is a function of
- \(\alpha\) an intercept
- \(\beta L\) the land characteristics of the parcel itself
- \(\delta I\) the improvement characteristics of the parcel itself
- \(\theta WL\) the land characteristics of nearby parcels
- \(\gamma WI\) the improvement characteristics of nearby parcels
- \(\lambda Wu\) a spatially correlated error term
- and \(\epsilon\) a random error term
This model should also yield a “conceptually accurate” valuation of undeveloped parcels, because the intercept should (again) refer exclusively to average land value (assuming we include a ‘developed’ dummy), and the characteristics of nearby developed and undeveloped parcels are taken into account. Our expectation with this model is that the endogenous relationship between land and improvement values is partially disentangled by allowing the characteristics of nearby developments (and undeveloped land) to impact the price of an undeveloped parcel. There is no need to calculate marginal effects for this model and the coefficients are interpretable as usual.
In such a case, the value for land is estimated by
\[land_i = \alpha + \beta L + \theta WL_j + \gamma WI_j + u\]
That is, the only variable (set) to ignore are the characteristics related to the improvements on the parcel itself (i.e. we set \(\delta=0\)). The price of the land at location \(i\) is still influenced by the improvement characteristics of other parcels nearby
Interestingly, the concept of the hedonic model was arguably foreshadowed by Hansen (1959) in his seminal work on accessibility and land use: “The immediate value of the relationships described in this paper is that it will be possible to isolate and examine empirically the effect of other factors on land development, such as income, zoning, taxes, and land costs. The results of such studies would provide the planner with a clearer understanding of the metropolitan community and of the effectiveness of land controls.”↩︎
Gibbons & Overman (2012) argue this is a critical logic error in the application of spatial lag models because it allows future events to influence the past, however, LeSage & Pace (2009) argue that the ‘simultaneity’ in the models has an implicit temporal dimension. You could also make a ‘rational expectations’ argument that the future can influence the past in some housing contexts (Can, 1992; Lowry, 1960)↩︎