Les données de transcations immobilières utilisées ici (données DVF, pour Demandes de Valeurs Foncières), disponibles sous licence ouverte depuis 2019, sont réputées globalement fiables et exhaustives, mais nécessitent néanmoins des précautions d’utilisation (voir une description de leur qualité ici : https://www.groupe-dvf.fr/vademecum-fiche-n3-precautions-techniques-et-qualite-des-donnees-dvf/). La base utilisée ci-dessous a été retravaillée par etalab, un département de la direction interministérielle du numérique.
L’utilisation de ce jeu de données est ici réalisé dans un but exclusivement pédagogique, et non académique. Par conséquent, la qualité de la donnée n’a pas été re-vérifiée de façon approfondie, et il convient de rester prudent quant aux inférences issues de ces analyses.
L’objectif de cet atelier est de mettre en pratique, dans R, certains des concepts et méthodes fréquemment employés en analyse spatiale, avec un focus ici sur l’hétérogénéité spatiale et la régression géographiquement pondérée (GWR).
L’hétérogénéité spatiale fait partie de deux principaux concepts de l’analyse spatiale (aux côtés de la dépendance spatiale). Cette hétérogénéité est ubiquiste et systématique, et a pour conséquence qu’aucun lieu sur terre ne saurait être représentatif de tous les autres. Elle a un rôle fondamental dans les analyses géographiques, en particulier quantitatives (elle est indissociable des problèmes de représentativité spatiale, de résolution et d’échelle d’analyse).
D’un point de vue statistique, la traduction de l’hétérogénéité spatiale est la non-stationnarité spatiale. Celle-ci désigne l’instabilité, dans l’espace, des moyennes, des variances et des covariances. La non-stationnarité spatiale des covariances fait référence à l’hétérogénéité spatiale des relations statistiques. C’est cet élément qui fait l’objet de la démonstration ci-dessous.
L’hétérogénéité spatiale des relations statistiques est très fréquente, dans de nombreux domaines (écologie, santé, économie, etc.), et son ignorance peut induire des erreurs d’interprétation importantes et des conséquences économiques et sanitaires (par exemple) non négligeables.
Dans le domaine des marchés immobiliers, cette hétérogénéité spatiale des relations est habituelle. Par exemple, on sait que l’effet marginal d’1 m\(^2\) supplémentaire sur le prix de vente est variable selon la localisation (typiquement plus élevé en zone dense, dans les centres-villes). Cela a des conséquences sur l’estimation des marchés, mais aussi en modélisation hédonique, sur l’estimation de la valeur de différentes caractéristiques sur la base de ces relations.
Cela a poussé des auteurs à segmenter les marchés immobiliers en sous-marchés homogènes, au sein desquels les déterminants du prix seraient stationnaires (voir par exemple Goodman et Thibodeau (1998) ou Helbich et al. (2013)). La question est de savoir comment délimiter ces sous-marchés de façon pertinente.
La méthode proposée dans le cadre de cet atelier, la GWR, est une méthode de régression locale qui permet précisément d’explorer la non-stationarité spatiale à travers des cartes de relations (Fotheringham et al., 2003), et de servir alors de base pour une régionalisation. Nous allons préciser la façon dont ce type de modèle est calibré, estimé et interprété.
library(tidyverse)
library(sf)
library(tmap)
library(plotly)
library(gtsummary)
library(GGally)
library(GWmodel)
library(spdep)
La base est disponible et décrite ici : https://www.data.gouv.fr/fr/datasets/demandes-de-valeurs-foncieres-geolocalisees/
### Import de la base brute
On importe directement le csv dans R (pour le département 17) :
data <- read_csv("https://files.data.gouv.fr/geo-dvf/latest/csv/2020/departements/17.csv.gz")
head(data) #Pour vérifier que l'import est correct
## # A tibble: 6 x 40
## id_mutation date_mutation numero_disposition nature_mutation valeur_fonciere
## <chr> <date> <chr> <chr> <dbl>
## 1 2020-135780 2020-01-03 000002 Vente 200000
## 2 2020-135780 2020-01-03 000002 Vente 200000
## 3 2020-135780 2020-01-03 000002 Vente 200000
## 4 2020-135780 2020-01-03 000002 Vente 200000
## 5 2020-135780 2020-01-03 000002 Vente 200000
## 6 2020-135781 2020-01-03 000001 Vente 310300
## # … with 35 more variables: adresse_numero <dbl>, adresse_suffixe <chr>,
## # adresse_nom_voie <chr>, adresse_code_voie <chr>, code_postal <dbl>,
## # code_commune <dbl>, nom_commune <chr>, code_departement <dbl>,
## # ancien_code_commune <lgl>, ancien_nom_commune <lgl>, id_parcelle <chr>,
## # ancien_id_parcelle <lgl>, numero_volume <lgl>, lot1_numero <chr>,
## # lot1_surface_carrez <dbl>, lot2_numero <chr>, lot2_surface_carrez <dbl>,
## # lot3_numero <dbl>, lot3_surface_carrez <lgl>, lot4_numero <dbl>,
## # lot4_surface_carrez <lgl>, lot5_numero <lgl>, lot5_surface_carrez <lgl>,
## # nombre_lots <dbl>, code_type_local <dbl>, type_local <chr>,
## # surface_reelle_bati <dbl>, nombre_pieces_principales <dbl>,
## # code_nature_culture <chr>, nature_culture <chr>,
## # code_nature_culture_speciale <chr>, nature_culture_speciale <chr>,
## # surface_terrain <dbl>, longitude <dbl>, latitude <dbl>
dataOleron <- data %>%
filter(nom_commune %in% c("Dolus-d'Oléron",
"La Brée-les-Bains",
"Le Château-d'Oléron",
"Le Grand-Village-Plage",
"Saint-Denis-d'Oléron",
"Saint-Georges-d'Oléron",
"Saint-Pierre-d'Oléron",
"Saint-Trojan-les-Bains") &
nature_mutation == "Vente" &
type_local == "Maison" &
!is.na(longitude) &
!is.na(surface_terrain) &
!is.na(valeur_fonciere))
dataSf <- dataOleron %>%
st_as_sf(coords = c("longitude","latitude"),
crs = 4326) # WGS84
plot(st_geometry(dataSf))
oleron <- st_read("oleron.shp")
## Reading layer `oleron' from data source `/home/tim/Documents/prz/gwr/oleron.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: LINESTRING
## Dimension: XY
## Bounding box: xmin: -1.413073 ymin: 45.79904 xmax: -1.187825 ymax: 46.04829
## Geodetic CRS: WGS 84
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(oleron) +
tm_lines(col = "black") +
tm_shape(dataSf) +
tm_dots(col = "red")
dataSf$dist_litt <- st_distance(dataSf, oleron) %>%
as.numeric()
plot_ly(dataSf, x = ~valeur_fonciere) %>% add_histogram()
dataSf <- dataSf %>% filter(valeur_fonciere > 1000)
plot_ly(dataSf, x = ~log(valeur_fonciere)) %>% add_histogram()
a <- plot_ly(dataSf, x = ~log(dist_litt)) %>% add_histogram()
b <- plot_ly(dataSf, x = ~log(surface_reelle_bati)) %>% add_histogram()
c <- plot_ly(dataSf, x = ~log(surface_terrain)) %>% add_histogram()
subplot(a,b,c)
# Suppression des maisons vraisemblablement trop petites
dataSf <- dataSf %>% filter(surface_reelle_bati > 10)
# Création des variables log (pour faciliter la carto par la suite)
dataSf$log_valeur_fonciere <- log(dataSf$valeur_fonciere)
dataSf$log_dist_litt <- log(dataSf$dist_litt)
dataSf$log_surface_reelle_bati <- log(dataSf$surface_reelle_bati)
dataSf$log_surface_terrain <- log(dataSf$surface_terrain)
ggplot(dataSf, aes(x=log(dist_litt), y=log(valeur_fonciere))) +
geom_point() + geom_smooth()
ggplot(dataSf, aes(x=log(surface_reelle_bati), y=log(valeur_fonciere))) +
geom_point() + geom_smooth()
ggplot(dataSf, aes(x=log(surface_terrain), y=log(valeur_fonciere))) +
geom_point() + geom_smooth()
mco <- lm(log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati + log_surface_terrain, data = dataSf)
mco %>%
tbl_regression(intercept = TRUE) %>%
add_vif()
Characteristic | Beta | 95% CI1 | p-value | VIF1 |
---|---|---|---|---|
(Intercept) | 9.6 | 9.1, 10 | <0.001 | |
log_dist_litt | -0.08 | -0.12, -0.04 | <0.001 | 1.0 |
log_surface_reelle_bati | 0.46 | 0.37, 0.55 | <0.001 | 1.1 |
log_surface_terrain | 0.22 | 0.18, 0.26 | <0.001 | 1.1 |
1
CI = Confidence Interval, VIF = Variance Inflation Factor
|
ggcoef_model(mco)
dataSf$resMco <- mco$residuals
tm_shape(dataSf) + tm_dots(col = "resMco", style = "quantile")
dataSp <- as_Spatial(dataSf) # le package GWmodel n'est pas compatible avec 'sf'
matDist <- gw.dist(dp.locat = coordinates(dataSp))
# Exponential
nNeigh.exp <- bw.gwr(data = dataSp, approach = "AICc",
kernel = "exponential",
adaptive = TRUE,
dMat = matDist,
formula = log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati + log_surface_terrain)
## Adaptive bandwidth (number of nearest neighbours): 283 AICc value: 493.5324
## Adaptive bandwidth (number of nearest neighbours): 183 AICc value: 490.555
## Adaptive bandwidth (number of nearest neighbours): 120 AICc value: 487.6843
## Adaptive bandwidth (number of nearest neighbours): 82 AICc value: 477.5322
## Adaptive bandwidth (number of nearest neighbours): 57 AICc value: 473.0692
## Adaptive bandwidth (number of nearest neighbours): 43 AICc value: 471.0132
## Adaptive bandwidth (number of nearest neighbours): 33 AICc value: 469.6697
## Adaptive bandwidth (number of nearest neighbours): 28 AICc value: 465.9644
## Adaptive bandwidth (number of nearest neighbours): 23 AICc value: 463.5791
## Adaptive bandwidth (number of nearest neighbours): 22 AICc value: 464.9729
## Adaptive bandwidth (number of nearest neighbours): 25 AICc value: 462.0606
## Adaptive bandwidth (number of nearest neighbours): 25 AICc value: 462.0606
# Bisquare
nNeigh.bisq <- bw.gwr(data = dataSp, approach = "AICc",
kernel = "bisquare",
adaptive = TRUE,
dMat = matDist,
formula = log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati + log_surface_terrain)
## Adaptive bandwidth (number of nearest neighbours): 283 AICc value: 493.3797
## Adaptive bandwidth (number of nearest neighbours): 183 AICc value: 488.1249
## Adaptive bandwidth (number of nearest neighbours): 120 AICc value: 490.2119
## Adaptive bandwidth (number of nearest neighbours): 221 AICc value: 490.1729
## Adaptive bandwidth (number of nearest neighbours): 158 AICc value: 488.1154
## Adaptive bandwidth (number of nearest neighbours): 144 AICc value: 487.7541
## Adaptive bandwidth (number of nearest neighbours): 134 AICc value: 488.5333
## Adaptive bandwidth (number of nearest neighbours): 149 AICc value: 487.5514
## Adaptive bandwidth (number of nearest neighbours): 153 AICc value: 487.7659
## Adaptive bandwidth (number of nearest neighbours): 147 AICc value: 487.9004
## Adaptive bandwidth (number of nearest neighbours): 150 AICc value: 487.3307
## Adaptive bandwidth (number of nearest neighbours): 151 AICc value: 487.2637
## Adaptive bandwidth (number of nearest neighbours): 151 AICc value: 487.2637
# Avec pondération exponential
GWR.exp <- gwr.basic(data = dataSp, bw = nNeigh.exp, kernel = "exponential", adaptive = TRUE, dMat = matDist, formula = log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati + log_surface_terrain)
# Avec pondération bisquare
GWR.bisq <- gwr.basic(data = dataSp, bw = nNeigh.bisq, kernel = "bisquare",
adaptive = TRUE, dMat = matDist,
formula = log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati + log_surface_terrain)
diagGwr <- cbind(
rbind(nNeigh.exp,nNeigh.bisq),
rbind(GWR.exp$GW.diagnostic$gw.R2,GWR.bisq$GW.diagnostic$gw.R2),
rbind(GWR.exp$GW.diagnostic$AIC,GWR.bisq$GW.diagnostic$AIC)) %>%
`colnames<-`(c("Nb Voisins","R2","AIC")) %>%
`rownames<-`(c("EXPONENTIAL","BISQUARE"))
diagGwr
## Nb Voisins R2 AIC
## EXPONENTIAL 25 0.5801772 401.1731
## BISQUARE 151 0.5031266 455.4523
GWR.exp
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2021-06-29 17:29:12
## Call:
## gwr.basic(formula = log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati +
## log_surface_terrain, data = dataSp, bw = nNeigh.exp, kernel = "exponential",
## adaptive = TRUE, dMat = matDist)
##
## Dependent (y) variable: log_valeur_fonciere
## Independent variables: log_dist_litt log_surface_reelle_bati log_surface_terrain
## Number of data points: 446
## ***********************************************************************
## * Results of Global Regression *
## ***********************************************************************
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.08899 -0.18479 0.01833 0.19689 2.97012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.60848 0.24208 39.692 < 2e-16 ***
## log_dist_litt -0.07647 0.02016 -3.793 0.000169 ***
## log_surface_reelle_bati 0.45742 0.04668 9.799 < 2e-16 ***
## log_surface_terrain 0.21706 0.02076 10.458 < 2e-16 ***
##
## ---Significance stars
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.4202 on 442 degrees of freedom
## Multiple R-squared: 0.433
## Adjusted R-squared: 0.4291
## F-statistic: 112.5 on 3 and 442 DF, p-value: < 2.2e-16
## ***Extra Diagnostic information
## Residual sum of squares: 78.02509
## Sigma(hat): 0.4192042
## AIC: 498.1865
## AICc: 498.3228
## BIC: 103.1897
## ***********************************************************************
## * Results of Geographically Weighted Regression *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: exponential
## Adaptive bandwidth: 25 (number of nearest neighbours)
## Regression points: the same locations as observations are used.
## Distance metric: A distance matrix is specified for this model calibration.
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median 3rd Qu. Max.
## Intercept 8.246966 8.953660 9.359738 10.132842 11.4319
## log_dist_litt -0.223555 -0.094306 -0.068669 -0.033344 0.1244
## log_surface_reelle_bati 0.177656 0.352127 0.480940 0.564959 0.6983
## log_surface_terrain -0.059955 0.174898 0.237955 0.254789 0.4053
## ************************Diagnostic information*************************
## Number of data points: 446
## Effective number of parameters (2trace(S) - trace(S'S)): 69.53251
## Effective degrees of freedom (n-2trace(S) + trace(S'S)): 376.4675
## AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 462.0606
## AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 401.1731
## BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 195.0179
## Residual sum of squares: 57.77146
## R-square value: 0.5801772
## Adjusted R-square value: 0.5024306
##
## ***********************************************************************
## Program stops at: 2021-06-29 17:29:12
Il faut maintenant cartographier les betas de chaque variable pour décrire cette non-stationnarité spatiale.
# Fonction de cartographie automatique des coefficients GWR
mapGWR <- function(spdf,var,var_TV,legend.title = "betas GWR",main.title, dot.size = 0.3) {
tv <- spdf[abs(var_TV)>1.96,]
tm_shape(spdf) +
tm_dots(var, title = legend.title, size = dot.size) +
tm_shape(oleron) + tm_lines() +
tm_shape(tv) + tm_dots(col="grey40") +
tm_layout(title = main.title, legend.title.size =0.9, inner.margins = .15)
}
# Planche cartographique des 3 variables
tmap_mode("plot")
a <- mapGWR(GWR.exp$SDF, var = "log_dist_litt",var_TV = GWR.exp$SDF$log_dist_litt_TV,
main.title = "Distance au littoral")
b <- mapGWR(GWR.exp$SDF, var = "log_surface_reelle_bati",var_TV = GWR.exp$SDF$log_surface_reelle_bati_TV,
main.title = "Surface bâtie")
c <- mapGWR(GWR.exp$SDF, var = "log_surface_terrain",var_TV = GWR.exp$SDF$log_surface_terrain_TV,
main.title = "Surface terrain")
tmap_arrange(a,b,c)
L’interprétation des cartes GWR est la partie la plus délicate, mais aussi la plus intéressante. Nous présentons d’abord une clé de lecture théorique et générique, puis une application à nos données.
Il n’y a pas de raison de penser que tous les prédicteurs agissent sur le prix à la même échelle (c’est-à-dire selon un même schéma de voisinage). Certains processus peuvent être locaux, d’autres globaux. Récemment, une extension de la GWR a été proposée, permettant de relâcher cette hypothèse d’égalité des échelles : la GWR multiscalaire (MGWR, Fotheringham et al., 2017). Le principe est simple : un algorithme optimise le choix de la bandwidth pour chaque prédicteur, en fonction des autres. Il en résulte un modèle souvent mixte.
source("gwr.multiscale_T.r")
# On lance la MGWR
MGWR <- gwr.multiscale(formula = log_valeur_fonciere ~ log_dist_litt +
log_surface_reelle_bati + log_surface_terrain,
data = dataSp, kernel = "exponential",
predictor.centered=rep(T, 3), # centrage des prédicteurs
adaptive = TRUE,
bws0 = rep(1,4)) # BW minimum pour l'optimisation
## ------ Calculate the initial beta0 from the above bandwidths ------
## ------ The end for calculating the initial beta0 ------
## ------ Select the optimum bandwidths for each independent variable via AIC aproach ------
## ***** The back-fitting process for model calibration with bandwiths selected *****
## Iteration 1 :
## Now select an optimum bandwidth for the variable: Intercept
## The newly selected bandwidth for variable Intercept is: 143
## The bandwidth used in the last ieration is: 1 and the difference between these two bandwidths is: 142
## The bandwidth for variable Intercept will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable: log_dist_litt
## The newly selected bandwidth for variable log_dist_litt is: 62
## The bandwidth used in the last ieration is: 1 and the difference between these two bandwidths is: 61
## The bandwidth for variable log_dist_litt will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable: log_surface_reelle_bati
## The newly selected bandwidth for variable log_surface_reelle_bati is: 64
## The bandwidth used in the last ieration is: 1 and the difference between these two bandwidths is: 63
## The bandwidth for variable log_surface_reelle_bati will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable: log_surface_terrain
## The newly selected bandwidth for variable log_surface_terrain is: 23
## The bandwidth used in the last ieration is: 1 and the difference between these two bandwidths is: 22
## The bandwidth for variable log_surface_terrain will be continually selected in the next ieration.
## Ieration 1 the differential change value of RSS (dCVR) is: 0.3100205
## ----------End of Iteration 1 ----------
## Iteration 2 :
## Now select an optimum bandwidth for the variable: Intercept
## The newly selected bandwidth for variable Intercept is: 384
## The bandwidth used in the last ieration is: 143 and the difference between these two bandwidths is: 241
## The bandwidth for variable Intercept will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable: log_dist_litt
## The newly selected bandwidth for variable log_dist_litt is: 73
## The bandwidth used in the last ieration is: 62 and the difference between these two bandwidths is: 11
## The bandwidth for variable log_dist_litt will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable: log_surface_reelle_bati
## The newly selected bandwidth for variable log_surface_reelle_bati is: 77
## The bandwidth used in the last ieration is: 64 and the difference between these two bandwidths is: 13
## The bandwidth for variable log_surface_reelle_bati will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable: log_surface_terrain
## The newly selected bandwidth for variable log_surface_terrain is: 23
## The bandwidth used in the last ieration is: 23 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_surface_terrain seems to be converged for 1 times.It will be continually optimized in the next 4 times
## Ieration 2 the differential change value of RSS (dCVR) is: 0.1040459
## ----------End of Iteration 2 ----------
## Iteration 3 :
## Now select an optimum bandwidth for the variable: Intercept
## The newly selected bandwidth for variable Intercept is: 432
## The bandwidth used in the last ieration is: 384 and the difference between these two bandwidths is: 48
## The bandwidth for variable Intercept will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable: log_dist_litt
## The newly selected bandwidth for variable log_dist_litt is: 77
## The bandwidth used in the last ieration is: 73 and the difference between these two bandwidths is: 4
## The bandwidth for variable log_dist_litt will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable: log_surface_reelle_bati
## The newly selected bandwidth for variable log_surface_reelle_bati is: 368
## The bandwidth used in the last ieration is: 77 and the difference between these two bandwidths is: 291
## The bandwidth for variable log_surface_reelle_bati will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable: log_surface_terrain
## The newly selected bandwidth for variable log_surface_terrain is: 23
## The bandwidth used in the last ieration is: 23 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_surface_terrain seems to be converged for 2 times.It will be continually optimized in the next 3 times
## Ieration 3 the differential change value of RSS (dCVR) is: 0.133204
## ----------End of Iteration 3 ----------
## Iteration 4 :
## Now select an optimum bandwidth for the variable: Intercept
## The newly selected bandwidth for variable Intercept is: 433
## The bandwidth used in the last ieration is: 432 and the difference between these two bandwidths is: 1
## The bandwidth for variable Intercept will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable: log_dist_litt
## The newly selected bandwidth for variable log_dist_litt is: 77
## The bandwidth used in the last ieration is: 77 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_dist_litt seems to be converged for 1 times.It will be continually optimized in the next 4 times
## Now select an optimum bandwidth for the variable: log_surface_reelle_bati
## The newly selected bandwidth for variable log_surface_reelle_bati is: 368
## The bandwidth used in the last ieration is: 368 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_surface_reelle_bati seems to be converged for 1 times.It will be continually optimized in the next 4 times
## Now select an optimum bandwidth for the variable: log_surface_terrain
## The newly selected bandwidth for variable log_surface_terrain is: 23
## The bandwidth used in the last ieration is: 23 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_surface_terrain seems to be converged for 3 times.It will be continually optimized in the next 2 times
## Ieration 4 the differential change value of RSS (dCVR) is: 0.004474833
## ----------End of Iteration 4 ----------
## Iteration 5 :
## Now select an optimum bandwidth for the variable: Intercept
## The newly selected bandwidth for variable Intercept is: 441
## The bandwidth used in the last ieration is: 433 and the difference between these two bandwidths is: 8
## The bandwidth for variable Intercept will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable: log_dist_litt
## The newly selected bandwidth for variable log_dist_litt is: 77
## The bandwidth used in the last ieration is: 77 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_dist_litt seems to be converged for 2 times.It will be continually optimized in the next 3 times
## Now select an optimum bandwidth for the variable: log_surface_reelle_bati
## The newly selected bandwidth for variable log_surface_reelle_bati is: 368
## The bandwidth used in the last ieration is: 368 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_surface_reelle_bati seems to be converged for 2 times.It will be continually optimized in the next 3 times
## Now select an optimum bandwidth for the variable: log_surface_terrain
## The newly selected bandwidth for variable log_surface_terrain is: 23
## The bandwidth used in the last ieration is: 23 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_surface_terrain seems to be converged for 4 times.It will be continually optimized in the next 1 times
## Ieration 5 the differential change value of RSS (dCVR) is: 0.0116311
## ----------End of Iteration 5 ----------
## Iteration 6 :
## Now select an optimum bandwidth for the variable: Intercept
## The newly selected bandwidth for variable Intercept is: 441
## The bandwidth used in the last ieration is: 441 and the difference between these two bandwidths is: 0
## The bandwidth for variable Intercept seems to be converged for 1 times.It will be continually optimized in the next 4 times
## Now select an optimum bandwidth for the variable: log_dist_litt
## The newly selected bandwidth for variable log_dist_litt is: 77
## The bandwidth used in the last ieration is: 77 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_dist_litt seems to be converged for 3 times.It will be continually optimized in the next 2 times
## Now select an optimum bandwidth for the variable: log_surface_reelle_bati
## The newly selected bandwidth for variable log_surface_reelle_bati is: 368
## The bandwidth used in the last ieration is: 368 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_surface_reelle_bati seems to be converged for 3 times.It will be continually optimized in the next 2 times
## Now select an optimum bandwidth for the variable: log_surface_terrain
## The newly selected bandwidth for variable log_surface_terrain is: 23
## The bandwidth used in the last ieration is: 23 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_surface_terrain seems to be converged and will be kept the same in the following ierations.
## Ieration 6 the differential change value of RSS (dCVR) is: 0.00309254
## ----------End of Iteration 6 ----------
## Iteration 7 :
## Now select an optimum bandwidth for the variable: Intercept
## The newly selected bandwidth for variable Intercept is: 441
## The bandwidth used in the last ieration is: 441 and the difference between these two bandwidths is: 0
## The bandwidth for variable Intercept seems to be converged for 2 times.It will be continually optimized in the next 3 times
## Now select an optimum bandwidth for the variable: log_dist_litt
## The newly selected bandwidth for variable log_dist_litt is: 77
## The bandwidth used in the last ieration is: 77 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_dist_litt seems to be converged for 4 times.It will be continually optimized in the next 1 times
## Now select an optimum bandwidth for the variable: log_surface_reelle_bati
## The newly selected bandwidth for variable log_surface_reelle_bati is: 368
## The bandwidth used in the last ieration is: 368 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_surface_reelle_bati seems to be converged for 4 times.It will be continually optimized in the next 1 times
## Ieration 7 the differential change value of RSS (dCVR) is: 0.001383194
## ----------End of Iteration 7 ----------
## Iteration 8 :
## Now select an optimum bandwidth for the variable: Intercept
## The newly selected bandwidth for variable Intercept is: 441
## The bandwidth used in the last ieration is: 441 and the difference between these two bandwidths is: 0
## The bandwidth for variable Intercept seems to be converged for 3 times.It will be continually optimized in the next 2 times
## Now select an optimum bandwidth for the variable: log_dist_litt
## The newly selected bandwidth for variable log_dist_litt is: 77
## The bandwidth used in the last ieration is: 77 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_dist_litt seems to be converged and will be kept the same in the following ierations.
## Now select an optimum bandwidth for the variable: log_surface_reelle_bati
## The newly selected bandwidth for variable log_surface_reelle_bati is: 368
## The bandwidth used in the last ieration is: 368 and the difference between these two bandwidths is: 0
## The bandwidth for variable log_surface_reelle_bati seems to be converged and will be kept the same in the following ierations.
## Ieration 8 the differential change value of RSS (dCVR) is: 0.0006396762
## ----------End of Iteration 8 ----------
## Iteration 9 :
## Now select an optimum bandwidth for the variable: Intercept
## The newly selected bandwidth for variable Intercept is: 441
## The bandwidth used in the last ieration is: 441 and the difference between these two bandwidths is: 0
## The bandwidth for variable Intercept seems to be converged for 4 times.It will be continually optimized in the next 1 times
## Ieration 9 the differential change value of RSS (dCVR) is: 0.0002963825
## ----------End of Iteration 9 ----------
## Iteration 10 :
## Now select an optimum bandwidth for the variable: Intercept
## The newly selected bandwidth for variable Intercept is: 441
## The bandwidth used in the last ieration is: 441 and the difference between these two bandwidths is: 0
## The bandwidth for variable Intercept seems to be converged and will be kept the same in the following ierations.
## Ieration 10 the differential change value of RSS (dCVR) is: 0.0001375759
## ----------End of Iteration 10 ----------
## Iteration 11 :
## Ieration 11 the differential change value of RSS (dCVR) is: 6.401949e-05
## ----------End of Iteration 11 ----------
## Iteration 12 :
## Ieration 12 the differential change value of RSS (dCVR) is: 2.986701e-05
## ----------End of Iteration 12 ----------
## Iteration 13 :
## Ieration 13 the differential change value of RSS (dCVR) is: 1.396718e-05
## ----------End of Iteration 13 ----------
## Iteration 14 :
## Ieration 14 the differential change value of RSS (dCVR) is: 6.546073e-06
## ----------End of Iteration 14 ----------
mgwr.bw <- round(MGWR[[2]]$bws,1) # Nombre de voisins pour chaque prédicteur
mgwr.bw
## [1] 441 77 368 23
# Exploration des résultats statistiques
print(MGWR)
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2021-06-29 17:29:15
## Call:
## gwr.multiscale(formula = log_valeur_fonciere ~ log_dist_litt +
## log_surface_reelle_bati + log_surface_terrain, data = dataSp,
## kernel = "exponential", adaptive = TRUE, bws0 = rep(1, 4),
## predictor.centered = rep(T, 3))
##
## Dependent (y) variable: log_valeur_fonciere
## Independent variables: log_dist_litt log_surface_reelle_bati log_surface_terrain
## Number of data points: 446
## ***********************************************************************
## * GWR with Parameter-Specific Distance Metrics *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: exponential
## Adaptive bandwidths for each coefficient(number of nearest neighbours):
## (Intercept) log_dist_litt log_surface_reelle_bati
## Bandwidth 441 77 368
## log_surface_terrain
## Bandwidth 23
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median 3rd Qu. Max.
## Intercept 8.201685 9.253211 9.481919 9.940775 10.9172
## log_dist_litt -0.176545 -0.093417 -0.060333 -0.049206 -0.0229
## log_surface_reelle_bati 0.456252 0.461429 0.465485 0.474299 0.4794
## log_surface_terrain -0.042735 0.154269 0.225274 0.264967 0.3943
## ************************Diagnostic information*************************
## Residual sum of squares: 65.78948
## R-square value: 0.5219106
## Adjusted R-square value: 0.4833002
## AICc value: 462.1182
##
## ***********************************************************************
## Program stops at: 2021-06-29 17:31:00
On constate que si l’effet de la surface bâtie sur les prix agit de façon assez globale à l’échelle de l’île, les deux autres prédicteurs agissent de manière beaucoup plus locales. Ainsi par exemple, l’effet de la distance au littoral relève d’un processus très localisé.
a <- mapGWR(MGWR$SDF, var = "log_dist_litt",var_TV = MGWR$SDF$log_dist_litt_TV,
main.title = "Distance au littoral (bw = 77)")
b <- mapGWR(MGWR$SDF, var = "log_surface_reelle_bati",var_TV = MGWR$SDF$log_surface_reelle_bati_TV,
main.title = "Surface bâtie (bw = 368)")
c <- mapGWR(MGWR$SDF, var = "log_surface_terrain",var_TV = MGWR$SDF$log_surface_terrain_TV,
main.title = "Surface terrain (bw = 23)")
tmap_arrange(a,b,c)
L’objectif est maintenant de délimiter des sous-marchés immobiliers, sur la base des coefficients de la GWR. Ainsi, on recherchera l’homogénéité des prédicteurs dans chaque sous-marché.
Ce processus de découpage de l’espace en sous-régions homogènes se nomme la régionalisation. C’est une extension de la classification classique : on y ajoute un critère de contiguité spatiale. La régionalisation est donc une classification spatiale.
Il existe plusieurs méthodes de régionalisation. Un des principes les plus répandus consiste à établir une classification à la fois sur la base de la ressemblance entre les observations, et sur leur proximité dans l’espace géographique.
Nous allons ici utiliser l’algorithme SKATER (Spatial Klustering Analysis by Tree Edge Removal), méthode proposée par Assunçao et al. (2006) et déjà appliqué dans un contexte de recherche similaire au notre par Helbich et al. (2013). Par ailleurs, une description très pédagogique de la méthode est disponible ici : http://www.jms-insee.fr/2018/S08_5_ACTE_ROUSSEZ_JMS2018.pdf
L’algorithme SKATER comporte 4 étapes (cf. doc cité ci-dessus) :
# Centrage-réduction pour rendre les coefficients comparables
gwrB.scaled <- GWR.exp$SDF %>%
as.data.frame() %>%
select(1:4) %>%
mutate_all(~scale(.)) %>%
rename_with(~paste(.x, "b", sep = "_"))
knn <- knearneigh(GWR.exp$SDF, k=50)
nb <- knn2nb(knn)
plot(nb, coords = coordinates(GWR.exp$SDF), col="blue")
costs <- nbcosts(nb, data = gwrB.scaled)
costsW <- nb2listw(nb, costs, style="B")
costsTree <- mstree(costsW)
plot(costsTree, coords = coordinates(GWR.exp$SDF), col="blue", main = "Arbre portant minimal")
clus6 <- skater(edges = costsTree[,1:2], data = gwrB.scaled, ncuts = 5)
### Troisième étape : analyse des résultats
#### Cartographie des clusters
dataClus <- dataSf %>%
mutate(clus = as.factor(clus6$groups)) %>%
bind_cols(gwrB.scaled)
tmap_mode(mode = "view")
tm_shape(dataClus) +
tm_symbols(col="clus", size=.8, palette = "Set1") +
tm_layout(title = "Classification en 6 groupes")
nomVar <- c("log_dist_litt_b","log_surface_reelle_bati_b","log_surface_terrain_b","Intercept_b")
clusProfile <- dataClus[, c(nomVar, "clus")] %>%
group_by(clus) %>%
summarise_each(funs(mean)) %>%
st_drop_geometry()
clusLong <- reshape2::melt(clusProfile, id.vars = "clus")
profilePlot <- ggplot(clusLong) +
geom_bar(aes(x = variable, y = value),
fill = "grey25",
position = "identity",
stat = "identity") +
scale_x_discrete("Effet") +
scale_y_continuous("Valeur moyenne par classe") +
facet_wrap(~ clus) +
coord_flip() +
theme(strip.background = element_rect(fill="grey25"),
strip.text = element_text(colour = "grey85", face = "bold"))
profilePlot
Assunção, R. M., Neves, M. C., Câmara, G., & da Costa Freitas, C. (2006). Efficient regionalization techniques for socio‐economic geographical units using minimum spanning trees. International Journal of Geographical Information Science, 20(7), 797-811.
Feuillet, T., Commenges, H., Menai, M., Salze, P., Perchoux, C., Reuillon, R., … & Oppert, J. M. (2018). A massive geographically weighted regression model of walking-environment relationships. Journal of transport geography, 68, 118-129.
Feuillet, T., Cossart, E., & Commenges, H. (2019). Manuel de géographie quantitative: concepts, outils, méthodes. Armand Colin.
Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2003). Geographically weighted regression: the analysis of spatially varying relationships. John Wiley & Sons.
Fotheringham, A. S., Yang, W., & Kang, W. (2017). Multiscale geographically weighted regression (MGWR). Annals of the American Association of Geographers, 107(6), 1247-1265.
Goodman, A. C., & Thibodeau, T. G. (1998). Housing market segmentation. Journal of housing economics, 7(2), 121-143.
Helbich, M., Brunauer, W., Hagenauer, J., & Leitner, M. (2013). Data-driven regionalization of housing markets. Annals of the Association of American Geographers, 103(4), 871-889.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 10 (buster)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] spdep_1.1-8 GWmodel_2.2-6 spatialreg_1.1-8 Matrix_1.3-3
## [5] spData_0.3.10 Rcpp_1.0.6 robustbase_0.93-8 maptools_1.1-1
## [9] sp_1.4-5 GGally_2.1.2 gtsummary_1.4.1 plotly_4.9.4.1
## [13] tmap_3.3-2 sf_1.0-1 forcats_0.5.1 stringr_1.4.0
## [17] dplyr_1.0.7 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
## [21] tibble_3.1.2 ggplot2_3.3.4 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.2.1 lwgeom_0.2-6
## [4] plyr_1.8.6 lazyeval_0.2.2 splines_4.1.0
## [7] crosstalk_1.1.1 leaflet_2.0.4.1 digest_0.6.27
## [10] htmltools_0.5.1.1 gdata_2.18.0 leaflet.providers_1.9.0
## [13] fansi_0.5.0 checkmate_2.0.0 magrittr_2.0.1
## [16] openxlsx_4.2.4 modelr_0.1.8 gmodels_2.18.1
## [19] xts_0.12.1 colorspace_2.0-2 rvest_1.0.0
## [22] rgdal_1.5-23 haven_2.4.1 xfun_0.24
## [25] leafem_0.1.6 crayon_1.4.1 jsonlite_1.7.2
## [28] survival_3.2-11 zoo_1.8-9 glue_1.4.2
## [31] stars_0.5-3 gtable_0.3.0 car_3.0-10
## [34] DEoptimR_1.0-9 abind_1.4-5 scales_1.1.1
## [37] DBI_1.1.1 viridisLite_0.4.0 units_0.7-2
## [40] foreign_0.8-81 proxy_0.4-26 intervals_0.15.2
## [43] htmlwidgets_1.5.3 httr_1.4.2 FNN_1.1.3
## [46] RColorBrewer_1.1-2 wk_0.4.1 ellipsis_0.3.2
## [49] pkgconfig_2.0.3 reshape_0.8.8 XML_3.99-0.6
## [52] farver_2.1.0 sass_0.4.0 dbplyr_2.1.1
## [55] deldir_0.2-10 utf8_1.2.1 reshape2_1.4.4
## [58] tidyselect_1.1.1 labeling_0.4.2 rlang_0.4.11
## [61] tmaptools_3.1-1 munsell_0.5.0 cellranger_1.1.0
## [64] tools_4.1.0 cli_2.5.0 generics_0.1.0
## [67] broom_0.7.8 evaluate_0.14 yaml_2.2.1
## [70] leafsync_0.1.0 knitr_1.33 fs_1.5.0
## [73] zip_2.2.0 s2_1.0.6 nlme_3.1-152
## [76] xml2_1.3.2 compiler_4.1.0 rstudioapi_0.13
## [79] curl_4.3.2 png_0.1-7 e1071_1.7-7
## [82] gt_0.3.0 reprex_2.0.0 spacetime_1.2-5
## [85] broom.helpers_1.3.0 bslib_0.2.5.1 stringi_1.6.2
## [88] highr_0.9 lattice_0.20-44 commonmark_1.7
## [91] classInt_0.4-3 vctrs_0.3.8 pillar_1.6.1
## [94] LearnBayes_2.15.1 lifecycle_1.0.0 jquerylib_0.1.4
## [97] data.table_1.14.0 raster_3.4-13 R6_2.5.0
## [100] rio_0.5.27 KernSmooth_2.23-20 spDataLarge_0.5.4
## [103] codetools_0.2-18 dichromat_2.0-0 boot_1.3-28
## [106] MASS_7.3-54 gtools_3.9.2 assertthat_0.2.1
## [109] withr_2.4.2 mgcv_1.8-35 expm_0.999-6
## [112] parallel_4.1.0 hms_1.1.0 labelled_2.8.0
## [115] grid_4.1.0 coda_0.19-4 class_7.3-19
## [118] rmarkdown_2.9 carData_3.0-4 lubridate_1.7.10
## [121] base64enc_0.1-3