Avertissement

Les données de transcations immobilières utilisées ici (données DVF, pour Demandes de Valeurs Foncières), disponibles sous licence ouverte depuis 2019, sont réputées globalement fiables et exhaustives, mais nécessitent néanmoins des précautions d’utilisation (voir une description de leur qualité ici : https://www.groupe-dvf.fr/vademecum-fiche-n3-precautions-techniques-et-qualite-des-donnees-dvf/). La base utilisée ci-dessous a été retravaillée par etalab, un département de la direction interministérielle du numérique.

L’utilisation de ce jeu de données est ici réalisé dans un but exclusivement pédagogique, et non académique. Par conséquent, la qualité de la donnée n’a pas été re-vérifiée de façon approfondie, et il convient de rester prudent quant aux inférences issues de ces analyses.


 

Objectif de cet atelier

L’objectif de cet atelier est de mettre en pratique, dans R, certains des concepts et méthodes fréquemment employés en analyse spatiale, avec un focus ici sur l’hétérogénéité spatiale et la régression géographiquement pondérée (GWR).

L’hétérogénéité spatiale fait partie de deux principaux concepts de l’analyse spatiale (aux côtés de la dépendance spatiale). Cette hétérogénéité est ubiquiste et systématique, et a pour conséquence qu’aucun lieu sur terre ne saurait être représentatif de tous les autres. Elle a un rôle fondamental dans les analyses géographiques, en particulier quantitatives (elle est indissociable des problèmes de représentativité spatiale, de résolution et d’échelle d’analyse).

D’un point de vue statistique, la traduction de l’hétérogénéité spatiale est la non-stationnarité spatiale. Celle-ci désigne l’instabilité, dans l’espace, des moyennes, des variances et des covariances. La non-stationnarité spatiale des covariances fait référence à l’hétérogénéité spatiale des relations statistiques. C’est cet élément qui fait l’objet de la démonstration ci-dessous.

L’hétérogénéité spatiale des relations statistiques est très fréquente, dans de nombreux domaines (écologie, santé, économie, etc.), et son ignorance peut induire des erreurs d’interprétation importantes et des conséquences économiques et sanitaires (par exemple) non négligeables.

Dans le domaine des marchés immobiliers, cette hétérogénéité spatiale des relations est habituelle. Par exemple, on sait que l’effet marginal d’1 m\(^2\) supplémentaire sur le prix de vente est variable selon la localisation (typiquement plus élevé en zone dense, dans les centres-villes). Cela a des conséquences sur l’estimation des marchés, mais aussi en modélisation hédonique, sur l’estimation de la valeur de différentes caractéristiques sur la base de ces relations.

Cela a poussé des auteurs à segmenter les marchés immobiliers en sous-marchés homogènes, au sein desquels les déterminants du prix seraient stationnaires (voir par exemple Goodman et Thibodeau (1998) ou Helbich et al. (2013)). La question est de savoir comment délimiter ces sous-marchés de façon pertinente.

La méthode proposée dans le cadre de cet atelier, la GWR, est une méthode de régression locale qui permet précisément d’explorer la non-stationarité spatiale à travers des cartes de relations (Fotheringham et al., 2003), et de servir alors de base pour une régionalisation. Nous allons préciser la façon dont ce type de modèle est calibré, estimé et interprété.


 

Chargement des librairies

library(tidyverse)
library(sf)
library(tmap)
library(plotly)
library(gtsummary)
library(GGally)
library(GWmodel)
library(spdep)

 

Import et préparation de la base

La base est disponible et décrite ici : https://www.data.gouv.fr/fr/datasets/demandes-de-valeurs-foncieres-geolocalisees/

### Import de la base brute

On importe directement le csv dans R (pour le département 17) :

data <- read_csv("https://files.data.gouv.fr/geo-dvf/latest/csv/2020/departements/17.csv.gz")
head(data) #Pour vérifier que l'import est correct
## # A tibble: 6 x 40
##   id_mutation date_mutation numero_disposition nature_mutation valeur_fonciere
##   <chr>       <date>        <chr>              <chr>                     <dbl>
## 1 2020-135780 2020-01-03    000002             Vente                    200000
## 2 2020-135780 2020-01-03    000002             Vente                    200000
## 3 2020-135780 2020-01-03    000002             Vente                    200000
## 4 2020-135780 2020-01-03    000002             Vente                    200000
## 5 2020-135780 2020-01-03    000002             Vente                    200000
## 6 2020-135781 2020-01-03    000001             Vente                    310300
## # … with 35 more variables: adresse_numero <dbl>, adresse_suffixe <chr>,
## #   adresse_nom_voie <chr>, adresse_code_voie <chr>, code_postal <dbl>,
## #   code_commune <dbl>, nom_commune <chr>, code_departement <dbl>,
## #   ancien_code_commune <lgl>, ancien_nom_commune <lgl>, id_parcelle <chr>,
## #   ancien_id_parcelle <lgl>, numero_volume <lgl>, lot1_numero <chr>,
## #   lot1_surface_carrez <dbl>, lot2_numero <chr>, lot2_surface_carrez <dbl>,
## #   lot3_numero <dbl>, lot3_surface_carrez <lgl>, lot4_numero <dbl>,
## #   lot4_surface_carrez <lgl>, lot5_numero <lgl>, lot5_surface_carrez <lgl>,
## #   nombre_lots <dbl>, code_type_local <dbl>, type_local <chr>,
## #   surface_reelle_bati <dbl>, nombre_pieces_principales <dbl>,
## #   code_nature_culture <chr>, nature_culture <chr>,
## #   code_nature_culture_speciale <chr>, nature_culture_speciale <chr>,
## #   surface_terrain <dbl>, longitude <dbl>, latitude <dbl>

Filtre sur les ventes de maisons à Oléron avec coordonnées géographiques

dataOleron <- data %>% 
  filter(nom_commune %in% c("Dolus-d'Oléron",
                            "La Brée-les-Bains",
                            "Le Château-d'Oléron",
                            "Le Grand-Village-Plage",
                            "Saint-Denis-d'Oléron",
                            "Saint-Georges-d'Oléron",
                            "Saint-Pierre-d'Oléron",
                            "Saint-Trojan-les-Bains") & 
           nature_mutation == "Vente" & 
           type_local == "Maison" &
           !is.na(longitude) & 
           !is.na(surface_terrain) &
           !is.na(valeur_fonciere))

Conversion en sf

dataSf <- dataOleron %>% 
  st_as_sf(coords = c("longitude","latitude"), 
           crs = 4326) # WGS84

plot(st_geometry(dataSf))


 

Import du fond de carte en shapefile

oleron <- st_read("oleron.shp")
## Reading layer `oleron' from data source `/home/tim/Documents/prz/gwr/oleron.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -1.413073 ymin: 45.79904 xmax: -1.187825 ymax: 46.04829
## Geodetic CRS:  WGS 84

Cartographie

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(oleron) + 
  tm_lines(col = "black") + 
  tm_shape(dataSf) + 
  tm_dots(col = "red")

Ajout d’une variable contextuelle : distance au littoral

dataSf$dist_litt <- st_distance(dataSf, oleron) %>% 
  as.numeric()

 

Exploration des variables

Distribution de la variable dépendante (prix de vente)

plot_ly(dataSf, x = ~valeur_fonciere) %>% add_histogram()

Suppression d’une valeur aberrante

dataSf <- dataSf %>% filter(valeur_fonciere > 1000)

Distribution très dissymétrique

plot_ly(dataSf, x = ~log(valeur_fonciere)) %>% add_histogram()

C’est mieux !

Distribution des variables indépendantes

a <- plot_ly(dataSf, x = ~log(dist_litt)) %>% add_histogram()
b <- plot_ly(dataSf, x = ~log(surface_reelle_bati)) %>% add_histogram()
c <- plot_ly(dataSf, x = ~log(surface_terrain)) %>% add_histogram()
subplot(a,b,c)
# Suppression des maisons vraisemblablement trop petites
dataSf <- dataSf %>% filter(surface_reelle_bati > 10)

# Création des variables log (pour faciliter la carto par la suite)
dataSf$log_valeur_fonciere <- log(dataSf$valeur_fonciere)
dataSf$log_dist_litt <- log(dataSf$dist_litt)
dataSf$log_surface_reelle_bati <- log(dataSf$surface_reelle_bati)
dataSf$log_surface_terrain <- log(dataSf$surface_terrain)

 

Relations bivariées - formes fonctionelles

ggplot(dataSf, aes(x=log(dist_litt), y=log(valeur_fonciere))) + 
  geom_point() + geom_smooth()

ggplot(dataSf, aes(x=log(surface_reelle_bati), y=log(valeur_fonciere))) + 
  geom_point() + geom_smooth()

ggplot(dataSf, aes(x=log(surface_terrain), y=log(valeur_fonciere))) + 
  geom_point() + geom_smooth()


 

Modèle log-log global (MCO)

mco <- lm(log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati + log_surface_terrain, data = dataSf)
mco %>%
  tbl_regression(intercept = TRUE) %>% 
  add_vif()
Characteristic Beta 95% CI1 p-value VIF1
(Intercept) 9.6 9.1, 10 <0.001
log_dist_litt -0.08 -0.12, -0.04 <0.001 1.0
log_surface_reelle_bati 0.46 0.37, 0.55 <0.001 1.1
log_surface_terrain 0.22 0.18, 0.26 <0.001 1.1

1 CI = Confidence Interval, VIF = Variance Inflation Factor

ggcoef_model(mco)

Cartographie des résidus

dataSf$resMco <- mco$residuals

tm_shape(dataSf) + tm_dots(col = "resMco", style = "quantile")

 

Modèle GWR


 

Principes généraux


 

Calibration du modèle : définition et pondération du voisinage


 

Première chose à faire : convertir l’objet sf en objet sp

dataSp <- as_Spatial(dataSf) # le package GWmodel n'est pas compatible avec 'sf'

 

Construction de la matrice de distances

matDist <- gw.dist(dp.locat = coordinates(dataSp))

 

Optimisation du voisinage (h)


 

Comparaison de deux pondérations spatiales : exponentielle et bicarrée :

# Exponential
nNeigh.exp <- bw.gwr(data = dataSp, approach = "AICc",
                     kernel = "exponential",
                     adaptive = TRUE,
                     dMat = matDist,
                     formula = log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati + log_surface_terrain)
## Adaptive bandwidth (number of nearest neighbours): 283 AICc value: 493.5324 
## Adaptive bandwidth (number of nearest neighbours): 183 AICc value: 490.555 
## Adaptive bandwidth (number of nearest neighbours): 120 AICc value: 487.6843 
## Adaptive bandwidth (number of nearest neighbours): 82 AICc value: 477.5322 
## Adaptive bandwidth (number of nearest neighbours): 57 AICc value: 473.0692 
## Adaptive bandwidth (number of nearest neighbours): 43 AICc value: 471.0132 
## Adaptive bandwidth (number of nearest neighbours): 33 AICc value: 469.6697 
## Adaptive bandwidth (number of nearest neighbours): 28 AICc value: 465.9644 
## Adaptive bandwidth (number of nearest neighbours): 23 AICc value: 463.5791 
## Adaptive bandwidth (number of nearest neighbours): 22 AICc value: 464.9729 
## Adaptive bandwidth (number of nearest neighbours): 25 AICc value: 462.0606 
## Adaptive bandwidth (number of nearest neighbours): 25 AICc value: 462.0606
# Bisquare
nNeigh.bisq <- bw.gwr(data = dataSp, approach = "AICc", 
                      kernel = "bisquare", 
                      adaptive = TRUE, 
                      dMat = matDist, 
                      formula = log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati + log_surface_terrain)
## Adaptive bandwidth (number of nearest neighbours): 283 AICc value: 493.3797 
## Adaptive bandwidth (number of nearest neighbours): 183 AICc value: 488.1249 
## Adaptive bandwidth (number of nearest neighbours): 120 AICc value: 490.2119 
## Adaptive bandwidth (number of nearest neighbours): 221 AICc value: 490.1729 
## Adaptive bandwidth (number of nearest neighbours): 158 AICc value: 488.1154 
## Adaptive bandwidth (number of nearest neighbours): 144 AICc value: 487.7541 
## Adaptive bandwidth (number of nearest neighbours): 134 AICc value: 488.5333 
## Adaptive bandwidth (number of nearest neighbours): 149 AICc value: 487.5514 
## Adaptive bandwidth (number of nearest neighbours): 153 AICc value: 487.7659 
## Adaptive bandwidth (number of nearest neighbours): 147 AICc value: 487.9004 
## Adaptive bandwidth (number of nearest neighbours): 150 AICc value: 487.3307 
## Adaptive bandwidth (number of nearest neighbours): 151 AICc value: 487.2637 
## Adaptive bandwidth (number of nearest neighbours): 151 AICc value: 487.2637

Estimation de la GWR

# Avec pondération exponential
GWR.exp <- gwr.basic(data = dataSp, bw = nNeigh.exp, kernel = "exponential", adaptive = TRUE,  dMat = matDist, formula = log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati + log_surface_terrain)

# Avec pondération bisquare
GWR.bisq <- gwr.basic(data = dataSp, bw = nNeigh.bisq, kernel = "bisquare", 
                      adaptive = TRUE,  dMat = matDist, 
                      formula = log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati + log_surface_terrain)

Comparaison des deux calibrations :

diagGwr <- cbind(
  rbind(nNeigh.exp,nNeigh.bisq),
  rbind(GWR.exp$GW.diagnostic$gw.R2,GWR.bisq$GW.diagnostic$gw.R2),
  rbind(GWR.exp$GW.diagnostic$AIC,GWR.bisq$GW.diagnostic$AIC)) %>% 
  `colnames<-`(c("Nb Voisins","R2","AIC")) %>% 
  `rownames<-`(c("EXPONENTIAL","BISQUARE"))
diagGwr
##             Nb Voisins        R2      AIC
## EXPONENTIAL         25 0.5801772 401.1731
## BISQUARE           151 0.5031266 455.4523

La GWR avec pondération exponentielle est la plus performante


 

Interprétation des résultats bruts de la GWR :

GWR.exp
##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2021-06-29 17:29:12 
##    Call:
##    gwr.basic(formula = log_valeur_fonciere ~ log_dist_litt + log_surface_reelle_bati + 
##     log_surface_terrain, data = dataSp, bw = nNeigh.exp, kernel = "exponential", 
##     adaptive = TRUE, dMat = matDist)
## 
##    Dependent (y) variable:  log_valeur_fonciere
##    Independent variables:  log_dist_litt log_surface_reelle_bati log_surface_terrain
##    Number of data points: 446
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.08899 -0.18479  0.01833  0.19689  2.97012 
## 
##    Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
##    (Intercept)              9.60848    0.24208  39.692  < 2e-16 ***
##    log_dist_litt           -0.07647    0.02016  -3.793 0.000169 ***
##    log_surface_reelle_bati  0.45742    0.04668   9.799  < 2e-16 ***
##    log_surface_terrain      0.21706    0.02076  10.458  < 2e-16 ***
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 0.4202 on 442 degrees of freedom
##    Multiple R-squared: 0.433
##    Adjusted R-squared: 0.4291 
##    F-statistic: 112.5 on 3 and 442 DF,  p-value: < 2.2e-16 
##    ***Extra Diagnostic information
##    Residual sum of squares: 78.02509
##    Sigma(hat): 0.4192042
##    AIC:  498.1865
##    AICc:  498.3228
##    BIC:  103.1897
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: exponential 
##    Adaptive bandwidth: 25 (number of nearest neighbours)
##    Regression points: the same locations as observations are used.
##    Distance metric: A distance matrix is specified for this model calibration.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                                 Min.   1st Qu.    Median   3rd Qu.    Max.
##    Intercept                8.246966  8.953660  9.359738 10.132842 11.4319
##    log_dist_litt           -0.223555 -0.094306 -0.068669 -0.033344  0.1244
##    log_surface_reelle_bati  0.177656  0.352127  0.480940  0.564959  0.6983
##    log_surface_terrain     -0.059955  0.174898  0.237955  0.254789  0.4053
##    ************************Diagnostic information*************************
##    Number of data points: 446 
##    Effective number of parameters (2trace(S) - trace(S'S)): 69.53251 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 376.4675 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 462.0606 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 401.1731 
##    BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 195.0179 
##    Residual sum of squares: 57.77146 
##    R-square value:  0.5801772 
##    Adjusted R-square value:  0.5024306 
## 
##    ***********************************************************************
##    Program stops at: 2021-06-29 17:29:12

 

  • Première information : le R2 du modèle GWR est > au R2 du modèle MCO (plus grande capacité à expliquer la variance de Y).
  • Seconde information : il semble exister une non-stationnarité spatiale, et même des inversions de signes pour la variable distance au littoral

Il faut maintenant cartographier les betas de chaque variable pour décrire cette non-stationnarité spatiale.

# Fonction de cartographie automatique des coefficients GWR
mapGWR <- function(spdf,var,var_TV,legend.title = "betas GWR",main.title, dot.size = 0.3) {
  tv <- spdf[abs(var_TV)>1.96,]
  tm_shape(spdf) +
    tm_dots(var, title = legend.title, size = dot.size) +
    tm_shape(oleron) + tm_lines() +
    tm_shape(tv) + tm_dots(col="grey40") +
    tm_layout(title = main.title, legend.title.size =0.9, inner.margins = .15) 
}
# Planche cartographique des 3 variables
tmap_mode("plot")
a <- mapGWR(GWR.exp$SDF, var = "log_dist_litt",var_TV = GWR.exp$SDF$log_dist_litt_TV,
            main.title = "Distance au littoral")
b <- mapGWR(GWR.exp$SDF, var = "log_surface_reelle_bati",var_TV = GWR.exp$SDF$log_surface_reelle_bati_TV,
            main.title = "Surface bâtie")
c <- mapGWR(GWR.exp$SDF, var = "log_surface_terrain",var_TV = GWR.exp$SDF$log_surface_terrain_TV,
            main.title = "Surface terrain")

tmap_arrange(a,b,c)


 

Interprétation des cartes de betas GWR

L’interprétation des cartes GWR est la partie la plus délicate, mais aussi la plus intéressante. Nous présentons d’abord une clé de lecture théorique et générique, puis une application à nos données.

Interprétation théorique : le poids des contextes locaux


 

Interprétation empirique : recherche de spécificités locales inobservées


 

Pour aller plus loin

La GWR multiscalaire (multiscale GWR)

Il n’y a pas de raison de penser que tous les prédicteurs agissent sur le prix à la même échelle (c’est-à-dire selon un même schéma de voisinage). Certains processus peuvent être locaux, d’autres globaux. Récemment, une extension de la GWR a été proposée, permettant de relâcher cette hypothèse d’égalité des échelles : la GWR multiscalaire (MGWR, Fotheringham et al., 2017). Le principe est simple : un algorithme optimise le choix de la bandwidth pour chaque prédicteur, en fonction des autres. Il en résulte un modèle souvent mixte.


 

source("gwr.multiscale_T.r") 

# On lance la MGWR
MGWR <- gwr.multiscale(formula = log_valeur_fonciere ~ log_dist_litt + 
                         log_surface_reelle_bati + log_surface_terrain,
                       data = dataSp, kernel = "exponential",
                       predictor.centered=rep(T, 3), # centrage des prédicteurs
                       adaptive = TRUE,
                       bws0 = rep(1,4)) # BW minimum pour l'optimisation
## ------   Calculate the initial beta0 from the above bandwidths    ------
## ------            The end for calculating the initial beta0              ------
## ------ Select the optimum bandwidths for each independent variable via  AIC  aproach ------
## *****  The back-fitting process for model calibration with bandwiths selected *****
##     Iteration  1 :
## Now select an optimum bandwidth for the variable:  Intercept 
## The newly selected bandwidth for variable  Intercept is:  143 
## The bandwidth used in the last ieration is:  1 and the difference between these two bandwidths is:  142 
## The bandwidth for variable  Intercept will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable:  log_dist_litt 
## The newly selected bandwidth for variable  log_dist_litt is:  62 
## The bandwidth used in the last ieration is:  1 and the difference between these two bandwidths is:  61 
## The bandwidth for variable  log_dist_litt will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable:  log_surface_reelle_bati 
## The newly selected bandwidth for variable  log_surface_reelle_bati is:  64 
## The bandwidth used in the last ieration is:  1 and the difference between these two bandwidths is:  63 
## The bandwidth for variable  log_surface_reelle_bati will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable:  log_surface_terrain 
## The newly selected bandwidth for variable  log_surface_terrain is:  23 
## The bandwidth used in the last ieration is:  1 and the difference between these two bandwidths is:  22 
## The bandwidth for variable  log_surface_terrain will be continually selected in the next ieration.
##     Ieration  1 the differential change value of RSS (dCVR) is:  0.3100205 
## ----------End of    Iteration  1 ----------
##     Iteration  2 :
## Now select an optimum bandwidth for the variable:  Intercept 
## The newly selected bandwidth for variable  Intercept is:  384 
## The bandwidth used in the last ieration is:  143 and the difference between these two bandwidths is:  241 
## The bandwidth for variable  Intercept will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable:  log_dist_litt 
## The newly selected bandwidth for variable  log_dist_litt is:  73 
## The bandwidth used in the last ieration is:  62 and the difference between these two bandwidths is:  11 
## The bandwidth for variable  log_dist_litt will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable:  log_surface_reelle_bati 
## The newly selected bandwidth for variable  log_surface_reelle_bati is:  77 
## The bandwidth used in the last ieration is:  64 and the difference between these two bandwidths is:  13 
## The bandwidth for variable  log_surface_reelle_bati will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable:  log_surface_terrain 
## The newly selected bandwidth for variable  log_surface_terrain is:  23 
## The bandwidth used in the last ieration is:  23 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_surface_terrain seems to be converged for  1  times.It will be continually optimized in the next  4  times
##     Ieration  2 the differential change value of RSS (dCVR) is:  0.1040459 
## ----------End of    Iteration  2 ----------
##     Iteration  3 :
## Now select an optimum bandwidth for the variable:  Intercept 
## The newly selected bandwidth for variable  Intercept is:  432 
## The bandwidth used in the last ieration is:  384 and the difference between these two bandwidths is:  48 
## The bandwidth for variable  Intercept will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable:  log_dist_litt 
## The newly selected bandwidth for variable  log_dist_litt is:  77 
## The bandwidth used in the last ieration is:  73 and the difference between these two bandwidths is:  4 
## The bandwidth for variable  log_dist_litt will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable:  log_surface_reelle_bati 
## The newly selected bandwidth for variable  log_surface_reelle_bati is:  368 
## The bandwidth used in the last ieration is:  77 and the difference between these two bandwidths is:  291 
## The bandwidth for variable  log_surface_reelle_bati will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable:  log_surface_terrain 
## The newly selected bandwidth for variable  log_surface_terrain is:  23 
## The bandwidth used in the last ieration is:  23 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_surface_terrain seems to be converged for  2  times.It will be continually optimized in the next  3  times
##     Ieration  3 the differential change value of RSS (dCVR) is:  0.133204 
## ----------End of    Iteration  3 ----------
##     Iteration  4 :
## Now select an optimum bandwidth for the variable:  Intercept 
## The newly selected bandwidth for variable  Intercept is:  433 
## The bandwidth used in the last ieration is:  432 and the difference between these two bandwidths is:  1 
## The bandwidth for variable  Intercept will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable:  log_dist_litt 
## The newly selected bandwidth for variable  log_dist_litt is:  77 
## The bandwidth used in the last ieration is:  77 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_dist_litt seems to be converged for  1  times.It will be continually optimized in the next  4  times
## Now select an optimum bandwidth for the variable:  log_surface_reelle_bati 
## The newly selected bandwidth for variable  log_surface_reelle_bati is:  368 
## The bandwidth used in the last ieration is:  368 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_surface_reelle_bati seems to be converged for  1  times.It will be continually optimized in the next  4  times
## Now select an optimum bandwidth for the variable:  log_surface_terrain 
## The newly selected bandwidth for variable  log_surface_terrain is:  23 
## The bandwidth used in the last ieration is:  23 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_surface_terrain seems to be converged for  3  times.It will be continually optimized in the next  2  times
##     Ieration  4 the differential change value of RSS (dCVR) is:  0.004474833 
## ----------End of    Iteration  4 ----------
##     Iteration  5 :
## Now select an optimum bandwidth for the variable:  Intercept 
## The newly selected bandwidth for variable  Intercept is:  441 
## The bandwidth used in the last ieration is:  433 and the difference between these two bandwidths is:  8 
## The bandwidth for variable  Intercept will be continually selected in the next ieration.
## Now select an optimum bandwidth for the variable:  log_dist_litt 
## The newly selected bandwidth for variable  log_dist_litt is:  77 
## The bandwidth used in the last ieration is:  77 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_dist_litt seems to be converged for  2  times.It will be continually optimized in the next  3  times
## Now select an optimum bandwidth for the variable:  log_surface_reelle_bati 
## The newly selected bandwidth for variable  log_surface_reelle_bati is:  368 
## The bandwidth used in the last ieration is:  368 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_surface_reelle_bati seems to be converged for  2  times.It will be continually optimized in the next  3  times
## Now select an optimum bandwidth for the variable:  log_surface_terrain 
## The newly selected bandwidth for variable  log_surface_terrain is:  23 
## The bandwidth used in the last ieration is:  23 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_surface_terrain seems to be converged for  4  times.It will be continually optimized in the next  1  times
##     Ieration  5 the differential change value of RSS (dCVR) is:  0.0116311 
## ----------End of    Iteration  5 ----------
##     Iteration  6 :
## Now select an optimum bandwidth for the variable:  Intercept 
## The newly selected bandwidth for variable  Intercept is:  441 
## The bandwidth used in the last ieration is:  441 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  Intercept seems to be converged for  1  times.It will be continually optimized in the next  4  times
## Now select an optimum bandwidth for the variable:  log_dist_litt 
## The newly selected bandwidth for variable  log_dist_litt is:  77 
## The bandwidth used in the last ieration is:  77 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_dist_litt seems to be converged for  3  times.It will be continually optimized in the next  2  times
## Now select an optimum bandwidth for the variable:  log_surface_reelle_bati 
## The newly selected bandwidth for variable  log_surface_reelle_bati is:  368 
## The bandwidth used in the last ieration is:  368 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_surface_reelle_bati seems to be converged for  3  times.It will be continually optimized in the next  2  times
## Now select an optimum bandwidth for the variable:  log_surface_terrain 
## The newly selected bandwidth for variable  log_surface_terrain is:  23 
## The bandwidth used in the last ieration is:  23 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_surface_terrain seems to be converged and will be kept the same in the following ierations.
##     Ieration  6 the differential change value of RSS (dCVR) is:  0.00309254 
## ----------End of    Iteration  6 ----------
##     Iteration  7 :
## Now select an optimum bandwidth for the variable:  Intercept 
## The newly selected bandwidth for variable  Intercept is:  441 
## The bandwidth used in the last ieration is:  441 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  Intercept seems to be converged for  2  times.It will be continually optimized in the next  3  times
## Now select an optimum bandwidth for the variable:  log_dist_litt 
## The newly selected bandwidth for variable  log_dist_litt is:  77 
## The bandwidth used in the last ieration is:  77 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_dist_litt seems to be converged for  4  times.It will be continually optimized in the next  1  times
## Now select an optimum bandwidth for the variable:  log_surface_reelle_bati 
## The newly selected bandwidth for variable  log_surface_reelle_bati is:  368 
## The bandwidth used in the last ieration is:  368 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_surface_reelle_bati seems to be converged for  4  times.It will be continually optimized in the next  1  times
##     Ieration  7 the differential change value of RSS (dCVR) is:  0.001383194 
## ----------End of    Iteration  7 ----------
##     Iteration  8 :
## Now select an optimum bandwidth for the variable:  Intercept 
## The newly selected bandwidth for variable  Intercept is:  441 
## The bandwidth used in the last ieration is:  441 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  Intercept seems to be converged for  3  times.It will be continually optimized in the next  2  times
## Now select an optimum bandwidth for the variable:  log_dist_litt 
## The newly selected bandwidth for variable  log_dist_litt is:  77 
## The bandwidth used in the last ieration is:  77 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_dist_litt seems to be converged and will be kept the same in the following ierations.
## Now select an optimum bandwidth for the variable:  log_surface_reelle_bati 
## The newly selected bandwidth for variable  log_surface_reelle_bati is:  368 
## The bandwidth used in the last ieration is:  368 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  log_surface_reelle_bati seems to be converged and will be kept the same in the following ierations.
##     Ieration  8 the differential change value of RSS (dCVR) is:  0.0006396762 
## ----------End of    Iteration  8 ----------
##     Iteration  9 :
## Now select an optimum bandwidth for the variable:  Intercept 
## The newly selected bandwidth for variable  Intercept is:  441 
## The bandwidth used in the last ieration is:  441 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  Intercept seems to be converged for  4  times.It will be continually optimized in the next  1  times
##     Ieration  9 the differential change value of RSS (dCVR) is:  0.0002963825 
## ----------End of    Iteration  9 ----------
##     Iteration  10 :
## Now select an optimum bandwidth for the variable:  Intercept 
## The newly selected bandwidth for variable  Intercept is:  441 
## The bandwidth used in the last ieration is:  441 and the difference between these two bandwidths is:  0 
## The bandwidth for variable  Intercept seems to be converged and will be kept the same in the following ierations.
##     Ieration  10 the differential change value of RSS (dCVR) is:  0.0001375759 
## ----------End of    Iteration  10 ----------
##     Iteration  11 :
##     Ieration  11 the differential change value of RSS (dCVR) is:  6.401949e-05 
## ----------End of    Iteration  11 ----------
##     Iteration  12 :
##     Ieration  12 the differential change value of RSS (dCVR) is:  2.986701e-05 
## ----------End of    Iteration  12 ----------
##     Iteration  13 :
##     Ieration  13 the differential change value of RSS (dCVR) is:  1.396718e-05 
## ----------End of    Iteration  13 ----------
##     Iteration  14 :
##     Ieration  14 the differential change value of RSS (dCVR) is:  6.546073e-06 
## ----------End of    Iteration  14 ----------
mgwr.bw  <- round(MGWR[[2]]$bws,1) # Nombre de voisins pour chaque prédicteur
mgwr.bw
## [1] 441  77 368  23

 

# Exploration des résultats statistiques
print(MGWR)
##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2021-06-29 17:29:15 
##    Call:
##    gwr.multiscale(formula = log_valeur_fonciere ~ log_dist_litt + 
##     log_surface_reelle_bati + log_surface_terrain, data = dataSp, 
##     kernel = "exponential", adaptive = TRUE, bws0 = rep(1, 4), 
##     predictor.centered = rep(T, 3))
## 
##    Dependent (y) variable:  log_valeur_fonciere
##    Independent variables:  log_dist_litt log_surface_reelle_bati log_surface_terrain
##    Number of data points: 446
##    ***********************************************************************
##    *             GWR with Parameter-Specific Distance Metrics            *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: exponential 
##    Adaptive bandwidths for each coefficient(number of nearest neighbours): 
##               (Intercept) log_dist_litt log_surface_reelle_bati
##    Bandwidth          441            77                     368
##               log_surface_terrain
##    Bandwidth                   23
## 
##    ****************Summary of GWR coefficient estimates:******************
##                                 Min.   1st Qu.    Median   3rd Qu.    Max.
##    Intercept                8.201685  9.253211  9.481919  9.940775 10.9172
##    log_dist_litt           -0.176545 -0.093417 -0.060333 -0.049206 -0.0229
##    log_surface_reelle_bati  0.456252  0.461429  0.465485  0.474299  0.4794
##    log_surface_terrain     -0.042735  0.154269  0.225274  0.264967  0.3943
##    ************************Diagnostic information*************************
##    Residual sum of squares:  65.78948 
##    R-square value:  0.5219106 
##    Adjusted R-square value:  0.4833002 
##    AICc value:  462.1182 
## 
##    ***********************************************************************
##    Program stops at: 2021-06-29 17:31:00

 

On constate que si l’effet de la surface bâtie sur les prix agit de façon assez globale à l’échelle de l’île, les deux autres prédicteurs agissent de manière beaucoup plus locales. Ainsi par exemple, l’effet de la distance au littoral relève d’un processus très localisé.

Cartographie des résultats

a <- mapGWR(MGWR$SDF, var = "log_dist_litt",var_TV = MGWR$SDF$log_dist_litt_TV,
       main.title = "Distance au littoral (bw = 77)")
b <- mapGWR(MGWR$SDF, var = "log_surface_reelle_bati",var_TV = MGWR$SDF$log_surface_reelle_bati_TV,
       main.title = "Surface bâtie (bw = 368)")
c <- mapGWR(MGWR$SDF, var = "log_surface_terrain",var_TV = MGWR$SDF$log_surface_terrain_TV,
       main.title = "Surface terrain (bw = 23)")

tmap_arrange(a,b,c)


 

Régionalisation des sous-marchés immobiliers

L’objectif est maintenant de délimiter des sous-marchés immobiliers, sur la base des coefficients de la GWR. Ainsi, on recherchera l’homogénéité des prédicteurs dans chaque sous-marché.

Ce processus de découpage de l’espace en sous-régions homogènes se nomme la régionalisation. C’est une extension de la classification classique : on y ajoute un critère de contiguité spatiale. La régionalisation est donc une classification spatiale.

Il existe plusieurs méthodes de régionalisation. Un des principes les plus répandus consiste à établir une classification à la fois sur la base de la ressemblance entre les observations, et sur leur proximité dans l’espace géographique.

Nous allons ici utiliser l’algorithme SKATER (Spatial Klustering Analysis by Tree Edge Removal), méthode proposée par Assunçao et al. (2006) et déjà appliqué dans un contexte de recherche similaire au notre par Helbich et al. (2013). Par ailleurs, une description très pédagogique de la méthode est disponible ici : http://www.jms-insee.fr/2018/S08_5_ACTE_ROUSSEZ_JMS2018.pdf

L’algorithme SKATER comporte 4 étapes (cf. doc cité ci-dessus) :

  1. Constuction d’un graphe de voisinage (contiguité ou knn)
  2. Pondération des liens du graphe à partir de la matrice de dissimilarité
  3. Construction de l’arbre portant minimal, en retenant le lien avec le voisin le plus ressemblant pour chaque noeud
  4. Elagage de l’arbre maximisant la variance inter-classes des sous-graphes

 

Première étape : préparation de la table des coefficients GWR

# Centrage-réduction pour rendre les coefficients comparables
gwrB.scaled <- GWR.exp$SDF %>% 
  as.data.frame() %>% 
  select(1:4) %>% 
  mutate_all(~scale(.)) %>% 
  rename_with(~paste(.x, "b", sep = "_"))

Deuxième étape : computation de l’algorithme SKATER

Définition du voisinage de chaque bien

knn <- knearneigh(GWR.exp$SDF, k=50)
nb <- knn2nb(knn)
plot(nb, coords = coordinates(GWR.exp$SDF), col="blue")

Calibrage du coût des arêtes et de la pondération spatiale

costs <- nbcosts(nb, data = gwrB.scaled)
costsW <- nb2listw(nb, costs, style="B")

Minimisation de l’arbre et classification

costsTree <- mstree(costsW)
plot(costsTree, coords = coordinates(GWR.exp$SDF), col="blue", main = "Arbre portant minimal")

clus6 <- skater(edges = costsTree[,1:2], data = gwrB.scaled, ncuts = 5)

### Troisième étape : analyse des résultats

#### Cartographie des clusters

dataClus <- dataSf %>% 
mutate(clus = as.factor(clus6$groups)) %>% 
bind_cols(gwrB.scaled)

tmap_mode(mode = "view")
tm_shape(dataClus) + 
tm_symbols(col="clus", size=.8, palette = "Set1") +
tm_layout(title = "Classification en 6 groupes") 

Caractérisation des clusters

nomVar <- c("log_dist_litt_b","log_surface_reelle_bati_b","log_surface_terrain_b","Intercept_b")

clusProfile <- dataClus[, c(nomVar, "clus")] %>% 
  group_by(clus) %>% 
  summarise_each(funs(mean)) %>% 
  st_drop_geometry()

clusLong <- reshape2::melt(clusProfile, id.vars = "clus")

profilePlot <- ggplot(clusLong) +
  geom_bar(aes(x = variable, y = value), 
           fill = "grey25", 
           position = "identity", 
           stat = "identity") +
  scale_x_discrete("Effet") + 
  scale_y_continuous("Valeur moyenne par classe") +
  facet_wrap(~ clus) + 
  coord_flip() + 
  theme(strip.background = element_rect(fill="grey25"),
        strip.text = element_text(colour = "grey85", face = "bold"))

profilePlot


 

Bibliographie


 

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 10 (buster)
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] spdep_1.1-8       GWmodel_2.2-6     spatialreg_1.1-8  Matrix_1.3-3     
##  [5] spData_0.3.10     Rcpp_1.0.6        robustbase_0.93-8 maptools_1.1-1   
##  [9] sp_1.4-5          GGally_2.1.2      gtsummary_1.4.1   plotly_4.9.4.1   
## [13] tmap_3.3-2        sf_1.0-1          forcats_0.5.1     stringr_1.4.0    
## [17] dplyr_1.0.7       purrr_0.3.4       readr_1.4.0       tidyr_1.1.3      
## [21] tibble_3.1.2      ggplot2_3.3.4     tidyverse_1.3.1  
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1            backports_1.2.1         lwgeom_0.2-6           
##   [4] plyr_1.8.6              lazyeval_0.2.2          splines_4.1.0          
##   [7] crosstalk_1.1.1         leaflet_2.0.4.1         digest_0.6.27          
##  [10] htmltools_0.5.1.1       gdata_2.18.0            leaflet.providers_1.9.0
##  [13] fansi_0.5.0             checkmate_2.0.0         magrittr_2.0.1         
##  [16] openxlsx_4.2.4          modelr_0.1.8            gmodels_2.18.1         
##  [19] xts_0.12.1              colorspace_2.0-2        rvest_1.0.0            
##  [22] rgdal_1.5-23            haven_2.4.1             xfun_0.24              
##  [25] leafem_0.1.6            crayon_1.4.1            jsonlite_1.7.2         
##  [28] survival_3.2-11         zoo_1.8-9               glue_1.4.2             
##  [31] stars_0.5-3             gtable_0.3.0            car_3.0-10             
##  [34] DEoptimR_1.0-9          abind_1.4-5             scales_1.1.1           
##  [37] DBI_1.1.1               viridisLite_0.4.0       units_0.7-2            
##  [40] foreign_0.8-81          proxy_0.4-26            intervals_0.15.2       
##  [43] htmlwidgets_1.5.3       httr_1.4.2              FNN_1.1.3              
##  [46] RColorBrewer_1.1-2      wk_0.4.1                ellipsis_0.3.2         
##  [49] pkgconfig_2.0.3         reshape_0.8.8           XML_3.99-0.6           
##  [52] farver_2.1.0            sass_0.4.0              dbplyr_2.1.1           
##  [55] deldir_0.2-10           utf8_1.2.1              reshape2_1.4.4         
##  [58] tidyselect_1.1.1        labeling_0.4.2          rlang_0.4.11           
##  [61] tmaptools_3.1-1         munsell_0.5.0           cellranger_1.1.0       
##  [64] tools_4.1.0             cli_2.5.0               generics_0.1.0         
##  [67] broom_0.7.8             evaluate_0.14           yaml_2.2.1             
##  [70] leafsync_0.1.0          knitr_1.33              fs_1.5.0               
##  [73] zip_2.2.0               s2_1.0.6                nlme_3.1-152           
##  [76] xml2_1.3.2              compiler_4.1.0          rstudioapi_0.13        
##  [79] curl_4.3.2              png_0.1-7               e1071_1.7-7            
##  [82] gt_0.3.0                reprex_2.0.0            spacetime_1.2-5        
##  [85] broom.helpers_1.3.0     bslib_0.2.5.1           stringi_1.6.2          
##  [88] highr_0.9               lattice_0.20-44         commonmark_1.7         
##  [91] classInt_0.4-3          vctrs_0.3.8             pillar_1.6.1           
##  [94] LearnBayes_2.15.1       lifecycle_1.0.0         jquerylib_0.1.4        
##  [97] data.table_1.14.0       raster_3.4-13           R6_2.5.0               
## [100] rio_0.5.27              KernSmooth_2.23-20      spDataLarge_0.5.4      
## [103] codetools_0.2-18        dichromat_2.0-0         boot_1.3-28            
## [106] MASS_7.3-54             gtools_3.9.2            assertthat_0.2.1       
## [109] withr_2.4.2             mgcv_1.8-35             expm_0.999-6           
## [112] parallel_4.1.0          hms_1.1.0               labelled_2.8.0         
## [115] grid_4.1.0              coda_0.19-4             class_7.3-19           
## [118] rmarkdown_2.9           carData_3.0-4           lubridate_1.7.10       
## [121] base64enc_0.1-3