Despite their many advantages, accessing nature and green spaces is getting increasingly difficult in highly urbanized areas. Some fear that underserved communities are more exposed to these issues. Here, I propose a data-driven way to explore this.
In particular, I pose an urban development question that has lately been gaining interest across professional circles and local governments — now as green equality. This concept refers to the disparities in people accessing green spaces in different parts of a particular city. Here, I explore its financial dimension and see if there are any clear relationships between the available green area per capita and the economic level of that same urban unit.
I will explore two different spatial resolutions of the city — districts and census districts using Esri Shapefiles provided by the Austrian Government’s Open Data Portal. I will also incorporate tabular statistical data (population and income) into the georeferenced administrative areas. Then, I overlay the administrative areas with an official green area dataset, recording the location of each green space in a geospatial format. Then, I combine this information and quantify each urban district’s total green space per capita size. Finally, I relate each area’s financial status, captured by annual net income, to the green area per capita ratio to see if any patterns emerge.
Let’s take a look at the Austrian government’s Open Data Portal here.
When I was writing this article, the website’s English translation wasn’t really working, so instead of relying on my long-forgotten 12 years of German classes, I used DeepL to navigate across the subpages and thousands of datasets.
Then, I collected a couple of data files, both georeferenced (Esri shapefiles) and simple tabular data, which I will use for the later analysis. The data I collected:
Boundaries — the administrative boundaries of the following spatial units in Vienna:
Land-use — information about the location of green spaces and built-in areas:
Statistics — data on population and income corresponding to the socio-economical level of an area:
- Population per district, annually recorded since 2002, and stored split based on 5-year age groups, gender, and original nationality
- Population per census district, annually recorded since 2008 and stored split based on three irregular age groups, gender, and origin
- Average net income since 2002 in the districts of Vienna, expressed in Euros per employee per annum
Additionally, I stored the downloaded data files in a local folder called data.
2.1 Administrative boundaries
First, read and visualize the different shape files containing each administrative boundary level to have a closer grip on the city at hand:
folder = 'data'
admin_city = gpd.read_file(folder + '/LANDESGRENZEOGD')
admin_district = gpd.read_file(folder + '/BEZIRKSGRENZEOGD')
admin_census = gpd.read_file(folder + '/ZAEHLBEZIRKOGD')display(admin_city.head(1))
display(admin_district.head(1))
display(admin_census.head(1))
Here we make a note that the column names BEZNR and ZBEZ, correspond to the District ID and the Census district ID, respectively. Unexpectedly, they are stored/parsed in different formats, numpy.float64 and str:
print(type(admin_district.BEZNR.iloc[0]))
print(type(admin_census.ZBEZ.iloc[0]))pyth
Making sure we indeed have 23 districts and 250 census districts as the data files documentation claimed:
print(len(set(admin_district.BEZNR)))
print(len(set(admin_census.ZBEZ)))
Now visualize the boundaries — first the city, then its districts, and then the even smaller census districts.
f, ax = plt.subplots(1,3,figsize=(15,5))admin_city.plot(ax=ax[0],
edgecolor = 'k',
linewidth = 0.5,
alpha = 0.9,
cmap = 'Reds')
admin_district.plot(ax=ax[1],
edgecolor = 'k',
linewidth = 0.5,
alpha = 0.9,
cmap = 'Blues')
admin_census.plot(ax=ax[2],
edgecolor = 'k',
linewidth = 0.5,
alpha = 0.9,
cmap = 'Purples')
ax[0].set_title('City boundaries')
ax[1].set_title('District boundaries')
ax[2].set_title('Census distrcit boundaries')
This code outputs the following visuals of Vienna:
2.2 Green areas
Now, also take a look at the green space distribution:
gdf_green = gpd.read_file(folder + '/GRUENFREIFLOGD_GRUENGEWOGD')
display(gdf_green.head(3))
Here, one may notice that there is no direct way to link green areas (e.g., no district id-s added) to neighborhoods — so later on, we will do so by manipulating the geometries to find overlaps.
Now visualize this:
f, ax = plt.subplots(1,1,figsize=(7,5))gdf_green.plot(ax=ax,
edgecolor = 'k',
linewidth = 0.5,
alpha = 0.9,
cmap = 'Greens')
ax.set_title('Green areas in Vienna')
This code shows where the green areas are within Vienna:
We may note that forestry segments are still within the admin boundary, implying that not every part of the city is urbanized and significantly populated. Later on, we will get back to this when evaluating the per-capital green area.
2.3 Statistical data — population, income
Finally, let’s take a look at the statistical data files. The first major difference is that these are not georeferenced but simple csv tables:
df_pop_distr = pd.read_csv('vie-bez-pop-sex-age5-stk-ori-geo4-2002f.csv',
sep = ';',
encoding='unicode_escape',
skiprows = 1)df_pop_cens = pd.read_csv('vie-zbz-pop-sex-agr3-stk-ori-geo2-2008f.csv',
sep = ';',
encoding='unicode_escape',
skiprows = 1)
df_inc_distr = pd.read_csv('vie-bez-biz-ecn-inc-sex-2002f.csv',
sep = ';',
encoding='unicode_escape',
skiprows = 1)
display(df_pop_distr.head(1))
display(df_pop_cens.head(1))
display(df_inc_distr.head(1))
3.1. Preparing the statistical data files
The previous subsection shows that the statistical data tables use different naming conventions — they have DISTRICT_CODE and SUB_DISTRICT_CODE identifiers instead of things like BEZNR and ZBEZ. However, after reading each data set’s documentation, it becomes clear that it’s easy to transform from one to another, for which I present two short functions in the next cell. I will simultaneously process data on the level of districts and census districts.
Additionally, I will only be interested in the (latest) aggregated values and data points of the statistical information, such as the total population at the newest snapshot. So, let’s clean up these data files and keep the columns I will use later.
# these functions convert the district and census district ids to be compatbile with the ones found in the shapefiles
def transform_district_id(x):
return int(str(x)[1:3])def transform_census_district_id(x):
return int(str(x)[1:5])
# select the latest year of the data set
df_pop_distr_2 = df_pop_distr[df_pop_distr.REF_YEAR \
==max(df_pop_distr.REF_YEAR)]
df_pop_cens_2 = df_pop_cens[df_pop_cens.REF_YEAR \
==max(df_pop_cens.REF_YEAR)]
df_inc_distr_2 = df_inc_distr[df_inc_distr.REF_YEAR \
==max(df_inc_distr.REF_YEAR)]
# convert district ids
df_pop_distr_2['district_id'] = \
df_pop_distr_2.DISTRICT_CODE.apply(transform_district_id)
df_pop_cens_2['census_district_id'] = \
df_pop_cens_2.SUB_DISTRICT_CODE.apply(transform_census_district_id)
df_inc_distr_2['district_id'] = \
df_inc_distr_2.DISTRICT_CODE.apply(transform_district_id)
# aggregate population values
df_pop_distr_2 = df_pop_distr_2.groupby(by = 'district_id').sum()
df_pop_distr_2['district_population'] = df_pop_distr_2.AUT + \
df_pop_distr_2.EEA + df_pop_distr_2.REU + df_pop_distr_2.TCN
df_pop_distr_2 = df_pop_distr_2[['district_population']]
df_pop_cens_2 = df_pop_cens_2.groupby(by = 'census_district_id').sum()
df_pop_cens_2['census_district_population'] = df_pop_cens_2.AUT \
+ df_pop_cens_2.FOR
df_pop_cens_2 = df_pop_cens_2[['census_district_population']]
df_inc_distr_2['district_average_income'] = \
1000*df_inc_distr_2[['INC_TOT_VALUE']]
df_inc_distr_2 = \
df_inc_distr_2.set_index('district_id')[['district_average_income']]
# display the finalized tables
display(df_pop_distr_2.head(3))
display(df_pop_cens_2.head(3))
display(df_inc_distr_2.head(3))
# and unifying the naming conventions
admin_district['district_id'] = admin_district.BEZNR.astype(int)
admin_census['census_district_id'] = admin_census.ZBEZ.astype(int)
print(len(set(admin_census.ZBEZ)))
Double-check the computed total population values at the two levels of aggregations:
print(sum(df_pop_distr_2.district_population))
print(sum(df_pop_cens_2.census_district_population))
These two should both provide the same result — 1931593 people.
3.1. Preparing the geospatial data files
Now that we are done with the essential data preparation of the statistical files, it’s time to match the green area polygons to the administrative area polygons. Then, let’s compute each admin area’s total green area coverage. Additionally, I will add each admin area’s relative green area coverage out of curiosity.
To obtain areas expressed in SI units, we need to switch to a so-called local CRS, which in the case of Vienna is EPSG:31282. You more read more on this topic, map projection and coordinate reference systems here and here.
# converting all GeoDataFrames into the loca crs
admin_district_2 = \
admin_district[['district_id', 'geometry']].to_crs(31282)admin_census_2 = \
admin_census[['census_district_id', 'geometry']].to_crs(31282)
gdf_green_2 = gdf_green.to_crs(31282)
Compute the administrative unit’s area measured in SI units:
admin_district_2['admin_area'] = \
admin_district_2.geometry.apply(lambda g: g.area)admin_census_2['admin_area'] = \
admin_census_2.geometry.apply(lambda g: g.area)
display(admin_district_2.head(1))
display(admin_census_2.head(1))
4.1 Compute the green area coverage in each administrative unit
I will use GeoPandas’ overlay function to overlay these two administrative boundary GeoDataFrames with the GeoDataFrame containing the green area polygons. Then, I compute the area of each green area section falling into different administrative regions. Next, I sum up these areas to the level of each admin area, both districts and census districts. In the final step, at each resolution unit, I add the administrative previously computed official unit areas and calculate the total area to green area ratio for each district and census district.
gdf_green_mapped_distr = gpd.overlay(gdf_green_2, admin_district_2)gdf_green_mapped_distr['green_area'] = \
gdf_green_mapped_distr.geometry.apply(lambda g: g.area)
gdf_green_mapped_distr = \
gdf_green_mapped_distr.groupby(by = 'district_id').sum()[['green_area']]
gdf_green_mapped_distr = \
gpd.GeoDataFrame(admin_district_2.merge(gdf_green_mapped_distr, left_on = 'district_id', right_index = True))
gdf_green_mapped_distr['green_ratio'] = \
gdf_green_mapped_distr.green_area / gdf_green_mapped_distr.admin_area
gdf_green_mapped_distr.head(3)
gdf_green_mapped_cens = gpd.overlay(gdf_green_2, admin_census_2)
gdf_green_mapped_cens['green_area'] = \
gdf_green_mapped_cens.geometry.apply(lambda g: g.area)gdf_green_mapped_cens = \
gdf_green_mapped_cens.groupby(by = 'census_district_id').sum()[['green_area']]
gdf_green_mapped_cens = \
gpd.GeoDataFrame(admin_census_2.merge(gdf_green_mapped_cens, left_on = 'census_district_id', right_index = True))
gdf_green_mapped_cens['green_ratio'] = gdf_green_mapped_cens.green_area / gdf_green_mapped_cens.admin_area
gdf_green_mapped_cens.head(3)
Finally, visualize the green ratio per district and census district! The results seem to make a lot of sense, with a high level of greenery on the outer parts and much lower in the central areas. Also, the 250 census districts clearly show a more detailed, fine-grained picture of the different neighborhood’s characteristics, offering more profound and more localized insights for urban planners. On the other hand, the district-level information, with ten times fewer spatial units, instead shows grand averages.
f, ax = plt.subplots(1,2,figsize=(17,5))gdf_green_mapped_distr.plot(ax = ax[0],
column = 'green_ratio',
edgecolor = 'k',
linewidth = 0.5,
alpha = 0.9,
legend = True,
cmap = 'Greens')
gdf_green_mapped_cens.plot(ax = ax[1],
column = 'green_ratio',
edgecolor = 'k',
linewidth = 0.5,
alpha = 0.9,
legend = True,
cmap = 'Greens')
This block of code outputs the following maps:
4.2 Add population and income information for each administrative unit
In the final step of this section, let’s map the statistical data into administrative areas. Reminder: We have population data on both the level of districts and the level of census districts. However, I could only find income (socioeconomic level indicator) on the level of districts. This is a usual trade-off in geospatial data science. While one dimension (greenery) is much more insightful at the higher resolution (census districts), data constraints may force us to use the lower resolution anyway.
display(admin_census_2.head(2))
display(df_pop_cens_2.head(2))
gdf_pop_mapped_distr = admin_district_2.merge(df_pop_distr_2, \
left_on = 'district_id', right_index = True)gdf_pop_mapped_cens = admin_census_2.merge(df_pop_cens_2, \
left_on = 'census_district_id', right_index = True)
gdf_inc_mapped_distr = admin_district_2.merge(df_inc_distr_2, \
left_on = 'district_id', right_index = True)
f, ax = plt.subplots(1,3,figsize=(15,5))
gdf_pop_mapped_distr.plot(column = 'district_population', ax=ax[0], \
edgecolor = 'k', linewidth = 0.5, alpha = 0.9, cmap = 'Blues')
gdf_pop_mapped_cens.plot(column = 'census_district_population', ax=ax[1], \
edgecolor = 'k', linewidth = 0.5, alpha = 0.9, cmap = 'Blues')
gdf_inc_mapped_distr.plot(column = 'district_average_income', ax=ax[2], \
edgecolor = 'k', linewidth = 0.5, alpha = 0.9, cmap = 'Purples')
ax[0].set_title('district_population')
ax[1].set_title('census_district_population')
ax[2].set_title('district_average_incomee')
This block of codes results in the following figure:
4.3. Green area-per-capita computation
Let’s sum up what we have now, all integrated into decent shapefiles corresponding to the districts and census districts of Vienna:
On the level of districts, we have green area ratio, population and income data
On the level of census districts, we have a green area ratio and population data
To capture green equality simply, I merge the information on the green area’s absolute size and the population in districts and census districts and compute the total amount of green area per capita.
Let’s take a look at our input — green coverage and population:
# a plot for the disticts
f, ax = plt.subplots(1,2,figsize=(10,5))gdf_green_mapped_distr.plot(
ax = ax[0],
column = 'green_ratio',
edgecolor = 'k',
linewidth = 0.5,
alpha = 0.9,
cmap = 'Greens')
gdf_pop_mapped_distr.plot(
ax = ax[1],
column = 'district_population',
edgecolor = 'k',
linewidth = 0.5,
alpha = 0.9,
cmap = 'Reds')
ax[0].set_title('green_ratio')
ax[1].set_title('district_population')
# a plot for the census disticts
f, ax = plt.subplots(1,2,figsize=(10,5))
gdf_green_mapped_cens.plot(
ax = ax[0],
column = 'green_ratio',
edgecolor = 'k',
linewidth = 0.5,
alpha = 0.9,
cmap = 'Greens')
gdf_pop_mapped_cens.plot(
ax = ax[1],
column = 'census_district_population',
edgecolor = 'k',
linewidth = 0.5,
alpha = 0.9,
cmap = 'Reds')
ax[0].set_title('green_ratio')
ax[1].set_title('district_population')
This block of codes results in the following figure:
To compute the green area per capita, I will first merge the greenery and population data frames in the following steps. I will do so via the example of census districts because its higher spatial resolution allows us to observe better patterns (if any) emerging. Make sure we do not divide by zero and also follow common sense; let’s drop those areas that are unpopulated.
gdf_green_pop_cens = \
gdf_green_mapped_cens.merge(gdf_pop_mapped_cens.drop( \
columns = ['geometry', 'admin_area']), left_on = 'census_district_id',\
right_on = 'census_district_id')[['census_district_id', \
'green_area', 'census_district_population', 'geometry']]gdf_green_pop_cens['green_area_per_capita'] = \
gdf_green_pop_cens['green_area'] / \
gdf_green_pop_cens['census_district_population']
gdf_green_pop_cens = \
gdf_green_pop_cens[gdf_green_pop_cens['census_district_population']>0]
f, ax = plt.subplots(1,1,figsize=(10,7))
gdf_green_pop_cens.plot(
column = 'green_area_per_capita',
ax=ax,
cmap = 'RdYlGn',
edgecolor = 'k',
linewidth = 0.5)
admin_district.to_crs(31282).plot(\
ax=ax, color = 'none', edgecolor = 'k', linewidth = 2.5)
This block of codes results in the following figure:
Let’s tweak the visualization a little:
f, ax = plt.subplots(1,1,figsize=(11,7))ax.set_title('Per-capita green area in\nthe census districts of Vienna',
fontsize = 18, pad = 30)
gdf_green_pop_cens.plot(
column = 'green_area_per_capita',
ax=ax,
cmap = 'RdYlGn',
edgecolor = 'k',
linewidth = 0.5,
legend=True,
norm=matplotlib.colors.LogNorm(\
vmin=gdf_green_pop_cens.green_area_per_capita.min(), \
vmax=gdf_green_pop_cens.green_area_per_capita.max()), )
admin_district.to_crs(31282).plot(
ax=ax, color = 'none', edgecolor = 'k', linewidth = 2.5)
This block of codes results in the following figure:
And the same for districts:
# compute the per-capita green area scores
gdf_green_pop_distr = \
gdf_green_mapped_distr.merge(gdf_pop_mapped_distr.drop(columns = \
['geometry', 'admin_area']), left_on = 'district_id', right_on = \
'district_id')[['district_id', 'green_area', 'district_population', \
'geometry']]gdf_green_popdistr = \
gdf_green_pop_distr[gdf_green_pop_distr.district_population>0]
gdf_green_pop_distr['green_area_per_capita'] = \
gdf_green_pop_distr['green_area'] / \
gdf_green_pop_distr['district_population']
# visualize the district-level map
f, ax = plt.subplots(1,1,figsize=(10,8))
ax.set_title('Per-capita green area in the districts of Vienna', \
fontsize = 18, pad = 26)
gdf_green_pop_distr.plot(column = 'green_area_per_capita', ax=ax, \
cmap = 'RdYlGn', edgecolor = 'k', linewidth = 0.5, legend=True, \
norm=matplotlib.colors.LogNorm(vmin=\
gdf_green_pop_cens.green_area_per_capita.min(), \
vmax=gdf_green_pop_cens.green_area_per_capita.max()), )
admin_city.to_crs(31282).plot(ax=ax, \
color = 'none', edgecolor = 'k', linewidth = 2.5)
This block of codes results in the following figure:
While the significant trends are clear — outer rim, more greenspace for everyone, built-in downtown, reversed. Still, these two plots, especially the more detailed one on the level of census districts, clearly show a variance in the amount of green space people enjoy in the different areas. Further research and incorporating additional data sources, for instance, on land use, could help explain better why those areas are higher in green area or population. For now, let’s enjoy this map and hope everybody finds the right amount of greenery in their home!
# merging the greenery, population and financial data
gdf_district_green_pip_inc = \
gdf_green_pop_distr.merge(gdf_inc_mapped_distr.drop(columns = \
['geometry']))
Visualize the relationship between the financial and the greenery dimensions:
f, ax = plt.subplots(1,1,figsize=(6,4))ax.plot(gdf_district_green_pip_inc.district_average_income, \
gdf_district_green_pip_inc.green_area_per_capita, 'o')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('district_average_income')
ax.set_ylabel('green_area_per_capita')
The result of this code block is the following scatter plot:
At first glance, the scatterplot doesn’t particularly set a strong case for the financials determining people’s access to green spaces. Honestly, I am a bit surprised by these results — however, in light of Vienna’s conscious, long-standing efforts in greening up their city, it may be why we do not see any major trend here. To confirm, I also checked the correlations between these two variables:
print(spearmanr(gdf_district_green_pip_inc.district_average_income, gdf_district_green_pip_inc.green_area_per_capita))print(pearsonr(gdf_district_green_pip_inc.district_average_income, gdf_district_green_pip_inc.green_area_per_capita))
Due to the heavy-tailed distribution of the financial data, I would take the Spearman (0.13) correlation more seriously here, but even the Pearson correlation (0.30) implies a relatively weak trend, aligning with my previous observations.