dd-l3-lagos-overture-buildings
Map (reference ← swipe → agent)
0:00
Updating the flood-risk model before rainy season. Need every building footprint exceeding 1000 m² across Lagos State (Nigeria) from Overture.
Also need a per-LGA roll-up: for each Lagos Local Government Area, the building count, total footprint area, count of buildings with a non-null Overture height, and the median height where available (null otherwise).
Write two output files. `lagos_buildings.geoparquet` should be EPSG:4326 GeoParquet with columns `id`, `height`, `footprint_area_m2`, `lga`, and `geometry`. `lagos_building_summary.parquet` should be a plain Parquet with no geometry column and columns `lga`, `n_buildings`, `total_footprint_m2`, `n_with_height`, and `p50_height_m`.
0:00
System info
{
"adapter": "openrouter",
"model": "deepseek/deepseek-v4-flash",
"base_url": "https://openrouter.ai/api/v1"
}
0:16
Let me break down the task: 1. Get building footprints > 1000 m² across Lagos State, Nigeria from Overture Maps 2. Create two output files: - `lagos_buildings.geoparquet` (EPSG:4326) with columns: id, height, footprint_area_m2, lga, geometry - `lagos_building_summary.parquet` (plain Parquet) with columns: lga, n_buildings, total_footprint_m2, n_with_height, p50_height_m Let me start by understanding what Lagos State looks like and use Overture Maps. First, let me check if there are any input files in the working directory, and then write the script.
0:16
0:16
Bash
ls /work/
(no output)
0:16
Bash
pip list 2>/dev/null | grep -iE "overture|geopandas|pyarrow|duckdb|shapely"
duckdb 1.5.3 geopandas 1.1.3 overturemaps 1.0.0 pyarrow 24.0.0 shapely 2.1.2
0:32
Good — all needed libraries are available. The working directory is empty (no input files). I need to: 1. Use Overture Maps to get all buildings > 1000 m² in Lagos State, Nigeria. 2. Get the LGA boundaries for Lagos State (Local Government Areas). 3. Compute building footprint areas, join to LGAs, and filter. 4. Aggregate per LGA. Let me write a script. I'll use `overturemaps` Python library to download buildings, and I'll need LGA boundaries. For LGA boundaries, I could use Overture's divisions (administrative boundaries), or I could look for a shapefile of Nigerian LGAs. Let me think about the best approach. Overture has a `divisions` type that includes administrative boundaries. The `subdivision` type includes regions, and `locality` includes cities/towns. For LGAs (which are admin level 3 in Nigeria), Overture might have them under `divisions` with `subtype: 'admin_region'` or similar. Actually, let me check Overture's division types. In Overture, administrative divisions are: - `continent`, `nation`, `region` (state/province), `county` (equivalent to LGA in some countries), `locality` (city/town), `macroregion`, `macrocounty` In Nigeria, the hierarchy is: Nation → State → Local Government Area (LGA). Overture has `region` for states and `county` for LGAs (or `locality`). Let me check by querying. Alternatively, I can use the `geopandas.datasets` or HDX or Natural Earth for LGA boundaries... but actually, I think Overture's divisions with `subtype = 'county'` might be LGAs in Nigeria. Let me try a DuckDB approach or the overturemaps library. Actually, let me use a simpler approach. Overturemaps has a Python client that can query by bounding box and type. I'll: 1. Find Lagos State's geometry first (using Overture divisions with `subtype='region'` and name='Lagos'). 2. Use that to get all buildings in Lagos. 3. For LGAs, I can also get divisions with `subtype='county'` or `subtype='locality'` that are within Lagos. 4. Spatial join buildings to LGAs. Let me write the script step by step. Let me first check what division subtypes are available for Nigeria/Overture.
0:32
Bash
python3 -c "
python3 -c "
import overturemaps
# Check what division types exist
print('Available types:', overturemaps.available_types())
" 2>&1 | head -20
Traceback (most recent call last):
File "<string>", line 4, in <module>
print('Available types:', overturemaps.available_types())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'overturemaps' has no attribute 'available_types'
0:40
Let me check the overturemaps API properly.
0:40
Bash
python3 -c "
python3 -c " import overturemaps print(dir(overturemaps)) "
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'core', 'geodataframe', 'get_all_overture_types', 'models', 'record_batch_reader']
0:41
Bash
python3 -c "
python3 -c " from overturemaps import core print(dir(core)) " 2>&1 | head -30
['ALL_RELEASES', 'BBox', 'GeoDataFrame', 'HAS_GEOPANDAS', 'List', 'Optional', 'STAC_CATALOG_URL', 'Tuple', '_ReleasesProxy', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_binary_search_manifest', '_cached_stac_catalog', '_coerce_bbox', '_dataset_path', '_get_all_releases', '_get_files_from_stac', '_get_stac_catalog', '_prepare_query', '_record_batch_reader_from_dataset', 'count_rows', 'ds', 'fs', 'geoarrow_schema_adapter', 'geodataframe', 'get_all_overture_types', 'get_available_releases', 'get_latest_release', 'io', 'json', 'pa', 'pc', 'pq', 'query_gers_registry', 'record_batch_reader', 'record_batch_reader_from_gers', 'sys', 'type_theme_map', 'urlopen']
0:54
Let me look at how to use overturemaps to query buildings data for Lagos. I'll use the `record_batch_reader` or `geodataframe` functions with a bounding box and type. I need to figure out the bounding box for Lagos State, Nigeria. Let me look up approximate coordinates. Lagos State, Nigeria is roughly: - Latitude: ~6.2°N to ~6.7°N - Longitude: ~2.7°E to ~4.5°E Let me get the Lagos LGA boundaries. Actually, I need to also get the LGA boundaries. I can either: 1. Query Overture divisions for Lagos LGA boundaries 2. Use a known boundaries file Let me first try to get Lagos LGA boundaries from Overture divisions, then use those to clip buildings. Actually, the approach should be: 1. Get all buildings in Lagos State area using Overture 2. Get LGA boundaries for Lagos from Overture divisions 3. Intersect buildings with LGAs 4. Filter buildings > 1000 m² 5. Compute area and output Let me start by getting the extent for Lagos.
0:54
Bash
python3 -c "
python3 -c "
from overturemaps.core import record_batch_reader, geodataframe
# Get the latest release
print('Latest release:', __import__('overturemaps').core.get_latest_release())
"
Latest release: 2026-05-20.0
1:13
1:13
Bash
python3 -c "
python3 -c "
from overturemaps.core import record_batch_reader, geodataframe
# Check what types are available
types = __import__('overturemaps').get_all_overture_types()
print(types)
"
['address', 'bathymetry', 'building', 'building_part', 'division', 'division_area', 'division_boundary', 'place', 'segment', 'connector', 'infrastructure', 'land', 'land_cover', 'land_use', 'water']
1:18
1:18
Bash
python3 -c "
python3 -c "
from overturemaps.core import record_batch_reader
import pyarrow as pa
# Get a small sample of building data to see schema
reader = record_batch_reader('building', release='2026-05-20.0',
bbox=(-180, -90, 180, 90), # global bbox, limit to 1 batch
limit=5)
for batch in reader:
print(batch.schema)
print(batch.to_pandas().columns.tolist())
break
"
Traceback (most recent call last):
File "<string>", line 6, in <module>
reader = record_batch_reader('building', release='2026-05-20.0',
bbox=(-180, -90, 180, 90), # global bbox, limit to 1 batch
limit=5)
TypeError: record_batch_reader() got an unexpected keyword argument 'limit'
1:22
Let me check the correct API for record_batch_reader and understand how to query buildings with a bounding box.
1:22
Bash
python3 -c "
python3 -c " from overturemaps.core import record_batch_reader import inspect sig = inspect.signature(record_batch_reader) print(sig) "
(overture_type, bbox=None, release=None, connect_timeout=None, request_timeout=None, stac=False) -> pyarrow.lib.RecordBatchReader | None
1:27
I can't use `limit` with `record_batch_reader`. Let me get the schema in another way - by reading just a few records.
1:27
Bash
python3 -c "
python3 -c "
from overturemaps.core import record_batch_reader
import pyarrow as pa
# Get a small sample of building data to see schema - use a very small bbox
reader = record_batch_reader('building', release='2026-05-20.0',
bbox=(-180, -90, 180, 90))
if reader is None:
print('None reader')
else:
for batch in reader:
print('Schema:', batch.schema)
print('Columns:', batch.to_pandas().columns.tolist())
print('First row:', batch.to_pandas().iloc[0].to_dict())
break
"
Schema: id: string
names: struct<primary: string, common: map<string, string ('common')>, rules: list<element: struct<variant: (... 159 chars omitted)
child 0, primary: string
child 1, common: map<string, string ('common')>
child 0, common: struct<key: string not null, value: string> not null
child 0, key: string not null
child 1, value: string
child 2, rules: list<element: struct<variant: string, language: string, perspectives: struct<mode: string, countries (... 87 chars omitted)
child 0, element: struct<variant: string, language: string, perspectives: struct<mode: string, countries: list<element (... 72 chars omitted)
child 0, variant: string
child 1, language: string
child 2, perspectives: struct<mode: string, countries: list<element: string>>
child 0, mode: string
child 1, countries: list<element: string>
child 0, element: string
child 3, value: string
child 4, between: list<element: double>
child 0, element: double
child 5, side: string
sources: list<element: struct<property: string, dataset: string, license: string, record_id: string, update_t (... 65 chars omitted)
child 0, element: struct<property: string, dataset: string, license: string, record_id: string, update_time: string, c (... 50 chars omitted)
child 0, property: string
child 1, dataset: string
child 2, license: string
child 3, record_id: string
child 4, update_time: string
child 5, confidence: double
child 6, between: list<element: double>
child 0, element: double
level: int32
height: double
min_height: double
is_underground: bool
num_floors: int32
num_floors_underground: int32
min_floor: int32
subtype: string
class: string
facade_color: string
facade_material: string
roof_material: string
roof_shape: string
roof_direction: double
roof_orientation: string
roof_color: string
roof_height: double
geometry: binary
has_parts: bool
version: int32
bbox: struct<xmin: double, xmax: double, ymin: double, ymax: double>
child 0, xmin: double
child 1, xmax: double
child 2, ymin: double
child 3, ymax: double
-- schema metadata --
geo: '{"version": "1.1.0", "primary_column": "geometry", "columns": {"geo' + 282
Columns: ['id', 'names', 'sources', 'level', 'height', 'min_height', 'is_underground', 'num_floors', 'num_floors_underground', 'min_floor', 'subtype', 'class', 'facade_color', 'facade_material', 'roof_material', 'roof_shape', 'roof_direction', 'roof_orientation', 'roof_color', 'roof_height', 'geometry', 'has_parts', 'version', 'bbox']
First row: {'id': '76c4a544-9ad3-4da0-be7d-22892dcbeb58', 'names': {'primary': "St. Paul's Burgers", 'common': None, 'rules': None}, 'sources': array([{'property': '', 'dataset': 'OpenStreetMap', 'license': 'ODbL-1.0', 'record_id': 'w1417351285@1', 'update_time': '2025-07-25T18:07:43.000Z', 'confidence': None, 'between': None}],
dtype=object), 'level': nan, 'height': nan, 'min_height': nan, 'is_underground': False, 'num_floors': nan, 'num_floors_underground': nan, 'min_floor': nan, 'subtype': 'religious', 'class': nan, 'facade_color': nan, 'facade_material': nan, 'roof_material': nan, 'roof_shape': nan, 'roof_direction': nan, 'roof_orientation': nan, 'roof_color': nan, 'roof_height': nan, 'geometry': b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x06\x00\x00\x00\x86k\xff\x14\xd8~f\xc0\x0e@\xb4\xa0A}S\xc0\x81\x0b\r\x1f\xec~f\xc0\xf1/\xdd\x7f?}S\xc0x"-:\xfe~f\xc0\x8f\xb6;\xb5D}S\xc0[\x93\x13\x92\xf1~f\xc09oR\x87K}S\xc0\xfao\x14\x0f\xde~f\xc0"a\xce"J}S\xc0\x86k\xff\x14\xd8~f\xc0\x0e@\xb4\xa0A}S\xc0', 'has_parts': False, 'version': 1, 'bbox': {'xmin': -179.96853637695312, 'xmax': -179.96385192871094, 'ymin': -77.9577407836914, 'ymax': -77.95699310302734}}
2:20
Good. The building data has `id`, `height`, `geometry`, and a `bbox` field. I need to constrain to Lagos State, Nigeria. Let me find the bounding box for Lagos State. I can use the `division` type from Overture to find the bbox of Lagos State.
2:20
2:20
Bash
python3 -c "
python3 -c "
from overturemaps.core import record_batch_reader, geodataframe
import pyarrow as pa
# Get Lagos State division area
# First let's query division_area for Nigeria
reader = record_batch_reader('division_area', release='2026-05-20.0',
bbox=(-180, -90, 180, 90))
if reader is None:
print('None reader')
else:
# Collect all divisions and find Lagos
import pyarrow as pa
for batch in reader:
df = batch.to_pandas()
# Filter for Nigeria
mask = df['country'] == 'NG'
df_ng = df[mask]
if len(df_ng) > 0:
print('Columns:', df_ng.columns.tolist())
print(df_ng[['id', 'country', 'region', 'subregion', 'locality', 'subtype']].head(10).to_string())
# Look for Lagos
lagos_mask = df_ng['region'].str.contains('Lagos|lagos', na=False) | df_ng['subregion'].str.contains('Lagos|lagos', na=False) | df_ng['locality'].str.contains('Lagos|lagos', na=False)
print()
print('Lagos matches:')
print(df_ng[lagos_mask][['id', 'region', 'subregion', 'locality', 'subtype']].to_string())
" 2>&1 | head -80
Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox']
Traceback (most recent call last):
File "<string>", line 22, in <module>
print(df_ng[['id', 'country', 'region', 'subregion', 'locality', 'subtype']].head(10).to_string())
~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.14/site-packages/pandas/core/frame.py", line 4384, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.14/site-packages/pandas/core/indexes/base.py", line 6302, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.14/site-packages/pandas/core/indexes/base.py", line 6355, in _raise_if_missing
raise KeyError(f"{not_found} not in index")
KeyError: "['subregion', 'locality'] not in index"
4:04
The columns are: id, geometry, country, sources, subtype, admin_level, class, names, is_land, is_territorial, region, division_id, version, bbox Let me look for Lagos state in the region column.
4:04
Bash
python3 -c "
python3 -c "
from overturemaps.core import record_batch_reader
import pyarrow as pa
reader = record_batch_reader('division_area', release='2026-05-20.0',
bbox=(-180, -90, 180, 90))
for batch in reader:
df = batch.to_pandas()
# Filter for Nigeria
ng = df[df['country'] == 'NG']
if len(ng) > 0:
print('Columns:', ng.columns.tolist())
print()
# Look for Lagos in region
lagos = ng[ng['region'].str.contains('Lagos', na=False)]
print('Lagos region entries:')
print(lagos[['id', 'region', 'subtype', 'admin_level', 'class']].to_string())
print()
# Also check subtypes
print('Subtypes:', ng['subtype'].unique())
print()
print('Admin levels:', ng['admin_level'].unique())
" 2>&1 | head -60
Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox'] Lagos region entries: Empty DataFrame Columns: [id, region, subtype, admin_level, class] Index: [] Subtypes: <ArrowStringArray> ['county', 'locality'] Length: 2, dtype: str Admin levels: [ 2. nan] Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox'] Lagos region entries: Empty DataFrame Columns: [id, region, subtype, admin_level, class] Index: [] Subtypes: <ArrowStringArray> ['locality', 'county', 'region', 'macrohood', 'microhood', 'neighborhood'] Length: 6, dtype: str Admin levels: [nan 2. 1.] Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox'] Lagos region entries: Empty DataFrame Columns: [id, region, subtype, admin_level, class] Index: [] Subtypes: <ArrowStringArray> ['locality', 'county'] Length: 2, dtype: str Admin levels: [nan 2.] Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox'] Lagos region entries: Empty DataFrame Columns: [id, region, subtype, admin_level, class] Index: [] Subtypes: <ArrowStringArray> ['locality', 'county', 'region', 'country'] Length: 4, dtype: str Admin levels: [nan 2. 1. 0.] Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox'] Lagos region entries: Empty DataFrame Columns: [id, region, subtype, admin_level, class] Index: [] Subtypes: <ArrowStringArray> ['locality', 'county', 'region'] Length: 3, dtype: str Admin levels: [nan 2. 1.]
5:55
The division_area data is partitioned and the batches are iterating. Let me collect all Nigeria divisions more efficiently. Let me check if the 'region' column has 'Lagos' or if it's stored differently.
5:55
Bash
python3 -c "
python3 -c "
from overturemaps.core import record_batch_reader
import pyarrow as pa
reader = record_batch_reader('division_area', release='2026-05-20.0',
bbox=(-180, -90, 180, 90))
all_ng = []
for batch in reader:
df = batch.to_pandas()
ng = df[df['country'] == 'NG']
if len(ng) > 0:
all_ng.append(ng)
if all_ng:
import pandas as pd
df_all = pd.concat(all_ng, ignore_index=True)
print('Total NG divisions:', len(df_all))
print()
print('Region values sample:', df_all['region'].unique()[:20])
print()
# Search for Lagos anywhere
for col in ['region', 'class', 'names']:
mask = df_all[col].astype(str).str.contains('Lagos|lagos|LAGOS', na=False)
if mask.any():
print(f'Found Lagos in {col}:')
print(df_all[mask][['id', 'region', 'subtype', 'admin_level', col]].to_string())
print()
# Get all division boundary for Nigeria to find Lagos
# Alternatively, let's use division_boundary type
"
<exit_code>-1</exit_code> <error>Command timed out after 120000ms</error>
8:05
The query was too broad. Let me try using the division_boundary type instead, and specifically look for Lagos. Also, let me try a more targeted approach - use the `geodataframe` function which should be more efficient. Actually, let me think about this differently. I know Lagos State, Nigeria is roughly bounded by: - Latitude: ~6.3°N to ~6.7°N - Longitude: ~2.7°E to ~3.8°E Let me use a bounding box to query the division_area or division_boundary for this area, and filter for Nigeria/Lagos. Or better yet, I can just query the division_boundary type with the bounding box of Lagos State.
8:05
Bash
python3 -c "
python3 -c "
from overturemaps.core import geodataframe, record_batch_reader
import pyarrow as pa
# Use geodataframe which handles the geometry decoding
# Query a bounding box covering Lagos State, Nigeria
bbox = (2.5, 6.2, 4.0, 6.8)
reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=bbox)
if reader:
for batch in reader:
df = batch.to_pandas()
# Check region field
if 'region' in df.columns:
print('Regions in bbox:', df['region'].unique())
print('Types:', df['subtype'].unique() if 'subtype' in df.columns else 'N/A')
print('Countries:', df['country'].unique() if 'country' in df.columns else 'N/A')
# Look at names
if 'names' in df.columns:
print('Names:', df['names'].head(5).to_dict())
" 2>&1 | head -40
Regions in bbox: <ArrowStringArray>
[nan]
Length: 1, dtype: str
Types: <ArrowStringArray>
['dependency']
Length: 1, dtype: str
Countries: <ArrowStringArray>
['UM']
Length: 1, dtype: str
Names: {0: {'primary': 'United States Minor Outlying Islands', 'common': [('hr', 'Mali Udaljeni Otoci Sjedinjenih Američkih Država'), ('pt', 'Ilhas Menores Distantes dos Estados Unidos'), ('cs', 'Menší odlehlé ostrovy Spojených států amerických'), ('ca', "Illes d'Ultramar Menors dels Estats Units"), ('nl', 'Amerikaanse Kleinere Afgelegen Eilanden'), ('zh-Hans', '美国本土外小岛屿'), ('eo', 'Usonaj Malgrandaj Insuloj'), ('ru', 'Внешние малые острова США'), ('zh-Hant', '美國本土外小島嶼'), ('en', 'United States Minor Outlying Islands'), ('es', 'Islas Ultramarinas Menores de Estados Unidos'), ('lt', 'Jungtinių Amerikos Valstijų mažosios aplinkinės salos'), ('zh', '美国本土外小岛屿 / 美國本土外小島嶼'), ('pl', 'Dalekie Wyspy Mniejsze Stanów Zjednoczonych')], 'rules': None}, 1: {'primary': 'United States Minor Outlying Islands', 'common': [('hr', 'Mali Udaljeni Otoci Sjedinjenih Američkih Država'), ('pt', 'Ilhas Menores Distantes dos Estados Unidos'), ('cs', 'Menší odlehlé ostrovy Spojených států amerických'), ('ca', "Illes d'Ultramar Menors dels Estats Units"), ('nl', 'Amerikaanse Kleinere Afgelegen Eilanden'), ('zh-Hans', '美国本土外小岛屿'), ('eo', 'Usonaj Malgrandaj Insuloj'), ('ru', 'Внешние малые острова США'), ('zh-Hant', '美國本土外小島嶼'), ('en', 'United States Minor Outlying Islands'), ('es', 'Islas Ultramarinas Menores de Estados Unidos'), ('lt', 'Jungtinių Amerikos Valstijų mažosios aplinkinės salos'), ('zh', '美国本土外小岛屿 / 美國本土外小島嶼'), ('pl', 'Dalekie Wyspy Mniejsze Stanów Zjednoczonych')], 'rules': None}}
Regions in bbox: <ArrowStringArray>
['BJ-OU', 'NG-LA', 'NG-OG', 'BJ-PL', nan]
Length: 5, dtype: str
Types: <ArrowStringArray>
['county', 'locality', 'region', 'country']
Length: 4, dtype: str
Countries: <ArrowStringArray>
['BJ', 'NG']
Length: 2, dtype: str
Names: {0: {'primary': 'Akpro-Missérété', 'common': [('fr', 'Akpro-Missérété')], 'rules': array([{'variant': 'official', 'language': None, 'perspectives': None, 'value': "Commune d'Akpro-Missérété", 'between': None, 'side': None}],
dtype=object)}, 1: {'primary': 'Aguégués', 'common': [('fr', 'Aguégués')], 'rules': array([{'variant': 'official', 'language': None, 'perspectives': None, 'value': "Commune d'Aguégués", 'between': None, 'side': None}],
dtype=object)}, 2: {'primary': 'Amuwo Odofin', 'common': None, 'rules': None}, 3: {'primary': 'Agege', 'common': None, 'rules': None}, 4: {'primary': 'Ifo', 'common': None, 'rules': None}}
Regions in bbox: <ArrowStringArray>
['NG-LA', 'NG-OG']
Length: 2, dtype: str
Types: <ArrowStringArray>
['region', 'county', 'locality', 'macrohood', 'neighborhood']
Length: 5, dtype: str
Countries: <ArrowStringArray>
['NG']
Length: 1, dtype: str
Names: {0: {'primary': 'Lagos', 'common': [('ka', 'ლაგოსის შტატი'), ('ar', 'لاغوس'), ('fr', 'État de Lagos'), ('uk', 'Лагос'), ('ceb', 'Lagos State'), ('war', 'Lagos'), ('pt', 'Lagos'), ('cs', 'Stát Lagos'), ('gl', 'Estado de Lagos'), ('it', 'Lagos'), ('ca', 'Estat de Lagos'), ('yo', 'Ìpínlẹ̀ Èkó'), ('nl', 'Lagos'), ('bg', 'Лагос'), ('ko', '라고스주'), ('mk', 'Лагос'), ('et', 'Lagose osariik'), ('de', 'Lagos'), ('ha', 'Lagos'), ('azb', 'لاقوس ایالتی'), ('ig', 'Ȯra Lagos'), ('ur', 'لاگوس ریاست'), ('nan', 'Lagos Chiu'), ('ru', 'Лагос'), ('th', 'รัฐเลกอส'), ('id', 'Lagos'), ('sv', 'Lagos'), ('tr', 'Lagos Eyaleti'), ('da', 'Lagos'), ('sr-Cyrl', 'Лагос'), ('en', 'Lagos State'), ('he', 'לאגוס'), ('mrj', 'Лагос'), ('az', 'Laqos ştatı'), ('es', 'Estado de Lagos'), ('hi', 'लेगोस राज्य'), ('sw', 'Jimbo la Lagos'), ('ja', 'ラゴス州'), ('fi', 'Lagos'), ('ro', 'Statul Lagos'), ('no', 'Lagos'), ('vec', 'Ƚagos'), ('fa', 'ایالت لاگوس'), ('zh', '拉哥斯'), ('ms', 'Negeri Lagos'), ('pl', 'Lagos')], 'rules': array([{'variant': 'alternate', 'language': 'zh', 'perspectives': None, 'value': '拉各斯州', 'between': None, 'side': None},
{'variant': 'alternate', 'language': 'ar', 'perspectives': None, 'value': 'ولاية لاغوس', 'between': None, 'side': None},
{'variant': 'alternate', 'language': 'es', 'perspectives': None, 'value': 'Estado Lagos', 'between': None, 'side': None},
{'variant': 'alternate', 'language': 'pt', 'perspectives': None, 'value': 'Estado Lagos', 'between': None, 'side': None},
{'variant': 'alternate', 'language': 'it', 'perspectives': None, 'value': 'Stato Lagos', 'between': None, 'side': None},
{'variant': 'alternate', 'language': 'fr', 'perspectives': None, 'value': 'État Lagos', 'between': None, 'side': None},
{'variant': 'alternate', 'language': 'de', 'perspectives': None, 'value': 'Staat Lagos', 'between': None, 'side': None},
{'variant': 'alternate', 'language': 'nl', 'perspectives': None, 'value': 'Staat Lagos', 'between': None, 'side': None},
{'variant': 'alternate', 'language': 'sv', 'perspectives': None, 'value': 'Lagos-tillstånd', 'between': None, 'side': None},
9:21
I see that Lagos State is present with region code 'NG-LA'. I can see `Amuwo Odofin`, `Agege` as LGAs. Now I need to: 1. Get the geometry of Lagos State (division_area with region='NG-LA') 2. Get all LGAs within Lagos State (these are the "county" subtype divisions) 3. Get all buildings within Lagos State 4. Compute the area of each building 5. Filter for building footprint > 1000 m² 6. Compute per-LGA statistics Let me write this as a comprehensive script.
9:21
9:21
Write
/work/solve.py
content · 332 lines · py
"""
Extract large buildings (>1000 m²) in Lagos State, Nigeria from Overture,
compute per-LGA summary statistics.
"""
import sys
import warnings
warnings.filterwarnings("ignore")
import geopandas as gpd
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import shapely
from overturemaps.core import record_batch_reader, geodataframe
# ── Step 1: Get Lagos State boundary ─────────────────────────────────────
print("Step 1: Getting Lagos State boundary...")
bbox_lagos = (2.5, 6.2, 4.0, 6.8) # rough bounding box around Lagos
# Use division_area to find Lagos State (region='NG-LA')
reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=bbox_lagos)
if reader is None:
sys.exit("No division_area reader returned")
ng_la_area = None
for batch in reader:
df = batch.to_pandas()
mask = df['region'] == 'NG-LA'
if mask.any():
ng_la_area = df[mask].iloc[0]
break
if ng_la_area is None:
sys.exit("Could not find Lagos State (NG-LA) in division_area")
print(f" Lagos State found: {ng_la_area['names']['primary']}")
lagos_geom_geom = shapely.from_wkb(ng_la_area['geometry'])
print(f" Lagos State geometry type: {lagos_geom_geom.geom_type}")
# ── Step 2: Get all LGAs (county subtype) within Lagos State ────────────
print("\nStep 2: Getting Lagos LGAs...")
reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=bbox_lagos)
if reader is None:
sys.exit("No reader returned for LGAs")
all_counties = []
for batch in reader:
df = batch.to_pandas()
# We want counties within Lagos State
# Filter: country='NG' and subtype='county' or subtype='locality' (LGAs might be either)
ng = df[df['country'] == 'NG']
if len(ng) == 0:
continue
all_counties.append(ng)
if all_counties:
counties = pd.concat(all_counties, ignore_index=True)
else:
sys.exit("No NG divisions found")
# Deduplicate by id
counties = counties.drop_duplicates(subset='id')
# The LGAs are within Lagos - let's filter spatially
county_gdf = counties.copy()
county_gdf['geometry'] = county_gdf['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
county_gdf = gpd.GeoDataFrame(county_gdf, geometry='geometry', crs='EPSG:4326')
# Spatial join: keep counties whose centroid is within Lagos State boundary
# (or that intersect Lagos significantly)
lagos_geom = lagos_geom_geom
county_gdf['within_lagos'] = county_gdf.centroid.within(lagos_geom)
county_gdf = county_gdf[county_gdf['within_lagos']].copy()
county_gdf.drop(columns='within_lagos', inplace=True)
# Also look at region field - LGAs should have region='NG-LA' or similar
# Check what region values we have
print(f" Found {len(county_gdf)} divisions within Lagos boundary")
print(f" Subtypes: {county_gdf['subtype'].unique()}")
print(f" Regions: {county_gdf['region'].unique()}")
# Use only actual LGAs (subtype='county' or 'locality')
# In Overture, LGAs in Nigeria are subtype='county' or maybe 'locality'
lga_gdf = county_gdf[county_gdf['subtype'].isin(['county', 'locality'])].copy()
print(f" LGA candidates: {len(lga_gdf)}")
# Print all LGA names
lga_gdf['lga_name'] = lga_gdf['names'].apply(lambda n: n['primary'])
print(" LGAs found:")
for _, row in lga_gdf.iterrows():
print(f" - {row['lga_name']} ({row['subtype']})")
# Build a lookup: LGA name -> geometry (save as a dict for efficiency)
lga_records = []
for _, row in lga_gdf.iterrows():
lga_records.append({
'lga': row['lga_name'],
'geometry': row['geometry']
})
print(f"\nTotal LGAs: {len(lga_records)}")
# ── Step 3: Get all buildings from Overture in the Lagos area ────────────
print("\nStep 3: Querying Overture buildings for Lagos State...")
# Use a generous bounding box, we'll filter spatially
bbox_lagos_big = (2.5, 6.2, 4.0, 6.8)
reader = record_batch_reader('building', release='2026-05-20.0', bbox=bbox_lagos_big)
if reader is None:
sys.exit("No building reader returned")
all_buildings = []
batch_count = 0
for batch in reader:
df = batch.to_pandas()
all_buildings.append(df)
batch_count += 1
if batch_count % 10 == 0:
print(f" Read {batch_count} batches, {sum(len(b) for b in all_buildings)} rows so far...")
print(f" Total batches: {batch_count}")
buildings_raw = pd.concat(all_buildings, ignore_index=True)
print(f" Total buildings fetched: {len(buildings_raw)}")
# ── Step 4: Filter buildings within Lagos State and compute area ─────────
print("\nStep 4: Computing footprint areas and filtering...")
# Decode WKB geometries
buildings_raw['geometry'] = buildings_raw['geometry'].apply(
lambda wkb: shapely.from_wkb(wkb)
)
# Filter buildings that intersect Lagos State boundary
# Use a spatial index on Lagos boundary for speed
lagos_poly = lagos_geom
buildings_raw['in_lagos'] = buildings_raw['geometry'].apply(
lambda g: g is not None and not g.is_empty and g.intersects(lagos_poly)
)
buildings_lagos = buildings_raw[buildings_raw['in_lagos']].copy()
print(f" Buildings intersecting Lagos: {len(buildings_lagos)}")
# Compute footprint area in m² using an equal-area CRS for Nigeria
# Use EPSG:26391 (Nigia East Belt) or EPSG:32631/32632 (UTM zones)
# Lagos is around UTM zone 31N (EPSG:32631)
gdf = gpd.GeoDataFrame(buildings_lagos, geometry='geometry', crs='EPSG:4326')
gdf['geometry'] = gdf['geometry'].to_crs('EPSG:32631')
gdf['footprint_area_m2'] = gdf.geometry.area
# Convert back to 4326 for output
gdf = gdf.to_crs('EPSG:4326')
# Filter: footprint_area_m2 > 1000
gdf_big = gdf[gdf['footprint_area_m2'] > 1000].copy()
print(f" Buildings with area > 1000 m²: {len(gdf_big)}")
# ── Step 5: Assign each building to an LGA ──────────────────────────────
print("\nStep 5: Assigning buildings to LGAs...")
# Build LGA geometries in EPSG:4326
lga_geoms = []
for rec in lga_records:
lga_geoms.append({
'lga': rec['lga'],
'geometry': shapely.from_wkb(rec['geometry']) if isinstance(rec['geometry'], (bytes, bytearray)) else rec['geometry']
})
lga_gdf = gpd.GeoDataFrame(lga_geoms, geometry='geometry', crs='EPSG:4326')
# Spatial join: point-in-polygon using centroids for efficiency
# But for polygon buildings, use representative point or centroid
gdf_big_latlon = gdf_big.copy()
gdf_big_latlon = gdf_big_latlon.set_crs('EPSG:4326')
gdf_big_latlon['centroid'] = gdf_big_latlon.geometry.centroid
gdf_big_latlon = gdf_big_latlon.set_geometry('centroid')
joined = gpd.sjoin(gdf_big_latlon, lga_gdf, how='left', predicate='within')
# For unmatched buildings (centroid outside LGA boundary but building inside Lagos),
# try representative point
unmatched = joined[joined['lga'].isna()]
if len(unmatched) > 0:
print(f" Unmatched buildings (centroid-based): {len(unmatched)}, trying rep_point...")
gdf_big_latlon_unmatched = gdf_big_latlon.loc[unmatched.index]
gdf_big_latlon_unmatched['rep_point'] = gdf_big_latlon_unmatched.geometry.apply(
lambda g: g.representative_point()
)
gdf_big_latlon_unmatched = gdf_big_latlon_unmatched.set_geometry('rep_point')
joined2 = gpd.sjoin(gdf_big_latlon_unmatched, lga_gdf, how='left', predicate='within')
joined.loc[unmatched.index, 'lga'] = joined2['lga']
# Still unmatched - assign to nearest LGA
still_unmatched = joined['lga'].isna()
n_still = still_unmatched.sum()
if n_still > 0:
print(f" Still unmatched: {n_still}, assigning to nearest LGA...")
from shapely.ops import nearest_points
# Build a merged LGA polygon for distance checks
lga_union = lga_gdf.geometry.unary_union
for idx in joined[still_unmatched].index:
pt = gdf_big_latlon.loc[idx, 'centroid']
if pt is None or pt.is_empty:
continue
# Find nearest LGA
min_dist = float('inf')
best_lga = None
for _, lrow in lga_gdf.iterrows():
d = pt.distance(lrow['geometry'])
if d < min_dist:
min_dist = d
best_lga = lrow['lga']
joined.loc[idx, 'lga'] = best_lga
# Drop the centroid geometry, keep original
joined = joined.drop(columns=['centroid', 'rep_point'] if 'rep_point' in joined.columns else ['centroid'])
joined = joined.set_geometry('geometry')
# Remove index_right from spatial join
if 'index_right' in joined.columns:
joined = joined.drop(columns=['index_right'])
big_buildings = joined.copy()
print(f" Buildings with LGA assigned: {big_buildings['lga'].notna().sum()} / {len(big_buildings)}")
# ── Step 6: Prepare output columns ───────────────────────────────────────
print("\nStep 6: Preparing output...")
out = big_buildings[['id', 'height', 'footprint_area_m2', 'lga', 'geometry']].copy()
out = out[out['lga'].notna()].copy()
# Write lagos_buildings.geoparquet
print(" Writing lagos_buildings.geoparquet...")
out = out.set_crs('EPSG:4326')
out.to_parquet('/work/lagos_buildings.geoparquet')
# ── Step 7: Per-LGA roll-up ─────────────────────────────────────────────
print("\nStep 7: Computing per-LGA summary...")
# Also need to compute summary over ALL lagos buildings (not just big ones)
# But the task says "for each Lagos Local Government Area" - the building count etc.
# Actually re-read: "need a per-LGA roll-up" - this is about ALL buildings in Lagos
# Wait, re-reading: "for each Lagos Local Government Area, the building count,
# total footprint area, count of buildings with a non-null Overture height,
# and the median height where available"
# Hmm, this could be for all buildings OR only the filtered (>1000 m²) ones.
# Re-reading the first sentence: "Need every building footprint exceeding 1000 m²"
# That is the filter for the GeoParquet output.
# Then the summary: "for each Lagos Local Government Area, the building count,
# total footprint area..." - this could be summarizing the filtered output.
# But "total footprint area" for the filtered set would just be sum of those >1000m²
# Let me re-read more carefully.
# "Need every building footprint exceeding 1000 m² across Lagos State" - this is for
# the geoparquet output.
# "Also need a per-LGA roll-up: for each Lagos Local Government Area, the building count,
# total footprint area, count of buildings with a non-null Overture height, and the median height"
# I think the roll-up should be over ALL buildings (not just >1000m²), because
# "total footprint area" for all buildings makes more sense as a metric.
# But actually it could be the same set. Let me think...
# "the building count" - this is most useful for ALL buildings in the LGA
# "total footprint area" - same, makes more sense for all buildings
# So I'll compute the roll-up over ALL buildings in Lagos, not just the >1000m² ones.
# Compute footprint area for ALL buildings in Lagos
gdf_all = gpd.GeoDataFrame(buildings_lagos, geometry='geometry', crs='EPSG:4326')
# Compute area
gdf_all_utm = gdf_all.copy()
gdf_all_utm['geometry'] = gdf_all_utm['geometry'].to_crs('EPSG:32631')
gdf_all['footprint_area_m2'] = gdf_all_utm.geometry.area
# Assign LGA using same method
gdf_all['centroid'] = gdf_all.geometry.centroid
gdf_all_cent = gdf_all.set_geometry('centroid')
joined_all = gpd.sjoin(gdf_all_cent, lga_gdf, how='left', predicate='within')
# Handle unmatched
unmatched_all = joined_all[joined_all['lga'].isna()]
if len(unmatched_all) > 0:
gdf_unmatched = gdf_all_cent.loc[unmatched_all.index]
gdf_unmatched['rep_point'] = gdf_unmatched.geometry.apply(
lambda g: g.representative_point() if g is not None and not g.is_empty else None
)
gdf_unmatched = gdf_unmatched[gdf_unmatched['rep_point'].notna()]
gdf_unmatched = gdf_unmatched.set_geometry('rep_point')
if len(gdf_unmatched) > 0:
joined2_all = gpd.sjoin(gdf_unmatched, lga_gdf, how='left', predicate='within')
joined_all.loc[unmatched_all.index, 'lga'] = joined2_all['lga']
# Nearest for still unmatched
still_un = joined_all['lga'].isna()
if still_un.sum() > 0:
for idx in joined_all[still_un].index:
pt = gdf_all_cent.loc[idx, 'centroid']
if pt is None or pt.is_empty:
continue
min_dist = float('inf')
best_lga = None
for _, lrow in lga_gdf.iterrows():
d = pt.distance(lrow['geometry'])
if d < min_dist:
min_dist = d
best_lga = lrow['lga']
joined_all.loc[idx, 'lga'] = best_lga
joined_all = joined_all.drop(columns=['centroid', 'rep_point'] if 'rep_point' in joined_all.columns else ['centroid'])
if 'index_right' in joined_all.columns:
joined_all = joined_all.drop(columns=['index_right'])
# Now compute per-LGA summary
lga_summary = joined_all.groupby('lga').agg(
n_buildings=('id', 'count'),
total_footprint_m2=('footprint_area_m2', 'sum'),
n_with_height=('height', lambda s: s.notna().sum()),
p50_height_m=('height', lambda s: s.median() if s.notna().any() else None)
).reset_index()
print(" LGA Summary:")
print(lga_summary.to_string())
# Write the summary
lga_summary.to_parquet('/work/lagos_building_summary.parquet', index=False)
print("\n✅ Done! Output files:")
print(" /work/lagos_buildings.geoparquet")
print(" /work/lagos_building_summary.parquet")
Wrote /work/solve.py (13141 chars).
9:40
Let me first try running this to see if it works and check for any issues. The script is complex and might have issues with large data volumes.
9:40
Bash
cd /work && python solve.py 2>&1
Step 1: Getting Lagos State boundary...
Lagos State found: Amuwo Odofin
Lagos State geometry type: Polygon
Step 2: Getting Lagos LGAs...
Found 2 divisions within Lagos boundary
Subtypes: <ArrowStringArray>
['county', 'locality']
Length: 2, dtype: str
Regions: <ArrowStringArray>
['NG-LA']
Length: 1, dtype: str
LGA candidates: 2
LGAs found:
- Amuwo Odofin (county)
- Abagbo (locality)
Total LGAs: 2
Step 3: Querying Overture buildings for Lagos State...
Read 10 batches, 273430 rows so far...
Read 20 batches, 557786 rows so far...
Read 30 batches, 962584 rows so far...
Read 40 batches, 1295674 rows so far...
Read 50 batches, 1644849 rows so far...
Read 60 batches, 2050109 rows so far...
Read 70 batches, 2460131 rows so far...
Read 80 batches, 2820222 rows so far...
Read 90 batches, 3152659 rows so far...
Read 100 batches, 3535208 rows so far...
Read 110 batches, 3855485 rows so far...
Read 120 batches, 4186555 rows so far...
Total batches: 129
Total buildings fetched: 4354364
Step 4: Computing footprint areas and filtering...
Buildings intersecting Lagos: 102598
Buildings with area > 1000 m²: 835
Step 5: Assigning buildings to LGAs...
Buildings with LGA assigned: 835 / 835
Step 6: Preparing output...
Writing lagos_buildings.geoparquet...
Step 7: Computing per-LGA summary...
LGA Summary:
lga n_buildings total_footprint_m2 n_with_height p50_height_m
0 Abagbo 84 7.812667e+03 0 None
1 Amuwo Odofin 102598 1.427859e+07 0 None
✅ Done! Output files:
/work/lagos_buildings.geoparquet
/work/lagos_building_summary.parquet
14:26
It seems like the Lagos State boundary query only returned a small part of Lagos. The issue is that the bounding box I used (2.5, 6.2, 4.0, 6.8) only captures a small corner of Lagos State. Lagos State spans approximately from 2.7°E to 4.5°E and 6.2°N to 6.8°N. Let me also expand the bbox. Also, the division_area query for the Lagos region (NG-LA) only returned 2 small divisions (Amuwo Odofin and Abagbo) instead of the whole state. The problem is that I'm not querying the `division_boundary` type for the actual boundary polygon of Lagos State. Let me approach this differently. Actually, I think the issue is: 1. The `division_area` with bbox (2.5, 6.2, 4.0, 6.8) returned only the portions of Lagos State within that bbox. Since Lagos State boundary is large, only a part was returned. 2. I should use `division_boundary` to get the actual boundary geometry of Lagos State. Let me rethink the approach. In Overture: - `division_area` represents the boundary polygon for a division - `division_boundary` represents the boundary line Actually, looking at the output again, it only found "Amuwo Odofin" and "Abagbo" within the bounding box. This is because the bounding box (2.5, 6.2, 4.0, 6.8) only covers a small section of Lagos. I need a wider bounding box. Lagos State coordinates approximately: 2.7°E to 4.5°E, 6.2°N to 6.8°N. Let me also look at whether the division_area query for NG-LA region returned actual results for the whole state or just pieces. The issue is that division_area for a region returns the region's boundary polygon, but the Overture query uses bounding box filtering at the S3 level. So to get the full Lagos State boundary, I need to use a bbox that covers the whole state. But inductive of the Overture data structure - the `division_area` with `region='NG-LA'` should be the boundary of the entire state. Let me widen the bbox.
14:26
14:26
Bash
python3 -c "
python3 -c "
# Lagos State approximate bounds
# Longitude: ~2.7°E to ~4.5°E
# Latitude: ~6.2°N to ~6.8°N
# Let me use a wider bbox
bbox = (2.5, 6.1, 4.6, 6.9)
from overturemaps.core import record_batch_reader
import pyarrow as pa
# Get the division_area for Lagos State (region='NG-LA')
reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=bbox)
if reader:
for batch in reader:
df = batch.to_pandas()
ng = df[df['country'] == 'NG']
if len(ng) > 0:
# Look for Lagos region
lagos = ng[ng['region'] == 'NG-LA']
print(f'NG-LA entries: {len(lagos)}')
if len(lagos) > 0:
for _, row in lagos.iterrows():
print(f\" id={row['id']}, subtype={row['subtype']}, names={row['names']['primary']}\")
print(f'All subtypes: {ng[\"subtype\"].unique()}')
print(f'All regions in NG: {ng[\"region\"].unique()}')
break
" 2>&1
NG-LA entries: 7 id=32cfa1af-5e74-41a7-9272-b9dd8da715d9, subtype=county, names=Amuwo Odofin id=fcaab4e8-075a-4ab6-a2d2-cd614e95dfc2, subtype=county, names=Agege id=934f5b76-4cbf-4902-b124-b33d057d87b1, subtype=county, names=Ifako/Ijaye id=457a7986-f212-4765-b432-e1c1b77d6d61, subtype=county, names=Oshodi/Isolo id=93ccd19d-3389-4c00-a8e3-920e20810d39, subtype=county, names=Badagry id=1a9532db-bcd5-4050-b62b-f27a22f2ebcf, subtype=county, names=Alimosho id=7918fd74-2ce4-48af-b3c1-1a4f1e26558e, subtype=county, names=Ojo All subtypes: <ArrowStringArray> ['county'] Length: 1, dtype: str All regions in NG: <ArrowStringArray> ['NG-LA', 'NG-OG'] Length: 2, dtype: str
14:53
I see - the bounding box (2.5, 6.1, 4.6, 6.9) reveals more LGAs. The Lagos State boundary itself (region='NG-LA' with subtype='region') was missed because of the small bbox. Let me think about this differently. The Lagos State region (admin_level 1) has subtype='region'. Let me find it. Also, the LGAs are subtype='county'. Let me query with a larger bbox to get the full Lagos State boundary and all its LGAs. Let me look for the Lagos State region boundary first.
14:53
Bash
python3 -c "
python3 -c "
from overturemaps.core import record_batch_reader
import pyarrow as pa
bbox = (2.5, 6.1, 4.6, 6.9)
reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=bbox)
all_ng = []
for batch in reader:
df = batch.to_pandas()
ng = df[df['country'] == 'NG']
if len(ng) > 0:
all_ng.append(ng)
import pandas as pd
if all_ng:
df_all = pd.concat(all_ng, ignore_index=True)
print('Total NG entries:', len(df_all))
# Find region subtype
region_rows = df_all[df_all['subtype'] == 'region']
print(f'Region entries: {len(region_rows)}')
for _, row in region_rows.iterrows():
print(f\" region={row.get('region','?')}, names={row['names']['primary']}\")
lagos_state = df_all[(df_all['subtype'] == 'region') & (df_all['region'] == 'NG-LA')]
print(f'Lagos State region entries: {len(lagos_state)}')
# Also find all county entries in NG-LA
counties = df_all[(df_all['subtype'] == 'county') & (df_all['region'] == 'NG-LA')]
print(f'Counties in NG-LA: {len(counties)}')
for _, row in counties.iterrows():
print(f\" {row['names']['primary']}\")
# Also find locality entries
localities = df_all[(df_all['subtype'] == 'locality') & (df_all['region'] == 'NG-LA')]
print(f'Localities in NG-LA: {len(localities)}')
for _, row in localities.iterrows():
print(f\" {row['names']['primary']}\")
" 2>&1
Total NG entries: 51 Region entries: 3 region=NG-LA, names=Lagos region=NG-ON, names=Ondo region=NG-OG, names=Ogun Lagos State region entries: 1 Counties in NG-LA: 20 Amuwo Odofin Agege Ifako/Ijaye Oshodi/Isolo Badagry Alimosho Ojo Ìkòròdú Epe Ajeromi/Ifelodun Kosofe Lagos Island Ikeja Shomolu Lagos Mainland Ibeju Lekki Apapa Surulere Eti Osa Mushin Localities in NG-LA: 2 Alaro City Abagbo
15:31
Now I need to: 1. Use a bigger bbox (2.5, 6.1, 4.6, 6.9) to capture all of Lagos State 2. Get the Lagos State boundary (region subtype) 3. Get all 20 counties (LGAs) + 2 localities 4. Query all buildings in that bbox 5. Filter to those within Lagos State boundary 6. Compute footprint areas 7. Filter >1000 m² for the geoparquet output 8. Assign to LGAs 9. Compute per-LGA summary over ALL buildings (not just the big ones) Let me rewrite the script to be more robust and comprehensive.
15:31
15:31
Write
/work/solve.py
content · 194 lines · py
"""
Extract large buildings (>1000 m²) in Lagos State, Nigeria from Overture,
compute per-LGA summary statistics.
"""
import sys
import warnings
warnings.filterwarnings("ignore")
import geopandas as gpd
import pandas as pd
import numpy as np
import pyarrow as pa
from overturemaps.core import record_batch_reader
import shapely
# ── Config ───────────────────────────────────────────────────────────────
LAGOS_BBOX = (2.5, 6.1, 4.6, 6.9) # covers all of Lagos State
RELEASE = '2026-05-20.0'
# ── Step 1: Get Lagos State boundary and all LGAs ──────────────────────
print("Step 1: Getting Lagos State boundary and LGAs...")
reader = record_batch_reader('division_area', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
sys.exit("No division_area reader returned")
all_divs = []
for batch in reader:
df = batch.to_pandas()
all_divs.append(df)
all_divs = pd.concat(all_divs, ignore_index=True)
# Keep only NG
all_divs = all_divs[all_divs['country'] == 'NG'].copy()
# Deduplicate
all_divs = all_divs.drop_duplicates(subset='id')
# Decode geometries
all_divs['geometry'] = all_divs['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
div_gdf = gpd.GeoDataFrame(all_divs, geometry='geometry', crs='EPSG:4326')
# Find Lagos State (region subtype)
lagos_state = div_gdf[(div_gdf['subtype'] == 'region') & (div_gdf['region'] == 'NG-LA')]
if len(lagos_state) == 0:
sys.exit("Could not find Lagos State region")
lagos_boundary = lagos_state.iloc[0]['geometry']
print(f" Lagos State boundary: {lagos_boundary.geom_type}")
# Find LGAs - subtype='county' or 'locality' within NG-LA
lga_divs = div_gdf[div_gdf['region'] == 'NG-LA'].copy()
# Keep only county and locality subtypes (LGAs)
lga_divs = lga_divs[lga_divs['subtype'].isin(['county', 'locality'])].copy()
lga_divs['lga_name'] = lga_divs['names'].apply(lambda n: n['primary'])
print(f" Total LGAs: {len(lga_divs)}")
for _, row in lga_divs.iterrows():
print(f" - {row['lga_name']} ({row['subtype']})")
# Build LGA GeoDataFrame for spatial joins
lga_gdf = lga_divs[['lga_name', 'geometry']].copy()
lga_gdf = lga_gdf.rename(columns={'lga_name': 'lga'})
# ── Step 2: Get all buildings from Overture ────────────────────────────
print("\nStep 2: Querying Overture buildings...")
reader = record_batch_reader('building', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
sys.exit("No building reader returned")
all_buildings = []
batch_count = 0
for batch in reader:
df = batch.to_pandas()
all_buildings.append(df)
batch_count += 1
if batch_count % 20 == 0:
print(f" Read {batch_count} batches, {sum(len(b) for b in all_buildings)} rows...")
buildings_raw = pd.concat(all_buildings, ignore_index=True)
print(f" Total buildings fetched: {len(buildings_raw)}")
# ── Step 3: Filter buildings within Lagos State boundary ──────────────
print("\nStep 3: Filtering to Lagos State and computing areas...")
# Decode WKB geometries
buildings_raw['geometry'] = buildings_raw['geometry'].apply(
lambda wkb: shapely.from_wkb(wkb) if isinstance(wkb, (bytes, bytearray)) else wkb
)
# Spatial filter - buildings intersecting Lagos State boundary
buildings_raw['in_lagos'] = buildings_raw['geometry'].apply(
lambda g: g is not None and not g.is_empty and g.intersects(lagos_boundary)
)
buildings_lagos = buildings_raw[buildings_raw['in_lagos']].copy()
print(f" Buildings in Lagos State: {len(buildings_lagos)}")
# Convert to GeoDataFrame for area computation
gdf = gpd.GeoDataFrame(buildings_lagos, geometry='geometry', crs='EPSG:4326')
# Compute area in m² - use EPSG:32631 (UTM 31N) which covers Lagos
gdf_utm = gdf.copy()
gdf_utm['geometry'] = gdf_utm['geometry'].to_crs('EPSG:32631')
gdf['footprint_area_m2'] = gdf_utm.geometry.area
# ── Step 4: Assign each building to an LGA ────────────────────────────
print("\nStep 4: Assigning buildings to LGAs...")
# Use centroids for point-in-polygon join
gdf['centroid'] = gdf.geometry.centroid
gdf_cent = gdf.set_geometry('centroid')
# Spatial join
joined = gpd.sjoin(gdf_cent, lga_gdf, how='left', predicate='within')
# Handle unmatched buildings with representative_point
unmask = joined['lga'].isna()
n_unmatched = unmask.sum()
if n_unmatched > 0:
print(f" Unmatched (centroid): {n_unmatched}, trying representative_point...")
for idx in joined[unmask].index:
geom = gdf.loc[idx, 'geometry']
if geom is None or geom.is_empty:
continue
rep_pt = geom.representative_point()
# Find which LGA contains this point
for _, lrow in lga_gdf.iterrows():
if lrow['geometry'].contains(rep_pt):
joined.loc[idx, 'lga'] = lrow['lga']
break
# Nearest LGA for any still unmatched
still_un = joined['lga'].isna()
n_still = still_un.sum()
if n_still > 0:
print(f" Still unmatched: {n_still}, assigning to nearest LGA...")
for idx in joined[still_un].index:
pt = gdf.loc[idx, 'centroid']
if pt is None or pt.is_empty:
continue
min_dist = float('inf')
best_lga = None
for _, lrow in lga_gdf.iterrows():
d = pt.distance(lrow['geometry'])
if d < min_dist:
min_dist = d
best_lga = lrow['lga']
joined.loc[idx, 'lga'] = best_lga
# Drop centroid geometry and index_right
joined = joined.drop(columns=['centroid'])
if 'index_right' in joined.columns:
joined = joined.drop(columns=['index_right'])
joined = joined.set_geometry('geometry')
assigned = joined['lga'].notna().sum()
print(f" Buildings with LGA assigned: {assigned} / {len(joined)}")
# ── Step 5: Filter to buildings > 1000 m² for output ──────────────────
print("\nStep 5: Filtering to buildings > 1000 m²...")
big_buildings = joined[joined['footprint_area_m2'] > 1000].copy()
big_buildings = big_buildings[big_buildings['lga'].notna()].copy()
print(f" Buildings > 1000 m²: {len(big_buildings)}")
# Prepare output columns
out_cols = ['id', 'height', 'footprint_area_m2', 'lga', 'geometry']
out = big_buildings[out_cols].copy()
out = out.set_crs('EPSG:4326')
# Write lagos_buildings.geoparquet
print(" Writing lagos_buildings.geoparquet...")
out.to_parquet('/work/lagos_buildings.geoparquet')
# ── Step 6: Per-LGA roll-up over ALL buildings ────────────────────────
print("\nStep 6: Computing per-LGA summary (over all buildings)...")
# Use the joined dataframe with all Lagos buildings (before area filter)
all_with_lga = joined[joined['lga'].notna()].copy()
lga_summary = all_with_lga.groupby('lga').agg(
n_buildings=('id', 'count'),
total_footprint_m2=('footprint_area_m2', 'sum'),
n_with_height=('height', lambda s: s.notna().sum()),
p50_height_m=('height', lambda s: s.median() if s.notna().any() else None)
).reset_index()
print(lga_summary.to_string())
# Write summary
lga_summary.to_parquet('/work/lagos_building_summary.parquet', index=False)
print("\n✅ Done!")
print(f" /work/lagos_buildings.geoparquet ({len(out)} rows)")
print(f" /work/lagos_building_summary.parquet ({len(lga_summary)} LGAs)")
Wrote /work/solve.py (7077 chars).
15:36
Wait, I realize there's an issue with the approach. The `division_area` type might have the Lagos State boundary, but when I did the previous query (with the small bbox), I only got Amuwo Odofin as the first match because I used `df[mask].iloc[0]` (the first row where region='NG-LA'). Now I'm using a wider bbox and looking for `subtype='region'` with `region='NG-LA'` which should be correct. Also, I need to reconsider: will the `record_batch_reader` for 'division_area' return all batches if I just iterate? Yes, it should iterate through all the parquet files that match the spatial filter. Let me also check - the Lagos State boundary geometry might be stored in the `division_area` type or the `division_boundary` type. Let me check both. Actually, looking at the output from the previous small bbox query: - The first batch had `subtype=dependency, country=UM` - this was from US Minor Outlying Islands which happens to be in the bbox - The second batch had Lagos entries with subtypes: county, locality, region, country So `division_area` does contain the Lagos State region with subtype='region'. Good. Now let me also consider: the Lagos State shape from `division_area` might actually be the administrative boundary. That's what I need. Let me run the updated script.
15:36
Bash
cd /work && python solve.py 2>&1
Step 1: Getting Lagos State boundary and LGAs...
Lagos State boundary: Polygon
Total LGAs: 22
- Amuwo Odofin (county)
- Agege (county)
- Ifako/Ijaye (county)
- Oshodi/Isolo (county)
- Badagry (county)
- Alimosho (county)
- Ojo (county)
- Ìkòròdú (county)
- Epe (county)
- Ajeromi/Ifelodun (county)
- Kosofe (county)
- Alaro City (locality)
- Lagos Island (county)
- Ikeja (county)
- Shomolu (county)
- Lagos Mainland (county)
- Abagbo (locality)
- Ibeju Lekki (county)
- Apapa (county)
- Surulere (county)
- Eti Osa (county)
- Mushin (county)
Step 2: Querying Overture buildings...
Read 20 batches, 496767 rows...
Read 40 batches, 1265487 rows...
Read 60 batches, 1986906 rows...
Read 80 batches, 2807879 rows...
Read 100 batches, 3509578 rows...
Read 120 batches, 4094347 rows...
Read 140 batches, 4769198 rows...
Total buildings fetched: 5003306
Step 3: Filtering to Lagos State and computing areas...
Buildings in Lagos State: 2245422
Step 4: Assigning buildings to LGAs...
bash: line 1: 640 Killed python solve.py 2>&1
20:38
The process was killed (OOM - out of memory). 2.2 million buildings with geometry operations is too much. I need to optimize: 1. Process buildings in batches using DuckDB or pyarrow 2. Use spatial index/operations more efficiently 3. Maybe avoid converting all geometries at once Let me take a more memory-efficient approach - use DuckDB with the spatial extension, or process in chunks. Actually, the overturemaps library returns RecordBatchReaders which are already batched. I can process each batch one at a time rather than collecting everything in memory. Let me redesign the approach: 1. Get Lagos boundary and LGA polygons first 2. For each building batch, filter by spatial intersection, compute area, assign LGA, and aggregate incrementally 3. Only keep >1000 m² buildings for the big output file This should use much less memory.
20:38
20:38
Write
/work/solve.py
content · 230 lines · py
"""
Extract large buildings (>1000 m²) in Lagos State, Nigeria from Overture,
compute per-LGA summary statistics.
Memory-efficient batch processing.
"""
import sys
import warnings
warnings.filterwarnings("ignore")
import geopandas as gpd
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
from overturemaps.core import record_batch_reader
import shapely
# ── Config ───────────────────────────────────────────────────────────────
LAGOS_BBOX = (2.5, 6.1, 4.6, 6.9) # covers all of Lagos State
RELEASE = '2026-05-20.0'
# ── Step 1: Get Lagos State boundary and all LGAs ──────────────────────
print("Step 1: Getting Lagos State boundary and LGAs...")
reader = record_batch_reader('division_area', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
sys.exit("No division_area reader returned")
all_divs = []
for batch in reader:
df = batch.to_pandas()
all_divs.append(df)
all_divs = pd.concat(all_divs, ignore_index=True)
all_divs = all_divs[all_divs['country'] == 'NG'].copy()
all_divs = all_divs.drop_duplicates(subset='id')
# Decode geometries
all_divs['geometry'] = all_divs['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
div_gdf = gpd.GeoDataFrame(all_divs, geometry='geometry', crs='EPSG:4326')
# Lagos State boundary
lagos_state = div_gdf[(div_gdf['subtype'] == 'region') & (div_gdf['region'] == 'NG-LA')]
if len(lagos_state) == 0:
sys.exit("Could not find Lagos State region")
lagos_boundary = lagos_state.iloc[0]['geometry']
print(f" Lagos State boundary: {lagos_boundary.geom_type}")
# LGAs: county or locality within NG-LA
lga_divs = div_gdf[div_gdf['region'] == 'NG-LA'].copy()
lga_divs = lga_divs[lga_divs['subtype'].isin(['county', 'locality'])].copy()
lga_divs['lga_name'] = lga_divs['names'].apply(lambda n: n['primary'])
print(f" Total LGAs: {len(lga_divs)}")
for _, row in lga_divs.iterrows():
print(f" - {row['lga_name']} ({row['subtype']})")
lga_gdf = lga_divs[['lga_name', 'geometry']].copy().rename(columns={'lga_name': 'lga'})
# Build an R-tree index for fast point-in-polygon lookups
from shapely import STRtree
lga_tree = STRtree(lga_gdf['geometry'].tolist())
# Map tree index -> lga name
lga_index_to_name = {i: name for i, name in enumerate(lga_gdf['lga'].tolist())}
# Function to assign LGA to a point
def assign_lga_by_point(pt):
"""Return LGA name for a point, or None."""
if pt is None or pt.is_empty:
return None
# Query the tree
candidates = lga_tree.query(pt)
for idx in candidates:
if lga_gdf.iloc[idx]['geometry'].contains(pt):
return lga_gdf.iloc[idx]['lga']
# Nearest neighbor fallback
min_dist = float('inf')
best = None
for _, row in lga_gdf.iterrows():
d = pt.distance(row['geometry'])
if d < min_dist:
min_dist = d
best = row['lga']
return best
# Pre-compute UTM geometry for Lagos boundary
lagos_utm = shapely.transform(lagos_boundary, lambda x, y, z=None: (
gpd.GeoSeries([shapely.Point(0,0)], crs='EPSG:4326')
.to_crs('EPSG:32631')
).geometry.iloc[0].coords[:][0]) # NO, let me use a simpler transform
# Actually, let's just pre-compute with pyproj
import pyproj
transformer = pyproj.Transformer.from_crs('EPSG:4326', 'EPSG:32631', always_xy=True)
def area_m2_4326(geom):
"""Compute area in m² for a 4326 polygon using UTM 31N transform."""
if geom is None or geom.is_empty:
return 0.0
try:
geom_utm = shapely.ops.transform(lambda x, y: transformer.transform(x, y), geom)
return geom_utm.area
except Exception:
return 0.0
# ── Step 2: Process buildings in batches ──────────────────────────────
print("\nStep 2: Processing Overture buildings in batches...")
reader = record_batch_reader('building', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
sys.exit("No building reader returned")
# Output accumulators
all_big_buildings = [] # for geoparquet (>1000 m²)
lga_summary_accum = {} # lga -> {n, total_area, n_h, heights_list}
batch_count = 0
for batch in reader:
batch_count += 1
df = batch.to_pandas()
# Decode geometry
df['geometry'] = df['geometry'].apply(
lambda wkb: shapely.from_wkb(wkb) if isinstance(wkb, (bytes, bytearray)) else wkb
)
# Quick spatial filter using envelope/bbox intersection first
# Check if building intersects Lagos
in_lagos = df['geometry'].apply(
lambda g: g is not None and not g.is_empty and g.intersects(lagos_boundary)
)
df_lagos = df[in_lagos].copy()
if len(df_lagos) == 0:
continue
# Compute area
for i, row in df_lagos.iterrows():
geom = row['geometry']
area_val = area_m2_4326(geom)
df_lagos.loc[i, 'footprint_area_m2'] = area_val
# Assign LGA using centroid
for i, row in df_lagos.iterrows():
geom = row['geometry']
if geom is None or geom.is_empty:
df_lagos.loc[i, 'lga'] = None
continue
centroid = geom.centroid
lga_name = assign_lga_by_point(centroid)
if lga_name is None:
# Try representative point
rep = geom.representative_point()
lga_name = assign_lga_by_point(rep)
df_lagos.loc[i, 'lga'] = lga_name
# Remove rows with no LGA
df_lagos = df_lagos[df_lagos['lga'].notna()].copy()
if len(df_lagos) == 0:
continue
# Accumulate for big buildings output
big_mask = df_lagos['footprint_area_m2'] > 1000
big_batch = df_lagos[big_mask].copy()
if len(big_batch) > 0:
all_big_buildings.append(big_batch[['id', 'height', 'footprint_area_m2', 'lga', 'geometry']])
# Accumulate for per-LGA summary
for _, row in df_lagos.iterrows():
lga = row['lga']
h = row['height']
area = row['footprint_area_m2']
if lga not in lga_summary_accum:
lga_summary_accum[lga] = {
'n': 0, 'total_area': 0.0, 'n_h': 0, 'heights': []
}
acc = lga_summary_accum[lga]
acc['n'] += 1
acc['total_area'] += area
if pd.notna(h):
acc['n_h'] += 1
acc['heights'].append(h)
if batch_count % 50 == 0:
print(f" Processed {batch_count} batches, {sum(len(b) for b in all_big_buildings)} big buildings so far...")
print(f" Total batches processed: {batch_count}")
print(f" Total big buildings (>1000 m²): {sum(len(b) for b in all_big_buildings)}")
# ── Step 3: Write lagos_buildings.geoparquet ──────────────────────────
print("\nStep 3: Writing lagos_buildings.geoparquet...")
if all_big_buildings:
big_out = pd.concat(all_big_buildings, ignore_index=True)
big_gdf = gpd.GeoDataFrame(big_out, geometry='geometry', crs='EPSG:4326')
big_gdf.to_parquet('/work/lagos_buildings.geoparquet')
print(f" Wrote {len(big_gdf)} rows")
else:
print(" No big buildings found, writing empty file")
empty = gpd.GeoDataFrame({'id': [], 'height': [], 'footprint_area_m2': [], 'lga': []},
geometry=[], crs='EPSG:4326')
empty.to_parquet('/work/lagos_buildings.geoparquet')
# Clear memory
all_big_buildings = None
# ── Step 4: Write lagos_building_summary.parquet ──────────────────────
print("\nStep 4: Writing lagos_building_summary.parquet...")
summary_rows = []
for lga, acc in lga_summary_accum.items():
heights = acc['heights']
p50 = float(np.median(heights)) if len(heights) > 0 else None
summary_rows.append({
'lga': lga,
'n_buildings': acc['n'],
'total_footprint_m2': acc['total_area'],
'n_with_height': acc['n_h'],
'p50_height_m': p50
})
summary_df = pd.DataFrame(summary_rows)
summary_df = summary_df.sort_values('lga').reset_index(drop=True)
summary_df.to_parquet('/work/lagos_building_summary.parquet', index=False)
print(summary_df.to_string())
print("\n✅ Done!")
print(f" /work/lagos_buildings.geoparquet")
print(f" /work/lagos_building_summary.parquet")
Wrote /work/solve.py (8039 chars).
20:58
The issue with this approach is that the area calculation using pyproj transform for every single row will be slow. Let me think about a better approach. Actually, the real issue was memory - 2.2M buildings loaded at once. Let me try a chunked approach but with a simpler area computation. Actually, let me think about using DuckDB with the spatial extension to read the overture data directly from S3. That might be more efficient. Wait, actually the overturemaps library already gives us RecordBatchReaders, which are streaming. The problem was that I was accumulating all batches in memory. Let me fix by not accumulating, just processing each batch and discarding. The approach I have above does that - it processes each batch individually and only keeps the big building records. Let me try running it. The area calculation using pyproj for each row could be slow though. Let me optimize by vectorizing. Actually, let me re-think. Each batch might be ~40k rows. For each row I'm calling area_m2_4326 which does a pyproj transform. That could be slow for 2.2M rows. Let me use a GeoSeries approach instead - convert each batch's Lagos-filtered geometries to GeoSeries, project to UTM, and compute areas in one go. Let me rewrite more carefully.
20:58
20:58
Write
/work/solve.py
content · 198 lines · py
"""
Extract large buildings (>1000 m²) in Lagos State, Nigeria from Overture,
compute per-LGA summary statistics.
Memory-efficient batch processing using DuckDB for heavy lifting.
"""
import sys
import warnings
warnings.filterwarnings("ignore")
import geopandas as gpd
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
from overturemaps.core import record_batch_reader
import shapely
from shapely import STRtree
# ── Config ───────────────────────────────────────────────────────────────
LAGOS_BBOX = (2.5, 6.1, 4.6, 6.9)
RELEASE = '2026-05-20.0'
# ── Step 1: Get Lagos State boundary and LGAs ──────────────────────────
print("Step 1: Getting Lagos State boundary and LGAs...")
reader = record_batch_reader('division_area', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
sys.exit("No division_area reader returned")
all_divs = []
for batch in reader:
all_divs.append(batch.to_pandas())
all_divs = pd.concat(all_divs, ignore_index=True)
all_divs = all_divs[all_divs['country'] == 'NG'].copy()
all_divs = all_divs.drop_duplicates(subset='id')
all_divs['geometry'] = all_divs['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
div_gdf = gpd.GeoDataFrame(all_divs, geometry='geometry', crs='EPSG:4326')
# Lagos State boundary
lagos_state = div_gdf[(div_gdf['subtype'] == 'region') & (div_gdf['region'] == 'NG-LA')]
if len(lagos_state) == 0:
sys.exit("Could not find Lagos State region")
lagos_boundary = lagos_state.iloc[0]['geometry']
print(f" Lagos State boundary type: {lagos_boundary.geom_type}")
# LGAs
lga_divs = div_gdf[div_gdf['region'] == 'NG-LA'].copy()
lga_divs = lga_divs[lga_divs['subtype'].isin(['county', 'locality'])].copy()
lga_divs['lga_name'] = lga_divs['names'].apply(lambda n: n['primary'])
print(f" Total LGAs: {len(lga_divs)}")
for _, row in lga_divs.iterrows():
print(f" - {row['lga_name']} ({row['subtype']})")
lga_df = lga_divs[['lga_name', 'geometry']].copy().rename(columns={'lga_name': 'lga'})
lga_geoms = lga_df['geometry'].tolist()
lga_names = lga_df['lga'].tolist()
# R-tree for fast point-in-polygon
lga_tree = STRtree(lga_geoms)
def assign_lga(pt):
"""Return LGA name for a point geometry."""
if pt is None or pt.is_empty:
return None
candidates = lga_tree.query(pt)
for idx in candidates:
if lga_geoms[idx].contains(pt):
return lga_names[idx]
# Nearest neighbor fallback
min_d = float('inf')
best = None
for i, g in enumerate(lga_geoms):
d = pt.distance(g)
if d < min_d:
min_d = d
best = lga_names[i]
return best
# ── Step 2: Process buildings in batches ──────────────────────────────
print("\nStep 2: Processing buildings in batches...")
reader = record_batch_reader('building', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
sys.exit("No building reader returned")
big_buildings_list = []
lga_summary = {} # lga -> {n, total_area, n_h, heights[]}
batch_num = 0
for batch in reader:
batch_num += 1
df = batch.to_pandas()
# Decode WKB -> Shapely
df['geometry'] = df['geometry'].apply(
lambda wkb: shapely.from_wkb(wkb) if isinstance(wkb, (bytes, bytearray)) else wkb
)
# Quick bbox/envelope filter using shapely intersects
in_lagos = df['geometry'].apply(
lambda g: g is not None and not g.is_empty and g.intersects(lagos_boundary)
)
df_lag = df[in_lagos].copy()
if len(df_lag) == 0:
continue
# ---- Compute area in m² using vectorized GeoSeries ----
gs = gpd.GeoSeries(df_lag['geometry'].values, crs='EPSG:4326')
gs_utm = gs.to_crs('EPSG:32631')
df_lag['footprint_area_m2'] = gs_utm.area.values
# ---- Assign LGA ----
centroids = df_lag['geometry'].apply(lambda g: g.centroid if g and not g.is_empty else None)
lga_vals = centroids.apply(assign_lga)
# For nulls, try representative point
null_mask = lga_vals.isna()
if null_mask.any():
rep_pts = df_lag.loc[null_mask, 'geometry'].apply(
lambda g: g.representative_point() if g and not g.is_empty else None
)
lga_vals.loc[null_mask] = rep_pts.apply(assign_lga)
df_lag['lga'] = lga_vals
# Drop rows with no LGA
df_lag = df_lag[df_lag['lga'].notna()].copy()
if len(df_lag) == 0:
continue
# ---- Collect big buildings (>1000 m²) ----
big = df_lag[df_lag['footprint_area_m2'] > 1000]
if len(big) > 0:
big_buildings_list.append(
big[['id', 'height', 'footprint_area_m2', 'lga', 'geometry']].copy()
)
# ---- Update LGA summary ----
for _, row in df_lag.iterrows():
lga = row['lga']
h = row['height']
area = row['footprint_area_m2']
if lga not in lga_summary:
lga_summary[lga] = {'n': 0, 'area': 0.0, 'n_h': 0, 'heights': []}
acc = lga_summary[lga]
acc['n'] += 1
acc['area'] += area
if pd.notna(h):
acc['n_h'] += 1
acc['heights'].append(h)
if batch_num % 50 == 0:
n_big = sum(len(b) for b in big_buildings_list)
print(f" Batch {batch_num}: {len(df_lag)} in Lagos, {n_big} big buildings so far...")
print(f" Processed {batch_num} batches.")
n_big_total = sum(len(b) for b in big_buildings_list)
print(f" Total buildings >1000 m²: {n_big_total}")
# ── Step 3: Write lagos_buildings.geoparquet ──────────────────────────
print("\nStep 3: Writing lagos_buildings.geoparquet...")
if big_buildings_list:
big_df = pd.concat(big_buildings_list, ignore_index=True)
big_gdf = gpd.GeoDataFrame(big_df, geometry='geometry', crs='EPSG:4326')
big_gdf.to_parquet('/work/lagos_buildings.geoparquet')
print(f" Wrote {len(big_gdf)} rows")
big_df = None
else:
print(" No big buildings found")
empty = gpd.GeoDataFrame(
{'id': pd.Series([], dtype='str'), 'height': pd.Series([], dtype='float64'),
'footprint_area_m2': pd.Series([], dtype='float64'), 'lga': pd.Series([], dtype='str')},
geometry=pd.Series([], dtype='object'), crs='EPSG:4326'
)
empty.to_parquet('/work/lagos_buildings.geoparquet')
big_buildings_list = None
# ── Step 4: Write lagos_building_summary.parquet ──────────────────────
print("\nStep 4: Writing lagos_building_summary.parquet...")
rows = []
for lga, acc in lga_summary.items():
hts = acc['heights']
p50 = float(np.median(hts)) if len(hts) > 0 else None
rows.append({
'lga': lga,
'n_buildings': acc['n'],
'total_footprint_m2': acc['area'],
'n_with_height': acc['n_h'],
'p50_height_m': p50
})
summary_df = pd.DataFrame(rows).sort_values('lga').reset_index(drop=True)
summary_df.to_parquet('/work/lagos_building_summary.parquet', index=False)
print(summary_df.to_string())
print("\n✅ Done!")
Wrote /work/solve.py (6870 chars).
21:06
The main bottleneck is the huge number of buildings (2.2M). Let me think about optimizations:
1. The `intersects` check for each geometry against Lagos boundary might be slow
2. Converting 500k+ geometries per batch to GeoSeries could be memory-heavy
3. The `assign_lga` with R-tree should be fast
Actually, the bigger issue from before was that the process got killed (OOM). Let me think about what's using the most memory:
- `all_buildings.append(df)` was holding all data in memory
- We're now processing batch-by-batch which is good
- But `gpd.GeoSeries(..., crs='EPSG:4326').to_crs('EPSG:32631')` creates in-memory copies
Let me try a slightly different approach - use DuckDB's spatial extension to do the heavy lifting, or just process the batches more carefully.
Actually, looking at the error - the process was killed at Step 4 (assigning LGAs) with 2.2M buildings. The OOM probably happened because:
1. The GeoSeries operations on a large batch might create many copies
2. Or the STRtree query for each point
But actually, the batches from overturemaps are already quite large (each batch is ~40-50k rows). One batch in memory + GeoSeries should be fine. Let me check how many rows per batch we're getting.
Actually, the OOM happened with the FIRST version that did `pd.concat(all_buildings, ...)` which loaded ALL 5M buildings. The second version processed batch-by-batch but still got OOM-killed at Step 4. Hmm.
Wait, looking more carefully at the output: "Step 4: Assigning buildings to LGAs..." and then Killed. So it was killed during the LGA assignment phase.
In the second version (which was killed), the bottleneck was likely the `for idx in joined[unmask].index` loop - iterating over 2.2M rows with spatial operations. Actually, that was the SECOND version that was killed.
Let me think about further optimizations:
1. Instead of using GeoSeries `.to_crs()` which creates full copies, use `shapely.transform` with pyproj transformer
2. Process batches one at a time (already doing this)
3. Use a more efficient spatial filter (envelope check first)
Actually, I think the main issue is that each batch can be large (~40-50k buildings). With 2.2M buildings in Lagos, that's ~44-55 batches. Processing each one with GeoSeries should be okay.
But wait, the version that was killed (version 2) wasn't processing batch-by-batch - it loaded everything into `buildings_raw` and then filtered. Let me look at version 3 which IS batch-by-batch.
Actually looking at the output of version 2's run - it got through "Step 3: Filtering to Lagos State and computing areas..." with 2,245,422 buildings, then started Step 4 and got killed. That was the version that loaded all buildings at once.
Version 3 processes batch-by-batch, so it should be fine memory-wise. Let me just run it.
21:06
Bash
cd /work && python solve.py 2>&1
Step 1: Getting Lagos State boundary and LGAs...
Lagos State boundary type: Polygon
Total LGAs: 22
- Amuwo Odofin (county)
- Agege (county)
- Ifako/Ijaye (county)
- Oshodi/Isolo (county)
- Badagry (county)
- Alimosho (county)
- Ojo (county)
- Ìkòròdú (county)
- Epe (county)
- Ajeromi/Ifelodun (county)
- Kosofe (county)
- Alaro City (locality)
- Lagos Island (county)
- Ikeja (county)
- Shomolu (county)
- Lagos Mainland (county)
- Abagbo (locality)
- Ibeju Lekki (county)
- Apapa (county)
- Surulere (county)
- Eti Osa (county)
- Mushin (county)
Step 2: Processing buildings in batches...
Batch 50: 36901 in Lagos, 1325 big buildings so far...
Processed 153 batches.
Total buildings >1000 m²: 7249
Step 3: Writing lagos_buildings.geoparquet...
Wrote 7249 rows
Step 4: Writing lagos_building_summary.parquet...
lga n_buildings total_footprint_m2 n_with_height p50_height_m
0 Abagbo 84 7.812667e+03 0 NaN
1 Agege 32564 4.478193e+06 0 NaN
2 Ajeromi/Ifelodun 33206 4.593079e+06 0 NaN
3 Alimosho 433223 5.185803e+07 0 NaN
4 Amuwo Odofin 102455 1.426173e+07 0 NaN
5 Apapa 32502 4.275231e+06 0 NaN
6 Badagry 171840 1.887823e+07 0 NaN
7 Epe 74235 9.777356e+06 3 10.0
8 Eti Osa 185690 2.922655e+07 1791 9.0
9 Ibeju Lekki 131914 1.656669e+07 18 6.0
10 Ifako/Ijaye 68168 9.156084e+06 0 NaN
11 Ikeja 65919 1.232163e+07 149 10.0
12 Kosofe 92184 1.338926e+07 2 6.5
13 Lagos Island 9069 1.565505e+06 26 15.5
14 Lagos Mainland 35267 3.311757e+06 4 40.0
15 Mushin 45998 7.065688e+06 1 35.0
16 Ojo 149350 1.880956e+07 0 NaN
17 Oshodi/Isolo 95548 1.423290e+07 13 5.0
18 Shomolu 32220 4.983182e+06 12 16.0
19 Surulere 53597 7.867977e+06 0 NaN
20 Ìkòròdú 400389 5.004820e+07 569 6.0
✅ Done!
27:48
Let me verify the outputs match the requirements: 1. `lagos_buildings.geoparquet` - EPSG:4326 GeoParquet with columns `id`, `height`, `footprint_area_m2`, `lga`, and `geometry` ✓ 2. `lagos_building_summary.parquet` - plain Parquet with columns `lga`, `n_buildings`, `total_footprint_m2`, `n_with_height`, `p50_height_m` ✓ Let me verify the files on disk. Wait - the summary has `NaN` for p50_height_m where there's no height data, but the requirement says "null otherwise". Let me check if parquet can store NaN or if it should be null. In pandas, NaN in float columns becomes null in parquet, so that should be fine. But let me verify: - I notice "Alaro City" (locality) appears in the LGA list but not in the summary. It's possible that no buildings were assigned to Alaro City. Let me check. Actually looking at the summary output, there are 21 entries but there were 22 LGAs. Alaro City is missing from the summary. That could be because no buildings fell within its boundary. Let me check. Also, let me verify the GeoParquet file.
27:48
Bash
cd /work && ls -lh lagos_buildings.geoparquet lagos_building_summary.parquet
-rw-r--r-- 1 agent agent 4.2K Jun 16 06:34 lagos_building_summary.parquet -rw-r--r-- 1 agent agent 1.4M Jun 16 06:34 lagos_buildings.geoparquet
27:48
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import pandas as pd
import geopandas as gpd
# Check buildings file
gdf = gpd.read_parquet('/work/lagos_buildings.geoparquet')
print('=== lagos_buildings.geoparquet ===')
print(f'CRS: {gdf.crs}')
print(f'Columns: {gdf.columns.tolist()}')
print(f'Rows: {len(gdf)}')
print(f'Dtypes:\n{gdf.dtypes}')
print()
# Check summary file
df = pd.read_parquet('/work/lagos_building_summary.parquet')
print('=== lagos_building_summary.parquet ===')
print(f'Columns: {df.columns.tolist()}')
print(f'Rows: {len(df)}')
print(f'Dtypes:\n{df.dtypes}')
print()
print('All values:')
print(df.to_string())
# Check for Alaro City
print()
lagos_lgas = ['Amuwo Odofin', 'Agege', 'Ifako/Ijaye', 'Oshodi/Isolo', 'Badagry',
'Alimosho', 'Ojo', 'Ìkòròdú', 'Epe', 'Ajeromi/Ifelodun', 'Kosofe',
'Alaro City', 'Lagos Island', 'Ikeja', 'Shomolu', 'Lagos Mainland',
'Abagbo', 'Ibeju Lekki', 'Apapa', 'Surulere', 'Eti Osa', 'Mushin']
for l in lagos_lgas:
if l not in df['lga'].values:
print(f'MISSING from summary: {l}')
"
=== lagos_buildings.geoparquet ===
CRS: {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "GeographicCRS", "name": "WGS 84", "datum_ensemble": {"name": "World Geodetic System 1984 ensemble", "members": [{"name": "World Geodetic System 1984 (Transit)"}, {"name": "World Geodetic System 1984 (G730)"}, {"name": "World Geodetic System 1984 (G873)"}, {"name": "World Geodetic System 1984 (G1150)"}, {"name": "World Geodetic System 1984 (G1674)"}, {"name": "World Geodetic System 1984 (G1762)"}, {"name": "World Geodetic System 1984 (G2139)"}, {"name": "World Geodetic System 1984 (G2296)"}], "ellipsoid": {"name": "WGS 84", "semi_major_axis": 6378137, "inverse_flattening": 298.257223563}, "accuracy": "2.0", "id": {"authority": "EPSG", "code": 6326}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, "scope": "Horizontal component of 3D system.", "area": "World.", "bbox": {"south_latitude": -90, "west_longitude": -180, "north_latitude": 90, "east_longitude": 180}, "id": {"authority": "EPSG", "code": 4326}}
Columns: ['id', 'height', 'footprint_area_m2', 'lga', 'geometry']
Rows: 7249
Dtypes:
id str
height float64
footprint_area_m2 float64
lga str
geometry geometry
dtype: object
=== lagos_building_summary.parquet ===
Columns: ['lga', 'n_buildings', 'total_footprint_m2', 'n_with_height', 'p50_height_m']
Rows: 21
Dtypes:
lga str
n_buildings int64
total_footprint_m2 float64
n_with_height int64
p50_height_m float64
dtype: object
All values:
lga n_buildings total_footprint_m2 n_with_height p50_height_m
0 Abagbo 84 7.812667e+03 0 NaN
1 Agege 32564 4.478193e+06 0 NaN
2 Ajeromi/Ifelodun 33206 4.593079e+06 0 NaN
3 Alimosho 433223 5.185803e+07 0 NaN
4 Amuwo Odofin 102455 1.426173e+07 0 NaN
5 Apapa 32502 4.275231e+06 0 NaN
6 Badagry 171840 1.887823e+07 0 NaN
7 Epe 74235 9.777356e+06 3 10.0
8 Eti Osa 185690 2.922655e+07 1791 9.0
9 Ibeju Lekki 131914 1.656669e+07 18 6.0
10 Ifako/Ijaye 68168 9.156084e+06 0 NaN
11 Ikeja 65919 1.232163e+07 149 10.0
12 Kosofe 92184 1.338926e+07 2 6.5
13 Lagos Island 9069 1.565505e+06 26 15.5
14 Lagos Mainland 35267 3.311757e+06 4 40.0
15 Mushin 45998 7.065688e+06 1 35.0
16 Ojo 149350 1.880956e+07 0 NaN
17 Oshodi/Isolo 95548 1.423290e+07 13 5.0
18 Shomolu 32220 4.983182e+06 12 16.0
19 Surulere 53597 7.867977e+06 0 NaN
20 Ìkòròdú 400389 5.004820e+07 569 6.0
MISSING from summary: Alaro City
28:09
Alaro City is missing from the summary. Let me check why. It's listed as a "locality" subtype in the LGAs. Let me check if there are actually any buildings assigned to Alaro City. It might be a small area without many buildings, or the spatial join might be failing for it. Actually, looking at the output, Alaro City and Abagbo are both "locality" subtype. Abagbo has 84 buildings, but Alaro City has 0. Let me check what happened - maybe Alaro City's boundary is very small or buildings don't intersect it. But actually, the task says "for each Lagos Local Government Area". Alaro City is a locality, not a formal LGA. The official Lagos LGAs are the 20 "county" subtype entries. Abagbo and Alaro City are "locality" subtype - smaller subdivisions. So it's fine if Alaro City has no buildings (or wasn't in the spatial results). Wait, looking more carefully - Alaro City should appear in the summary even if it has 0 buildings. But it doesn't appear at all, meaning no buildings were assigned to it. That's because we only have entries for LGAs that had buildings. I should ensure all LGAs appear in the summary. But wait, re-reading the task: "for each Lagos Local Government Area, the building count, total footprint area, count of buildings with a non-null Overture height, and the median height where available (null otherwise)." This could mean the official LGAs only. The 20 "county" entries are the official LGAs. Alaro City and Abagbo are localities, not LGAs. But I included them since I included both subtypes. Actually, let me check: are all 20 official LGAs present? Let me compare... Official Lagos LGAs (20): Agege, Ajeromi-Ifelodun, Alimosho, Amuwo-Odofin, Apapa, Badagry, Epe, Eti-Osa, Ibeju-Lekki, Ifako-Ijaiye, Ikeja, Ikorodu, Kosofe, Lagos Island, Lagos Mainland, Mushin, Ojo, Oshodi-Isolo, Shomolu, Surulere My results have 21 entries (including Abagbo and Alaro City as localities). But Alaro City is missing. Let me check - it might just have 0 buildings. Actually, I'm checking that all 20 official LGAs are present. Looking at the summary: 1. Agege ✓ 2. Ajeromi/Ifelodun ✓ (Ajeromi-Ifelodun) 3. Alimosho ✓ 4. Amuwo Odofin ✓ (Amuwo-Odofin) 5. Apapa ✓ 6. Badagry ✓ 7. Epe ✓ 8. Eti Osa ✓ (Eti-Osa) 9. Ibeju Lekki ✓ (Ibeju-Lekki) 10. Ifako/Ijaye ✓ (Ifako-Ijaiye) 11. Ikeja ✓ 12. Ìkòròdú ✓ (Ikorodu) 13. Kosofe ✓ 14. Lagos Island ✓ 15. Lagos Mainland ✓ 16. Mushin ✓ 17. Ojo ✓ 18. Oshodi/Isolo ✓ (Oshodi-Isolo) 19. Shomolu ✓ 20. Surulere ✓ All 20 are there. Alaro City and Abagbo were also in the LGA list but Abagbo has 84 buildings (somehow) and Alaro City has none. That's fine. But wait - should Alaro City appear with 0 values? That's a nice-to-have but the task says "for each Lagos Local Government Area" - if we consider only the 20 official LGAs, they're all there. Let me also check: I should verify Alaro City doesn't appear because no buildings were assigned to it. Let me also double-check the CRS is correct EPSG:4326 (it is) and that the file sizes look right. The summary has p50_height_m as NaN for LGAs with no heights, which matches the requirement "null otherwise". One thing I notice: the p50_height_m for Abagbo, Agege, Ajeromi/Ifelodun, Alimosho, etc. is NaN, which is correct since they have 0 n_with_height. But wait - the task says "the median height where available (null otherwise)". So NaN is fine. Let me verify the files are correct and think about whether I should ensure Alaro City appears with 0 values. Actually, since the task asks about "Lagos Local Government Area", and Alaro City is a locality (not an LGA), it's fine not to include it. The same goes for Abagbo, which probably shouldn't be there either but happened to have buildings. Actually, I think the cleanest approach would be to only include the actual counties (LGAs) in both outputs. Let me re-check if Alaro City buildings should be merged into an LGA or not. In Overture, the LGA hierarchy in Nigeria is: region (state) -> county (LGA) -> locality (city/neighborhood). Alaro City is a locality within an LGA. Buildings in Abagbo locality should actually be assigned to whichever LGA contains Abagbo. But for now, the results look correct. The 20 official LGAs are all present, each has a total_footprint_m2 in reasonable ranges, and the big buildings file has 7249 records. Let me verify the output files exist and are valid.
28:09
28:09
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
import pandas as pd
import shapely
from overturemaps.core import record_batch_reader
# Check which LGA contains Abagbo and Alaro City
reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=(2.5, 6.1, 4.6, 6.9))
all_divs = []
for batch in reader:
all_divs.append(batch.to_pandas())
all_divs = pd.concat(all_divs, ignore_index=True)
all_divs = all_divs[all_divs['country'] == 'NG'].copy()
all_divs = all_divs.drop_duplicates(subset='id')
all_divs['geometry'] = all_divs['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
div_gdf = gpd.GeoDataFrame(all_divs, geometry='geometry', crs='EPSG:4326')
# Find Abagbo and Alaro City
for name in ['Abagbo', 'Alaro City']:
match = div_gdf[div_gdf['names'].apply(lambda n: n['primary'] == name)]
if len(match) > 0:
print(f'{name}: subtype={match.iloc[0][\"subtype\"]}, region={match.iloc[0][\"region\"]}')
geom = match.iloc[0]['geometry']
print(f' Centroid: {geom.centroid}')
# Find containing county
counties = div_gdf[div_gdf['subtype'] == 'county']
for _, crow in counties.iterrows():
if crow['geometry'].contains(geom.centroid):
print(f' Contained by LGA: {crow[\"names\"][\"primary\"]}')
else:
print(f'{name}: not found')
"
Abagbo: subtype=locality, region=NG-LA Centroid: POINT (3.380758784863297 6.405693031457745) Contained by LGA: Amuwo Odofin Alaro City: subtype=locality, region=NG-LA Centroid: POINT (3.956440199027766 6.555857090897392) Contained by LGA: Epe
28:40
So "Abagbo" and "Alaro City" are localities within Amuwo Odofin and Epe LGAs respectively. The 20 actual LGAs in Lagos State are the counties: Amuwo Odofin, Agege, Ifako/Ijaye, Oshodi/Isolo, Badagry, Alimosho, Ojo, Ìkòròdú, Epe, Ajeromi/Ifelodun, Kosofe, Lagos Island, Ikeja, Shomolu, Lagos Mainland, Ibeju Lekki, Apapa, Surulere, Eti Osa, Mushin. Abagbo and Alaro City are just localities (sub-local divisions), not LGAs. Since the user asked for "Lagos Local Government Areas", the 20 counties are the proper LGAs. Abagbo and Alaro City show up as separate rows in the summary because some building centroids fell in those localities rather than the parent counties - but since those localities are contained within counties, the spatial join actually gave them their own names. Let me think about this more carefully. In the division_area table, for region='NG-LA', I'm getting both 'county' (20 LGAs) and 'locality' (2 sub-areas: Abagbo and Alaro City). The locality areas are contained within the county areas. Since I use point-in-polygon, if a centroid falls in Abagbo (locality), it gets assigned 'Abagbo' rather than 'Amuwo Odofin'. For the output, the user asked for "each Lagos Local Government Area". The 20 counties are the official LGAs. I should either: 1. Only use the 20 counties as the reference polygons (exclude localities from LGA lookup) 2. Or merge localities into their parent counties Option 1 is cleaner - only use subtype='county' as LGAs. Let me update the script. But wait - there are exactly 20 LGAs in Lagos State. My output has 21 entries including Abagbo. So I should exclude the 'locality' subtype from the LGA polygons used for spatial join. Let me fix the script to only use 'county' subtype.
28:40
28:40
Read
/work/solve.py
1 """
2 Extract large buildings (>1000 m²) in Lagos State, Nigeria from Overture,
3 compute per-LGA summary statistics.
4 Memory-efficient batch processing using DuckDB for heavy lifting.
5 """
6 import sys
7 import warnings
8 warnings.filterwarnings("ignore")
9
10 import geopandas as gpd
11 import pandas as pd
12 import numpy as np
13 import pyarrow as pa
14 import pyarrow.parquet as pq
15 from overturemaps.core import record_batch_reader
16 import shapely
17 from shapely import STRtree
18
19 # ── Config ───────────────────────────────────────────────────────────────
20 LAGOS_BBOX = (2.5, 6.1, 4.6, 6.9)
21 RELEASE = '2026-05-20.0'
22
23 # ── Step 1: Get Lagos State boundary and LGAs ──────────────────────────
24 print("Step 1: Getting Lagos State boundary and LGAs...")
25
26 reader = record_batch_reader('division_area', release=RELEASE, bbox=LAGOS_BBOX)
27 if reader is None:
28 sys.exit("No division_area reader returned")
29
30 all_divs = []
31 for batch in reader:
32 all_divs.append(batch.to_pandas())
33
34 all_divs = pd.concat(all_divs, ignore_index=True)
35 all_divs = all_divs[all_divs['country'] == 'NG'].copy()
36 all_divs = all_divs.drop_duplicates(subset='id')
37 all_divs['geometry'] = all_divs['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
38 div_gdf = gpd.GeoDataFrame(all_divs, geometry='geometry', crs='EPSG:4326')
39
40 # Lagos State boundary
41 lagos_state = div_gdf[(div_gdf['subtype'] == 'region') & (div_gdf['region'] == 'NG-LA')]
42 if len(lagos_state) == 0:
43 sys.exit("Could not find Lagos State region")
44 lagos_boundary = lagos_state.iloc[0]['geometry']
45 print(f" Lagos State boundary type: {lagos_boundary.geom_type}")
46
47 # LGAs
48 lga_divs = div_gdf[div_gdf['region'] == 'NG-LA'].copy()
49 lga_divs = lga_divs[lga_divs['subtype'].isin(['county', 'locality'])].copy()
50 lga_divs['lga_name'] = lga_divs['names'].apply(lambda n: n['primary'])
51 print(f" Total LGAs: {len(lga_divs)}")
52 for _, row in lga_divs.iterrows():
53 print(f" - {row['lga_name']} ({row['subtype']})")
54
55 lga_df = lga_divs[['lga_name', 'geometry']].copy().rename(columns={'lga_name': 'lga'})
56 lga_geoms = lga_df['geometry'].tolist()
57 lga_names = lga_df['lga'].tolist()
58
59 # R-tree for fast point-in-polygon
60 lga_tree = STRtree(lga_geoms)
61
62 def assign_lga(pt):
63 """Return LGA name for a point geometry."""
64 if pt is None or pt.is_empty:
65 return None
66 candidates = lga_tree.query(pt)
67 for idx in candidates:
68 if lga_geoms[idx].contains(pt):
69 return lga_names[idx]
70 # Nearest neighbor fallback
71 min_d = float('inf')
72 best = None
73 for i, g in enumerate(lga_geoms):
74 d = pt.distance(g)
75 if d < min_d:
76 min_d = d
77 best = lga_names[i]
78 return best
79
80 # ── Step 2: Process buildings in batches ──────────────────────────────
81 print("\nStep 2: Processing buildings in batches...")
82
83 reader = record_batch_reader('building', release=RELEASE, bbox=LAGOS_BBOX)
84 if reader is None:
85 sys.exit("No building reader returned")
86
87 big_buildings_list = []
88 lga_summary = {} # lga -> {n, total_area, n_h, heights[]}
89
90 batch_num = 0
91 for batch in reader:
92 batch_num += 1
93 df = batch.to_pandas()
94
95 # Decode WKB -> Shapely
96 df['geometry'] = df['geometry'].apply(
97 lambda wkb: shapely.from_wkb(wkb) if isinstance(wkb, (bytes, bytearray)) else wkb
98 )
99
100 # Quick bbox/envelope filter using shapely intersects
101 in_lagos = df['geometry'].apply(
102 lambda g: g is not None and not g.is_empty and g.intersects(lagos_boundary)
103 )
104 df_lag = df[in_lagos].copy()
105 if len(df_lag) == 0:
106 continue
107
108 # ---- Compute area in m² using vectorized GeoSeries ----
109 gs = gpd.GeoSeries(df_lag['geometry'].values, crs='EPSG:4326')
110 gs_utm = gs.to_crs('EPSG:32631')
111 df_lag['footprint_area_m2'] = gs_utm.area.values
112
113 # ---- Assign LGA ----
114 centroids = df_lag['geometry'].apply(lambda g: g.centroid if g and not g.is_empty else None)
115 lga_vals = centroids.apply(assign_lga)
116 # For nulls, try representative point
117 null_mask = lga_vals.isna()
118 if null_mask.any():
119 rep_pts = df_lag.loc[null_mask, 'geometry'].apply(
120 lambda g: g.representative_point() if g and not g.is_empty else None
121 )
122 lga_vals.loc[null_mask] = rep_pts.apply(assign_lga)
123 df_lag['lga'] = lga_vals
124
125 # Drop rows with no LGA
126 df_lag = df_lag[df_lag['lga'].notna()].copy()
127 if len(df_lag) == 0:
128 continue
129
130 # ---- Collect big buildings (>1000 m²) ----
131 big = df_lag[df_lag['footprint_area_m2'] > 1000]
132 if len(big) > 0:
133 big_buildings_list.append(
134 big[['id', 'height', 'footprint_area_m2', 'lga', 'geometry']].copy()
135 )
136
137 # ---- Update LGA summary ----
138 for _, row in df_lag.iterrows():
139 lga = row['lga']
140 h = row['height']
141 area = row['footprint_area_m2']
142 if lga not in lga_summary:
143 lga_summary[lga] = {'n': 0, 'area': 0.0, 'n_h': 0, 'heights': []}
144 acc = lga_summary[lga]
145 acc['n'] += 1
146 acc['area'] += area
147 if pd.notna(h):
148 acc['n_h'] += 1
149 acc['heights'].append(h)
150
151 if batch_num % 50 == 0:
152 n_big = sum(len(b) for b in big_buildings_list)
153 print(f" Batch {batch_num}: {len(df_lag)} in Lagos, {n_big} big buildings so far...")
154
155 print(f" Processed {batch_num} batches.")
156 n_big_total = sum(len(b) for b in big_buildings_list)
157 print(f" Total buildings >1000 m²: {n_big_total}")
158
159 # ── Step 3: Write lagos_buildings.geoparquet ──────────────────────────
160 print("\nStep 3: Writing lagos_buildings.geoparquet...")
161
162 if big_buildings_list:
163 big_df = pd.concat(big_buildings_list, ignore_index=True)
164 big_gdf = gpd.GeoDataFrame(big_df, geometry='geometry', crs='EPSG:4326')
165 big_gdf.to_parquet('/work/lagos_buildings.geoparquet')
166 print(f" Wrote {len(big_gdf)} rows")
167 big_df = None
168 else:
169 print(" No big buildings found")
170 empty = gpd.GeoDataFrame(
171 {'id': pd.Series([], dtype='str'), 'height': pd.Series([], dtype='float64'),
172 'footprint_area_m2': pd.Series([], dtype='float64'), 'lga': pd.Series([], dtype='str')},
173 geometry=pd.Series([], dtype='object'), crs='EPSG:4326'
174 )
175 empty.to_parquet('/work/lagos_buildings.geoparquet')
176
177 big_buildings_list = None
178
179 # ── Step 4: Write lagos_building_summary.parquet ──────────────────────
180 print("\nStep 4: Writing lagos_building_summary.parquet...")
181
182 rows = []
183 for lga, acc in lga_summary.items():
184 hts = acc['heights']
185 p50 = float(np.median(hts)) if len(hts) > 0 else None
186 rows.append({
187 'lga': lga,
188 'n_buildings': acc['n'],
189 'total_footprint_m2': acc['area'],
190 'n_with_height': acc['n_h'],
191 'p50_height_m': p50
192 })
193
194 summary_df = pd.DataFrame(rows).sort_values('lga').reset_index(drop=True)
195 summary_df.to_parquet('/work/lagos_building_summary.parquet', index=False)
196
197 print(summary_df.to_string())
198 print("\n✅ Done!")
29:20
The issue is that on line 49, I'm including `locality` in the LGA reference. But `Abagbo` and `Alaro City` are locality subtypes - they are places within LGAs, not LGAs themselves. I should only use `county` subtype for the LGA reference polygons. The official 20 LGAs in Lagos are all `county` subtype. Let me fix line 49 to only include 'county'.
29:20
Edit
/work/solve.py
Edited /work/solve.py.
29:27
Wait, I also need to handle the initial LGA name check. Let me also update the print statement to reflect that we're only using counties. And I need to re-run. But wait - the task says "for each Lagos Local Government Area". The official LGAs of Lagos State are the 20 local government areas. The localities (Abagbo, Alaro City) are sub-divisions within LGAs, not LGAs themselves. So I should only use the 20 county-level divisions as LGAs. Buildings in Abagbo/Alaro City will be assigned to their parent LGAs (Amuwo Odofin and Epe respectively). Let me also think about whether there are exactly 20 LGAs in Lagos State. Yes, Lagos State has 20 Local Government Areas (LGAs). The counties in Overture should correspond to these 20 LGAs. Let me run the updated script.
29:27
Bash
cd /work && python solve.py 2>&1
Step 1: Getting Lagos State boundary and LGAs...
Lagos State boundary type: Polygon
Total LGAs: 20
- Amuwo Odofin (county)
- Agege (county)
- Ifako/Ijaye (county)
- Oshodi/Isolo (county)
- Badagry (county)
- Alimosho (county)
- Ojo (county)
- Ìkòròdú (county)
- Epe (county)
- Ajeromi/Ifelodun (county)
- Kosofe (county)
- Lagos Island (county)
- Ikeja (county)
- Shomolu (county)
- Lagos Mainland (county)
- Ibeju Lekki (county)
- Apapa (county)
- Surulere (county)
- Eti Osa (county)
- Mushin (county)
Step 2: Processing buildings in batches...
Batch 50: 36901 in Lagos, 1325 big buildings so far...
Processed 153 batches.
Total buildings >1000 m²: 7249
Step 3: Writing lagos_buildings.geoparquet...
Wrote 7249 rows
Step 4: Writing lagos_building_summary.parquet...
lga n_buildings total_footprint_m2 n_with_height p50_height_m
0 Agege 32564 4.478193e+06 0 NaN
1 Ajeromi/Ifelodun 33206 4.593079e+06 0 NaN
2 Alimosho 433223 5.185803e+07 0 NaN
3 Amuwo Odofin 102539 1.426954e+07 0 NaN
4 Apapa 32502 4.275231e+06 0 NaN
5 Badagry 171840 1.887823e+07 0 NaN
6 Epe 74235 9.777356e+06 3 10.0
7 Eti Osa 185690 2.922655e+07 1791 9.0
8 Ibeju Lekki 131914 1.656669e+07 18 6.0
9 Ifako/Ijaye 68168 9.156084e+06 0 NaN
10 Ikeja 65919 1.232163e+07 149 10.0
11 Kosofe 92184 1.338926e+07 2 6.5
12 Lagos Island 9069 1.565505e+06 26 15.5
13 Lagos Mainland 35267 3.311757e+06 4 40.0
14 Mushin 45998 7.065688e+06 1 35.0
15 Ojo 149350 1.880956e+07 0 NaN
16 Oshodi/Isolo 95548 1.423290e+07 13 5.0
17 Shomolu 32220 4.983182e+06 12 16.0
18 Surulere 53597 7.867977e+06 0 NaN
19 Ìkòròdú 400389 5.004820e+07 569 6.0
✅ Done!
36:21
Now all 20 official Lagos LGAs are present. Let me verify the outputs look correct. Amuwo Odofin went from 102,455 (with locality) to 102,539 (without locality) - that's because the Abagbo locality's buildings are now being assigned to Amuwo Odofin LGA instead. Good. Let me verify the output files once more.
36:21
36:21
Bash
cd /work && ls -lh lagos_buildings.geoparquet lagos_building_summary.parquet
-rw-r--r-- 1 agent agent 4.2K Jun 16 06:43 lagos_building_summary.parquet -rw-r--r-- 1 agent agent 1.4M Jun 16 06:43 lagos_buildings.geoparquet
36:21
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
import pandas as pd
# Verify buildings geoparquet
gdf = gpd.read_parquet('/work/lagos_buildings.geoparquet')
print('=== lagos_buildings.geoparquet ===')
print(f'CRS: {gdf.crs}')
print(f'Rows: {len(gdf)}')
print(f'Columns: {gdf.columns.tolist()}')
print(f'Height na count: {gdf[\"height\"].isna().sum()}')
print(f'Area range: {gdf[\"footprint_area_m2\"].min():.2f} - {gdf[\"footprint_area_m2\"].max():.2f} m²')
print(f'LGA count: {gdf[\"lga\"].nunique()}')
print(f'LGAs: {sorted(gdf[\"lga\"].unique())}')
print()
# Verify summary parquet
df = pd.read_parquet('/work/lagos_building_summary.parquet')
print('=== lagos_building_summary.parquet ===')
print(f'Rows: {len(df)}')
print(f'Columns: {df.columns.tolist()}')
print(f'LGAs: {df[\"lga\"].tolist()}')
print(f'Total buildings: {df[\"n_buildings\"].sum()}')
print(f'Total area: {df[\"total_footprint_m2\"].sum():.2f} m²')
# Check the building dataset for Alaro city / Abagbo to make sure they're not in lga column
print()
print('Checking that locality names are NOT in lga column...')
for name in ['Alaro City', 'Abagbo']:
if name in gdf['lga'].values:
print(f' WARNING: {name} is in building lga column!')
else:
print(f' OK: {name} is NOT in building lga column')
"
=== lagos_buildings.geoparquet ===
CRS: {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "GeographicCRS", "name": "WGS 84", "datum_ensemble": {"name": "World Geodetic System 1984 ensemble", "members": [{"name": "World Geodetic System 1984 (Transit)"}, {"name": "World Geodetic System 1984 (G730)"}, {"name": "World Geodetic System 1984 (G873)"}, {"name": "World Geodetic System 1984 (G1150)"}, {"name": "World Geodetic System 1984 (G1674)"}, {"name": "World Geodetic System 1984 (G1762)"}, {"name": "World Geodetic System 1984 (G2139)"}, {"name": "World Geodetic System 1984 (G2296)"}], "ellipsoid": {"name": "WGS 84", "semi_major_axis": 6378137, "inverse_flattening": 298.257223563}, "accuracy": "2.0", "id": {"authority": "EPSG", "code": 6326}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, "scope": "Horizontal component of 3D system.", "area": "World.", "bbox": {"south_latitude": -90, "west_longitude": -180, "north_latitude": 90, "east_longitude": 180}, "id": {"authority": "EPSG", "code": 4326}}
Rows: 7249
Columns: ['id', 'height', 'footprint_area_m2', 'lga', 'geometry']
Height na count: 6948
Area range: 1000.04 - 181048.18 m²
LGA count: 20
LGAs: ['Agege', 'Ajeromi/Ifelodun', 'Alimosho', 'Amuwo Odofin', 'Apapa', 'Badagry', 'Epe', 'Eti Osa', 'Ibeju Lekki', 'Ifako/Ijaye', 'Ikeja', 'Kosofe', 'Lagos Island', 'Lagos Mainland', 'Mushin', 'Ojo', 'Oshodi/Isolo', 'Shomolu', 'Surulere', 'Ìkòròdú']
=== lagos_building_summary.parquet ===
Rows: 20
Columns: ['lga', 'n_buildings', 'total_footprint_m2', 'n_with_height', 'p50_height_m']
LGAs: ['Agege', 'Ajeromi/Ifelodun', 'Alimosho', 'Amuwo Odofin', 'Apapa', 'Badagry', 'Epe', 'Eti Osa', 'Ibeju Lekki', 'Ifako/Ijaye', 'Ikeja', 'Kosofe', 'Lagos Island', 'Lagos Mainland', 'Mushin', 'Ojo', 'Oshodi/Isolo', 'Shomolu', 'Surulere', 'Ìkòròdú']
Total buildings: 2245422
Total area: 296674635.20 m²
Checking that locality names are NOT in lga column...
OK: Alaro City is NOT in building lga column
OK: Abagbo is NOT in building lga column
37:06
Everything looks good. Let me also sanity-check the coordinates - they should be in degrees (EPSG:4326).
37:06
37:06
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_parquet('/work/lagos_buildings.geoparquet')
bounds = gdf.geometry.total_bounds
print(f'Bounds (degrees): minx={bounds[0]:.4f}, miny={bounds[1]:.4f}, maxx={bounds[2]:.4f}, maxy={bounds[3]:.4f}')
# Lagos is roughly at 3°E, 6.5°N so values should be around there
assert 2.0 < bounds[0] < 5.0, f'minx {bounds[0]} out of range for Lagos'
assert 6.0 < bounds[1] < 7.0, f'miny {bounds[1]} out of range for Lagos'
assert 2.0 < bounds[2] < 5.0, f'maxx {bounds[2]} out of range for Lagos'
assert 6.0 < bounds[3] < 7.0, f'maxy {bounds[3]} out of range for Lagos'
print('✅ Coordinate bounds look correct for Lagos State (EPSG:4326 degrees)')
# Verify no geometry column in summary
import pandas as pd
df = pd.read_parquet('/work/lagos_building_summary.parquet')
print(f'✅ Summary columns (no geometry): {df.columns.tolist()}')
assert 'geometry' not in df.columns
"
Bounds (degrees): minx=2.7082, miny=6.3776, maxx=4.3332, maxy=6.6962 ✅ Coordinate bounds look correct for Lagos State (EPSG:4326 degrees) ✅ Summary columns (no geometry): ['lga', 'n_buildings', 'total_footprint_m2', 'n_with_height', 'p50_height_m']
37:19
Everything looks good. Let me verify against the user's exact requirements: 1. **lagos_buildings.geoparquet** - EPSG:4326 GeoParquet with columns `id`, `height`, `footprint_area_m2`, `lga`, and `geometry` ✅ - CRS is EPSG:4326 ✅ - Has the right columns ✅ - Contains buildings > 1000 m² ✅ - All buildings in Lagos State ✅ 2. **lagos_building_summary.parquet** - plain Parquet with no geometry column and columns `lga`, `n_buildings`, `total_footprint_m2`, `n_with_height`, and `p50_height_m` ✅ - No geometry column ✅ - Has the right columns ✅ - Per-LGA rollup ✅ Let me check if the `p50_height_m` is null (not NaN) for LGAs with no heights. The spec says "median height where available (null otherwise)". The current output has NaN, which in Parquet will be null. Let me check. Actually, pandas writes NaN as null in Parquet, so that's fine. Let me verify the output one more time and make sure everything is good.
37:19
37:19
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import pandas as pd
df = pd.read_parquet('/work/lagos_building_summary.parquet')
print('p50_height_m null values:')
mask = df['p50_height_m'].isna()
print(df[mask][['lga', 'p50_height_m']].to_string())
print(f'Number of LGAs with null p50_height_m: {mask.sum()}')
"
p50_height_m null values:
lga p50_height_m
0 Agege NaN
1 Ajeromi/Ifelodun NaN
2 Alimosho NaN
3 Amuwo Odofin NaN
4 Apapa NaN
5 Badagry NaN
9 Ifako/Ijaye NaN
15 Ojo NaN
18 Surulere NaN
Number of LGAs with null p50_height_m: 9
37:53
Everything checks out perfectly. Let me summarize what was done: 1. **lagos_buildings.geoparquet** - EPSG:4326, columns `id`, `height`, `footprint_area_m2`, `lga`, `geometry` with 7249 rows (buildings > 1000 m² in Lagos State) 2. **lagos_building_summary.parquet** - No geometry, columns `lga`, `n_buildings`, `total_footprint_m2`, `n_with_height`, `p50_height_m` with 20 rows (one per LGA) All 20 official Lagos LGAs are covered, null p50_height_m is properly NaN when no heights available.
37:53