run-20260616-043931Z / dd-l3-lagos-overture-buildings

0:00

Updating the flood-risk model before rainy season. Need every building footprint exceeding 1000 m² across Lagos State (Nigeria) from Overture. Also need a per-LGA roll-up: for each Lagos Local Government Area, the building count, total footprint area, count of buildings with a non-null Overture height, and the median height where available (null otherwise). Write two output files. `lagos_buildings.geoparquet` should be EPSG:4326 GeoParquet with columns `id`, `height`, `footprint_area_m2`, `lga`, and `geometry`. `lagos_building_summary.parquet` should be a plain Parquet with no geometry column and columns `lga`, `n_buildings`, `total_footprint_m2`, `n_with_height`, and `p50_height_m`.

0:00

System info

{
  "adapter": "openrouter",
  "model": "deepseek/deepseek-v4-flash",
  "base_url": "https://openrouter.ai/api/v1"
}

0:16

Let me break down the task:

1. Get building footprints > 1000 m² across Lagos State, Nigeria from Overture Maps
2. Create two output files:
   - `lagos_buildings.geoparquet` (EPSG:4326) with columns: id, height, footprint_area_m2, lga, geometry
   - `lagos_building_summary.parquet` (plain Parquet) with columns: lga, n_buildings, total_footprint_m2, n_with_height, p50_height_m

Let me start by understanding what Lagos State looks like and use Overture Maps.

First, let me check if there are any input files in the working directory, and then write the script.

0:16

Bash ls /work/

(no output)

0:16

duckdb             1.5.3
geopandas          1.1.3
overturemaps       1.0.0
pyarrow            24.0.0
shapely            2.1.2

0:32

Good — all needed libraries are available. The working directory is empty (no input files). I need to:

1. Use Overture Maps to get all buildings > 1000 m² in Lagos State, Nigeria.
2. Get the LGA boundaries for Lagos State (Local Government Areas).
3. Compute building footprint areas, join to LGAs, and filter.
4. Aggregate per LGA.

Let me write a script. I'll use `overturemaps` Python library to download buildings, and I'll need LGA boundaries. For LGA boundaries, I could use Overture's divisions (administrative boundaries), or I could look for a shapefile of Nigerian LGAs. Let me think about the best approach.

Overture has a `divisions` type that includes administrative boundaries. The `subdivision` type includes regions, and `locality` includes cities/towns. For LGAs (which are admin level 3 in Nigeria), Overture might have them under `divisions` with `subtype: 'admin_region'` or similar.

Actually, let me check Overture's division types. In Overture, administrative divisions are:
- `continent`, `nation`, `region` (state/province), `county` (equivalent to LGA in some countries), `locality` (city/town), `macroregion`, `macrocounty`

In Nigeria, the hierarchy is: Nation → State → Local Government Area (LGA). Overture has `region` for states and `county` for LGAs (or `locality`). Let me check by querying.

Alternatively, I can use the `geopandas.datasets` or HDX or Natural Earth for LGA boundaries... but actually, I think Overture's divisions with `subtype = 'county'` might be LGAs in Nigeria. Let me try a DuckDB approach or the overturemaps library.

Actually, let me use a simpler approach. Overturemaps has a Python client that can query by bounding box and type. I'll:

1. Find Lagos State's geometry first (using Overture divisions with `subtype='region'` and name='Lagos').
2. Use that to get all buildings in Lagos.
3. For LGAs, I can also get divisions with `subtype='county'` or `subtype='locality'` that are within Lagos.
4. Spatial join buildings to LGAs.

Let me write the script step by step.

Let me first check what division subtypes are available for Nigeria/Overture.

0:32

Bash python3 -c "

python3 -c "
import overturemaps
# Check what division types exist
print('Available types:', overturemaps.available_types())
" 2>&1 | head -20

Traceback (most recent call last):
  File "<string>", line 4, in <module>
    print('Available types:', overturemaps.available_types())
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'overturemaps' has no attribute 'available_types'

0:40

Let me check the overturemaps API properly.

0:40

Bash python3 -c "

python3 -c "
import overturemaps
print(dir(overturemaps))
"

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'core', 'geodataframe', 'get_all_overture_types', 'models', 'record_batch_reader']

0:41

Bash python3 -c "

python3 -c "
from overturemaps import core
print(dir(core))
" 2>&1 | head -30

['ALL_RELEASES', 'BBox', 'GeoDataFrame', 'HAS_GEOPANDAS', 'List', 'Optional', 'STAC_CATALOG_URL', 'Tuple', '_ReleasesProxy', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_binary_search_manifest', '_cached_stac_catalog', '_coerce_bbox', '_dataset_path', '_get_all_releases', '_get_files_from_stac', '_get_stac_catalog', '_prepare_query', '_record_batch_reader_from_dataset', 'count_rows', 'ds', 'fs', 'geoarrow_schema_adapter', 'geodataframe', 'get_all_overture_types', 'get_available_releases', 'get_latest_release', 'io', 'json', 'pa', 'pc', 'pq', 'query_gers_registry', 'record_batch_reader', 'record_batch_reader_from_gers', 'sys', 'type_theme_map', 'urlopen']

0:54

Let me look at how to use overturemaps to query buildings data for Lagos. I'll use the `record_batch_reader` or `geodataframe` functions with a bounding box and type.

I need to figure out the bounding box for Lagos State, Nigeria. Let me look up approximate coordinates.

Lagos State, Nigeria is roughly:
- Latitude: ~6.2°N to ~6.7°N
- Longitude: ~2.7°E to ~4.5°E

Let me get the Lagos LGA boundaries. Actually, I need to also get the LGA boundaries. I can either:
1. Query Overture divisions for Lagos LGA boundaries
2. Use a known boundaries file

Let me first try to get Lagos LGA boundaries from Overture divisions, then use those to clip buildings.

Actually, the approach should be:
1. Get all buildings in Lagos State area using Overture
2. Get LGA boundaries for Lagos from Overture divisions
3. Intersect buildings with LGAs
4. Filter buildings > 1000 m²
5. Compute area and output

Let me start by getting the extent for Lagos.

0:54

Bash python3 -c "

python3 -c "
from overturemaps.core import record_batch_reader, geodataframe
# Get the latest release
print('Latest release:', __import__('overturemaps').core.get_latest_release())
"

Latest release: 2026-05-20.0

1:13

Bash python3 -c "

python3 -c "
from overturemaps.core import record_batch_reader, geodataframe
# Check what types are available
types = __import__('overturemaps').get_all_overture_types()
print(types)
"

['address', 'bathymetry', 'building', 'building_part', 'division', 'division_area', 'division_boundary', 'place', 'segment', 'connector', 'infrastructure', 'land', 'land_cover', 'land_use', 'water']

1:18

Bash python3 -c "

python3 -c "
from overturemaps.core import record_batch_reader
import pyarrow as pa

# Get a small sample of building data to see schema
reader = record_batch_reader('building', release='2026-05-20.0', 
    bbox=(-180, -90, 180, 90), # global bbox, limit to 1 batch
    limit=5)
for batch in reader:
    print(batch.schema)
    print(batch.to_pandas().columns.tolist())
    break
"

Traceback (most recent call last):
  File "<string>", line 6, in <module>
    reader = record_batch_reader('building', release='2026-05-20.0',
        bbox=(-180, -90, 180, 90), # global bbox, limit to 1 batch
        limit=5)
TypeError: record_batch_reader() got an unexpected keyword argument 'limit'

1:22

Let me check the correct API for record_batch_reader and understand how to query buildings with a bounding box.

1:22

Bash python3 -c "

python3 -c "
from overturemaps.core import record_batch_reader
import inspect
sig = inspect.signature(record_batch_reader)
print(sig)
"

(overture_type, bbox=None, release=None, connect_timeout=None, request_timeout=None, stac=False) -> pyarrow.lib.RecordBatchReader | None

1:27

I can't use `limit` with `record_batch_reader`. Let me get the schema in another way - by reading just a few records.

1:27

Bash python3 -c "

python3 -c "
from overturemaps.core import record_batch_reader
import pyarrow as pa

# Get a small sample of building data to see schema - use a very small bbox
reader = record_batch_reader('building', release='2026-05-20.0', 
    bbox=(-180, -90, 180, 90))
if reader is None:
    print('None reader')
else:
    for batch in reader:
        print('Schema:', batch.schema)
        print('Columns:', batch.to_pandas().columns.tolist())
        print('First row:', batch.to_pandas().iloc[0].to_dict())
        break
"

Schema: id: string
names: struct<primary: string, common: map<string, string ('common')>, rules: list<element: struct<variant: (... 159 chars omitted)
  child 0, primary: string
  child 1, common: map<string, string ('common')>
      child 0, common: struct<key: string not null, value: string> not null
          child 0, key: string not null
          child 1, value: string
  child 2, rules: list<element: struct<variant: string, language: string, perspectives: struct<mode: string, countries (... 87 chars omitted)
      child 0, element: struct<variant: string, language: string, perspectives: struct<mode: string, countries: list<element (... 72 chars omitted)
          child 0, variant: string
          child 1, language: string
          child 2, perspectives: struct<mode: string, countries: list<element: string>>
              child 0, mode: string
              child 1, countries: list<element: string>
                  child 0, element: string
          child 3, value: string
          child 4, between: list<element: double>
              child 0, element: double
          child 5, side: string
sources: list<element: struct<property: string, dataset: string, license: string, record_id: string, update_t (... 65 chars omitted)
  child 0, element: struct<property: string, dataset: string, license: string, record_id: string, update_time: string, c (... 50 chars omitted)
      child 0, property: string
      child 1, dataset: string
      child 2, license: string
      child 3, record_id: string
      child 4, update_time: string
      child 5, confidence: double
      child 6, between: list<element: double>
          child 0, element: double
level: int32
height: double
min_height: double
is_underground: bool
num_floors: int32
num_floors_underground: int32
min_floor: int32
subtype: string
class: string
facade_color: string
facade_material: string
roof_material: string
roof_shape: string
roof_direction: double
roof_orientation: string
roof_color: string
roof_height: double
geometry: binary
has_parts: bool
version: int32
bbox: struct<xmin: double, xmax: double, ymin: double, ymax: double>
  child 0, xmin: double
  child 1, xmax: double
  child 2, ymin: double
  child 3, ymax: double
-- schema metadata --
geo: '{"version": "1.1.0", "primary_column": "geometry", "columns": {"geo' + 282
Columns: ['id', 'names', 'sources', 'level', 'height', 'min_height', 'is_underground', 'num_floors', 'num_floors_underground', 'min_floor', 'subtype', 'class', 'facade_color', 'facade_material', 'roof_material', 'roof_shape', 'roof_direction', 'roof_orientation', 'roof_color', 'roof_height', 'geometry', 'has_parts', 'version', 'bbox']
First row: {'id': '76c4a544-9ad3-4da0-be7d-22892dcbeb58', 'names': {'primary': "St. Paul's Burgers", 'common': None, 'rules': None}, 'sources': array([{'property': '', 'dataset': 'OpenStreetMap', 'license': 'ODbL-1.0', 'record_id': 'w1417351285@1', 'update_time': '2025-07-25T18:07:43.000Z', 'confidence': None, 'between': None}],
      dtype=object), 'level': nan, 'height': nan, 'min_height': nan, 'is_underground': False, 'num_floors': nan, 'num_floors_underground': nan, 'min_floor': nan, 'subtype': 'religious', 'class': nan, 'facade_color': nan, 'facade_material': nan, 'roof_material': nan, 'roof_shape': nan, 'roof_direction': nan, 'roof_orientation': nan, 'roof_color': nan, 'roof_height': nan, 'geometry': b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x06\x00\x00\x00\x86k\xff\x14\xd8~f\xc0\x0e@\xb4\xa0A}S\xc0\x81\x0b\r\x1f\xec~f\xc0\xf1/\xdd\x7f?}S\xc0x"-:\xfe~f\xc0\x8f\xb6;\xb5D}S\xc0[\x93\x13\x92\xf1~f\xc09oR\x87K}S\xc0\xfao\x14\x0f\xde~f\xc0"a\xce"J}S\xc0\x86k\xff\x14\xd8~f\xc0\x0e@\xb4\xa0A}S\xc0', 'has_parts': False, 'version': 1, 'bbox': {'xmin': -179.96853637695312, 'xmax': -179.96385192871094, 'ymin': -77.9577407836914, 'ymax': -77.95699310302734}}

2:20

Good. The building data has `id`, `height`, `geometry`, and a `bbox` field. I need to constrain to Lagos State, Nigeria.

Let me find the bounding box for Lagos State. I can use the `division` type from Overture to find the bbox of Lagos State.

2:20

Bash python3 -c "

python3 -c "
from overturemaps.core import record_batch_reader, geodataframe
import pyarrow as pa

# Get Lagos State division area
# First let's query division_area for Nigeria
reader = record_batch_reader('division_area', release='2026-05-20.0',
    bbox=(-180, -90, 180, 90))

if reader is None:
    print('None reader')
else:
    # Collect all divisions and find Lagos
    import pyarrow as pa
    for batch in reader:
        df = batch.to_pandas()
        # Filter for Nigeria
        mask = df['country'] == 'NG'
        df_ng = df[mask]
        if len(df_ng) > 0:
            print('Columns:', df_ng.columns.tolist())
            print(df_ng[['id', 'country', 'region', 'subregion', 'locality', 'subtype']].head(10).to_string())
            # Look for Lagos
            lagos_mask = df_ng['region'].str.contains('Lagos|lagos', na=False) | df_ng['subregion'].str.contains('Lagos|lagos', na=False) | df_ng['locality'].str.contains('Lagos|lagos', na=False)
            print()
            print('Lagos matches:')
            print(df_ng[lagos_mask][['id', 'region', 'subregion', 'locality', 'subtype']].to_string())
" 2>&1 | head -80

Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox']
Traceback (most recent call last):
  File "<string>", line 22, in <module>
    print(df_ng[['id', 'country', 'region', 'subregion', 'locality', 'subtype']].head(10).to_string())
          ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/pandas/core/frame.py", line 4384, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/pandas/core/indexes/base.py", line 6302, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
    ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/pandas/core/indexes/base.py", line 6355, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['subregion', 'locality'] not in index"

4:04

The columns are: id, geometry, country, sources, subtype, admin_level, class, names, is_land, is_territorial, region, division_id, version, bbox

Let me look for Lagos state in the region column.

4:04

Bash python3 -c "

python3 -c "
from overturemaps.core import record_batch_reader
import pyarrow as pa

reader = record_batch_reader('division_area', release='2026-05-20.0',
    bbox=(-180, -90, 180, 90))

for batch in reader:
    df = batch.to_pandas()
    # Filter for Nigeria
    ng = df[df['country'] == 'NG']
    if len(ng) > 0:
        print('Columns:', ng.columns.tolist())
        print()
        # Look for Lagos in region
        lagos = ng[ng['region'].str.contains('Lagos', na=False)]
        print('Lagos region entries:')
        print(lagos[['id', 'region', 'subtype', 'admin_level', 'class']].to_string())
        print()
        # Also check subtypes
        print('Subtypes:', ng['subtype'].unique())
        print()
        print('Admin levels:', ng['admin_level'].unique())
" 2>&1 | head -60

Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox']

Lagos region entries:
Empty DataFrame
Columns: [id, region, subtype, admin_level, class]
Index: []

Subtypes: <ArrowStringArray>
['county', 'locality']
Length: 2, dtype: str

Admin levels: [ 2. nan]
Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox']

Lagos region entries:
Empty DataFrame
Columns: [id, region, subtype, admin_level, class]
Index: []

Subtypes: <ArrowStringArray>
['locality', 'county', 'region', 'macrohood', 'microhood', 'neighborhood']
Length: 6, dtype: str

Admin levels: [nan  2.  1.]
Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox']

Lagos region entries:
Empty DataFrame
Columns: [id, region, subtype, admin_level, class]
Index: []

Subtypes: <ArrowStringArray>
['locality', 'county']
Length: 2, dtype: str

Admin levels: [nan  2.]
Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox']

Lagos region entries:
Empty DataFrame
Columns: [id, region, subtype, admin_level, class]
Index: []

Subtypes: <ArrowStringArray>
['locality', 'county', 'region', 'country']
Length: 4, dtype: str

Admin levels: [nan  2.  1.  0.]
Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox']

Lagos region entries:
Empty DataFrame
Columns: [id, region, subtype, admin_level, class]
Index: []

Subtypes: <ArrowStringArray>
['locality', 'county', 'region']
Length: 3, dtype: str

Admin levels: [nan  2.  1.]

5:55

The division_area data is partitioned and the batches are iterating. Let me collect all Nigeria divisions more efficiently. Let me check if the 'region' column has 'Lagos' or if it's stored differently.

5:55

Bash python3 -c "

python3 -c "
from overturemaps.core import record_batch_reader
import pyarrow as pa

reader = record_batch_reader('division_area', release='2026-05-20.0',
    bbox=(-180, -90, 180, 90))

all_ng = []
for batch in reader:
    df = batch.to_pandas()
    ng = df[df['country'] == 'NG']
    if len(ng) > 0:
        all_ng.append(ng)

if all_ng:
    import pandas as pd
    df_all = pd.concat(all_ng, ignore_index=True)
    print('Total NG divisions:', len(df_all))
    print()
    print('Region values sample:', df_all['region'].unique()[:20])
    print()
    # Search for Lagos anywhere
    for col in ['region', 'class', 'names']:
        mask = df_all[col].astype(str).str.contains('Lagos|lagos|LAGOS', na=False)
        if mask.any():
            print(f'Found Lagos in {col}:')
            print(df_all[mask][['id', 'region', 'subtype', 'admin_level', col]].to_string())
            print()

    # Get all division boundary for Nigeria to find Lagos
    # Alternatively, let's use division_boundary type
"

<exit_code>-1</exit_code>
<error>Command timed out after 120000ms</error>

8:05

The query was too broad. Let me try using the division_boundary type instead, and specifically look for Lagos. Also, let me try a more targeted approach - use the `geodataframe` function which should be more efficient.

Actually, let me think about this differently. I know Lagos State, Nigeria is roughly bounded by:

- Latitude: ~6.3°N to ~6.7°N
- Longitude: ~2.7°E to ~3.8°E

Let me use a bounding box to query the division_area or division_boundary for this area, and filter for Nigeria/Lagos.

Or better yet, I can just query the division_boundary type with the bounding box of Lagos State.

8:05

Bash python3 -c "

python3 -c "
from overturemaps.core import geodataframe, record_batch_reader
import pyarrow as pa

# Use geodataframe which handles the geometry decoding
# Query a bounding box covering Lagos State, Nigeria
bbox = (2.5, 6.2, 4.0, 6.8)
reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=bbox)

if reader:
    for batch in reader:
        df = batch.to_pandas()
        # Check region field
        if 'region' in df.columns:
            print('Regions in bbox:', df['region'].unique())
        print('Types:', df['subtype'].unique() if 'subtype' in df.columns else 'N/A')
        print('Countries:', df['country'].unique() if 'country' in df.columns else 'N/A')
        # Look at names
        if 'names' in df.columns:
            print('Names:', df['names'].head(5).to_dict())
" 2>&1 | head -40

Regions in bbox: <ArrowStringArray>
[nan]
Length: 1, dtype: str
Types: <ArrowStringArray>
['dependency']
Length: 1, dtype: str
Countries: <ArrowStringArray>
['UM']
Length: 1, dtype: str
Names: {0: {'primary': 'United States Minor Outlying Islands', 'common': [('hr', 'Mali Udaljeni Otoci Sjedinjenih Američkih Država'), ('pt', 'Ilhas Menores Distantes dos Estados Unidos'), ('cs', 'Menší odlehlé ostrovy Spojených států amerických'), ('ca', "Illes d'Ultramar Menors dels Estats Units"), ('nl', 'Amerikaanse Kleinere Afgelegen Eilanden'), ('zh-Hans', '美国本土外小岛屿'), ('eo', 'Usonaj Malgrandaj Insuloj'), ('ru', 'Внешние малые острова США'), ('zh-Hant', '美國本土外小島嶼'), ('en', 'United States Minor Outlying Islands'), ('es', 'Islas Ultramarinas Menores de Estados Unidos'), ('lt', 'Jungtinių Amerikos Valstijų mažosios aplinkinės salos'), ('zh', '美国本土外小岛屿 / 美國本土外小島嶼'), ('pl', 'Dalekie Wyspy Mniejsze Stanów Zjednoczonych')], 'rules': None}, 1: {'primary': 'United States Minor Outlying Islands', 'common': [('hr', 'Mali Udaljeni Otoci Sjedinjenih Američkih Država'), ('pt', 'Ilhas Menores Distantes dos Estados Unidos'), ('cs', 'Menší odlehlé ostrovy Spojených států amerických'), ('ca', "Illes d'Ultramar Menors dels Estats Units"), ('nl', 'Amerikaanse Kleinere Afgelegen Eilanden'), ('zh-Hans', '美国本土外小岛屿'), ('eo', 'Usonaj Malgrandaj Insuloj'), ('ru', 'Внешние малые острова США'), ('zh-Hant', '美國本土外小島嶼'), ('en', 'United States Minor Outlying Islands'), ('es', 'Islas Ultramarinas Menores de Estados Unidos'), ('lt', 'Jungtinių Amerikos Valstijų mažosios aplinkinės salos'), ('zh', '美国本土外小岛屿 / 美國本土外小島嶼'), ('pl', 'Dalekie Wyspy Mniejsze Stanów Zjednoczonych')], 'rules': None}}
Regions in bbox: <ArrowStringArray>
['BJ-OU', 'NG-LA', 'NG-OG', 'BJ-PL', nan]
Length: 5, dtype: str
Types: <ArrowStringArray>
['county', 'locality', 'region', 'country']
Length: 4, dtype: str
Countries: <ArrowStringArray>
['BJ', 'NG']
Length: 2, dtype: str
Names: {0: {'primary': 'Akpro-Missérété', 'common': [('fr', 'Akpro-Missérété')], 'rules': array([{'variant': 'official', 'language': None, 'perspectives': None, 'value': "Commune d'Akpro-Missérété", 'between': None, 'side': None}],
      dtype=object)}, 1: {'primary': 'Aguégués', 'common': [('fr', 'Aguégués')], 'rules': array([{'variant': 'official', 'language': None, 'perspectives': None, 'value': "Commune d'Aguégués", 'between': None, 'side': None}],
      dtype=object)}, 2: {'primary': 'Amuwo Odofin', 'common': None, 'rules': None}, 3: {'primary': 'Agege', 'common': None, 'rules': None}, 4: {'primary': 'Ifo', 'common': None, 'rules': None}}
Regions in bbox: <ArrowStringArray>
['NG-LA', 'NG-OG']
Length: 2, dtype: str
Types: <ArrowStringArray>
['region', 'county', 'locality', 'macrohood', 'neighborhood']
Length: 5, dtype: str
Countries: <ArrowStringArray>
['NG']
Length: 1, dtype: str
Names: {0: {'primary': 'Lagos', 'common': [('ka', 'ლაგოსის შტატი'), ('ar', 'لاغوس'), ('fr', 'État de Lagos'), ('uk', 'Лагос'), ('ceb', 'Lagos State'), ('war', 'Lagos'), ('pt', 'Lagos'), ('cs', 'Stát Lagos'), ('gl', 'Estado de Lagos'), ('it', 'Lagos'), ('ca', 'Estat de Lagos'), ('yo', 'Ìpínlẹ̀ Èkó'), ('nl', 'Lagos'), ('bg', 'Лагос'), ('ko', '라고스주'), ('mk', 'Лагос'), ('et', 'Lagose osariik'), ('de', 'Lagos'), ('ha', 'Lagos'), ('azb', 'لاقوس ایالتی'), ('ig', 'Ȯra Lagos'), ('ur', 'لاگوس ریاست'), ('nan', 'Lagos Chiu'), ('ru', 'Лагос'), ('th', 'รัฐเลกอส'), ('id', 'Lagos'), ('sv', 'Lagos'), ('tr', 'Lagos Eyaleti'), ('da', 'Lagos'), ('sr-Cyrl', 'Лагос'), ('en', 'Lagos State'), ('he', 'לאגוס'), ('mrj', 'Лагос'), ('az', 'Laqos ştatı'), ('es', 'Estado de Lagos'), ('hi', 'लेगोस राज्य'), ('sw', 'Jimbo la Lagos'), ('ja', 'ラゴス州'), ('fi', 'Lagos'), ('ro', 'Statul Lagos'), ('no', 'Lagos'), ('vec', 'Ƚagos'), ('fa', 'ایالت لاگوس'), ('zh', '拉哥斯'), ('ms', 'Negeri Lagos'), ('pl', 'Lagos')], 'rules': array([{'variant': 'alternate', 'language': 'zh', 'perspectives': None, 'value': '拉各斯州', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'ar', 'perspectives': None, 'value': 'ولاية لاغوس', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'es', 'perspectives': None, 'value': 'Estado Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'pt', 'perspectives': None, 'value': 'Estado Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'it', 'perspectives': None, 'value': 'Stato Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'fr', 'perspectives': None, 'value': 'État Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'de', 'perspectives': None, 'value': 'Staat Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'nl', 'perspectives': None, 'value': 'Staat Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'sv', 'perspectives': None, 'value': 'Lagos-tillstånd', 'between': None, 'side': None},

9:21

I see that Lagos State is present with region code 'NG-LA'. I can see `Amuwo Odofin`, `Agege` as LGAs. Now I need to:

1. Get the geometry of Lagos State (division_area with region='NG-LA')
2. Get all LGAs within Lagos State (these are the "county" subtype divisions)
3. Get all buildings within Lagos State
4. Compute the area of each building
5. Filter for building footprint > 1000 m²
6. Compute per-LGA statistics

Let me write this as a comprehensive script.

9:21

Write /work/solve.py

content · 332 lines · py

"""
Extract large buildings (>1000 m²) in Lagos State, Nigeria from Overture,
compute per-LGA summary statistics.
"""
import sys
import warnings
warnings.filterwarnings("ignore")

import geopandas as gpd
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import shapely
from overturemaps.core import record_batch_reader, geodataframe

# ── Step 1: Get Lagos State boundary ─────────────────────────────────────
print("Step 1: Getting Lagos State boundary...")
bbox_lagos = (2.5, 6.2, 4.0, 6.8)  # rough bounding box around Lagos

# Use division_area to find Lagos State (region='NG-LA')
reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=bbox_lagos)
if reader is None:
    sys.exit("No division_area reader returned")

ng_la_area = None
for batch in reader:
    df = batch.to_pandas()
    mask = df['region'] == 'NG-LA'
    if mask.any():
        ng_la_area = df[mask].iloc[0]
        break

if ng_la_area is None:
    sys.exit("Could not find Lagos State (NG-LA) in division_area")

print(f"  Lagos State found: {ng_la_area['names']['primary']}")
lagos_geom_geom = shapely.from_wkb(ng_la_area['geometry'])
print(f"  Lagos State geometry type: {lagos_geom_geom.geom_type}")

# ── Step 2: Get all LGAs (county subtype) within Lagos State ────────────
print("\nStep 2: Getting Lagos LGAs...")
reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=bbox_lagos)
if reader is None:
    sys.exit("No reader returned for LGAs")

all_counties = []
for batch in reader:
    df = batch.to_pandas()
    # We want counties within Lagos State
    # Filter: country='NG' and subtype='county' or subtype='locality' (LGAs might be either)
    ng = df[df['country'] == 'NG']
    if len(ng) == 0:
        continue
    all_counties.append(ng)

if all_counties:
    counties = pd.concat(all_counties, ignore_index=True)
else:
    sys.exit("No NG divisions found")

# Deduplicate by id
counties = counties.drop_duplicates(subset='id')

# The LGAs are within Lagos - let's filter spatially
county_gdf = counties.copy()
county_gdf['geometry'] = county_gdf['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
county_gdf = gpd.GeoDataFrame(county_gdf, geometry='geometry', crs='EPSG:4326')

# Spatial join: keep counties whose centroid is within Lagos State boundary
# (or that intersect Lagos significantly)
lagos_geom = lagos_geom_geom
county_gdf['within_lagos'] = county_gdf.centroid.within(lagos_geom)
county_gdf = county_gdf[county_gdf['within_lagos']].copy()
county_gdf.drop(columns='within_lagos', inplace=True)

# Also look at region field - LGAs should have region='NG-LA' or similar
# Check what region values we have
print(f"  Found {len(county_gdf)} divisions within Lagos boundary")
print(f"  Subtypes: {county_gdf['subtype'].unique()}")
print(f"  Regions: {county_gdf['region'].unique()}")

# Use only actual LGAs (subtype='county' or 'locality')
# In Overture, LGAs in Nigeria are subtype='county' or maybe 'locality'
lga_gdf = county_gdf[county_gdf['subtype'].isin(['county', 'locality'])].copy()
print(f"  LGA candidates: {len(lga_gdf)}")

# Print all LGA names
lga_gdf['lga_name'] = lga_gdf['names'].apply(lambda n: n['primary'])
print("  LGAs found:")
for _, row in lga_gdf.iterrows():
    print(f"    - {row['lga_name']} ({row['subtype']})")

# Build a lookup: LGA name -> geometry (save as a dict for efficiency)
lga_records = []
for _, row in lga_gdf.iterrows():
    lga_records.append({
        'lga': row['lga_name'],
        'geometry': row['geometry']
    })

print(f"\nTotal LGAs: {len(lga_records)}")

# ── Step 3: Get all buildings from Overture in the Lagos area ────────────
print("\nStep 3: Querying Overture buildings for Lagos State...")

# Use a generous bounding box, we'll filter spatially
bbox_lagos_big = (2.5, 6.2, 4.0, 6.8)

reader = record_batch_reader('building', release='2026-05-20.0', bbox=bbox_lagos_big)
if reader is None:
    sys.exit("No building reader returned")

all_buildings = []
batch_count = 0
for batch in reader:
    df = batch.to_pandas()
    all_buildings.append(df)
    batch_count += 1
    if batch_count % 10 == 0:
        print(f"  Read {batch_count} batches, {sum(len(b) for b in all_buildings)} rows so far...")

print(f"  Total batches: {batch_count}")
buildings_raw = pd.concat(all_buildings, ignore_index=True)
print(f"  Total buildings fetched: {len(buildings_raw)}")

# ── Step 4: Filter buildings within Lagos State and compute area ─────────
print("\nStep 4: Computing footprint areas and filtering...")

# Decode WKB geometries
buildings_raw['geometry'] = buildings_raw['geometry'].apply(
    lambda wkb: shapely.from_wkb(wkb)
)

# Filter buildings that intersect Lagos State boundary
# Use a spatial index on Lagos boundary for speed
lagos_poly = lagos_geom
buildings_raw['in_lagos'] = buildings_raw['geometry'].apply(
    lambda g: g is not None and not g.is_empty and g.intersects(lagos_poly)
)
buildings_lagos = buildings_raw[buildings_raw['in_lagos']].copy()
print(f"  Buildings intersecting Lagos: {len(buildings_lagos)}")

# Compute footprint area in m² using an equal-area CRS for Nigeria
# Use EPSG:26391 (Nigia East Belt) or EPSG:32631/32632 (UTM zones)
# Lagos is around UTM zone 31N (EPSG:32631)
gdf = gpd.GeoDataFrame(buildings_lagos, geometry='geometry', crs='EPSG:4326')
gdf['geometry'] = gdf['geometry'].to_crs('EPSG:32631')
gdf['footprint_area_m2'] = gdf.geometry.area
# Convert back to 4326 for output
gdf = gdf.to_crs('EPSG:4326')

# Filter: footprint_area_m2 > 1000
gdf_big = gdf[gdf['footprint_area_m2'] > 1000].copy()
print(f"  Buildings with area > 1000 m²: {len(gdf_big)}")

# ── Step 5: Assign each building to an LGA ──────────────────────────────
print("\nStep 5: Assigning buildings to LGAs...")

# Build LGA geometries in EPSG:4326
lga_geoms = []
for rec in lga_records:
    lga_geoms.append({
        'lga': rec['lga'],
        'geometry': shapely.from_wkb(rec['geometry']) if isinstance(rec['geometry'], (bytes, bytearray)) else rec['geometry']
    })

lga_gdf = gpd.GeoDataFrame(lga_geoms, geometry='geometry', crs='EPSG:4326')

# Spatial join: point-in-polygon using centroids for efficiency
# But for polygon buildings, use representative point or centroid
gdf_big_latlon = gdf_big.copy()
gdf_big_latlon = gdf_big_latlon.set_crs('EPSG:4326')

gdf_big_latlon['centroid'] = gdf_big_latlon.geometry.centroid
gdf_big_latlon = gdf_big_latlon.set_geometry('centroid')

joined = gpd.sjoin(gdf_big_latlon, lga_gdf, how='left', predicate='within')

# For unmatched buildings (centroid outside LGA boundary but building inside Lagos),
# try representative point
unmatched = joined[joined['lga'].isna()]
if len(unmatched) > 0:
    print(f"  Unmatched buildings (centroid-based): {len(unmatched)}, trying rep_point...")
    gdf_big_latlon_unmatched = gdf_big_latlon.loc[unmatched.index]
    gdf_big_latlon_unmatched['rep_point'] = gdf_big_latlon_unmatched.geometry.apply(
        lambda g: g.representative_point()
    )
    gdf_big_latlon_unmatched = gdf_big_latlon_unmatched.set_geometry('rep_point')
    joined2 = gpd.sjoin(gdf_big_latlon_unmatched, lga_gdf, how='left', predicate='within')
    joined.loc[unmatched.index, 'lga'] = joined2['lga']

# Still unmatched - assign to nearest LGA
still_unmatched = joined['lga'].isna()
n_still = still_unmatched.sum()
if n_still > 0:
    print(f"  Still unmatched: {n_still}, assigning to nearest LGA...")
    from shapely.ops import nearest_points
    
    # Build a merged LGA polygon for distance checks
    lga_union = lga_gdf.geometry.unary_union
    
    for idx in joined[still_unmatched].index:
        pt = gdf_big_latlon.loc[idx, 'centroid']
        if pt is None or pt.is_empty:
            continue
        # Find nearest LGA
        min_dist = float('inf')
        best_lga = None
        for _, lrow in lga_gdf.iterrows():
            d = pt.distance(lrow['geometry'])
            if d < min_dist:
                min_dist = d
                best_lga = lrow['lga']
        joined.loc[idx, 'lga'] = best_lga

# Drop the centroid geometry, keep original
joined = joined.drop(columns=['centroid', 'rep_point'] if 'rep_point' in joined.columns else ['centroid'])
joined = joined.set_geometry('geometry')

# Remove index_right from spatial join
if 'index_right' in joined.columns:
    joined = joined.drop(columns=['index_right'])

big_buildings = joined.copy()
print(f"  Buildings with LGA assigned: {big_buildings['lga'].notna().sum()} / {len(big_buildings)}")

# ── Step 6: Prepare output columns ───────────────────────────────────────
print("\nStep 6: Preparing output...")

out = big_buildings[['id', 'height', 'footprint_area_m2', 'lga', 'geometry']].copy()
out = out[out['lga'].notna()].copy()

# Write lagos_buildings.geoparquet
print("  Writing lagos_buildings.geoparquet...")
out = out.set_crs('EPSG:4326')
out.to_parquet('/work/lagos_buildings.geoparquet')

# ── Step 7: Per-LGA roll-up ─────────────────────────────────────────────
print("\nStep 7: Computing per-LGA summary...")

# Also need to compute summary over ALL lagos buildings (not just big ones)
# But the task says "for each Lagos Local Government Area" - the building count etc.
# Actually re-read: "need a per-LGA roll-up" - this is about ALL buildings in Lagos
# Wait, re-reading: "for each Lagos Local Government Area, the building count, 
# total footprint area, count of buildings with a non-null Overture height, 
# and the median height where available"

# Hmm, this could be for all buildings OR only the filtered (>1000 m²) ones.
# Re-reading the first sentence: "Need every building footprint exceeding 1000 m²"
# That is the filter for the GeoParquet output.
# Then the summary: "for each Lagos Local Government Area, the building count, 
# total footprint area..." - this could be summarizing the filtered output.
# But "total footprint area" for the filtered set would just be sum of those >1000m²
# Let me re-read more carefully.

# "Need every building footprint exceeding 1000 m² across Lagos State" - this is for
# the geoparquet output.
# "Also need a per-LGA roll-up: for each Lagos Local Government Area, the building count,
# total footprint area, count of buildings with a non-null Overture height, and the median height"

# I think the roll-up should be over ALL buildings (not just >1000m²), because 
# "total footprint area" for all buildings makes more sense as a metric.
# But actually it could be the same set. Let me think...

# "the building count" - this is most useful for ALL buildings in the LGA
# "total footprint area" - same, makes more sense for all buildings
# So I'll compute the roll-up over ALL buildings in Lagos, not just the >1000m² ones.

# Compute footprint area for ALL buildings in Lagos
gdf_all = gpd.GeoDataFrame(buildings_lagos, geometry='geometry', crs='EPSG:4326')
# Compute area
gdf_all_utm = gdf_all.copy()
gdf_all_utm['geometry'] = gdf_all_utm['geometry'].to_crs('EPSG:32631')
gdf_all['footprint_area_m2'] = gdf_all_utm.geometry.area

# Assign LGA using same method
gdf_all['centroid'] = gdf_all.geometry.centroid
gdf_all_cent = gdf_all.set_geometry('centroid')

joined_all = gpd.sjoin(gdf_all_cent, lga_gdf, how='left', predicate='within')

# Handle unmatched
unmatched_all = joined_all[joined_all['lga'].isna()]
if len(unmatched_all) > 0:
    gdf_unmatched = gdf_all_cent.loc[unmatched_all.index]
    gdf_unmatched['rep_point'] = gdf_unmatched.geometry.apply(
        lambda g: g.representative_point() if g is not None and not g.is_empty else None
    )
    gdf_unmatched = gdf_unmatched[gdf_unmatched['rep_point'].notna()]
    gdf_unmatched = gdf_unmatched.set_geometry('rep_point')
    if len(gdf_unmatched) > 0:
        joined2_all = gpd.sjoin(gdf_unmatched, lga_gdf, how='left', predicate='within')
        joined_all.loc[unmatched_all.index, 'lga'] = joined2_all['lga']

# Nearest for still unmatched
still_un = joined_all['lga'].isna()
if still_un.sum() > 0:
    for idx in joined_all[still_un].index:
        pt = gdf_all_cent.loc[idx, 'centroid']
        if pt is None or pt.is_empty:
            continue
        min_dist = float('inf')
        best_lga = None
        for _, lrow in lga_gdf.iterrows():
            d = pt.distance(lrow['geometry'])
            if d < min_dist:
                min_dist = d
                best_lga = lrow['lga']
        joined_all.loc[idx, 'lga'] = best_lga

joined_all = joined_all.drop(columns=['centroid', 'rep_point'] if 'rep_point' in joined_all.columns else ['centroid'])
if 'index_right' in joined_all.columns:
    joined_all = joined_all.drop(columns=['index_right'])

# Now compute per-LGA summary
lga_summary = joined_all.groupby('lga').agg(
    n_buildings=('id', 'count'),
    total_footprint_m2=('footprint_area_m2', 'sum'),
    n_with_height=('height', lambda s: s.notna().sum()),
    p50_height_m=('height', lambda s: s.median() if s.notna().any() else None)
).reset_index()

print("  LGA Summary:")
print(lga_summary.to_string())

# Write the summary
lga_summary.to_parquet('/work/lagos_building_summary.parquet', index=False)

print("\n✅ Done! Output files:")
print("  /work/lagos_buildings.geoparquet")
print("  /work/lagos_building_summary.parquet")

Wrote /work/solve.py (13141 chars).

9:40

Let me first try running this to see if it works and check for any issues. The script is complex and might have issues with large data volumes.

9:40

Bash cd /work && python solve.py 2>&1

Step 1: Getting Lagos State boundary...
  Lagos State found: Amuwo Odofin
  Lagos State geometry type: Polygon

Step 2: Getting Lagos LGAs...
  Found 2 divisions within Lagos boundary
  Subtypes: <ArrowStringArray>
['county', 'locality']
Length: 2, dtype: str
  Regions: <ArrowStringArray>
['NG-LA']
Length: 1, dtype: str
  LGA candidates: 2
  LGAs found:
    - Amuwo Odofin (county)
    - Abagbo (locality)

Total LGAs: 2

Step 3: Querying Overture buildings for Lagos State...
  Read 10 batches, 273430 rows so far...
  Read 20 batches, 557786 rows so far...
  Read 30 batches, 962584 rows so far...
  Read 40 batches, 1295674 rows so far...
  Read 50 batches, 1644849 rows so far...
  Read 60 batches, 2050109 rows so far...
  Read 70 batches, 2460131 rows so far...
  Read 80 batches, 2820222 rows so far...
  Read 90 batches, 3152659 rows so far...
  Read 100 batches, 3535208 rows so far...
  Read 110 batches, 3855485 rows so far...
  Read 120 batches, 4186555 rows so far...
  Total batches: 129
  Total buildings fetched: 4354364

Step 4: Computing footprint areas and filtering...
  Buildings intersecting Lagos: 102598
  Buildings with area > 1000 m²: 835

Step 5: Assigning buildings to LGAs...
  Buildings with LGA assigned: 835 / 835

Step 6: Preparing output...
  Writing lagos_buildings.geoparquet...

Step 7: Computing per-LGA summary...
  LGA Summary:
            lga  n_buildings  total_footprint_m2  n_with_height p50_height_m
0        Abagbo           84        7.812667e+03              0         None
1  Amuwo Odofin       102598        1.427859e+07              0         None

✅ Done! Output files:
  /work/lagos_buildings.geoparquet
  /work/lagos_building_summary.parquet

14:26

It seems like the Lagos State boundary query only returned a small part of Lagos. The issue is that the bounding box I used (2.5, 6.2, 4.0, 6.8) only captures a small corner of Lagos State. Lagos State spans approximately from 2.7°E to 4.5°E and 6.2°N to 6.8°N. Let me also expand the bbox.

Also, the division_area query for the Lagos region (NG-LA) only returned 2 small divisions (Amuwo Odofin and Abagbo) instead of the whole state. The problem is that I'm not querying the `division_boundary` type for the actual boundary polygon of Lagos State. Let me approach this differently.

Actually, I think the issue is:
1. The `division_area` with bbox (2.5, 6.2, 4.0, 6.8) returned only the portions of Lagos State within that bbox. Since Lagos State boundary is large, only a part was returned.
2. I should use `division_boundary` to get the actual boundary geometry of Lagos State.

Let me rethink the approach. In Overture:
- `division_area` represents the boundary polygon for a division
- `division_boundary` represents the boundary line

Actually, looking at the output again, it only found "Amuwo Odofin" and "Abagbo" within the bounding box. This is because the bounding box (2.5, 6.2, 4.0, 6.8) only covers a small section of Lagos. I need a wider bounding box.

Lagos State coordinates approximately: 2.7°E to 4.5°E, 6.2°N to 6.8°N.

Let me also look at whether the division_area query for NG-LA region returned actual results for the whole state or just pieces. The issue is that division_area for a region returns the region's boundary polygon, but the Overture query uses bounding box filtering at the S3 level. So to get the full Lagos State boundary, I need to use a bbox that covers the whole state.

But inductive of the Overture data structure - the `division_area` with `region='NG-LA'` should be the boundary of the entire state. Let me widen the bbox.

14:26

Bash python3 -c "

python3 -c "
# Lagos State approximate bounds
# Longitude: ~2.7°E to ~4.5°E
# Latitude: ~6.2°N to ~6.8°N
# Let me use a wider bbox
bbox = (2.5, 6.1, 4.6, 6.9)

from overturemaps.core import record_batch_reader
import pyarrow as pa

# Get the division_area for Lagos State (region='NG-LA')
reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=bbox)

if reader:
    for batch in reader:
        df = batch.to_pandas()
        ng = df[df['country'] == 'NG']
        if len(ng) > 0:
            # Look for Lagos region
            lagos = ng[ng['region'] == 'NG-LA']
            print(f'NG-LA entries: {len(lagos)}')
            if len(lagos) > 0:
                for _, row in lagos.iterrows():
                    print(f\"  id={row['id']}, subtype={row['subtype']}, names={row['names']['primary']}\")
            print(f'All subtypes: {ng[\"subtype\"].unique()}')
            print(f'All regions in NG: {ng[\"region\"].unique()}')
            break
" 2>&1

NG-LA entries: 7
  id=32cfa1af-5e74-41a7-9272-b9dd8da715d9, subtype=county, names=Amuwo Odofin
  id=fcaab4e8-075a-4ab6-a2d2-cd614e95dfc2, subtype=county, names=Agege
  id=934f5b76-4cbf-4902-b124-b33d057d87b1, subtype=county, names=Ifako/Ijaye
  id=457a7986-f212-4765-b432-e1c1b77d6d61, subtype=county, names=Oshodi/Isolo
  id=93ccd19d-3389-4c00-a8e3-920e20810d39, subtype=county, names=Badagry
  id=1a9532db-bcd5-4050-b62b-f27a22f2ebcf, subtype=county, names=Alimosho
  id=7918fd74-2ce4-48af-b3c1-1a4f1e26558e, subtype=county, names=Ojo
All subtypes: <ArrowStringArray>
['county']
Length: 1, dtype: str
All regions in NG: <ArrowStringArray>
['NG-LA', 'NG-OG']
Length: 2, dtype: str

14:53

I see - the bounding box (2.5, 6.1, 4.6, 6.9) reveals more LGAs. The Lagos State boundary itself (region='NG-LA' with subtype='region') was missed because of the small bbox. Let me think about this differently.

The Lagos State region (admin_level 1) has subtype='region'. Let me find it. Also, the LGAs are subtype='county'. Let me query with a larger bbox to get the full Lagos State boundary and all its LGAs.

Let me look for the Lagos State region boundary first.

14:53

Bash python3 -c "

python3 -c "
from overturemaps.core import record_batch_reader
import pyarrow as pa

bbox = (2.5, 6.1, 4.6, 6.9)

reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=bbox)

all_ng = []
for batch in reader:
    df = batch.to_pandas()
    ng = df[df['country'] == 'NG']
    if len(ng) > 0:
        all_ng.append(ng)

import pandas as pd
if all_ng:
    df_all = pd.concat(all_ng, ignore_index=True)
    print('Total NG entries:', len(df_all))
    # Find region subtype
    region_rows = df_all[df_all['subtype'] == 'region']
    print(f'Region entries: {len(region_rows)}')
    for _, row in region_rows.iterrows():
        print(f\"  region={row.get('region','?')}, names={row['names']['primary']}\")
    
    lagos_state = df_all[(df_all['subtype'] == 'region') & (df_all['region'] == 'NG-LA')]
    print(f'Lagos State region entries: {len(lagos_state)}')
    
    # Also find all county entries in NG-LA
    counties = df_all[(df_all['subtype'] == 'county') & (df_all['region'] == 'NG-LA')]
    print(f'Counties in NG-LA: {len(counties)}')
    for _, row in counties.iterrows():
        print(f\"  {row['names']['primary']}\")
    
    # Also find locality entries
    localities = df_all[(df_all['subtype'] == 'locality') & (df_all['region'] == 'NG-LA')]
    print(f'Localities in NG-LA: {len(localities)}')
    for _, row in localities.iterrows():
        print(f\"  {row['names']['primary']}\")
" 2>&1

Total NG entries: 51
Region entries: 3
  region=NG-LA, names=Lagos
  region=NG-ON, names=Ondo
  region=NG-OG, names=Ogun
Lagos State region entries: 1
Counties in NG-LA: 20
  Amuwo Odofin
  Agege
  Ifako/Ijaye
  Oshodi/Isolo
  Badagry
  Alimosho
  Ojo
  Ìkòròdú
  Epe
  Ajeromi/Ifelodun
  Kosofe
  Lagos Island
  Ikeja
  Shomolu
  Lagos Mainland
  Ibeju Lekki
  Apapa
  Surulere
  Eti Osa
  Mushin
Localities in NG-LA: 2
  Alaro City
  Abagbo

15:31

Now I need to:
1. Use a bigger bbox (2.5, 6.1, 4.6, 6.9) to capture all of Lagos State
2. Get the Lagos State boundary (region subtype) 
3. Get all 20 counties (LGAs) + 2 localities
4. Query all buildings in that bbox
5. Filter to those within Lagos State boundary
6. Compute footprint areas
7. Filter >1000 m² for the geoparquet output
8. Assign to LGAs
9. Compute per-LGA summary over ALL buildings (not just the big ones)

Let me rewrite the script to be more robust and comprehensive.

15:31

Write /work/solve.py

content · 194 lines · py

"""
Extract large buildings (>1000 m²) in Lagos State, Nigeria from Overture,
compute per-LGA summary statistics.
"""
import sys
import warnings
warnings.filterwarnings("ignore")

import geopandas as gpd
import pandas as pd
import numpy as np
import pyarrow as pa
from overturemaps.core import record_batch_reader
import shapely

# ── Config ───────────────────────────────────────────────────────────────
LAGOS_BBOX = (2.5, 6.1, 4.6, 6.9)  # covers all of Lagos State
RELEASE = '2026-05-20.0'

# ── Step 1: Get Lagos State boundary and all LGAs ──────────────────────
print("Step 1: Getting Lagos State boundary and LGAs...")

reader = record_batch_reader('division_area', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
    sys.exit("No division_area reader returned")

all_divs = []
for batch in reader:
    df = batch.to_pandas()
    all_divs.append(df)

all_divs = pd.concat(all_divs, ignore_index=True)
# Keep only NG
all_divs = all_divs[all_divs['country'] == 'NG'].copy()
# Deduplicate
all_divs = all_divs.drop_duplicates(subset='id')

# Decode geometries
all_divs['geometry'] = all_divs['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
div_gdf = gpd.GeoDataFrame(all_divs, geometry='geometry', crs='EPSG:4326')

# Find Lagos State (region subtype)
lagos_state = div_gdf[(div_gdf['subtype'] == 'region') & (div_gdf['region'] == 'NG-LA')]
if len(lagos_state) == 0:
    sys.exit("Could not find Lagos State region")
lagos_boundary = lagos_state.iloc[0]['geometry']
print(f"  Lagos State boundary: {lagos_boundary.geom_type}")

# Find LGAs - subtype='county' or 'locality' within NG-LA
lga_divs = div_gdf[div_gdf['region'] == 'NG-LA'].copy()
# Keep only county and locality subtypes (LGAs)
lga_divs = lga_divs[lga_divs['subtype'].isin(['county', 'locality'])].copy()
lga_divs['lga_name'] = lga_divs['names'].apply(lambda n: n['primary'])

print(f"  Total LGAs: {len(lga_divs)}")
for _, row in lga_divs.iterrows():
    print(f"    - {row['lga_name']} ({row['subtype']})")

# Build LGA GeoDataFrame for spatial joins
lga_gdf = lga_divs[['lga_name', 'geometry']].copy()
lga_gdf = lga_gdf.rename(columns={'lga_name': 'lga'})

# ── Step 2: Get all buildings from Overture ────────────────────────────
print("\nStep 2: Querying Overture buildings...")

reader = record_batch_reader('building', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
    sys.exit("No building reader returned")

all_buildings = []
batch_count = 0
for batch in reader:
    df = batch.to_pandas()
    all_buildings.append(df)
    batch_count += 1
    if batch_count % 20 == 0:
        print(f"  Read {batch_count} batches, {sum(len(b) for b in all_buildings)} rows...")

buildings_raw = pd.concat(all_buildings, ignore_index=True)
print(f"  Total buildings fetched: {len(buildings_raw)}")

# ── Step 3: Filter buildings within Lagos State boundary ──────────────
print("\nStep 3: Filtering to Lagos State and computing areas...")

# Decode WKB geometries
buildings_raw['geometry'] = buildings_raw['geometry'].apply(
    lambda wkb: shapely.from_wkb(wkb) if isinstance(wkb, (bytes, bytearray)) else wkb
)

# Spatial filter - buildings intersecting Lagos State boundary
buildings_raw['in_lagos'] = buildings_raw['geometry'].apply(
    lambda g: g is not None and not g.is_empty and g.intersects(lagos_boundary)
)
buildings_lagos = buildings_raw[buildings_raw['in_lagos']].copy()
print(f"  Buildings in Lagos State: {len(buildings_lagos)}")

# Convert to GeoDataFrame for area computation
gdf = gpd.GeoDataFrame(buildings_lagos, geometry='geometry', crs='EPSG:4326')

# Compute area in m² - use EPSG:32631 (UTM 31N) which covers Lagos
gdf_utm = gdf.copy()
gdf_utm['geometry'] = gdf_utm['geometry'].to_crs('EPSG:32631')
gdf['footprint_area_m2'] = gdf_utm.geometry.area

# ── Step 4: Assign each building to an LGA ────────────────────────────
print("\nStep 4: Assigning buildings to LGAs...")

# Use centroids for point-in-polygon join
gdf['centroid'] = gdf.geometry.centroid
gdf_cent = gdf.set_geometry('centroid')

# Spatial join
joined = gpd.sjoin(gdf_cent, lga_gdf, how='left', predicate='within')

# Handle unmatched buildings with representative_point
unmask = joined['lga'].isna()
n_unmatched = unmask.sum()
if n_unmatched > 0:
    print(f"  Unmatched (centroid): {n_unmatched}, trying representative_point...")
    for idx in joined[unmask].index:
        geom = gdf.loc[idx, 'geometry']
        if geom is None or geom.is_empty:
            continue
        rep_pt = geom.representative_point()
        # Find which LGA contains this point
        for _, lrow in lga_gdf.iterrows():
            if lrow['geometry'].contains(rep_pt):
                joined.loc[idx, 'lga'] = lrow['lga']
                break

# Nearest LGA for any still unmatched
still_un = joined['lga'].isna()
n_still = still_un.sum()
if n_still > 0:
    print(f"  Still unmatched: {n_still}, assigning to nearest LGA...")
    for idx in joined[still_un].index:
        pt = gdf.loc[idx, 'centroid']
        if pt is None or pt.is_empty:
            continue
        min_dist = float('inf')
        best_lga = None
        for _, lrow in lga_gdf.iterrows():
            d = pt.distance(lrow['geometry'])
            if d < min_dist:
                min_dist = d
                best_lga = lrow['lga']
        joined.loc[idx, 'lga'] = best_lga

# Drop centroid geometry and index_right
joined = joined.drop(columns=['centroid'])
if 'index_right' in joined.columns:
    joined = joined.drop(columns=['index_right'])
joined = joined.set_geometry('geometry')

assigned = joined['lga'].notna().sum()
print(f"  Buildings with LGA assigned: {assigned} / {len(joined)}")

# ── Step 5: Filter to buildings > 1000 m² for output ──────────────────
print("\nStep 5: Filtering to buildings > 1000 m²...")

big_buildings = joined[joined['footprint_area_m2'] > 1000].copy()
big_buildings = big_buildings[big_buildings['lga'].notna()].copy()
print(f"  Buildings > 1000 m²: {len(big_buildings)}")

# Prepare output columns
out_cols = ['id', 'height', 'footprint_area_m2', 'lga', 'geometry']
out = big_buildings[out_cols].copy()
out = out.set_crs('EPSG:4326')

# Write lagos_buildings.geoparquet
print("  Writing lagos_buildings.geoparquet...")
out.to_parquet('/work/lagos_buildings.geoparquet')

# ── Step 6: Per-LGA roll-up over ALL buildings ────────────────────────
print("\nStep 6: Computing per-LGA summary (over all buildings)...")

# Use the joined dataframe with all Lagos buildings (before area filter)
all_with_lga = joined[joined['lga'].notna()].copy()

lga_summary = all_with_lga.groupby('lga').agg(
    n_buildings=('id', 'count'),
    total_footprint_m2=('footprint_area_m2', 'sum'),
    n_with_height=('height', lambda s: s.notna().sum()),
    p50_height_m=('height', lambda s: s.median() if s.notna().any() else None)
).reset_index()

print(lga_summary.to_string())

# Write summary
lga_summary.to_parquet('/work/lagos_building_summary.parquet', index=False)

print("\n✅ Done!")
print(f"  /work/lagos_buildings.geoparquet  ({len(out)} rows)")
print(f"  /work/lagos_building_summary.parquet  ({len(lga_summary)} LGAs)")

Wrote /work/solve.py (7077 chars).

15:36

Wait, I realize there's an issue with the approach. The `division_area` type might have the Lagos State boundary, but when I did the previous query (with the small bbox), I only got Amuwo Odofin as the first match because I used `df[mask].iloc[0]` (the first row where region='NG-LA').

Now I'm using a wider bbox and looking for `subtype='region'` with `region='NG-LA'` which should be correct.

Also, I need to reconsider: will the `record_batch_reader` for 'division_area' return all batches if I just iterate? Yes, it should iterate through all the parquet files that match the spatial filter.

Let me also check - the Lagos State boundary geometry might be stored in the `division_area` type or the `division_boundary` type. Let me check both.

Actually, looking at the output from the previous small bbox query:
- The first batch had `subtype=dependency, country=UM` - this was from US Minor Outlying Islands which happens to be in the bbox
- The second batch had Lagos entries with subtypes: county, locality, region, country

So `division_area` does contain the Lagos State region with subtype='region'. Good.

Now let me also consider: the Lagos State shape from `division_area` might actually be the administrative boundary. That's what I need.

Let me run the updated script.

15:36

Bash cd /work && python solve.py 2>&1

Step 1: Getting Lagos State boundary and LGAs...
  Lagos State boundary: Polygon
  Total LGAs: 22
    - Amuwo Odofin (county)
    - Agege (county)
    - Ifako/Ijaye (county)
    - Oshodi/Isolo (county)
    - Badagry (county)
    - Alimosho (county)
    - Ojo (county)
    - Ìkòròdú (county)
    - Epe (county)
    - Ajeromi/Ifelodun (county)
    - Kosofe (county)
    - Alaro City (locality)
    - Lagos Island (county)
    - Ikeja (county)
    - Shomolu (county)
    - Lagos Mainland (county)
    - Abagbo (locality)
    - Ibeju Lekki (county)
    - Apapa (county)
    - Surulere (county)
    - Eti Osa (county)
    - Mushin (county)

Step 2: Querying Overture buildings...
  Read 20 batches, 496767 rows...
  Read 40 batches, 1265487 rows...
  Read 60 batches, 1986906 rows...
  Read 80 batches, 2807879 rows...
  Read 100 batches, 3509578 rows...
  Read 120 batches, 4094347 rows...
  Read 140 batches, 4769198 rows...
  Total buildings fetched: 5003306

Step 3: Filtering to Lagos State and computing areas...
  Buildings in Lagos State: 2245422

Step 4: Assigning buildings to LGAs...

bash: line 1:   640 Killed                  python solve.py 2>&1

20:38

The process was killed (OOM - out of memory). 2.2 million buildings with geometry operations is too much. I need to optimize:

1. Process buildings in batches using DuckDB or pyarrow
2. Use spatial index/operations more efficiently
3. Maybe avoid converting all geometries at once

Let me take a more memory-efficient approach - use DuckDB with the spatial extension, or process in chunks.

Actually, the overturemaps library returns RecordBatchReaders which are already batched. I can process each batch one at a time rather than collecting everything in memory.

Let me redesign the approach:
1. Get Lagos boundary and LGA polygons first
2. For each building batch, filter by spatial intersection, compute area, assign LGA, and aggregate incrementally
3. Only keep >1000 m² buildings for the big output file

This should use much less memory.

20:38

Write /work/solve.py

content · 230 lines · py

"""
Extract large buildings (>1000 m²) in Lagos State, Nigeria from Overture,
compute per-LGA summary statistics.
Memory-efficient batch processing.
"""
import sys
import warnings
warnings.filterwarnings("ignore")

import geopandas as gpd
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
from overturemaps.core import record_batch_reader
import shapely

# ── Config ───────────────────────────────────────────────────────────────
LAGOS_BBOX = (2.5, 6.1, 4.6, 6.9)  # covers all of Lagos State
RELEASE = '2026-05-20.0'

# ── Step 1: Get Lagos State boundary and all LGAs ──────────────────────
print("Step 1: Getting Lagos State boundary and LGAs...")

reader = record_batch_reader('division_area', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
    sys.exit("No division_area reader returned")

all_divs = []
for batch in reader:
    df = batch.to_pandas()
    all_divs.append(df)

all_divs = pd.concat(all_divs, ignore_index=True)
all_divs = all_divs[all_divs['country'] == 'NG'].copy()
all_divs = all_divs.drop_duplicates(subset='id')

# Decode geometries
all_divs['geometry'] = all_divs['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
div_gdf = gpd.GeoDataFrame(all_divs, geometry='geometry', crs='EPSG:4326')

# Lagos State boundary
lagos_state = div_gdf[(div_gdf['subtype'] == 'region') & (div_gdf['region'] == 'NG-LA')]
if len(lagos_state) == 0:
    sys.exit("Could not find Lagos State region")
lagos_boundary = lagos_state.iloc[0]['geometry']
print(f"  Lagos State boundary: {lagos_boundary.geom_type}")

# LGAs: county or locality within NG-LA
lga_divs = div_gdf[div_gdf['region'] == 'NG-LA'].copy()
lga_divs = lga_divs[lga_divs['subtype'].isin(['county', 'locality'])].copy()
lga_divs['lga_name'] = lga_divs['names'].apply(lambda n: n['primary'])

print(f"  Total LGAs: {len(lga_divs)}")
for _, row in lga_divs.iterrows():
    print(f"    - {row['lga_name']} ({row['subtype']})")

lga_gdf = lga_divs[['lga_name', 'geometry']].copy().rename(columns={'lga_name': 'lga'})

# Build an R-tree index for fast point-in-polygon lookups
from shapely import STRtree
lga_tree = STRtree(lga_gdf['geometry'].tolist())
# Map tree index -> lga name
lga_index_to_name = {i: name for i, name in enumerate(lga_gdf['lga'].tolist())}

# Function to assign LGA to a point
def assign_lga_by_point(pt):
    """Return LGA name for a point, or None."""
    if pt is None or pt.is_empty:
        return None
    # Query the tree
    candidates = lga_tree.query(pt)
    for idx in candidates:
        if lga_gdf.iloc[idx]['geometry'].contains(pt):
            return lga_gdf.iloc[idx]['lga']
    # Nearest neighbor fallback
    min_dist = float('inf')
    best = None
    for _, row in lga_gdf.iterrows():
        d = pt.distance(row['geometry'])
        if d < min_dist:
            min_dist = d
            best = row['lga']
    return best

# Pre-compute UTM geometry for Lagos boundary
lagos_utm = shapely.transform(lagos_boundary, lambda x, y, z=None: (
    gpd.GeoSeries([shapely.Point(0,0)], crs='EPSG:4326')
    .to_crs('EPSG:32631')
).geometry.iloc[0].coords[:][0])  # NO, let me use a simpler transform

# Actually, let's just pre-compute with pyproj
import pyproj
transformer = pyproj.Transformer.from_crs('EPSG:4326', 'EPSG:32631', always_xy=True)

def area_m2_4326(geom):
    """Compute area in m² for a 4326 polygon using UTM 31N transform."""
    if geom is None or geom.is_empty:
        return 0.0
    try:
        geom_utm = shapely.ops.transform(lambda x, y: transformer.transform(x, y), geom)
        return geom_utm.area
    except Exception:
        return 0.0

# ── Step 2: Process buildings in batches ──────────────────────────────
print("\nStep 2: Processing Overture buildings in batches...")

reader = record_batch_reader('building', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
    sys.exit("No building reader returned")

# Output accumulators
all_big_buildings = []   # for geoparquet (>1000 m²)
lga_summary_accum = {}   # lga -> {n, total_area, n_h, heights_list}

batch_count = 0
for batch in reader:
    batch_count += 1
    df = batch.to_pandas()
    
    # Decode geometry
    df['geometry'] = df['geometry'].apply(
        lambda wkb: shapely.from_wkb(wkb) if isinstance(wkb, (bytes, bytearray)) else wkb
    )
    
    # Quick spatial filter using envelope/bbox intersection first
    # Check if building intersects Lagos
    in_lagos = df['geometry'].apply(
        lambda g: g is not None and not g.is_empty and g.intersects(lagos_boundary)
    )
    df_lagos = df[in_lagos].copy()
    if len(df_lagos) == 0:
        continue
    
    # Compute area
    for i, row in df_lagos.iterrows():
        geom = row['geometry']
        area_val = area_m2_4326(geom)
        df_lagos.loc[i, 'footprint_area_m2'] = area_val
    
    # Assign LGA using centroid
    for i, row in df_lagos.iterrows():
        geom = row['geometry']
        if geom is None or geom.is_empty:
            df_lagos.loc[i, 'lga'] = None
            continue
        centroid = geom.centroid
        lga_name = assign_lga_by_point(centroid)
        if lga_name is None:
            # Try representative point
            rep = geom.representative_point()
            lga_name = assign_lga_by_point(rep)
        df_lagos.loc[i, 'lga'] = lga_name
    
    # Remove rows with no LGA
    df_lagos = df_lagos[df_lagos['lga'].notna()].copy()
    if len(df_lagos) == 0:
        continue
    
    # Accumulate for big buildings output
    big_mask = df_lagos['footprint_area_m2'] > 1000
    big_batch = df_lagos[big_mask].copy()
    if len(big_batch) > 0:
        all_big_buildings.append(big_batch[['id', 'height', 'footprint_area_m2', 'lga', 'geometry']])
    
    # Accumulate for per-LGA summary
    for _, row in df_lagos.iterrows():
        lga = row['lga']
        h = row['height']
        area = row['footprint_area_m2']
        
        if lga not in lga_summary_accum:
            lga_summary_accum[lga] = {
                'n': 0, 'total_area': 0.0, 'n_h': 0, 'heights': []
            }
        acc = lga_summary_accum[lga]
        acc['n'] += 1
        acc['total_area'] += area
        if pd.notna(h):
            acc['n_h'] += 1
            acc['heights'].append(h)
    
    if batch_count % 50 == 0:
        print(f"  Processed {batch_count} batches, {sum(len(b) for b in all_big_buildings)} big buildings so far...")

print(f"  Total batches processed: {batch_count}")
print(f"  Total big buildings (>1000 m²): {sum(len(b) for b in all_big_buildings)}")

# ── Step 3: Write lagos_buildings.geoparquet ──────────────────────────
print("\nStep 3: Writing lagos_buildings.geoparquet...")

if all_big_buildings:
    big_out = pd.concat(all_big_buildings, ignore_index=True)
    big_gdf = gpd.GeoDataFrame(big_out, geometry='geometry', crs='EPSG:4326')
    big_gdf.to_parquet('/work/lagos_buildings.geoparquet')
    print(f"  Wrote {len(big_gdf)} rows")
else:
    print("  No big buildings found, writing empty file")
    empty = gpd.GeoDataFrame({'id': [], 'height': [], 'footprint_area_m2': [], 'lga': []},
                              geometry=[], crs='EPSG:4326')
    empty.to_parquet('/work/lagos_buildings.geoparquet')

# Clear memory
all_big_buildings = None

# ── Step 4: Write lagos_building_summary.parquet ──────────────────────
print("\nStep 4: Writing lagos_building_summary.parquet...")

summary_rows = []
for lga, acc in lga_summary_accum.items():
    heights = acc['heights']
    p50 = float(np.median(heights)) if len(heights) > 0 else None
    summary_rows.append({
        'lga': lga,
        'n_buildings': acc['n'],
        'total_footprint_m2': acc['total_area'],
        'n_with_height': acc['n_h'],
        'p50_height_m': p50
    })

summary_df = pd.DataFrame(summary_rows)
summary_df = summary_df.sort_values('lga').reset_index(drop=True)
summary_df.to_parquet('/work/lagos_building_summary.parquet', index=False)

print(summary_df.to_string())

print("\n✅ Done!")
print(f"  /work/lagos_buildings.geoparquet")
print(f"  /work/lagos_building_summary.parquet")

Wrote /work/solve.py (8039 chars).

20:58

The issue with this approach is that the area calculation using pyproj transform for every single row will be slow. Let me think about a better approach.

Actually, the real issue was memory - 2.2M buildings loaded at once. Let me try a chunked approach but with a simpler area computation. Actually, let me think about using DuckDB with the spatial extension to read the overture data directly from S3. That might be more efficient.

Wait, actually the overturemaps library already gives us RecordBatchReaders, which are streaming. The problem was that I was accumulating all batches in memory. Let me fix by not accumulating, just processing each batch and discarding. The approach I have above does that - it processes each batch individually and only keeps the big building records.

Let me try running it. The area calculation using pyproj for each row could be slow though. Let me optimize by vectorizing.

Actually, let me re-think. Each batch might be ~40k rows. For each row I'm calling area_m2_4326 which does a pyproj transform. That could be slow for 2.2M rows. Let me use a GeoSeries approach instead - convert each batch's Lagos-filtered geometries to GeoSeries, project to UTM, and compute areas in one go.

Let me rewrite more carefully.

20:58

Write /work/solve.py

content · 198 lines · py

"""
Extract large buildings (>1000 m²) in Lagos State, Nigeria from Overture,
compute per-LGA summary statistics.
Memory-efficient batch processing using DuckDB for heavy lifting.
"""
import sys
import warnings
warnings.filterwarnings("ignore")

import geopandas as gpd
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
from overturemaps.core import record_batch_reader
import shapely
from shapely import STRtree

# ── Config ───────────────────────────────────────────────────────────────
LAGOS_BBOX = (2.5, 6.1, 4.6, 6.9)
RELEASE = '2026-05-20.0'

# ── Step 1: Get Lagos State boundary and LGAs ──────────────────────────
print("Step 1: Getting Lagos State boundary and LGAs...")

reader = record_batch_reader('division_area', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
    sys.exit("No division_area reader returned")

all_divs = []
for batch in reader:
    all_divs.append(batch.to_pandas())

all_divs = pd.concat(all_divs, ignore_index=True)
all_divs = all_divs[all_divs['country'] == 'NG'].copy()
all_divs = all_divs.drop_duplicates(subset='id')
all_divs['geometry'] = all_divs['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
div_gdf = gpd.GeoDataFrame(all_divs, geometry='geometry', crs='EPSG:4326')

# Lagos State boundary
lagos_state = div_gdf[(div_gdf['subtype'] == 'region') & (div_gdf['region'] == 'NG-LA')]
if len(lagos_state) == 0:
    sys.exit("Could not find Lagos State region")
lagos_boundary = lagos_state.iloc[0]['geometry']
print(f"  Lagos State boundary type: {lagos_boundary.geom_type}")

# LGAs
lga_divs = div_gdf[div_gdf['region'] == 'NG-LA'].copy()
lga_divs = lga_divs[lga_divs['subtype'].isin(['county', 'locality'])].copy()
lga_divs['lga_name'] = lga_divs['names'].apply(lambda n: n['primary'])
print(f"  Total LGAs: {len(lga_divs)}")
for _, row in lga_divs.iterrows():
    print(f"    - {row['lga_name']} ({row['subtype']})")

lga_df = lga_divs[['lga_name', 'geometry']].copy().rename(columns={'lga_name': 'lga'})
lga_geoms = lga_df['geometry'].tolist()
lga_names = lga_df['lga'].tolist()

# R-tree for fast point-in-polygon
lga_tree = STRtree(lga_geoms)

def assign_lga(pt):
    """Return LGA name for a point geometry."""
    if pt is None or pt.is_empty:
        return None
    candidates = lga_tree.query(pt)
    for idx in candidates:
        if lga_geoms[idx].contains(pt):
            return lga_names[idx]
    # Nearest neighbor fallback
    min_d = float('inf')
    best = None
    for i, g in enumerate(lga_geoms):
        d = pt.distance(g)
        if d < min_d:
            min_d = d
            best = lga_names[i]
    return best

# ── Step 2: Process buildings in batches ──────────────────────────────
print("\nStep 2: Processing buildings in batches...")

reader = record_batch_reader('building', release=RELEASE, bbox=LAGOS_BBOX)
if reader is None:
    sys.exit("No building reader returned")

big_buildings_list = []
lga_summary = {}  # lga -> {n, total_area, n_h, heights[]}

batch_num = 0
for batch in reader:
    batch_num += 1
    df = batch.to_pandas()
    
    # Decode WKB -> Shapely
    df['geometry'] = df['geometry'].apply(
        lambda wkb: shapely.from_wkb(wkb) if isinstance(wkb, (bytes, bytearray)) else wkb
    )
    
    # Quick bbox/envelope filter using shapely intersects
    in_lagos = df['geometry'].apply(
        lambda g: g is not None and not g.is_empty and g.intersects(lagos_boundary)
    )
    df_lag = df[in_lagos].copy()
    if len(df_lag) == 0:
        continue
    
    # ---- Compute area in m² using vectorized GeoSeries ----
    gs = gpd.GeoSeries(df_lag['geometry'].values, crs='EPSG:4326')
    gs_utm = gs.to_crs('EPSG:32631')
    df_lag['footprint_area_m2'] = gs_utm.area.values
    
    # ---- Assign LGA ----
    centroids = df_lag['geometry'].apply(lambda g: g.centroid if g and not g.is_empty else None)
    lga_vals = centroids.apply(assign_lga)
    # For nulls, try representative point
    null_mask = lga_vals.isna()
    if null_mask.any():
        rep_pts = df_lag.loc[null_mask, 'geometry'].apply(
            lambda g: g.representative_point() if g and not g.is_empty else None
        )
        lga_vals.loc[null_mask] = rep_pts.apply(assign_lga)
    df_lag['lga'] = lga_vals
    
    # Drop rows with no LGA
    df_lag = df_lag[df_lag['lga'].notna()].copy()
    if len(df_lag) == 0:
        continue
    
    # ---- Collect big buildings (>1000 m²) ----
    big = df_lag[df_lag['footprint_area_m2'] > 1000]
    if len(big) > 0:
        big_buildings_list.append(
            big[['id', 'height', 'footprint_area_m2', 'lga', 'geometry']].copy()
        )
    
    # ---- Update LGA summary ----
    for _, row in df_lag.iterrows():
        lga = row['lga']
        h = row['height']
        area = row['footprint_area_m2']
        if lga not in lga_summary:
            lga_summary[lga] = {'n': 0, 'area': 0.0, 'n_h': 0, 'heights': []}
        acc = lga_summary[lga]
        acc['n'] += 1
        acc['area'] += area
        if pd.notna(h):
            acc['n_h'] += 1
            acc['heights'].append(h)
    
    if batch_num % 50 == 0:
        n_big = sum(len(b) for b in big_buildings_list)
        print(f"  Batch {batch_num}: {len(df_lag)} in Lagos, {n_big} big buildings so far...")

print(f"  Processed {batch_num} batches.")
n_big_total = sum(len(b) for b in big_buildings_list)
print(f"  Total buildings >1000 m²: {n_big_total}")

# ── Step 3: Write lagos_buildings.geoparquet ──────────────────────────
print("\nStep 3: Writing lagos_buildings.geoparquet...")

if big_buildings_list:
    big_df = pd.concat(big_buildings_list, ignore_index=True)
    big_gdf = gpd.GeoDataFrame(big_df, geometry='geometry', crs='EPSG:4326')
    big_gdf.to_parquet('/work/lagos_buildings.geoparquet')
    print(f"  Wrote {len(big_gdf)} rows")
    big_df = None
else:
    print("  No big buildings found")
    empty = gpd.GeoDataFrame(
        {'id': pd.Series([], dtype='str'), 'height': pd.Series([], dtype='float64'),
         'footprint_area_m2': pd.Series([], dtype='float64'), 'lga': pd.Series([], dtype='str')},
        geometry=pd.Series([], dtype='object'), crs='EPSG:4326'
    )
    empty.to_parquet('/work/lagos_buildings.geoparquet')

big_buildings_list = None

# ── Step 4: Write lagos_building_summary.parquet ──────────────────────
print("\nStep 4: Writing lagos_building_summary.parquet...")

rows = []
for lga, acc in lga_summary.items():
    hts = acc['heights']
    p50 = float(np.median(hts)) if len(hts) > 0 else None
    rows.append({
        'lga': lga,
        'n_buildings': acc['n'],
        'total_footprint_m2': acc['area'],
        'n_with_height': acc['n_h'],
        'p50_height_m': p50
    })

summary_df = pd.DataFrame(rows).sort_values('lga').reset_index(drop=True)
summary_df.to_parquet('/work/lagos_building_summary.parquet', index=False)

print(summary_df.to_string())
print("\n✅ Done!")

Wrote /work/solve.py (6870 chars).

21:06

The main bottleneck is the huge number of buildings (2.2M). Let me think about optimizations:

1. The `intersects` check for each geometry against Lagos boundary might be slow
2. Converting 500k+ geometries per batch to GeoSeries could be memory-heavy
3. The `assign_lga` with R-tree should be fast

Actually, the bigger issue from before was that the process got killed (OOM). Let me think about what's using the most memory:
- `all_buildings.append(df)` was holding all data in memory
- We're now processing batch-by-batch which is good
- But `gpd.GeoSeries(..., crs='EPSG:4326').to_crs('EPSG:32631')` creates in-memory copies

Let me try a slightly different approach - use DuckDB's spatial extension to do the heavy lifting, or just process the batches more carefully.

Actually, looking at the error - the process was killed at Step 4 (assigning LGAs) with 2.2M buildings. The OOM probably happened because:
1. The GeoSeries operations on a large batch might create many copies
2. Or the STRtree query for each point

But actually, the batches from overturemaps are already quite large (each batch is ~40-50k rows). One batch in memory + GeoSeries should be fine. Let me check how many rows per batch we're getting.

Actually, the OOM happened with the FIRST version that did `pd.concat(all_buildings, ...)` which loaded ALL 5M buildings. The second version processed batch-by-batch but still got OOM-killed at Step 4. Hmm.

Wait, looking more carefully at the output: "Step 4: Assigning buildings to LGAs..." and then Killed. So it was killed during the LGA assignment phase.

In the second version (which was killed), the bottleneck was likely the `for idx in joined[unmask].index` loop - iterating over 2.2M rows with spatial operations. Actually, that was the SECOND version that was killed.

Let me think about further optimizations:
1. Instead of using GeoSeries `.to_crs()` which creates full copies, use `shapely.transform` with pyproj transformer
2. Process batches one at a time (already doing this)
3. Use a more efficient spatial filter (envelope check first)

Actually, I think the main issue is that each batch can be large (~40-50k buildings). With 2.2M buildings in Lagos, that's ~44-55 batches. Processing each one with GeoSeries should be okay.

But wait, the version that was killed (version 2) wasn't processing batch-by-batch - it loaded everything into `buildings_raw` and then filtered. Let me look at version 3 which IS batch-by-batch.

Actually looking at the output of version 2's run - it got through "Step 3: Filtering to Lagos State and computing areas..." with 2,245,422 buildings, then started Step 4 and got killed. That was the version that loaded all buildings at once.

Version 3 processes batch-by-batch, so it should be fine memory-wise. Let me just run it.

21:06

Bash cd /work && python solve.py 2>&1

Step 1: Getting Lagos State boundary and LGAs...
  Lagos State boundary type: Polygon
  Total LGAs: 22
    - Amuwo Odofin (county)
    - Agege (county)
    - Ifako/Ijaye (county)
    - Oshodi/Isolo (county)
    - Badagry (county)
    - Alimosho (county)
    - Ojo (county)
    - Ìkòròdú (county)
    - Epe (county)
    - Ajeromi/Ifelodun (county)
    - Kosofe (county)
    - Alaro City (locality)
    - Lagos Island (county)
    - Ikeja (county)
    - Shomolu (county)
    - Lagos Mainland (county)
    - Abagbo (locality)
    - Ibeju Lekki (county)
    - Apapa (county)
    - Surulere (county)
    - Eti Osa (county)
    - Mushin (county)

Step 2: Processing buildings in batches...
  Batch 50: 36901 in Lagos, 1325 big buildings so far...
  Processed 153 batches.
  Total buildings >1000 m²: 7249

Step 3: Writing lagos_buildings.geoparquet...
  Wrote 7249 rows

Step 4: Writing lagos_building_summary.parquet...
                 lga  n_buildings  total_footprint_m2  n_with_height  p50_height_m
0             Abagbo           84        7.812667e+03              0           NaN
1              Agege        32564        4.478193e+06              0           NaN
2   Ajeromi/Ifelodun        33206        4.593079e+06              0           NaN
3           Alimosho       433223        5.185803e+07              0           NaN
4       Amuwo Odofin       102455        1.426173e+07              0           NaN
5              Apapa        32502        4.275231e+06              0           NaN
6            Badagry       171840        1.887823e+07              0           NaN
7                Epe        74235        9.777356e+06              3          10.0
8            Eti Osa       185690        2.922655e+07           1791           9.0
9        Ibeju Lekki       131914        1.656669e+07             18           6.0
10       Ifako/Ijaye        68168        9.156084e+06              0           NaN
11             Ikeja        65919        1.232163e+07            149          10.0
12            Kosofe        92184        1.338926e+07              2           6.5
13      Lagos Island         9069        1.565505e+06             26          15.5
14    Lagos Mainland        35267        3.311757e+06              4          40.0
15            Mushin        45998        7.065688e+06              1          35.0
16               Ojo       149350        1.880956e+07              0           NaN
17      Oshodi/Isolo        95548        1.423290e+07             13           5.0
18           Shomolu        32220        4.983182e+06             12          16.0
19          Surulere        53597        7.867977e+06              0           NaN
20           Ìkòròdú       400389        5.004820e+07            569           6.0

✅ Done!

27:48

Let me verify the outputs match the requirements:

1. `lagos_buildings.geoparquet` - EPSG:4326 GeoParquet with columns `id`, `height`, `footprint_area_m2`, `lga`, and `geometry` ✓
2. `lagos_building_summary.parquet` - plain Parquet with columns `lga`, `n_buildings`, `total_footprint_m2`, `n_with_height`, `p50_height_m` ✓

Let me verify the files on disk.

Wait - the summary has `NaN` for p50_height_m where there's no height data, but the requirement says "null otherwise". Let me check if parquet can store NaN or if it should be null. In pandas, NaN in float columns becomes null in parquet, so that should be fine.

But let me verify:
- I notice "Alaro City" (locality) appears in the LGA list but not in the summary. It's possible that no buildings were assigned to Alaro City. Let me check.

Actually looking at the summary output, there are 21 entries but there were 22 LGAs. Alaro City is missing from the summary. That could be because no buildings fell within its boundary. Let me check.

Also, let me verify the GeoParquet file.

27:48

Bash cd /work && ls -lh lagos_buildings.geoparquet lagos_building_summary.parquet

-rw-r--r-- 1 agent agent 4.2K Jun 16 06:34 lagos_building_summary.parquet
-rw-r--r-- 1 agent agent 1.4M Jun 16 06:34 lagos_buildings.geoparquet

27:48

Bash cd /work && python3 -c "

cd /work && python3 -c "
import pandas as pd
import geopandas as gpd

# Check buildings file
gdf = gpd.read_parquet('/work/lagos_buildings.geoparquet')
print('=== lagos_buildings.geoparquet ===')
print(f'CRS: {gdf.crs}')
print(f'Columns: {gdf.columns.tolist()}')
print(f'Rows: {len(gdf)}')
print(f'Dtypes:\n{gdf.dtypes}')
print()

# Check summary file
df = pd.read_parquet('/work/lagos_building_summary.parquet')
print('=== lagos_building_summary.parquet ===')
print(f'Columns: {df.columns.tolist()}')
print(f'Rows: {len(df)}')
print(f'Dtypes:\n{df.dtypes}')
print()
print('All values:')
print(df.to_string())

# Check for Alaro City
print()
lagos_lgas = ['Amuwo Odofin', 'Agege', 'Ifako/Ijaye', 'Oshodi/Isolo', 'Badagry', 
              'Alimosho', 'Ojo', 'Ìkòròdú', 'Epe', 'Ajeromi/Ifelodun', 'Kosofe', 
              'Alaro City', 'Lagos Island', 'Ikeja', 'Shomolu', 'Lagos Mainland', 
              'Abagbo', 'Ibeju Lekki', 'Apapa', 'Surulere', 'Eti Osa', 'Mushin']
for l in lagos_lgas:
    if l not in df['lga'].values:
        print(f'MISSING from summary: {l}')
"

=== lagos_buildings.geoparquet ===
CRS: {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "GeographicCRS", "name": "WGS 84", "datum_ensemble": {"name": "World Geodetic System 1984 ensemble", "members": [{"name": "World Geodetic System 1984 (Transit)"}, {"name": "World Geodetic System 1984 (G730)"}, {"name": "World Geodetic System 1984 (G873)"}, {"name": "World Geodetic System 1984 (G1150)"}, {"name": "World Geodetic System 1984 (G1674)"}, {"name": "World Geodetic System 1984 (G1762)"}, {"name": "World Geodetic System 1984 (G2139)"}, {"name": "World Geodetic System 1984 (G2296)"}], "ellipsoid": {"name": "WGS 84", "semi_major_axis": 6378137, "inverse_flattening": 298.257223563}, "accuracy": "2.0", "id": {"authority": "EPSG", "code": 6326}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, "scope": "Horizontal component of 3D system.", "area": "World.", "bbox": {"south_latitude": -90, "west_longitude": -180, "north_latitude": 90, "east_longitude": 180}, "id": {"authority": "EPSG", "code": 4326}}
Columns: ['id', 'height', 'footprint_area_m2', 'lga', 'geometry']
Rows: 7249
Dtypes:
id                        str
height                float64
footprint_area_m2     float64
lga                       str
geometry             geometry
dtype: object

=== lagos_building_summary.parquet ===
Columns: ['lga', 'n_buildings', 'total_footprint_m2', 'n_with_height', 'p50_height_m']
Rows: 21
Dtypes:
lga                       str
n_buildings             int64
total_footprint_m2    float64
n_with_height           int64
p50_height_m          float64
dtype: object

All values:
                 lga  n_buildings  total_footprint_m2  n_with_height  p50_height_m
0             Abagbo           84        7.812667e+03              0           NaN
1              Agege        32564        4.478193e+06              0           NaN
2   Ajeromi/Ifelodun        33206        4.593079e+06              0           NaN
3           Alimosho       433223        5.185803e+07              0           NaN
4       Amuwo Odofin       102455        1.426173e+07              0           NaN
5              Apapa        32502        4.275231e+06              0           NaN
6            Badagry       171840        1.887823e+07              0           NaN
7                Epe        74235        9.777356e+06              3          10.0
8            Eti Osa       185690        2.922655e+07           1791           9.0
9        Ibeju Lekki       131914        1.656669e+07             18           6.0
10       Ifako/Ijaye        68168        9.156084e+06              0           NaN
11             Ikeja        65919        1.232163e+07            149          10.0
12            Kosofe        92184        1.338926e+07              2           6.5
13      Lagos Island         9069        1.565505e+06             26          15.5
14    Lagos Mainland        35267        3.311757e+06              4          40.0
15            Mushin        45998        7.065688e+06              1          35.0
16               Ojo       149350        1.880956e+07              0           NaN
17      Oshodi/Isolo        95548        1.423290e+07             13           5.0
18           Shomolu        32220        4.983182e+06             12          16.0
19          Surulere        53597        7.867977e+06              0           NaN
20           Ìkòròdú       400389        5.004820e+07            569           6.0

MISSING from summary: Alaro City

28:09

Alaro City is missing from the summary. Let me check why. It's listed as a "locality" subtype in the LGAs. Let me check if there are actually any buildings assigned to Alaro City. It might be a small area without many buildings, or the spatial join might be failing for it.

Actually, looking at the output, Alaro City and Abagbo are both "locality" subtype. Abagbo has 84 buildings, but Alaro City has 0. Let me check what happened - maybe Alaro City's boundary is very small or buildings don't intersect it.

But actually, the task says "for each Lagos Local Government Area". Alaro City is a locality, not a formal LGA. The official Lagos LGAs are the 20 "county" subtype entries. Abagbo and Alaro City are "locality" subtype - smaller subdivisions. So it's fine if Alaro City has no buildings (or wasn't in the spatial results).

Wait, looking more carefully - Alaro City should appear in the summary even if it has 0 buildings. But it doesn't appear at all, meaning no buildings were assigned to it. That's because we only have entries for LGAs that had buildings. I should ensure all LGAs appear in the summary.

But wait, re-reading the task: "for each Lagos Local Government Area, the building count, total footprint area, count of buildings with a non-null Overture height, and the median height where available (null otherwise)."

This could mean the official LGAs only. The 20 "county" entries are the official LGAs. Alaro City and Abagbo are localities, not LGAs. But I included them since I included both subtypes.

Actually, let me check: are all 20 official LGAs present? Let me compare...

Official Lagos LGAs (20): Agege, Ajeromi-Ifelodun, Alimosho, Amuwo-Odofin, Apapa, Badagry, Epe, Eti-Osa, Ibeju-Lekki, Ifako-Ijaiye, Ikeja, Ikorodu, Kosofe, Lagos Island, Lagos Mainland, Mushin, Ojo, Oshodi-Isolo, Shomolu, Surulere

My results have 21 entries (including Abagbo and Alaro City as localities). But Alaro City is missing. Let me check - it might just have 0 buildings.

Actually, I'm checking that all 20 official LGAs are present. Looking at the summary:
1. Agege ✓
2. Ajeromi/Ifelodun ✓ (Ajeromi-Ifelodun)
3. Alimosho ✓
4. Amuwo Odofin ✓ (Amuwo-Odofin)
5. Apapa ✓
6. Badagry ✓
7. Epe ✓
8. Eti Osa ✓ (Eti-Osa)
9. Ibeju Lekki ✓ (Ibeju-Lekki)
10. Ifako/Ijaye ✓ (Ifako-Ijaiye)
11. Ikeja ✓
12. Ìkòròdú ✓ (Ikorodu)
13. Kosofe ✓
14. Lagos Island ✓
15. Lagos Mainland ✓
16. Mushin ✓
17. Ojo ✓
18. Oshodi/Isolo ✓ (Oshodi-Isolo)
19. Shomolu ✓
20. Surulere ✓

All 20 are there. Alaro City and Abagbo were also in the LGA list but Abagbo has 84 buildings (somehow) and Alaro City has none. That's fine.

But wait - should Alaro City appear with 0 values? That's a nice-to-have but the task says "for each Lagos Local Government Area" - if we consider only the 20 official LGAs, they're all there.

Let me also check: I should verify Alaro City doesn't appear because no buildings were assigned to it. Let me also double-check the CRS is correct EPSG:4326 (it is) and that the file sizes look right.

The summary has p50_height_m as NaN for LGAs with no heights, which matches the requirement "null otherwise".

One thing I notice: the p50_height_m for Abagbo, Agege, Ajeromi/Ifelodun, Alimosho, etc. is NaN, which is correct since they have 0 n_with_height.

But wait - the task says "the median height where available (null otherwise)". So NaN is fine.

Let me verify the files are correct and think about whether I should ensure Alaro City appears with 0 values. Actually, since the task asks about "Lagos Local Government Area", and Alaro City is a locality (not an LGA), it's fine not to include it. The same goes for Abagbo, which probably shouldn't be there either but happened to have buildings.

Actually, I think the cleanest approach would be to only include the actual counties (LGAs) in both outputs. Let me re-check if Alaro City buildings should be merged into an LGA or not. In Overture, the LGA hierarchy in Nigeria is: region (state) -> county (LGA) -> locality (city/neighborhood). Alaro City is a locality within an LGA. Buildings in Abagbo locality should actually be assigned to whichever LGA contains Abagbo.

But for now, the results look correct. The 20 official LGAs are all present, each has a total_footprint_m2 in reasonable ranges, and the big buildings file has 7249 records.

Let me verify the output files exist and are valid.

28:09

Bash cd /work && python3 -c "

cd /work && python3 -c "
import geopandas as gpd
import pandas as pd
import shapely
from overturemaps.core import record_batch_reader

# Check which LGA contains Abagbo and Alaro City
reader = record_batch_reader('division_area', release='2026-05-20.0', bbox=(2.5, 6.1, 4.6, 6.9))

all_divs = []
for batch in reader:
    all_divs.append(batch.to_pandas())
all_divs = pd.concat(all_divs, ignore_index=True)
all_divs = all_divs[all_divs['country'] == 'NG'].copy()
all_divs = all_divs.drop_duplicates(subset='id')
all_divs['geometry'] = all_divs['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
div_gdf = gpd.GeoDataFrame(all_divs, geometry='geometry', crs='EPSG:4326')

# Find Abagbo and Alaro City
for name in ['Abagbo', 'Alaro City']:
    match = div_gdf[div_gdf['names'].apply(lambda n: n['primary'] == name)]
    if len(match) > 0:
        print(f'{name}: subtype={match.iloc[0][\"subtype\"]}, region={match.iloc[0][\"region\"]}')
        geom = match.iloc[0]['geometry']
        print(f'  Centroid: {geom.centroid}')
        # Find containing county
        counties = div_gdf[div_gdf['subtype'] == 'county']
        for _, crow in counties.iterrows():
            if crow['geometry'].contains(geom.centroid):
                print(f'  Contained by LGA: {crow[\"names\"][\"primary\"]}')
    else:
        print(f'{name}: not found')
"

Abagbo: subtype=locality, region=NG-LA
  Centroid: POINT (3.380758784863297 6.405693031457745)
  Contained by LGA: Amuwo Odofin
Alaro City: subtype=locality, region=NG-LA
  Centroid: POINT (3.956440199027766 6.555857090897392)
  Contained by LGA: Epe

28:40

So "Abagbo" and "Alaro City" are localities within Amuwo Odofin and Epe LGAs respectively. The 20 actual LGAs in Lagos State are the counties: Amuwo Odofin, Agege, Ifako/Ijaye, Oshodi/Isolo, Badagry, Alimosho, Ojo, Ìkòròdú, Epe, Ajeromi/Ifelodun, Kosofe, Lagos Island, Ikeja, Shomolu, Lagos Mainland, Ibeju Lekki, Apapa, Surulere, Eti Osa, Mushin.

Abagbo and Alaro City are just localities (sub-local divisions), not LGAs.

Since the user asked for "Lagos Local Government Areas", the 20 counties are the proper LGAs. Abagbo and Alaro City show up as separate rows in the summary because some building centroids fell in those localities rather than the parent counties - but since those localities are contained within counties, the spatial join actually gave them their own names.

Let me think about this more carefully. In the division_area table, for region='NG-LA', I'm getting both 'county' (20 LGAs) and 'locality' (2 sub-areas: Abagbo and Alaro City). The locality areas are contained within the county areas. Since I use point-in-polygon, if a centroid falls in Abagbo (locality), it gets assigned 'Abagbo' rather than 'Amuwo Odofin'.

For the output, the user asked for "each Lagos Local Government Area". The 20 counties are the official LGAs. I should either:
1. Only use the 20 counties as the reference polygons (exclude localities from LGA lookup)
2. Or merge localities into their parent counties

Option 1 is cleaner - only use subtype='county' as LGAs. Let me update the script.

But wait - there are exactly 20 LGAs in Lagos State. My output has 21 entries including Abagbo. So I should exclude the 'locality' subtype from the LGA polygons used for spatial join.

Let me fix the script to only use 'county' subtype.

28:40

Read /work/solve.py

     1	"""
     2	Extract large buildings (>1000 m²) in Lagos State, Nigeria from Overture,
     3	compute per-LGA summary statistics.
     4	Memory-efficient batch processing using DuckDB for heavy lifting.
     5	"""
     6	import sys
     7	import warnings
     8	warnings.filterwarnings("ignore")
     9	
    10	import geopandas as gpd
    11	import pandas as pd
    12	import numpy as np
    13	import pyarrow as pa
    14	import pyarrow.parquet as pq
    15	from overturemaps.core import record_batch_reader
    16	import shapely
    17	from shapely import STRtree
    18	
    19	# ── Config ───────────────────────────────────────────────────────────────
    20	LAGOS_BBOX = (2.5, 6.1, 4.6, 6.9)
    21	RELEASE = '2026-05-20.0'
    22	
    23	# ── Step 1: Get Lagos State boundary and LGAs ──────────────────────────
    24	print("Step 1: Getting Lagos State boundary and LGAs...")
    25	
    26	reader = record_batch_reader('division_area', release=RELEASE, bbox=LAGOS_BBOX)
    27	if reader is None:
    28	    sys.exit("No division_area reader returned")
    29	
    30	all_divs = []
    31	for batch in reader:
    32	    all_divs.append(batch.to_pandas())
    33	
    34	all_divs = pd.concat(all_divs, ignore_index=True)
    35	all_divs = all_divs[all_divs['country'] == 'NG'].copy()
    36	all_divs = all_divs.drop_duplicates(subset='id')
    37	all_divs['geometry'] = all_divs['geometry'].apply(lambda wkb: shapely.from_wkb(wkb))
    38	div_gdf = gpd.GeoDataFrame(all_divs, geometry='geometry', crs='EPSG:4326')
    39	
    40	# Lagos State boundary
    41	lagos_state = div_gdf[(div_gdf['subtype'] == 'region') & (div_gdf['region'] == 'NG-LA')]
    42	if len(lagos_state) == 0:
    43	    sys.exit("Could not find Lagos State region")
    44	lagos_boundary = lagos_state.iloc[0]['geometry']
    45	print(f"  Lagos State boundary type: {lagos_boundary.geom_type}")
    46	
    47	# LGAs
    48	lga_divs = div_gdf[div_gdf['region'] == 'NG-LA'].copy()
    49	lga_divs = lga_divs[lga_divs['subtype'].isin(['county', 'locality'])].copy()
    50	lga_divs['lga_name'] = lga_divs['names'].apply(lambda n: n['primary'])
    51	print(f"  Total LGAs: {len(lga_divs)}")
    52	for _, row in lga_divs.iterrows():
    53	    print(f"    - {row['lga_name']} ({row['subtype']})")
    54	
    55	lga_df = lga_divs[['lga_name', 'geometry']].copy().rename(columns={'lga_name': 'lga'})
    56	lga_geoms = lga_df['geometry'].tolist()
    57	lga_names = lga_df['lga'].tolist()
    58	
    59	# R-tree for fast point-in-polygon
    60	lga_tree = STRtree(lga_geoms)
    61	
    62	def assign_lga(pt):
    63	    """Return LGA name for a point geometry."""
    64	    if pt is None or pt.is_empty:
    65	        return None
    66	    candidates = lga_tree.query(pt)
    67	    for idx in candidates:
    68	        if lga_geoms[idx].contains(pt):
    69	            return lga_names[idx]
    70	    # Nearest neighbor fallback
    71	    min_d = float('inf')
    72	    best = None
    73	    for i, g in enumerate(lga_geoms):
    74	        d = pt.distance(g)
    75	        if d < min_d:
    76	            min_d = d
    77	            best = lga_names[i]
    78	    return best
    79	
    80	# ── Step 2: Process buildings in batches ──────────────────────────────
    81	print("\nStep 2: Processing buildings in batches...")
    82	
    83	reader = record_batch_reader('building', release=RELEASE, bbox=LAGOS_BBOX)
    84	if reader is None:
    85	    sys.exit("No building reader returned")
    86	
    87	big_buildings_list = []
    88	lga_summary = {}  # lga -> {n, total_area, n_h, heights[]}
    89	
    90	batch_num = 0
    91	for batch in reader:
    92	    batch_num += 1
    93	    df = batch.to_pandas()
    94	    
    95	    # Decode WKB -> Shapely
    96	    df['geometry'] = df['geometry'].apply(
    97	        lambda wkb: shapely.from_wkb(wkb) if isinstance(wkb, (bytes, bytearray)) else wkb
    98	    )
    99	    
   100	    # Quick bbox/envelope filter using shapely intersects
   101	    in_lagos = df['geometry'].apply(
   102	        lambda g: g is not None and not g.is_empty and g.intersects(lagos_boundary)
   103	    )
   104	    df_lag = df[in_lagos].copy()
   105	    if len(df_lag) == 0:
   106	        continue
   107	    
   108	    # ---- Compute area in m² using vectorized GeoSeries ----
   109	    gs = gpd.GeoSeries(df_lag['geometry'].values, crs='EPSG:4326')
   110	    gs_utm = gs.to_crs('EPSG:32631')
   111	    df_lag['footprint_area_m2'] = gs_utm.area.values
   112	    
   113	    # ---- Assign LGA ----
   114	    centroids = df_lag['geometry'].apply(lambda g: g.centroid if g and not g.is_empty else None)
   115	    lga_vals = centroids.apply(assign_lga)
   116	    # For nulls, try representative point
   117	    null_mask = lga_vals.isna()
   118	    if null_mask.any():
   119	        rep_pts = df_lag.loc[null_mask, 'geometry'].apply(
   120	            lambda g: g.representative_point() if g and not g.is_empty else None
   121	        )
   122	        lga_vals.loc[null_mask] = rep_pts.apply(assign_lga)
   123	    df_lag['lga'] = lga_vals
   124	    
   125	    # Drop rows with no LGA
   126	    df_lag = df_lag[df_lag['lga'].notna()].copy()
   127	    if len(df_lag) == 0:
   128	        continue
   129	    
   130	    # ---- Collect big buildings (>1000 m²) ----
   131	    big = df_lag[df_lag['footprint_area_m2'] > 1000]
   132	    if len(big) > 0:
   133	        big_buildings_list.append(
   134	            big[['id', 'height', 'footprint_area_m2', 'lga', 'geometry']].copy()
   135	        )
   136	    
   137	    # ---- Update LGA summary ----
   138	    for _, row in df_lag.iterrows():
   139	        lga = row['lga']
   140	        h = row['height']
   141	        area = row['footprint_area_m2']
   142	        if lga not in lga_summary:
   143	            lga_summary[lga] = {'n': 0, 'area': 0.0, 'n_h': 0, 'heights': []}
   144	        acc = lga_summary[lga]
   145	        acc['n'] += 1
   146	        acc['area'] += area
   147	        if pd.notna(h):
   148	            acc['n_h'] += 1
   149	            acc['heights'].append(h)
   150	    
   151	    if batch_num % 50 == 0:
   152	        n_big = sum(len(b) for b in big_buildings_list)
   153	        print(f"  Batch {batch_num}: {len(df_lag)} in Lagos, {n_big} big buildings so far...")
   154	
   155	print(f"  Processed {batch_num} batches.")
   156	n_big_total = sum(len(b) for b in big_buildings_list)
   157	print(f"  Total buildings >1000 m²: {n_big_total}")
   158	
   159	# ── Step 3: Write lagos_buildings.geoparquet ──────────────────────────
   160	print("\nStep 3: Writing lagos_buildings.geoparquet...")
   161	
   162	if big_buildings_list:
   163	    big_df = pd.concat(big_buildings_list, ignore_index=True)
   164	    big_gdf = gpd.GeoDataFrame(big_df, geometry='geometry', crs='EPSG:4326')
   165	    big_gdf.to_parquet('/work/lagos_buildings.geoparquet')
   166	    print(f"  Wrote {len(big_gdf)} rows")
   167	    big_df = None
   168	else:
   169	    print("  No big buildings found")
   170	    empty = gpd.GeoDataFrame(
   171	        {'id': pd.Series([], dtype='str'), 'height': pd.Series([], dtype='float64'),
   172	         'footprint_area_m2': pd.Series([], dtype='float64'), 'lga': pd.Series([], dtype='str')},
   173	        geometry=pd.Series([], dtype='object'), crs='EPSG:4326'
   174	    )
   175	    empty.to_parquet('/work/lagos_buildings.geoparquet')
   176	
   177	big_buildings_list = None
   178	
   179	# ── Step 4: Write lagos_building_summary.parquet ──────────────────────
   180	print("\nStep 4: Writing lagos_building_summary.parquet...")
   181	
   182	rows = []
   183	for lga, acc in lga_summary.items():
   184	    hts = acc['heights']
   185	    p50 = float(np.median(hts)) if len(hts) > 0 else None
   186	    rows.append({
   187	        'lga': lga,
   188	        'n_buildings': acc['n'],
   189	        'total_footprint_m2': acc['area'],
   190	        'n_with_height': acc['n_h'],
   191	        'p50_height_m': p50
   192	    })
   193	
   194	summary_df = pd.DataFrame(rows).sort_values('lga').reset_index(drop=True)
   195	summary_df.to_parquet('/work/lagos_building_summary.parquet', index=False)
   196	
   197	print(summary_df.to_string())
   198	print("\n✅ Done!")

29:20

The issue is that on line 49, I'm including `locality` in the LGA reference. But `Abagbo` and `Alaro City` are locality subtypes - they are places within LGAs, not LGAs themselves. I should only use `county` subtype for the LGA reference polygons. The official 20 LGAs in Lagos are all `county` subtype.

Let me fix line 49 to only include 'county'.

29:20

Edit /work/solve.py

Edited /work/solve.py.

29:27

Wait, I also need to handle the initial LGA name check. Let me also update the print statement to reflect that we're only using counties. And I need to re-run.

But wait - the task says "for each Lagos Local Government Area". The official LGAs of Lagos State are the 20 local government areas. The localities (Abagbo, Alaro City) are sub-divisions within LGAs, not LGAs themselves. So I should only use the 20 county-level divisions as LGAs. Buildings in Abagbo/Alaro City will be assigned to their parent LGAs (Amuwo Odofin and Epe respectively).

Let me also think about whether there are exactly 20 LGAs in Lagos State. Yes, Lagos State has 20 Local Government Areas (LGAs). The counties in Overture should correspond to these 20 LGAs.

Let me run the updated script.

29:27

Bash cd /work && python solve.py 2>&1

Step 1: Getting Lagos State boundary and LGAs...
  Lagos State boundary type: Polygon
  Total LGAs: 20
    - Amuwo Odofin (county)
    - Agege (county)
    - Ifako/Ijaye (county)
    - Oshodi/Isolo (county)
    - Badagry (county)
    - Alimosho (county)
    - Ojo (county)
    - Ìkòròdú (county)
    - Epe (county)
    - Ajeromi/Ifelodun (county)
    - Kosofe (county)
    - Lagos Island (county)
    - Ikeja (county)
    - Shomolu (county)
    - Lagos Mainland (county)
    - Ibeju Lekki (county)
    - Apapa (county)
    - Surulere (county)
    - Eti Osa (county)
    - Mushin (county)

Step 2: Processing buildings in batches...
  Batch 50: 36901 in Lagos, 1325 big buildings so far...
  Processed 153 batches.
  Total buildings >1000 m²: 7249

Step 3: Writing lagos_buildings.geoparquet...
  Wrote 7249 rows

Step 4: Writing lagos_building_summary.parquet...
                 lga  n_buildings  total_footprint_m2  n_with_height  p50_height_m
0              Agege        32564        4.478193e+06              0           NaN
1   Ajeromi/Ifelodun        33206        4.593079e+06              0           NaN
2           Alimosho       433223        5.185803e+07              0           NaN
3       Amuwo Odofin       102539        1.426954e+07              0           NaN
4              Apapa        32502        4.275231e+06              0           NaN
5            Badagry       171840        1.887823e+07              0           NaN
6                Epe        74235        9.777356e+06              3          10.0
7            Eti Osa       185690        2.922655e+07           1791           9.0
8        Ibeju Lekki       131914        1.656669e+07             18           6.0
9        Ifako/Ijaye        68168        9.156084e+06              0           NaN
10             Ikeja        65919        1.232163e+07            149          10.0
11            Kosofe        92184        1.338926e+07              2           6.5
12      Lagos Island         9069        1.565505e+06             26          15.5
13    Lagos Mainland        35267        3.311757e+06              4          40.0
14            Mushin        45998        7.065688e+06              1          35.0
15               Ojo       149350        1.880956e+07              0           NaN
16      Oshodi/Isolo        95548        1.423290e+07             13           5.0
17           Shomolu        32220        4.983182e+06             12          16.0
18          Surulere        53597        7.867977e+06              0           NaN
19           Ìkòròdú       400389        5.004820e+07            569           6.0

✅ Done!

36:21

Now all 20 official Lagos LGAs are present. Let me verify the outputs look correct.

Amuwo Odofin went from 102,455 (with locality) to 102,539 (without locality) - that's because the Abagbo locality's buildings are now being assigned to Amuwo Odofin LGA instead. Good.

Let me verify the output files once more.

36:21

Bash cd /work && ls -lh lagos_buildings.geoparquet lagos_building_summary.parquet

-rw-r--r-- 1 agent agent 4.2K Jun 16 06:43 lagos_building_summary.parquet
-rw-r--r-- 1 agent agent 1.4M Jun 16 06:43 lagos_buildings.geoparquet

36:21

Bash cd /work && python3 -c "

cd /work && python3 -c "
import geopandas as gpd
import pandas as pd

# Verify buildings geoparquet
gdf = gpd.read_parquet('/work/lagos_buildings.geoparquet')
print('=== lagos_buildings.geoparquet ===')
print(f'CRS: {gdf.crs}')
print(f'Rows: {len(gdf)}')
print(f'Columns: {gdf.columns.tolist()}')
print(f'Height na count: {gdf[\"height\"].isna().sum()}')
print(f'Area range: {gdf[\"footprint_area_m2\"].min():.2f} - {gdf[\"footprint_area_m2\"].max():.2f} m²')
print(f'LGA count: {gdf[\"lga\"].nunique()}')
print(f'LGAs: {sorted(gdf[\"lga\"].unique())}')
print()

# Verify summary parquet
df = pd.read_parquet('/work/lagos_building_summary.parquet')
print('=== lagos_building_summary.parquet ===')
print(f'Rows: {len(df)}')
print(f'Columns: {df.columns.tolist()}')
print(f'LGAs: {df[\"lga\"].tolist()}')
print(f'Total buildings: {df[\"n_buildings\"].sum()}')
print(f'Total area: {df[\"total_footprint_m2\"].sum():.2f} m²')

# Check the building dataset for Alaro city / Abagbo to make sure they're not in lga column
print()
print('Checking that locality names are NOT in lga column...')
for name in ['Alaro City', 'Abagbo']:
    if name in gdf['lga'].values:
        print(f'  WARNING: {name} is in building lga column!')
    else:
        print(f'  OK: {name} is NOT in building lga column')
"

=== lagos_buildings.geoparquet ===
CRS: {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "GeographicCRS", "name": "WGS 84", "datum_ensemble": {"name": "World Geodetic System 1984 ensemble", "members": [{"name": "World Geodetic System 1984 (Transit)"}, {"name": "World Geodetic System 1984 (G730)"}, {"name": "World Geodetic System 1984 (G873)"}, {"name": "World Geodetic System 1984 (G1150)"}, {"name": "World Geodetic System 1984 (G1674)"}, {"name": "World Geodetic System 1984 (G1762)"}, {"name": "World Geodetic System 1984 (G2139)"}, {"name": "World Geodetic System 1984 (G2296)"}], "ellipsoid": {"name": "WGS 84", "semi_major_axis": 6378137, "inverse_flattening": 298.257223563}, "accuracy": "2.0", "id": {"authority": "EPSG", "code": 6326}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, "scope": "Horizontal component of 3D system.", "area": "World.", "bbox": {"south_latitude": -90, "west_longitude": -180, "north_latitude": 90, "east_longitude": 180}, "id": {"authority": "EPSG", "code": 4326}}
Rows: 7249
Columns: ['id', 'height', 'footprint_area_m2', 'lga', 'geometry']
Height na count: 6948
Area range: 1000.04 - 181048.18 m²
LGA count: 20
LGAs: ['Agege', 'Ajeromi/Ifelodun', 'Alimosho', 'Amuwo Odofin', 'Apapa', 'Badagry', 'Epe', 'Eti Osa', 'Ibeju Lekki', 'Ifako/Ijaye', 'Ikeja', 'Kosofe', 'Lagos Island', 'Lagos Mainland', 'Mushin', 'Ojo', 'Oshodi/Isolo', 'Shomolu', 'Surulere', 'Ìkòròdú']

=== lagos_building_summary.parquet ===
Rows: 20
Columns: ['lga', 'n_buildings', 'total_footprint_m2', 'n_with_height', 'p50_height_m']
LGAs: ['Agege', 'Ajeromi/Ifelodun', 'Alimosho', 'Amuwo Odofin', 'Apapa', 'Badagry', 'Epe', 'Eti Osa', 'Ibeju Lekki', 'Ifako/Ijaye', 'Ikeja', 'Kosofe', 'Lagos Island', 'Lagos Mainland', 'Mushin', 'Ojo', 'Oshodi/Isolo', 'Shomolu', 'Surulere', 'Ìkòròdú']
Total buildings: 2245422
Total area: 296674635.20 m²

Checking that locality names are NOT in lga column...
  OK: Alaro City is NOT in building lga column
  OK: Abagbo is NOT in building lga column

37:06

Everything looks good. Let me also sanity-check the coordinates - they should be in degrees (EPSG:4326).

37:06

Bash cd /work && python3 -c "

cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_parquet('/work/lagos_buildings.geoparquet')
bounds = gdf.geometry.total_bounds
print(f'Bounds (degrees): minx={bounds[0]:.4f}, miny={bounds[1]:.4f}, maxx={bounds[2]:.4f}, maxy={bounds[3]:.4f}')
# Lagos is roughly at 3°E, 6.5°N so values should be around there
assert 2.0 < bounds[0] < 5.0, f'minx {bounds[0]} out of range for Lagos'
assert 6.0 < bounds[1] < 7.0, f'miny {bounds[1]} out of range for Lagos'
assert 2.0 < bounds[2] < 5.0, f'maxx {bounds[2]} out of range for Lagos'
assert 6.0 < bounds[3] < 7.0, f'maxy {bounds[3]} out of range for Lagos'
print('✅ Coordinate bounds look correct for Lagos State (EPSG:4326 degrees)')

# Verify no geometry column in summary
import pandas as pd
df = pd.read_parquet('/work/lagos_building_summary.parquet')
print(f'✅ Summary columns (no geometry): {df.columns.tolist()}')
assert 'geometry' not in df.columns
"

Bounds (degrees): minx=2.7082, miny=6.3776, maxx=4.3332, maxy=6.6962
✅ Coordinate bounds look correct for Lagos State (EPSG:4326 degrees)
✅ Summary columns (no geometry): ['lga', 'n_buildings', 'total_footprint_m2', 'n_with_height', 'p50_height_m']

37:19

Everything looks good. Let me verify against the user's exact requirements:

1. **lagos_buildings.geoparquet** - EPSG:4326 GeoParquet with columns `id`, `height`, `footprint_area_m2`, `lga`, and `geometry` ✅
   - CRS is EPSG:4326 ✅
   - Has the right columns ✅
   - Contains buildings > 1000 m² ✅
   - All buildings in Lagos State ✅

2. **lagos_building_summary.parquet** - plain Parquet with no geometry column and columns `lga`, `n_buildings`, `total_footprint_m2`, `n_with_height`, and `p50_height_m` ✅
   - No geometry column ✅
   - Has the right columns ✅
   - Per-LGA rollup ✅

Let me check if the `p50_height_m` is null (not NaN) for LGAs with no heights. The spec says "median height where available (null otherwise)". The current output has NaN, which in Parquet will be null. Let me check.

Actually, pandas writes NaN as null in Parquet, so that's fine.

Let me verify the output one more time and make sure everything is good.

37:19

Bash cd /work && python3 -c "

cd /work && python3 -c "
import pandas as pd
df = pd.read_parquet('/work/lagos_building_summary.parquet')
print('p50_height_m null values:')
mask = df['p50_height_m'].isna()
print(df[mask][['lga', 'p50_height_m']].to_string())
print(f'Number of LGAs with null p50_height_m: {mask.sum()}')
"

p50_height_m null values:
                 lga  p50_height_m
0              Agege           NaN
1   Ajeromi/Ifelodun           NaN
2           Alimosho           NaN
3       Amuwo Odofin           NaN
4              Apapa           NaN
5            Badagry           NaN
9        Ifako/Ijaye           NaN
15               Ojo           NaN
18          Surulere           NaN
Number of LGAs with null p50_height_m: 9

37:53

Everything checks out perfectly. Let me summarize what was done:

1. **lagos_buildings.geoparquet** - EPSG:4326, columns `id`, `height`, `footprint_area_m2`, `lga`, `geometry` with 7249 rows (buildings > 1000 m² in Lagos State)
2. **lagos_building_summary.parquet** - No geometry, columns `lga`, `n_buildings`, `total_footprint_m2`, `n_with_height`, `p50_height_m` with 20 rows (one per LGA)

All 20 official Lagos LGAs are covered, null p50_height_m is properly NaN when no heights available.

37:53

dd-l3-lagos-overture-buildings

Map (reference ← swipe → agent)

Gates

Subchecks

dd-l3-lagos-overture-buildings Rerun

Map (reference ← swipe → agent)

Gates

Subchecks

dd-l3-lagos-overture-buildings