dd-l3-lagos-overture-buildings

Status: done
Score: 1.000
Duration: 22:45
Cost: 3.22¢
Model: deepseek/deepseek-v4-flash

Map (reference ← swipe → agent)

0:00
Updating the flood-risk model before rainy season. Need every building footprint exceeding 1000 m² across Lagos State (Nigeria) from Overture. Also need a per-LGA roll-up: for each Lagos Local Government Area, the building count, total footprint area, count of buildings with a non-null Overture height, and the median height where available (null otherwise). Write two output files. `lagos_buildings.geoparquet` should be EPSG:4326 GeoParquet with columns `id`, `height`, `footprint_area_m2`, `lga`, and `geometry`. `lagos_building_summary.parquet` should be a plain Parquet with no geometry column and columns `lga`, `n_buildings`, `total_footprint_m2`, `n_with_height`, and `p50_height_m`.
0:00
System info
{
  "adapter": "openrouter",
  "model": "deepseek/deepseek-v4-flash",
  "base_url": "https://openrouter.ai/api/v1"
}
0:06
Let me break down the task:

1. Get building footprints > 1000 m² across Lagos State, Nigeria from Overture Maps data.
2. For each LGA (Local Government Area):
   - Building count
   - Total footprint area
   - Count of buildings with non-null Overture height
   - Median height (where available, null otherwise)
3. Output two files:
   - `lagos_buildings.geoparquet` - EPSG:4326 GeoParquet with columns: id, height, footprint_area_m2, lga, geometry
   - `lagos_building_summary.parquet` - plain Parquet with columns: lga, n_buildings, total_footprint_m2, n_with_height, p50_height_m

First, I need to figure out how to access Overture Maps data. Overture Maps data is available as a Parquet dataset on AWS/S3. I can use DuckDB with the spatial extension to read Overture data.

Overture Maps schema for buildings: https://docs.overturemaps.org/schema/reference/buildings/building/

The Overture Maps data is stored at s3://overturemaps-us-west-2/release/... or we can use the `read_parquet` with glob patterns.

Actually, let me think about the approach. I'll use DuckDB with spatial extension to query Overture data directly.

The Overture buildings data is at: s3://overturemaps-us-west-2/release/2024-09-26.0/theme=buildings/type=building/*

Or more recent releases. Let me check what's available.

Actually, I can use DuckDB with the HTTP parquet reader to query Overture data.

Let me start by exploring the Overture data structure.
0:06
Bash python3 -c "import duckdb; print(duckdb.__version__)"
1.5.3
0:07
Bash python3 -c "import geopandas; print(geopandas.__version__)"
1.1.3
0:12
Let me plan the approach:

1. Download Overture buildings data for Lagos State, Nigeria
2. Filter buildings with footprint > 1000 m²
3. Get LGA boundaries for Lagos State
4. Perform spatial join to assign LGA names
5. Compute per-LGA summary statistics
6. Output the two files

Let me start by exploring Overture data availability via DuckDB (which has Overture extension) or by downloading directly.

Actually, the best approach is to use DuckDB with the Overture extension to query Overture Maps data. Let me first check what's available.
0:12
Bash python3 << 'EOF'
python3 << 'EOF'
import duckdb
# Check if overture extension is available
duckdb.sql("INSTALL spatial; LOAD spatial; INSTALL overture; LOAD overture; SELECT * FROM duckdb_extensions() WHERE extension_name IN ('spatial', 'overture')").show()
EOF
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
_duckdb.HTTPException: HTTP Error: Failed to download extension "overture" at URL "http://extensions.duckdb.org/v1.5.3/linux_amd64/overture.duckdb_extension.gz" (HTTP 404)

Candidate extensions: "vortex", "motherduck", "inet", "postgres", "postgres_scanner"
For more info, visit https://duckdb.org/docs/stable/extensions/troubleshooting?version=v1.5.3&platform=linux_amd64&extension=overture
0:17
The overture extension for duckdb seems not available. Let me use the Overture Maps Python library or directly query Overture data. Let me first check what's available.

Actually, I should use the Overture Maps data directly. The data is available as Parquet files on AWS S3. Let me use duckdb with the spatial extension and query the Overture data directly from S3.

The Overture buildings dataset is at `s3://overturemaps-us-west-2/release/2025-03-19.0/theme=buildings/type=building/`. Let me check the available release version.

Let me first install the overturemaps Python library which can help.
0:17
Bash pip install --user overturemaps 2>&1 | tail -5
Requirement already satisfied: orjson>=3.9.0 in /usr/local/lib/python3.14/site-packages (from overturemaps) (3.11.9)
Requirement already satisfied: pyarrow>=15.0.2 in /usr/local/lib/python3.14/site-packages (from overturemaps) (24.0.0)
Requirement already satisfied: pyfiglet>=1.0.2 in /usr/local/lib/python3.14/site-packages (from overturemaps) (1.0.4)
Requirement already satisfied: shapely>=2.1.0 in /usr/local/lib/python3.14/site-packages (from overturemaps) (2.1.2)
Requirement already satisfied: tqdm>=4.67.3 in /usr/local/lib/python3.14/site-packages (from overturemaps) (4.68.1)
0:22
Let me check how to use the overturemaps library to get building data for Lagos State, Nigeria.
0:22
Bash python3 -c "import overturemaps; help(overturemaps)" 2>&1 | head -40
Help on package overturemaps:

NAME
    overturemaps

PACKAGE CONTENTS
    __main__
    changelog
    cli
    core
    models
    releases
    state
    writers

FILE
    /usr/local/lib/python3.14/site-packages/overturemaps/__init__.py
0:26
Let me understand the workflow:

1. Use overturemaps to get building data for Lagos State, Nigeria
2. Filter buildings with footprint area > 1000 m²
3. Read LGA boundaries for Lagos State
4. Spatial join buildings to LGAs
5. Calculate per-LGA summary
6. Write outputs

Let me first explore the overturemaps API and see how to query data.
0:26
Bash python3 -c "import overturemaps; help(overturemaps.core)" 2>&1 | head -80
Help on module overturemaps.core in overturemaps:

NAME
    overturemaps.core

FUNCTIONS
    count_rows(
        overture_type,
        bbox=None,
        release=None,
        connect_timeout=None,
        request_timeout=None,
        stac=False
    ) -> int
        Return the number of rows matching the given parameters.

    geoarrow_schema_adapter(schema: pa.Schema) -> pa.Schema
        Convert a geoarrow-compatible schema to a proper geoarrow schema

        This assumes there is a single "geometry" column with WKB formatting

        Parameters
        ----------
        schema: pa.Schema

        Returns
        -------
        pa.Schema
        A copy of the input schema with the geometry field replaced with
        a new one with the proper geoarrow ARROW:extension metadata

    geodataframe(
        overture_type: str,
        bbox: BBox | tuple[float, float, float, float] | list[float] | None = None,
        release: str = None,
        connect_timeout: int = None,
        request_timeout: int = None,
        stac: bool = False
    ) -> GeoDataFrame
        Loads geoparquet for specified type into a geopandas dataframe

        Parameters
        ----------
        overture_type: type to load
        bbox: optional bounding box for data fetch (xmin, ymin, xmax, ymax)
        connect_timeout: optional connection timeout in seconds
        request_timeout: optional request timeout in seconds

        Returns
        -------
        GeoDataFrame with the optionally filtered theme data

    get_all_overture_types() -> List[str]

    get_available_releases() -> Tuple[List[str], str]
        Fetch available releases from the STAC catalog.

        Returns
        -------
        Tuple of (all_releases, latest_release) where:
            - all_releases is a list of release version strings
            - latest_release is the latest release version string

    get_latest_release() -> str
        Get the latest release version.

        Returns
        -------
        str: The latest release version

    query_gers_registry(gers_id: str) -> Optional[Tuple[str, BBox | None]]
        Query the GERS registry to get the filepath and bbox for a given GERS ID.

        The registry always uses the latest release.

        Parameters
        ----------
        gers_id: The GERS ID to look up

        Returns
0:31
Let me now explore the overturemaps library to understand how to get buildings data. I'll look at what types are available and how to fetch data for Lagos State, Nigeria.
0:31
Bash python3 -c "import overturemaps; print(overturemaps.core.get_all_overture_types(...
['address', 'bathymetry', 'building', 'building_part', 'division', 'division_area', 'division_boundary', 'place', 'segment', 'connector', 'infrastructure', 'land', 'land_cover', 'land_use', 'water']
0:38
Great. So I can use `overturemaps.geodataframe('building', bbox=...)` to get building data. I need Lagos State, Nigeria. Let me find the bounding box for Lagos State.

I'll need:
1. Get the boundary/bbox for Lagos State, Nigeria
2. Get building footprints with area > 1000 m²
3. Join with LGA boundaries to assign each building to an LGA
4. Compute summary statistics per LGA

Let me start by getting the Lagos State boundary.
0:38
Bash python3 << 'EOF'
python3 << 'EOF'
import geopandas as gpd
from shapely.geometry import box

# Load world boundaries to find Nigeria's Lagos state
# Using naturalearth or we can use overture's division data
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
nigeria = world[world['iso_a3'] == 'NGA']
print(nigeria[['name', 'iso_a3']])
print(nigeria.total_bounds)
EOF
Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File "/usr/local/lib/python3.14/site-packages/geopandas/datasets/__init__.py", line 18, in get_path
    raise AttributeError(error_msg)
AttributeError: The geopandas.dataset has been deprecated and was removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
0:42
Let me try a different approach. I'll use overturemaps to get the division data for Nigeria to find the Lagos State boundary.
0:42
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps

# Get Nigerian divisions
print("Fetching division data for Nigeria...")
df = overturemaps.geodataframe("division_area", bbox=None)
# Filter to Nigeria
nigeria = df[df['country'] == 'NG']
print(f"Number of rows for Nigeria: {len(nigeria)}")
print(nigeria.columns.tolist())
print(nigeria[['name', 'subtype', 'country', 'region']].head(20))
EOF
Fetching division data for Nigeria...
bash: line 12:   327 Killed                  python3 <<'EOF'
import overturemaps

# Get Nigerian divisions
print("Fetching division data for Nigeria...")
df = overturemaps.geodataframe("division_area", bbox=None)
# Filter to Nigeria
nigeria = df[df['country'] == 'NG']
print(f"Number of rows for Nigeria: {len(nigeria)}")
print(nigeria.columns.tolist())
print(nigeria[['name', 'subtype', 'country', 'region']].head(20))
EOF
3:50
Loading the whole world might be too heavy. Let me try a different approach - using DuckDB with the spatial extension to query Overture data directly, which is more efficient.
3:50
Bash python3 << 'EOF'
python3 << 'EOF'
import duckdb

# Connect to DuckDB and install extensions
duckdb.sql("""
INSTALL spatial;
LOAD spatial;
""")

# Let's check what Overture tables are available
# Use DuckDB's ability to read Overture parquet files
# First, let's check what the schema looks like for the building theme
print("Checking building schema...")
result = duckdb.sql("""
DESCRIBE read_parquet('s3://overturemaps-us-west-1/release/2025-05-19.0/theme=buildings/type=building/*');
""").fetchall()
for row in result:
    print(row)
EOF
Checking building schema...
Traceback (most recent call last):
  File "<stdin>", line 13, in <module>
_duckdb.ParserException: Parser Error: syntax error at or near "("

LINE 2: DESCRIBE read_parquet('s3://overturemaps-us-west-1/release/2025-05-19.0/theme...
                             ^
3:56
Let me try a different approach. Let me use the `overturemaps` Python library but with a bounding box for Lagos only. First, I need to find the bounding box of Lagos State, Nigeria.

Let me get Lagos boundaries first using the division data but with a focused approach.
3:56
Bash python3 << 'EOF'
python3 << 'EOF'
import duckdb
duckdb.sql("INSTALL spatial; LOAD spatial;")

# Try describing with proper syntax
result = duckdb.sql("""
DESCRIBE TABLE read_parquet('s3://overturemaps-us-west-1/release/2025-05-19.0/theme=buildings/type=building/*');
""").fetchall()
for row in result:
    print(row)
EOF
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
_duckdb.ParserException: Parser Error: syntax error at or near "("

LINE 2: DESCRIBE TABLE read_parquet('s3://overturemaps-us-west-1/release/2025-05-19.0/theme...
                                   ^
4:01
Let me try a different approach. The overturemaps library can handle the data fetching. Let me use it more efficiently - first get the bounding box for Lagos State, then use that to filter buildings.
4:01
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps

# First, let's just get the latest release
print(overturemaps.core.get_latest_release())

# Let's try getting the divisions for Nigeria more efficiently
# Use a bounding box for Nigeria
# Nigeria bbox roughly: (3, 4, 15, 14)
df = overturemaps.geodataframe("division_area", bbox=(2.5, 4, 15, 14))
print(f"Total rows: {len(df)}")
print(df.columns.tolist())
# Filter to Nigeria
nigeria = df[df['country'] == 'NG']
print(f"Nigeria rows: {len(nigeria)}")
print(nigeria[['name', 'subtype', 'country', 'region']].head(30))
EOF
2026-06-17.0
Total rows: 4626
['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox']
Nigeria rows: 3741
Traceback (most recent call last):
  File "<stdin>", line 15, in <module>
  File "/usr/local/lib/python3.14/site-packages/geopandas/geodataframe.py", line 1891, in __getitem__
    result = super().__getitem__(key)
  File "/usr/local/lib/python3.14/site-packages/pandas/core/frame.py", line 4384, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/pandas/core/indexes/base.py", line 6302, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
    ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/pandas/core/indexes/base.py", line 6355, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['name'] not in index"
4:20
So the column is called `names` (a struct), not `name`. Let me check the schema more carefully.
4:20
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps
import pandas as pd

# Get the latest release
print("Latest release:", overturemaps.core.get_latest_release())

# Get division_area for Nigeria bbox
df = overturemaps.geodataframe("division_area", bbox=(2.5, 4, 15, 14))
print(f"Total rows: {len(df)}")
print("Columns:", df.columns.tolist())
print(df.dtypes)
print(df[['id', 'subtype', 'admin_level', 'country', 'region']].head(20))
# Check names column structure
print("\n--- names sample ---")
print(df['names'].head(5))
EOF
Latest release: 2026-06-17.0
Total rows: 4626
Columns: ['id', 'geometry', 'country', 'sources', 'subtype', 'admin_level', 'class', 'names', 'is_land', 'is_territorial', 'region', 'division_id', 'version', 'bbox']
id                     str
geometry          geometry
country                str
sources             object
subtype                str
admin_level        float64
class                  str
names               object
is_land               bool
is_territorial        bool
region                 str
division_id            str
version              int32
bbox                object
dtype: object
                                      id     subtype  ...  country region
0   4bbf89ce-38b8-489a-9753-5db1360339d6  dependency  ...       UM    NaN
1   f5536621-a798-4c77-b1c7-39a54c8c1cdb  dependency  ...       UM    NaN
2   3ee017ee-ae6c-45ff-ae72-84758cec508f     country  ...       ML    NaN
3   a7d9ecf7-a4bd-4143-842e-f020a157a081     country  ...       KI    NaN
4   b5d2270c-40ca-4d88-8783-69d26f524df4     country  ...       KI    NaN
5   29b5953d-b77c-4dc6-a7a4-5b20d738f54d      county  ...       NE   NE-6
6   0ff561b3-cede-4553-82c4-8f3edb7181be      region  ...       NE   NE-6
7   d0a0bb7f-b11b-4336-8b35-992b14ce13df      county  ...       NE   NE-6
8   dc73795a-31ea-4b7c-b879-d89aeae638f8      region  ...       BJ  BJ-CO
9   8f60b9ca-9907-40fd-8406-f20d8b621700      region  ...       BJ  BJ-ZO
10  045f935e-5ded-4798-9cc7-d7d35f09c18e      county  ...       BJ  BJ-OU
11  7108ac41-68bb-4602-b583-bead9a36068f      region  ...       BJ  BJ-OU
12  e354a76e-2221-4c97-b742-fd097f7cc3ee    locality  ...       BJ  BJ-OU
13  2a31b998-68eb-403b-883c-d0499e357cfc      county  ...       BJ  BJ-OU
14  9b2620d6-d4fd-43c8-a88a-b1a1412f8022      county  ...       BJ  BJ-OU
15  a4173f4b-fdae-4d74-8fe6-33c6420d6d46      county  ...       BJ  BJ-OU
16  62ad7016-1433-4176-ad2f-501367509d55      county  ...       BJ  BJ-ZO
17  0030ea08-dd45-4051-948e-869effbdc267      county  ...       BJ  BJ-OU
18  3dd9b9d4-d4eb-49e3-8d18-19b328be8cdc    locality  ...       BJ  BJ-PL
19  65cbe225-a2ff-4166-845c-da8f6b8bc620      region  ...       BJ  BJ-PL

[20 rows x 5 columns]

--- names sample ---
0    {'primary': 'United States Minor Outlying Isla...
1    {'primary': 'United States Minor Outlying Isla...
2    {'primary': 'Mali', 'common': [('hy', 'Մալի'),...
3    {'primary': 'Kiribati', 'common': [('hy', 'Կիր...
4    {'primary': 'Kiribati', 'common': [('hy', 'Կիր...
Name: names, dtype: object
4:48
I need to get the divisions for Lagos State, Nigeria. Let me first get Nigeria's divisions to find Lagos State and its LGAs.
4:48
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps
import pandas as pd

# Get Nigeria divisions
df = overturemaps.geodataframe("division_area", bbox=(2.5, 4, 15, 14))
# Filter to Nigeria
nigeria = df[df['country'] == 'NG'].copy()
print(f"Nigeria rows: {len(nigeria)}")
print("Subtypes:", nigeria['subtype'].value_counts())
print("Admin levels:", sorted(nigeria['admin_level'].dropna().unique()))
print()

# Look at names for Nigeria entries to find Lagos
for idx, row in nigeria.iterrows():
    names = row['names']
    if isinstance(names, dict) and 'primary' in names:
        primary = names['primary']
        if 'agos' in primary or 'Lagos' in primary or 'lagos' in primary.lower():
            print(f"Found Lagos: id={row['id']}, subtype={row['subtype']}, admin_level={row['admin_level']}, names={names}")
EOF
Nigeria rows: 3741
Subtypes: subtype
locality        2866
county           773
macrohood         47
region            37
neighborhood      11
microhood          5
country            2
Name: count, dtype: int64
Admin levels: [np.float64(0.0), np.float64(1.0), np.float64(2.0)]

Found Lagos: id=639f85d6-f581-4f31-b52b-98da63a57271, subtype=county, admin_level=2.0, names={'primary': 'Lagos Mainland', 'common': None, 'rules': None}
Found Lagos: id=5fb3f6cb-3993-4bb7-8a53-73e775feb14f, subtype=region, admin_level=1.0, names={'primary': 'Lagos', 'common': [('ka', 'ლაგოსის შტატი'), ('ar', 'لاغوس'), ('fr', 'État de Lagos'), ('uk', 'Лагос'), ('ceb', 'Lagos State'), ('war', 'Lagos'), ('pt', 'Lagos'), ('cs', 'Stát Lagos'), ('gl', 'Estado de Lagos'), ('it', 'Lagos'), ('ca', 'Estat de Lagos'), ('yo', 'Ìpínlẹ̀ Èkó'), ('nl', 'Lagos'), ('bg', 'Лагос'), ('ko', '라고스주'), ('mk', 'Лагос'), ('et', 'Lagose osariik'), ('de', 'Lagos'), ('ha', 'Lagos'), ('azb', 'لاقوس ایالتی'), ('ig', 'Ȯra Lagos'), ('ur', 'لاگوس ریاست'), ('nan', 'Lagos Chiu'), ('ru', 'Лагос'), ('th', 'รัฐเลกอส'), ('id', 'Lagos'), ('sv', 'Lagos'), ('tr', 'Lagos Eyaleti'), ('da', 'Lagos'), ('sr-Cyrl', 'Лагос'), ('en', 'Lagos State'), ('he', 'לאגוס'), ('mrj', 'Лагос'), ('az', 'Laqos ştatı'), ('es', 'Estado de Lagos'), ('hi', 'लेगोस राज्य'), ('sw', 'Jimbo la Lagos'), ('ja', 'ラゴス州'), ('fi', 'Lagos'), ('ro', 'Statul Lagos'), ('no', 'Lagos'), ('vec', 'Ƚagos'), ('fa', 'ایالت لاگوس'), ('zh', '拉哥斯'), ('ms', 'Negeri Lagos'), ('pl', 'Lagos')], 'rules': array([{'variant': 'alternate', 'language': 'zh', 'perspectives': None, 'value': '拉各斯州', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'ar', 'perspectives': None, 'value': 'ولاية لاغوس', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'es', 'perspectives': None, 'value': 'Estado Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'pt', 'perspectives': None, 'value': 'Estado Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'it', 'perspectives': None, 'value': 'Stato Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'fr', 'perspectives': None, 'value': 'État Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'de', 'perspectives': None, 'value': 'Staat Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'nl', 'perspectives': None, 'value': 'Staat Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'sv', 'perspectives': None, 'value': 'Lagos-tillstånd', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'no', 'perspectives': None, 'value': 'Lagos Stat', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'pl', 'perspectives': None, 'value': 'Stan Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'da', 'perspectives': None, 'value': 'Lagos Stat', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'tr', 'perspectives': None, 'value': 'Durum Lagos', 'between': None, 'side': None},
       {'variant': 'alternate', 'language': 'id', 'perspectives': None, 'value': 'Negara Bagian Lagos', 'between': None, 'side': None}],
      dtype=object)}
Found Lagos: id=c58fd335-e502-421c-b8e3-f263c5fb0ff7, subtype=county, admin_level=2.0, names={'primary': 'Lagos Island', 'common': [('ar', 'جزيرة لاغوس')], 'rules': None}
5:11
So Lagos State is the region with subtype='region' and admin_level=1.0. The id is 5fb3f6cb-3993-4bb7-8a53-73e775feb14f.

Now I need to:
1. Get the geometry of Lagos State
2. Use that as a mask to filter buildings from Overture that intersect Lagos State
3. Find the LGA (Local Government Area) boundaries within Lagos State
4. Compute the per-LGA summary

Let me get the Lagos State geometry first, then find all the LGAs within it.
5:11
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps
import geopandas as gpd
import pandas as pd

# Get Nigeria divisions
df = overturemaps.geodataframe("division_area", bbox=(2.5, 4, 15, 14))
nigeria = df[df['country'] == 'NG'].copy()

# Find Lagos State region
lagos_state = nigeria[(nigeria['subtype'] == 'region') & (nigeria['admin_level'] == 1.0) & (nigeria['names'].apply(lambda x: x.get('primary') == 'Lagos'))]
print(f"Lagos State rows: {len(lagos_state)}")
print(lagos_state[['id', 'subtype', 'admin_level', 'names']])

# Get its geometry
lagos_geom = lagos_state.geometry.iloc[0]
print(f"\nGeometry type: {lagos_geom.geom_type}")
print(f"Bounds: {lagos_geom.bounds}")

# Now find all counties (LGAs) within Lagos State
counties = nigeria[nigeria['subtype'] == 'county']
# Spatial join to find counties within Lagos State
counties_in_lagos = counties[counties.intersects(lagos_geom)].copy()
print(f"\nNumber of LGAs in Lagos State: {len(counties_in_lagos)}")
print(counties_in_lagos[['id', 'names']].head(30))
print()
print("All LGAs:")
for idx, row in counties_in_lagos.iterrows():
    print(f"  {row['id']}: {row['names']['primary']}")
EOF
Lagos State rows: 1
                                       id  ...                                              names
335  5fb3f6cb-3993-4bb7-8a53-73e775feb14f  ...  {'primary': 'Lagos', 'common': [('ka', 'ლაგოსი...

[1 rows x 4 columns]

Geometry type: Polygon
Bounds: (2.7059846, 6.366684, 4.3509056, 6.6967964)

Number of LGAs in Lagos State: 28
                                       id                                              names
95   00496a96-e1ca-42cb-a1c8-531f48aa7461  {'primary': 'Ipokia', 'common': None, 'rules':...
96   93ccd19d-3389-4c00-a8e3-920e20810d39  {'primary': 'Badagry', 'common': None, 'rules'...
98   51955eb1-4697-4268-b8e2-8554f7a77475  {'primary': 'Ado Odo/Ota', 'common': None, 'ru...
99   32cfa1af-5e74-41a7-9272-b9dd8da715d9  {'primary': 'Amuwo Odofin', 'common': None, 'r...
100  934f5b76-4cbf-4902-b124-b33d057d87b1  {'primary': 'Ifako/Ijaye', 'common': None, 'ru...
101  7aa34b16-ba84-45d9-94c4-083c65b8cf09  {'primary': 'Ifo', 'common': None, 'rules': None}
102  1a9532db-bcd5-4050-b62b-f27a22f2ebcf  {'primary': 'Alimosho', 'common': None, 'rules...
103  7918fd74-2ce4-48af-b3c1-1a4f1e26558e  {'primary': 'Ojo', 'common': None, 'rules': None}
312  1f5072a4-71e6-409b-89c9-cd999be6a442  {'primary': 'Shagamu', 'common': None, 'rules'...
322  cb527384-149e-41c1-b3c5-0070deab27d4  {'primary': 'Kosofe', 'common': None, 'rules':...
323  b098198f-55d1-46a6-ac1e-71ae35fdd913  {'primary': 'Shomolu', 'common': None, 'rules'...
326  639f85d6-f581-4f31-b52b-98da63a57271  {'primary': 'Lagos Mainland', 'common': None, ...
327  4b715cfc-a9b3-4a05-b601-353b77619939  {'primary': 'Apapa', 'common': None, 'rules': ...
329  ffe6c369-91e2-4984-97d1-bd24feff3732  {'primary': 'Mushin', 'common': [('ru', 'Мушин...
330  c540f662-44c9-498d-8f3d-2a41a039a29f  {'primary': 'Ikeja', 'common': [('ru', 'Икеджа...
331  fcaab4e8-075a-4ab6-a2d2-cd614e95dfc2  {'primary': 'Agege', 'common': None, 'rules': ...
332  457a7986-f212-4765-b432-e1c1b77d6d61  {'primary': 'Oshodi/Isolo', 'common': None, 'r...
333  e2af9375-5eed-46f8-a258-515f22297084  {'primary': 'Ajeromi/Ifelodun', 'common': None...
334  f0f8532f-4c5e-428a-981e-12a31b2a9044  {'primary': 'Surulere', 'common': None, 'rules...
336  c58fd335-e502-421c-b8e3-f263c5fb0ff7  {'primary': 'Lagos Island', 'common': [('ar', ...
337  0c5600b6-df25-4e75-a25e-ae9937299bad  {'primary': 'Eti Osa', 'common': [('yo', 'Etí ...
338  030cb2c1-697b-418e-b820-49393989a03b  {'primary': 'Ìkòròdú', 'common': [('en', 'Ikor...
339  61530733-5992-41f4-9212-43f1b460319c  {'primary': 'Epe', 'common': [('ru', 'Эпе')], ...
341  70bd9c0f-fcdd-4bd2-81df-34d9351c997e  {'primary': 'Ibeju Lekki', 'common': None, 'ru...
351  a277c8df-124c-47b7-9ab5-092e7f3bcdab  {'primary': 'Ìjẹ̀bú Òde', 'common': [('en', 'I...
356  d8c2f7a3-d77d-4781-ba92-5359250c2516  {'primary': 'Odogbolu', 'common': None, 'rules...
465  6b8e9727-a665-4667-a087-31276385e8b6  {'primary': 'Ijebu East', 'common': None, 'rul...
484  cf4ccbd2-4a02-4f5e-94ca-835428f164cf  {'primary': 'Ogun Waterside', 'common': None, ...

All LGAs:
  00496a96-e1ca-42cb-a1c8-531f48aa7461: Ipokia
  93ccd19d-3389-4c00-a8e3-920e20810d39: Badagry
  51955eb1-4697-4268-b8e2-8554f7a77475: Ado Odo/Ota
  32cfa1af-5e74-41a7-9272-b9dd8da715d9: Amuwo Odofin
  934f5b76-4cbf-4902-b124-b33d057d87b1: Ifako/Ijaye
  7aa34b16-ba84-45d9-94c4-083c65b8cf09: Ifo
  1a9532db-bcd5-4050-b62b-f27a22f2ebcf: Alimosho
  7918fd74-2ce4-48af-b3c1-1a4f1e26558e: Ojo
  1f5072a4-71e6-409b-89c9-cd999be6a442: Shagamu
  cb527384-149e-41c1-b3c5-0070deab27d4: Kosofe
  b098198f-55d1-46a6-ac1e-71ae35fdd913: Shomolu
  639f85d6-f581-4f31-b52b-98da63a57271: Lagos Mainland
  4b715cfc-a9b3-4a05-b601-353b77619939: Apapa
  ffe6c369-91e2-4984-97d1-bd24feff3732: Mushin
  c540f662-44c9-498d-8f3d-2a41a039a29f: Ikeja
  fcaab4e8-075a-4ab6-a2d2-cd614e95dfc2: Agege
  457a7986-f212-4765-b432-e1c1b77d6d61: Oshodi/Isolo
  e2af9375-5eed-46f8-a258-515f22297084: Ajeromi/Ifelodun
  f0f8532f-4c5e-428a-981e-12a31b2a9044: Surulere
  c58fd335-e502-421c-b8e3-f263c5fb0ff7: Lagos Island
  0c5600b6-df25-4e75-a25e-ae9937299bad: Eti Osa
  030cb2c1-697b-418e-b820-49393989a03b: Ìkòròdú
  61530733-5992-41f4-9212-43f1b460319c: Epe
  70bd9c0f-fcdd-4bd2-81df-34d9351c997e: Ibeju Lekki
  a277c8df-124c-47b7-9ab5-092e7f3bcdab: Ìjẹ̀bú Òde
  d8c2f7a3-d77d-4781-ba92-5359250c2516: Odogbolu
  6b8e9727-a665-4667-a087-31276385e8b6: Ijebu East
  cf4ccbd2-4a02-4f5e-94ca-835428f164cf: Ogun Waterside
5:32
Wait, some of these LGAs are actually in Ogun State, not Lagos State. The intersection with Lagos State boundary may have picked up bordering LGAs from Ogun State. Let me narrow down to only the actual Lagos State LGAs (which are admin_level=2 within the Lagos State region).

Let me think about this differently. The administrative divisions have a hierarchy. The Lagos State region (admin_level=1) should contain counties (admin_level=2) that are part of it. But the intersection approach is picking up neighboring counties that just touch the boundary.

Let me use a more precise containment approach - counties whose centroids are within the Lagos State boundary, or counties that have division_id pointing to Lagos State.

Actually, looking at the division_area data, there's a `division_id` column. Let me check if counties reference their parent region via this column.
5:32
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps
import geopandas as gpd
import pandas as pd

# Get Nigeria divisions
df = overturemaps.geodataframe("division_area", bbox=(2.5, 4, 15, 14))
nigeria = df[df['country'] == 'NG'].copy()

# Find Lagos State region
lagos_state = nigeria[(nigeria['subtype'] == 'region') & (nigeria['names'].apply(lambda x: x.get('primary') == 'Lagos'))]
lagos_geom = lagos_state.geometry.iloc[0]
lagos_id = lagos_state.iloc[0]['id']
print(f"Lagos State ID: {lagos_id}")

# Filter counties - use strict containment (centroid within Lagos State)
counties = nigeria[(nigeria['subtype'] == 'county') & (nigeria['admin_level'] == 2.0)].copy()
counties['centroid'] = counties.centroid
counties_in_lagos = counties[counties['centroid'].within(lagos_geom)].copy()
print(f"\nNumber of LGAs with centroid in Lagos State: {len(counties_in_lagos)}")

# Also check division_id
print("\nBy division_id:")
lagos_counties_by_parent = nigeria[(nigeria['subtype'] == 'county') & (nigeria['division_id'] == lagos_id)]
print(f"Counties with division_id = Lagos State: {len(lagos_counties_by_parent)}")

for idx, row in lagos_counties_by_parent.iterrows():
    print(f"  {row['names']['primary']}")

print("\n\nAll LGAs with centroid in Lagos:")
for idx, row in counties_in_lagos.iterrows():
    print(f"  {row['names']['primary']}")
EOF
Lagos State ID: 5fb3f6cb-3993-4bb7-8a53-73e775feb14f

Number of LGAs with centroid in Lagos State: 20

By division_id:
Counties with division_id = Lagos State: 0


All LGAs with centroid in Lagos:
  Badagry
  Amuwo Odofin
  Ifako/Ijaye
  Alimosho
  Ojo
  Kosofe
  Shomolu
  Lagos Mainland
  Apapa
  Mushin
  Ikeja
  Agege
  Oshodi/Isolo
  Ajeromi/Ifelodun
  Surulere
  Lagos Island
  Eti Osa
  Ìkòròdú
  Epe
  Ibeju Lekki
5:51
So the counties falling within Lagos State by centroid are 20. But the actual Lagos State has 20 LGAs. Let me check - the official list of LGAs in Lagos State are: Agege, Ajeromi-Ifelodun, Alimosho, Amuwo-Odofin, Apapa, Badagry, Epe, Eti-Osa, Ibeju-Lekki, Ifako-Ijaiye, Ikeja, Ikorodu, Kosofe, Lagos Island, Lagos Mainland, Mushin, Ojo, Oshodi-Isolo, Shomolu, Surulere.

That's 20 LGAs! But some of the counties I saw earlier like Ipokia, Ado Odo/Ota, Ifo, Shagamu, Ijebu Ode, Odogbolu, Ijebu East, Ogun Waterside are in Ogun State, not Lagos State. The intersection/contains approach might have included some that touch the boundary.

Let me use the centroid approach which gives exactly 20 - that matches the actual number of LGAs in Lagos State. Good.

Wait, actually I'm missing Ikorodu. Let me check... Looking at the list again:
- Ìkòròdú (Ikorodu) - yes it's there

So 20 LGAs by centroid. Let me use the division_area boundaries of these 20 LGAs to spatially filter the building data.

Let me now get the buildings for Lagos State. I'll use the bbox from Lagos State geometry to limit the data download, then clip spatially.
5:51
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps
import geopandas as gpd
import pandas as pd

# Get Nigeria divisions
df = overturemaps.geodataframe("division_area", bbox=(2.5, 4, 15, 14))
nigeria = df[df['country'] == 'NG'].copy()

# Find Lagos State region
lagos_state = nigeria[(nigeria['subtype'] == 'region') & (nigeria['names'].apply(lambda x: x.get('primary') == 'Lagos'))]
lagos_geom = lagos_state.geometry.iloc[0]
print(f"Lagos State bounds: {lagos_geom.bounds}")

# Get counties with centroid in Lagos
counties = nigeria[(nigeria['subtype'] == 'county')].copy()
counties['centroid'] = counties.centroid
lagos_counties = counties[counties['centroid'].within(lagos_geom)].copy()
print(f"Number of Lagos LGAs: {len(lagos_counties)}")

# Create a mapping of LGA name to geometry
lga_dict = {}
for idx, row in lagos_counties.iterrows():
    name = row['names']['primary']
    lga_dict[name] = row['geometry']

print("\nLagos LGAs:")
for name in sorted(lga_dict.keys()):
    print(f"  {name}")

# Save the LGA geometries
gdf_lgas = gpd.GeoDataFrame({'lga': list(lga_dict.keys()), 'geometry': list(laga_dict.values())}, crs='EPSG:4326')
gdf_lgas.to_file('/work/lagos_lgas.geojson', driver='GeoJSON')
print(f"\nSaved {len(gdf_lgas)} LGAs to /work/lagos_lgas.geojson")
EOF
Lagos State bounds: (2.7059846, 6.366684, 4.3509056, 6.6967964)
Number of Lagos LGAs: 20

Lagos LGAs:
  Agege
  Ajeromi/Ifelodun
  Alimosho
  Amuwo Odofin
  Apapa
  Badagry
  Epe
  Eti Osa
  Ibeju Lekki
  Ifako/Ijaye
  Ikeja
  Kosofe
  Lagos Island
  Lagos Mainland
  Mushin
  Ojo
  Oshodi/Isolo
  Shomolu
  Surulere
  Ìkòròdú
Traceback (most recent call last):
  File "<stdin>", line 31, in <module>
NameError: name 'laga_dict' is not defined. Did you mean: 'lga_dict'?
6:12
Typo in the script. Let me fix it.
6:12
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps
import geopandas as gpd
import pandas as pd

# Get Nigeria divisions
df = overturemaps.geodataframe("division_area", bbox=(2.5, 4, 15, 14))
nigeria = df[df['country'] == 'NG'].copy()

# Find Lagos State region
lagos_state = nigeria[(nigeria['subtype'] == 'region') & (nigeria['names'].apply(lambda x: x.get('primary') == 'Lagos'))]
lagos_geom = lagos_state.geometry.iloc[0]

# Get counties with centroid in Lagos
counties = nigeria[(nigeria['subtype'] == 'county')].copy()
counties['centroid'] = counties.centroid
lagos_counties = counties[counties['centroid'].within(lagos_geom)].copy()

# Create a mapping of LGA name to geometry
lga_dict = {}
for idx, row in lagos_counties.iterrows():
    name = row['names']['primary']
    lga_dict[name] = row['geometry']

# Save the LGA geometries
gdf_lgas = gpd.GeoDataFrame({'lga': list(lga_dict.keys()), 'geometry': list(lga_dict.values())}, crs='EPSG:4326')
gdf_lgas.to_file('/work/lagos_lgas.geojson', driver='GeoJSON')
print(f"Saved {len(gdf_lgas)} LGAs to /work/lagos_lgas.geojson")
print("LGA names:", sorted(gdf_lgas['lga'].tolist()))
EOF
Saved 20 LGAs to /work/lagos_lgas.geojson
LGA names: ['Agege', 'Ajeromi/Ifelodun', 'Alimosho', 'Amuwo Odofin', 'Apapa', 'Badagry', 'Epe', 'Eti Osa', 'Ibeju Lekki', 'Ifako/Ijaye', 'Ikeja', 'Kosofe', 'Lagos Island', 'Lagos Mainland', 'Mushin', 'Ojo', 'Oshodi/Isolo', 'Shomolu', 'Surulere', 'Ìkòròdú']
6:42
Now I need to get the overture building data for the Lagos area. Let me first explore the building schema to understand what columns are available.
6:42
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps

# Check building schema
print("Building types available:", overturemaps.core.get_all_overture_types())

# Get a small sample to see the schema
import geopandas as gpd

# Just get the schema info - use geodataframe with a small bbox around Lagos
# Lagos bbox roughly: (2.7, 6.37, 4.35, 6.7)
bbox = (2.7, 6.37, 4.35, 6.7)
buildings = overturemaps.geodataframe("building", bbox=bbox)
print(f"\nBuilding columns: {buildings.columns.tolist()}")
print(f"Building dtypes:\n{buildings.dtypes}")
print(f"\nShape: {buildings.shape}")
print(f"\nFirst few rows:")
print(buildings.head())
EOF
Building types available: ['address', 'bathymetry', 'building', 'building_part', 'division', 'division_area', 'division_boundary', 'place', 'segment', 'connector', 'infrastructure', 'land', 'land_cover', 'land_use', 'water']

Building columns: ['id', 'names', 'sources', 'level', 'height', 'min_height', 'is_underground', 'num_floors', 'num_floors_underground', 'min_floor', 'subtype', 'class', 'facade_color', 'facade_material', 'roof_material', 'roof_shape', 'roof_direction', 'roof_orientation', 'roof_color', 'roof_height', 'geometry', 'has_parts', 'version', 'bbox']
Building dtypes:
id                             str
names                       object
sources                     object
level                      float64
height                     float64
min_height                 float64
is_underground                bool
num_floors                 float64
num_floors_underground     float64
min_floor                  float64
subtype                        str
class                          str
facade_color                   str
facade_material                str
roof_material                  str
roof_shape                     str
roof_direction             float64
roof_orientation               str
roof_color                     str
roof_height                float64
geometry                  geometry
has_parts                     bool
version                      int32
bbox                        object
dtype: object

Shape: (3107308, 24)

First few rows:
                                     id  ...                                               bbox
0  c4148245-7dda-4ed2-99dc-4d4b2d203bc8  ...  {'xmin': 2.7009310722351074, 'xmax': 2.7010583...
1  4c2c3125-bdfa-4345-8807-4bc61e88e4ff  ...  {'xmin': 2.7003517150878906, 'xmax': 2.7004008...
2  7211cd12-94b9-439d-ad5d-824c66f17087  ...  {'xmin': 2.7003777027130127, 'xmax': 2.7004311...
3  07cc2b43-9271-4656-b9d3-c1318d9af758  ...  {'xmin': 2.7020692825317383, 'xmax': 2.7021231...
4  8c4382f1-5b91-4ebf-93af-e6a83c1c3bd0  ...  {'xmin': 2.70092511177063, 'xmax': 2.701052904...

[5 rows x 24 columns]
9:41
The building data is huge (3.1 million rows just for the bounding box around Lagos). I need to:

1. Get the building footprints > 1000 m² within Lagos State
2. Join with LGA boundaries to assign each building to an LGA
3. Compute summary statistics per LGA

Let me think about the approach:

1. First, I have the Lagos LGA boundaries saved as GeoJSON
2. I need to load buildings for the Lagos area and spatially filter them
3. For each building that intersects Lagos State, calculate its area and keep those > 1000 m²
4. Join with LGA boundaries to assign LGA names
5. Compute summaries

The building data is 3.1M rows just for the bounding box. I need to process this efficiently. Let me use DuckDB with spatial extension for efficient processing, or use geopandas with spatial joins.

Actually, let me reconsider. The overturemaps library's geodataframe loads all data matching the bbox into memory. With 3.1M buildings, it could be very heavy. Let me use DuckDB's spatial extension to read the parquet files directly and do the processing more efficiently.

Let me check if DuckDB can read from S3 directly.
9:41
Bash python3 << 'EOF'
python3 << 'EOF'
import duckdb
# Check if we can read from S3
duckdb.sql("""
SELECT count(*) 
FROM read_parquet('s3://overturemaps-us-west-1/release/2026-06-17.0/theme=buildings/type=building/*')
WHERE bbox.xmin >= 2.7 AND bbox.xmax <= 4.35 AND bbox.ymin >= 6.37 AND bbox.ymax <= 6.7
""").show()
EOF
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
_duckdb.HTTPException: HTTP Error: HTTP GET error reading 's3://overturemaps-us-west-1/release/2026-06-17.0/theme=buildings/type=building' in region '' (HTTP 404 Not Found)

NoSuchBucket: The specified bucket does not exist
9:46
Let me check how overturemaps library accesses the data - it might use a different endpoint.
9:46
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps
# Check the endpoint used
print("Release:", overturemaps.core.get_latest_release())
print("Available releases:", overturemaps.core.get_available_releases())
EOF
Release: 2026-06-17.0
Available releases: (['2026-06-17.0', '2026-05-20.0'], '2026-06-17.0')
9:55
Let me think about how to approach this efficiently. We have 3.1 million buildings in the Lagos bbox. We need to:

1. Filter to only buildings > 1000 m²
2. Assign each building to an LGA
3. Compute per-LGA stats

The `overturemaps.geodataframe` method already loads the data. Let me check what endpoint it uses for the data.

Actually, let me first check how overturemaps loads data - it probably handles the S3/URL for us. Let me just use the overturemaps library which already handles the parquet fetching.

For the spatial join, I can either:
- Use a spatial join with geopandas
- Or use DuckDB with spatial extension

Given 3.1M buildings and 20 LGAs, let's do this efficiently.

Let me first compute area for each building in the bbox, then filter > 1000 m², then spatial join with LGAs.

Actually the buildings data returned includes the geometry, and we need to compute area in a projected CRS for accurate area measurement. Let me think about the workflow:

1. Load buildings in Lagos bbox
2. Compute area in a suitable projected CRS (e.g., EPSG:26391 - Nigeria UTM zone 31N or EPSG:32631 - WGS 84 UTM zone 31N)
3. Filter > 1000 m²
4. Spatial join with LGA polygons
5. Compute summary stats

But 3.1M buildings is a lot. Let me try DuckDB approach with the overturemaps data.

Actually, let me first understand what endpoint overturemaps uses to fetch data.
9:55
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps.core as core
# Check where the data comes from
import inspect
print(inspect.getsource(core.geodataframe)[:2000])
EOF
def geodataframe(
    overture_type: str,
    bbox: BBox | tuple[float, float, float, float] | list[float] | None = None,
    release: str = None,
    connect_timeout: int = None,
    request_timeout: int = None,
    stac: bool = False,
) -> GeoDataFrame:
    """
    Loads geoparquet for specified type into a geopandas dataframe

    Parameters
    ----------
    overture_type: type to load
    bbox: optional bounding box for data fetch (xmin, ymin, xmax, ymax)
    connect_timeout: optional connection timeout in seconds
    request_timeout: optional request timeout in seconds

    Returns
    -------
    GeoDataFrame with the optionally filtered theme data

    """
    if not HAS_GEOPANDAS:
        raise ImportError("geopandas is required to use this function")

    reader = record_batch_reader(
        overture_type,
        bbox=bbox,
        release=release,
        connect_timeout=connect_timeout,
        request_timeout=request_timeout,
        stac=stac,
    )
    return GeoDataFrame.from_arrow(reader)
10:00
Let me look at the record_batch_reader function to understand the data source.
10:00
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps.core as core
import inspect
print(inspect.getsource(core.record_batch_reader)[:3000])
EOF
def record_batch_reader(
    overture_type,
    bbox=None,
    release=None,
    connect_timeout=None,
    request_timeout=None,
    stac=False,
) -> Optional[pa.RecordBatchReader]:
    """Return a pyarrow RecordBatchReader for the desired bounding box and s3 path, or None on error."""
    result = _prepare_query(
        overture_type, bbox, release, connect_timeout, request_timeout, stac
    )
    if result is None:
        return None
    dataset, filter_expr = result
    return _record_batch_reader_from_dataset(dataset, filter_expr=filter_expr)
10:07
Let me check the S3 path used.
10:07
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps.core as core
import inspect
# Get _prepare_query
print(inspect.getsource(core._prepare_query)[:3000])
EOF
def _prepare_query(
    overture_type,
    bbox: BBox | tuple[float, float, float, float] | list[float] | None = None,
    release=None,
    connect_timeout=None,
    request_timeout=None,
    stac=False,
) -> Optional[Tuple[ds.Dataset, Optional[pc.Expression]]]:
    """
    Resolve the S3 dataset and filter expression for a given query.

    Returns the dataset and filter expression ready for counting or streaming,
    or None if STAC reports no files intersect the bbox.
    """
    if release is None:
        release = get_latest_release()
    path = _dataset_path(overture_type, release)
    bbox_obj = _coerce_bbox(bbox)

    intersecting_files = None
    if bbox_obj and stac:
        intersecting_files = _get_files_from_stac(
            type_theme_map[overture_type], overture_type, bbox_obj, release
        )
        if intersecting_files is not None and len(intersecting_files) == 0:
            return None

    if bbox_obj:
        xmin, ymin, xmax, ymax = bbox_obj.as_tuple()
        filter_expr = (
            (pc.field("bbox", "xmin") < xmax)
            & (pc.field("bbox", "xmax") > xmin)
            & (pc.field("bbox", "ymin") < ymax)
            & (pc.field("bbox", "ymax") > ymin)
        )
    else:
        filter_expr = None

    dataset = ds.dataset(
        intersecting_files if intersecting_files is not None else path,
        filesystem=fs.S3FileSystem(
            anonymous=True,
            region="us-west-2",
            connect_timeout=connect_timeout,
            request_timeout=request_timeout,
        ),
    )

    return dataset, filter_expr
10:16
So the overturemaps library uses the s3 filesystem from pyarrow to read data from S3. The geodataframe function already handles fetching data with a bbox filter. We already have 3.1 million buildings for the Lagos bbox. Now I need to:

1. Get all buildings in Lagos State (bounding box + spatial filter using LGA geometries)
2. Calculate footprint areas
3. Join with LGA names
4. Filter to buildings > 1000 m²

Actually, the bbox I used (2.7, 6.37, 4.35, 6.7) covers Lagos State and some neighboring areas. Let me use the Lagos State geometry to spatially filter the buildings.

Let me now write the main script that:
1. Loads the LGA geometries for Lagos State
2. Loads the buildings for the Lagos bbox
3. Spatial joins buildings to LGAs
4. Calculates footprint area
5. Filters to > 1000 m²
6. Computes per-LGA summaries
7. Outputs both files

Let me think about the approach. The building dataset has 3.1 million rows for the bounding box. Doing a spatial join with 20 LGAs should be manageable. Let me use geopandas operations.

I should use the Lagos State boundary to clip the buildings first, then spatial join with the LGAs.
10:16
Bash python3 << 'EOF'
python3 << 'EOF'
import overturemaps
import geopandas as gpd
import pandas as pd
import numpy as np
from shapely import wkb
import pyarrow.parquet as pq
import json

# Load LGA geometries
lgas = gpd.read_file('/work/lagos_lgas.geojson')
print(f"LGAs loaded: {len(lgas)}")
print(lgas)

# Create union of all LGA geometries for Lagos State boundary
lagos_union = lgas.union_all()
print(f"\nLagos State union type: {lagos_union.geom_type}")
print(f"Bounds: {lagos_union.bounds}")

# Load buildings with bbox
bbox = lagos_union.bounds  # (xmin, ymin, xmax, ymax)
print(f"\nLoading buildings with bbox: {bbox}")
buildings = overturemaps.geodataframe("building", bbox=bbox)
print(f"Total buildings in bbox: {len(buildings)}")
print(f"Columns: {buildings.columns.tolist()}")
EOF
LGAs loaded: 20
                 lga                                           geometry
0            Badagry  POLYGON ((2.71658 6.45699, 2.715 6.45623, 2.71...
1       Amuwo Odofin  POLYGON ((3.30142 6.4856, 3.29815 6.48485, 3.2...
2        Ifako/Ijaye  POLYGON ((3.2601 6.69335, 3.25974 6.69188, 3.2...
3           Alimosho  POLYGON ((3.26363 6.68593, 3.26259 6.68688, 3....
4                Ojo  POLYGON ((3.11433 6.50108, 3.11139 6.5012, 3.1...
5             Kosofe  POLYGON ((3.43487 6.61755, 3.43477 6.62054, 3....
6            Shomolu  POLYGON ((3.39726 6.5137, 3.40154 6.51511, 3.4...
7     Lagos Mainland  POLYGON ((3.40813 6.51653, 3.4067 6.51825, 3.4...
8              Apapa  POLYGON ((3.34116 6.43702, 3.3392 6.43759, 3.3...
9             Mushin  POLYGON ((3.35791 6.558, 3.35726 6.55769, 3.35...
10             Ikeja  POLYGON ((3.35522 6.65638, 3.35471 6.65716, 3....
11             Agege  POLYGON ((3.31308 6.64832, 3.31189 6.64821, 3....
12      Oshodi/Isolo  POLYGON ((3.32179 6.55426, 3.3183 6.55116, 3.3...
13  Ajeromi/Ifelodun  POLYGON ((3.32156 6.46208, 3.31938 6.46088, 3....
14          Surulere  POLYGON ((3.37958 6.49029, 3.37897 6.49181, 3....
15      Lagos Island  POLYGON ((3.46955 6.59478, 3.47093 6.58921, 3....
16           Eti Osa  POLYGON ((3.4039 6.44073, 3.40341 6.4395, 3.40...
17           Ìkòròdú  POLYGON ((3.45641 6.64191, 3.45766 6.63943, 3....
18               Epe  POLYGON ((4.09425 6.67068, 4.09277 6.67253, 4....
19       Ibeju Lekki  POLYGON ((4.19407 6.38705, 4.22673 6.383, 4.24...

Lagos State union type: Polygon
Bounds: (2.7059846, 6.366684, 4.3509056, 6.6967964)

Loading buildings with bbox: (2.7059846, 6.366684, 4.3509056, 6.6967964)
Total buildings in bbox: 3055530
Columns: ['id', 'names', 'sources', 'level', 'height', 'min_height', 'is_underground', 'num_floors', 'num_floors_underground', 'min_floor', 'subtype', 'class', 'facade_color', 'facade_material', 'roof_material', 'roof_shape', 'roof_direction', 'roof_orientation', 'roof_color', 'roof_height', 'geometry', 'has_parts', 'version', 'bbox']
12:51
Now I have 3 million buildings in the bbox. I need to:
1. Filter to only buildings within Lagos State (intersect with union of LGA geometries)
2. Calculate footprint area
3. Assign LGA name
4. Export filtered buildings > 1000 m²

Let me do this step by step. First, let me filter the buildings to only those intersecting the Lagos state boundary, then assign LGA names.
12:51
12:51
Write /work/process_buildings.py
content · 118 lines · py
"""
Extract Lagos State buildings from Overture, filter to >1000m²,
assign LGA via spatial join, and produce two output files.
"""
import sys
import warnings
import time
warnings.filterwarnings("ignore")

import overturemaps
import geopandas as gpd
import pandas as pd
import numpy as np
from shapely import coverage_union

# ── 1. Load LGA boundaries (already clipped to Lagos State) ──────────────
print("Loading LGA boundaries...")
lgas = gpd.read_file("/work/lagos_lgas.geojson")
lgas = lgas.to_crs("EPSG:4326")
print(f"  {len(lgas)} LGAs loaded")

# Create a single dissolved boundary for Lagos State
lagos_union = lgas.union_all()
lagos_bounds = lagos_union.bounds  # (xmin, ymin, xmax, ymax)
print(f"  Lagos bounds: {lagos_bounds}")

# ── 2. Load all buildings in the Lagos bbox ─────────────────────────────
print("\nLoading Overture buildings (bbox filter)...")
t0 = time.time()
buildings = overturemaps.geodataframe("building", bbox=lagos_bounds)
print(f"  {len(buildings)} buildings loaded in {time.time()-t0:.1f}s")
print(f"  Columns: {buildings.columns.tolist()}")

# ── 3. Spatial filter: keep only buildings within Lagos State ────────────
print("\nFiltering buildings to Lagos State polygon...")
t0 = time.time()
# Use within/intersects - within is stricter
lagos_buildings = buildings[buildings.intersects(lagos_union)].copy()
print(f"  {len(lagos_buildings)} buildings within Lagos in {time.time()-t0:.1f}s")

del buildings  # free memory

# ── 4. Compute footprint area in square metres ──────────────────────────
print("\nComputing footprint area...")
t0 = time.time()
# Use an equal-area projection for Nigeria: EPSG:26391 (UTM 31N) or EPSG:32631
# Lagos is around UTM zone 31N
aea = lagos_buildings.to_crs("EPSG:32631")  # WGS 84 / UTM zone 31N
lagos_buildings["footprint_area_m2"] = aea.area
print(f"  Done in {time.time()-t0:.1f}s")

# ── 5. Spatial join to assign LGA name ──────────────────────────────────
print("\nAssigning LGA via spatial join...")
t0 = time.time()
# Use sjoin with predicate='within' for speed
lagos_buildings = lagos_buildings.sjoin(
    lgas[["lga", "geometry"]], how="left", predicate="within"
)
print(f"  Done in {time.time()-t0:.1f}s")
print(f"  NA LGA count: {lagos_buildings['lga'].isna().sum()}")

# Drop the index_right column
lagos_buildings = lagos_buildings.drop(columns=["index_right"])

# ── 6. Filter to buildings > 1000 m² ────────────────────────────────────
print(f"\nFiltering to footprint_area_m2 > 1000...")
t0 = time.time()
before = len(lagos_buildings)
large_buildings = lagos_buildings[lagos_buildings["footprint_area_m2"] > 1000].copy()
print(f"  {before} → {len(large_buildings)} buildings ({time.time()-t0:.1f}s)")

# ── 7. Select and rename columns for output ─────────────────────────────
print("\nPreparing output GeoDataFrame...")
out_cols = ["id", "height", "footprint_area_m2", "lga", "geometry"]
out_gdf = large_buildings[out_cols].copy()

# Ensure geometry is in EPSG:4326
out_gdf = out_gdf.set_geometry("geometry")
out_gdf.crs = "EPSG:4326"

print(f"  Output rows: {len(out_gdf)}")
print(f"  Output columns: {out_gdf.columns.tolist()}")

# ── 8. Write lagos_buildings.geoparquet ─────────────────────────────────
print("\nWriting lagos_buildings.geoparquet...")
t0 = time.time()
out_gdf.to_parquet(
    "/work/lagos_buildings.geoparquet",
    index=False,
    geometry_encoding="geoarrow",
)
print(f"  Written to /work/lagos_buildings.geoparquet in {time.time()-t0:.1f}s")

# ── 9. Build per-LGA summary ────────────────────────────────────────────
print("\nBuilding per-LGA summary...")
t0 = time.time()

summary = (
    large_buildings.groupby("lga")
    .agg(
        n_buildings=("id", "count"),
        total_footprint_m2=("footprint_area_m2", "sum"),
        n_with_height=("height", lambda x: x.notna().sum()),
        # median height (NaN if all null)
        p50_height_m=("height", "median"),
    )
    .reset_index()
)

print(f"  Summary rows: {len(summary)}")
print(summary.to_string())

# ── 10. Write lagos_building_summary.parquet ────────────────────────────
print("\nWriting lagos_building_summary.parquet...")
summary.to_parquet("/work/lagos_building_summary.parquet", index=False)
print(f"  Written to /work/lagos_building_summary.parquet in {time.time()-t0:.1f}s")

print("\n✅ Done!")
Wrote /work/process_buildings.py (4287 chars).
12:56
Let me think about potential issues:

1. The overturemaps library loads all data for the bounding box into memory first as a GeoDataFrame - this could be memory intensive with 3M buildings. But it should work.

2. The spatial join with 3M buildings and 20 LGAs could be slow. Let me think of optimizations.

3. The building dataset has a `has_parts` column - but we don't need that.

4. Need to handle the case where some buildings might not have an LGA assigned (fall outside any LGA boundary).

Let me run it and see.
12:56
Bash cd /work && python process_buildings.py
Loading LGA boundaries...
  20 LGAs loaded
  Lagos bounds: (2.7059846, 6.366684, 4.3509056, 6.6967964)

Loading Overture buildings (bbox filter)...
  3055530 buildings loaded in 140.2s
  Columns: ['id', 'names', 'sources', 'level', 'height', 'min_height', 'is_underground', 'num_floors', 'num_floors_underground', 'min_floor', 'subtype', 'class', 'facade_color', 'facade_material', 'roof_material', 'roof_shape', 'roof_direction', 'roof_orientation', 'roof_color', 'roof_height', 'geometry', 'has_parts', 'version', 'bbox']

Filtering buildings to Lagos State polygon...
  2243598 buildings within Lagos in 12.7s

Computing footprint area...
Traceback (most recent call last):
  File "/work/process_buildings.py", line 48, in <module>
    aea = lagos_buildings.to_crs("EPSG:32631")  # WGS 84 / UTM zone 31N
  File "/usr/local/lib/python3.14/site-packages/geopandas/geodataframe.py", line 1842, in to_crs
    geom = df.geometry.to_crs(crs=crs, epsg=epsg)
  File "/usr/local/lib/python3.14/site-packages/geopandas/geoseries.py", line 1251, in to_crs
    self.values.to_crs(crs=crs, epsg=epsg), index=self.index, name=self.name
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/geopandas/_compat.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.14/site-packages/geopandas/array.py", line 1114, in to_crs
    raise ValueError(
    ...<2 lines>...
    )
ValueError: Cannot transform naive geometries.  Please set a crs on the object first.
15:39
The buildings GeoDataFrame from Overture doesn't have a CRS set. I need to set it to EPSG:4326 first before reprojecting.
15:39
Edit /work/process_buildings.py
Edited /work/process_buildings.py.
15:48
The building data from overturemaps is already in WGS84 coordinates but the CRS might not be explicitly set on the GeoDataFrame. Let me also check if the spatial join and other steps need the CRS set.
15:48
Edit /work/process_buildings.py
Edited /work/process_buildings.py.
15:53
Let me also fix the CRS setting to remove the duplicate now.
15:53
Edit /work/process_buildings.py
Edited /work/process_buildings.py.
15:56
Bash cd /work && python process_buildings.py
Loading LGA boundaries...
  20 LGAs loaded
  Lagos bounds: (2.7059846, 6.366684, 4.3509056, 6.6967964)

Loading Overture buildings (bbox filter)...
  3055530 buildings loaded in 130.9s
  Columns: ['id', 'names', 'sources', 'level', 'height', 'min_height', 'is_underground', 'num_floors', 'num_floors_underground', 'min_floor', 'subtype', 'class', 'facade_color', 'facade_material', 'roof_material', 'roof_shape', 'roof_direction', 'roof_orientation', 'roof_color', 'roof_height', 'geometry', 'has_parts', 'version', 'bbox']

Filtering buildings to Lagos State polygon...
  2243598 buildings within Lagos in 12.6s

Computing footprint area...
  Done in 3.7s

Assigning LGA via spatial join...
  Done in 2.6s
  NA LGA count: 2651

Filtering to footprint_area_m2 > 1000...
  2243598 → 7256 buildings (0.0s)

Preparing output GeoDataFrame...
  Output rows: 7256
  Output columns: ['id', 'height', 'footprint_area_m2', 'lga', 'geometry']

Writing lagos_buildings.geoparquet...
  Written to /work/lagos_buildings.geoparquet in 0.1s

Building per-LGA summary...
  Summary rows: 20
                 lga  n_buildings  total_footprint_m2  n_with_height  p50_height_m
0              Agege           47        9.387676e+04              0           NaN
1   Ajeromi/Ifelodun           59        1.044837e+05              0           NaN
2           Alimosho          350        6.435003e+05              0           NaN
3       Amuwo Odofin          835        1.749515e+06              0           NaN
4              Apapa          286        6.083739e+05              0           NaN
5            Badagry          122        2.061790e+05              0           NaN
6                Epe          342        8.118842e+05              0           NaN
7            Eti Osa         1063        1.853554e+06            237          20.0
8        Ibeju Lekki          593        1.790242e+06              0           NaN
9        Ifako/Ijaye          117        2.429469e+05              0           NaN
10             Ikeja          912        3.141260e+06             46          15.0
11            Kosofe          163        3.399309e+05              0           NaN
12      Lagos Island          117        2.261127e+05              9          20.0
13    Lagos Mainland          122        2.283266e+05              0           NaN
14            Mushin          459        1.045023e+06              1          35.0
15               Ojo          232        4.085015e+05              0           NaN
16      Oshodi/Isolo          414        7.878077e+05              3          27.0
17           Shomolu          122        1.951046e+05              3          17.0
18          Surulere          338        7.737084e+05              0           NaN
19           Ìkòròdú          534        1.857791e+06              4           7.5

Writing lagos_building_summary.parquet...
  Written to /work/lagos_building_summary.parquet in 0.0s

✅ Done!
18:38
The script completed successfully. Let me verify the outputs look correct.
18:38
18:38
Bash python3 << 'EOF'
python3 << 'EOF'
import geopandas as gpd
import pandas as pd

# Check GeoParquet
gdf = gpd.read_parquet("/work/lagos_buildings.geoparquet")
print("=== lagos_buildings.geoparquet ===")
print(f"Rows: {len(gdf)}")
print(f"Columns: {gdf.columns.tolist()}")
print(f"CRS: {gdf.crs}")
print(f"Dtypes:\n{gdf.dtypes}")
print(f"\nFirst 3 rows:")
print(gdf.head(3))
print(f"\nNull height: {gdf['height'].isna().sum()}")
print(f"Null lga: {gdf['lga'].isna().sum()}")
print(f"Min area: {gdf['footprint_area_m2'].min():.2f}")
print(f"Max area: {gdf['footprint_area_m2'].max():.2f}")

print("\n\n=== lagos_building_summary.parquet ===")
df = pd.read_parquet("/work/lagos_building_summary.parquet")
print(f"Rows: {len(df)}")
print(f"Columns: {df.columns.tolist()}")
print(df.to_string())
EOF
=== lagos_buildings.geoparquet ===
Rows: 7256
Columns: ['id', 'height', 'footprint_area_m2', 'lga', 'geometry']
CRS: {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "GeographicCRS", "name": "WGS 84", "datum_ensemble": {"name": "World Geodetic System 1984 ensemble", "members": [{"name": "World Geodetic System 1984 (Transit)"}, {"name": "World Geodetic System 1984 (G730)"}, {"name": "World Geodetic System 1984 (G873)"}, {"name": "World Geodetic System 1984 (G1150)"}, {"name": "World Geodetic System 1984 (G1674)"}, {"name": "World Geodetic System 1984 (G1762)"}, {"name": "World Geodetic System 1984 (G2139)"}, {"name": "World Geodetic System 1984 (G2296)"}], "ellipsoid": {"name": "WGS 84", "semi_major_axis": 6378137, "inverse_flattening": 298.257223563}, "accuracy": "2.0", "id": {"authority": "EPSG", "code": 6326}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, "scope": "Horizontal component of 3D system.", "area": "World.", "bbox": {"south_latitude": -90, "west_longitude": -180, "north_latitude": 90, "east_longitude": 180}, "id": {"authority": "EPSG", "code": 4326}}
Dtypes:
id                        str
height                float64
footprint_area_m2     float64
lga                       str
geometry             geometry
dtype: object

First 3 rows:
                                     id  ...                                           geometry
0  e24f1bf5-e957-4867-859b-a6a578fd9b06  ...  MULTIPOLYGON (((2.70836 6.37804, 2.70816 6.378...
1  5df57f70-f431-495f-98cb-eff96e1ca151  ...  MULTIPOLYGON (((2.72675 6.37881, 2.72672 6.379...
2  cfb03953-1851-4ec2-a087-3283cd510753  ...  MULTIPOLYGON (((2.7137 6.37984, 2.7136 6.37984...

[3 rows x 5 columns]

Null height: 6953
Null lga: 29
Min area: 1000.04
Max area: 181048.18


=== lagos_building_summary.parquet ===
Rows: 20
Columns: ['lga', 'n_buildings', 'total_footprint_m2', 'n_with_height', 'p50_height_m']
                 lga  n_buildings  total_footprint_m2  n_with_height  p50_height_m
0              Agege           47        9.387676e+04              0           NaN
1   Ajeromi/Ifelodun           59        1.044837e+05              0           NaN
2           Alimosho          350        6.435003e+05              0           NaN
3       Amuwo Odofin          835        1.749515e+06              0           NaN
4              Apapa          286        6.083739e+05              0           NaN
5            Badagry          122        2.061790e+05              0           NaN
6                Epe          342        8.118842e+05              0           NaN
7            Eti Osa         1063        1.853554e+06            237          20.0
8        Ibeju Lekki          593        1.790242e+06              0           NaN
9        Ifako/Ijaye          117        2.429469e+05              0           NaN
10             Ikeja          912        3.141260e+06             46          15.0
11            Kosofe          163        3.399309e+05              0           NaN
12      Lagos Island          117        2.261127e+05              9          20.0
13    Lagos Mainland          122        2.283266e+05              0           NaN
14            Mushin          459        1.045023e+06              1          35.0
15               Ojo          232        4.085015e+05              0           NaN
16      Oshodi/Isolo          414        7.878077e+05              3          27.0
17           Shomolu          122        1.951046e+05              3          17.0
18          Surulere          338        7.737084e+05              0           NaN
19           Ìkòròdú          534        1.857791e+06              4           7.5
18:47
The outputs look good. Let me check the 29 null LGA records to see if we can handle them better. Some buildings might be on boundaries and not strictly "within" any LGA polygon. Let me check.
18:47
Bash python3 << 'EOF'
python3 << 'EOF'
import geopandas as gpd
import pandas as pd

# Check the buildings with null LGA
gdf = gpd.read_parquet("/work/lagos_buildings.geoparquet")
null_lga = gdf[gdf['lga'].isna()]
print(f"Buildings with null LGA: {len(null_lga)}")

# Check if they're on boundary - try spatial join with predicate='intersects'
lgas = gpd.read_file("/work/lagos_lgas.geojson")
intersects = gdf[gdf['lga'].isna()].copy()
intersects = intersects.sjoin(lgas, how="left", predicate="intersects")
print(f"After intersects join, null count: {intersects['lga'].isna().sum()}")
if len(intersects) > 0 and intersects['lga'].notna().any():
    print("Some buildings intersect but don't fall within any LGA.")
    # Show which LGAs they hit
    print(intersects['lga'].value_counts().head(10))
EOF
Buildings with null LGA: 29
Traceback (most recent call last):
  File "/usr/local/lib/python3.14/site-packages/pandas/core/indexes/base.py", line 3641, in get_loc
    return self._engine.get_loc(casted_key)
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 168, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 197, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7668, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7676, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'lga'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 13, in <module>
  File "/usr/local/lib/python3.14/site-packages/geopandas/geodataframe.py", line 1891, in __getitem__
    result = super().__getitem__(key)
  File "/usr/local/lib/python3.14/site-packages/pandas/core/frame.py", line 4378, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/usr/local/lib/python3.14/site-packages/pandas/core/indexes/base.py", line 3648, in get_loc
    raise KeyError(key) from err
KeyError: 'lga'
18:54
The sjoin added `lga_left` and `lga_right`. Let me check.
18:54
Bash python3 << 'EOF'
python3 << 'EOF'
import geopandas as gpd
import pandas as pd

gdf = gpd.read_parquet("/work/lagos_buildings.geoparquet")
null_lga = gdf[gdf['lga'].isna()].copy()
print(f"Buildings with null LGA: {len(null_lga)}")
print(null_lga)

# Check if they might be on boundaries
lgas = gpd.read_file("/work/lagos_lgas.geojson")
# Try intersects
intersects = null_lga.sjoin(lgas, how="left", predicate="intersects")
print(f"\nIntersects join columns: {intersects.columns.tolist()}")
# Check for lga column
if 'lga_right' in intersects.columns:
    print(f"Intersects - LGA matches: {intersects['lga_right'].tolist()}")
else:
    # Check what lga columns exist
    lga_cols = [c for c in intersects.columns if 'lga' in c.lower()]
    print(f"LGA cols: {lga_cols}")
EOF
Buildings with null LGA: 29
                                        id  ...                                           geometry
1530  895f9f3b-f7ef-4a30-ae21-c883334ba511  ...  MULTIPOLYGON (((3.09233 6.49853, 3.09254 6.498...
1531  405668f6-875c-4a5c-aaef-bf519dfe4687  ...  MULTIPOLYGON (((3.08985 6.50027, 3.0898 6.5003...
2070  2c6a5f4f-2ab8-4690-bb73-a42b8f2f4995  ...  MULTIPOLYGON (((3.27868 6.48533, 3.27851 6.485...
2433  50eabc23-68e1-4f73-8939-22d99da1674f  ...  MULTIPOLYGON (((3.33305 6.47083, 3.33229 6.470...
2610  9a614aad-8f9f-4af9-abda-9970f9bf63c2  ...  MULTIPOLYGON (((3.284 6.51873, 3.28367 6.5181,...
2612  9a4fa19c-8b4e-4694-9d45-39bd8f4ab60d  ...  MULTIPOLYGON (((3.28823 6.52972, 3.28792 6.529...
3328  64ad92cd-059b-4dec-9071-2ecbc76f291f  ...  MULTIPOLYGON (((3.37037 6.51161, 3.37035 6.511...
3330  32b6bc3b-1aa5-456c-8f39-fce66e85470f  ...  MULTIPOLYGON (((3.37063 6.51073, 3.37062 6.510...
3331  967da7df-cc98-42cc-90c6-5747eb159ff1  ...  MULTIPOLYGON (((3.37171 6.50779, 3.37182 6.507...
3798  02444433-78a8-404c-8327-433d356d7371  ...  MULTIPOLYGON (((3.33106 6.61436, 3.33038 6.614...
3799  6f9ae86c-c9e6-4a4b-8ae0-704e8df142bf  ...  MULTIPOLYGON (((3.32918 6.61258, 3.32916 6.612...
3800  94fe8c9f-260d-4554-abf9-9e5d91476876  ...  MULTIPOLYGON (((3.32977 6.61257, 3.32976 6.612...
3805  51e519cc-748e-444e-a784-1112c5c5b841  ...  MULTIPOLYGON (((3.33087 6.61155, 3.32977 6.611...
3806  fd468090-1e08-4511-86b2-4b78b1d4fe7a  ...  MULTIPOLYGON (((3.3309 6.6113, 3.3298 6.61115,...
3807  65eaedf5-1b0a-452a-a9b6-039eddc0a9c2  ...  MULTIPOLYGON (((3.33092 6.61117, 3.32987 6.611...
3811  217771a9-a907-4398-bd21-e6791d46b2b8  ...  MULTIPOLYGON (((3.331 6.61045, 3.33015 6.61036...
4179  a44ac7f1-a26d-47e2-b2ac-ee1149697cfe  ...  MULTIPOLYGON (((3.35563 6.557, 3.3554 6.55692,...
4186  f04b91f4-2435-4dc1-8f8a-ba646b30910a  ...  MULTIPOLYGON (((3.35668 6.55688, 3.35671 6.556...
4242  bd2e25fa-e21a-4f34-bddc-2e6fb3426648  ...  MULTIPOLYGON (((3.36753 6.56764, 3.36718 6.567...
4521  9230bb15-deaa-44a6-90b1-a44e4c3e7842  ...  MULTIPOLYGON (((3.33439 6.57954, 3.33436 6.579...
4705  c836554e-7788-4a78-b1fe-d6259fa68aca  ...  MULTIPOLYGON (((3.33277 6.61541, 3.33283 6.615...
4924  40f2005b-a6ad-4fc6-a810-31478416e631  ...  MULTIPOLYGON (((3.37372 6.6059, 3.37367 6.6059...
6210  05d75e5a-88f6-4722-8722-299fe81c5af9  ...  MULTIPOLYGON (((3.89782 6.53447, 3.89852 6.534...
6211  7b726225-b4d8-4196-a0b4-541b558d164b  ...  MULTIPOLYGON (((3.89814 6.53399, 3.89883 6.534...
6212  70b6aa21-6207-4361-882c-72ba170b747a  ...  MULTIPOLYGON (((3.89798 6.5342, 3.89866 6.5344...
6622  a79d5935-83e2-40a8-8e35-6d7647b0cb89  ...  MULTIPOLYGON (((3.36943 6.63907, 3.36949 6.639...
6770  80ac5101-5647-4f61-b5e5-f4f9bce80539  ...  MULTIPOLYGON (((3.27445 6.68867, 3.27426 6.688...
6771  e588d926-ae29-4bf5-bbaf-49b72349acb7  ...  MULTIPOLYGON (((3.27476 6.68775, 3.27464 6.687...
6772  425622ca-c130-41a9-9caf-4c0cd761c12d  ...  MULTIPOLYGON (((3.27521 6.68745, 3.27482 6.687...

[29 rows x 5 columns]

Intersects join columns: ['id', 'height', 'footprint_area_m2', 'lga_left', 'geometry', 'index_right', 'lga_right']
Intersects - LGA matches: ['Badagry', 'Badagry', 'Amuwo Odofin', 'Alimosho', 'Surulere', 'Ajeromi/Ifelodun', 'Oshodi/Isolo', 'Alimosho', 'Oshodi/Isolo', 'Alimosho', 'Shomolu', 'Mushin', 'Shomolu', 'Mushin', 'Surulere', 'Lagos Mainland', 'Ikeja', 'Agege', 'Ikeja', 'Agege', 'Ikeja', 'Agege', 'Ikeja', 'Agege', 'Ikeja', 'Agege', 'Ikeja', 'Agege', 'Ikeja', 'Agege', 'Mushin', 'Ikeja', 'Mushin', 'Ikeja', 'Kosofe', 'Ikeja', 'Oshodi/Isolo', 'Ikeja', 'Ikeja', 'Agege', 'Kosofe', 'Ikeja', 'Ibeju Lekki', 'Epe', 'Ibeju Lekki', 'Epe', 'Ibeju Lekki', 'Epe', 'Ikeja', 'Ifako/Ijaye', 'Ifako/Ijaye', 'Ifako/Ijaye']
19:05
So 29 buildings out of 7256 (0.4%) have null LGA because they fall on boundaries between LGAs and were not strictly "within" any single LGA. These are edge cases. For the purpose of the exercise, this is a very small percentage and acceptable. The `within` predicate is the most accurate approach - these buildings are on boundaries between LGAs.

Let me also update those to use `intersects` for the 29 edge cases so we can fill them in. But actually the task says "for each Lagos Local Government Area" and with 29 buildings not assigned, it's a negligible amount. Let me update the script to use `intersects` instead of `within` to capture those boundary cases, or better yet, we can use a simple point-in-polygon approach with the centroid.

Actually, let me re-assign those 29 with intersects to make the output cleaner.
19:05
19:05
Edit /work/process_buildings.py
Edited /work/process_buildings.py.
19:13
Wait, I also need to fix the column naming. When using sjoin with `how="left"`, the right columns get suffix "_right" if there are overlapping names. The LGA column already exists in both, so it will become "lga_left" and "lga_right". Let me check and adjust.
19:13
Read /work/process_buildings.py
    52	print(f"  Done in {time.time()-t0:.1f}s")
    53	
    54	# ── 5. Spatial join to assign LGA name ──────────────────────────────────
    55	print("\nAssigning LGA via spatial join...")
    56	t0 = time.time()
    57	# Use sjoin with predicate='intersects' to capture boundary buildings
    58	# When a building intersects multiple LGAs (on boundaries), we pick the first
    59	lagos_buildings = lagos_buildings.sjoin(
    60	    lgas[["lga", "geometry"]], how="left", predicate="intersects"
    61	)
    62	# Drop duplicate index rows from multi-intersections, keep first
    63	lagos_buildings = lagos_buildings[~lagos_buildings.index.duplicated(keep="first")]
    64	print(f"  Done in {time.time()-t0:.1f}s")
    65	print(f"  NA LGA count: {lagos_buildings['lga'].isna().sum()}")
    66	
    67	# Drop the index_right column
    68	lagos_buildings = lagos_buildings.drop(columns=["index_right"])
    69	
    70	# ── 6. Filter to buildings > 1000 m² ────────────────────────────────────
    71	print(f"\nFiltering to footprint_area_m2 > 1000...")
    72	t0 = time.time()
    73	before = len(lagos_buildings)
    74	large_buildings = lagos_buildings[lagos_buildings["footprint_area_m2"] > 1000].copy()
    75	print(f"  {before} → {len(large_buildings)} buildings ({time.time()-t0:.1f}s)")
    76	
    77	# ── 7. Select and rename columns for output ─────────────────────────────
    78	print("\nPreparing output GeoDataFrame...")
    79	out_cols = ["id", "height", "footprint_area_m2", "lga", "geometry"]
    80	out_gdf = large_buildings[out_cols].copy()
    81	
<system-reminder>Showed lines 52-81 of 123. Re-call Read with offset=82 to continue, or raise limit.</system-reminder>
19:22
The sjoin with `intersects` will create multiple rows for buildings that touch multiple LGAs. The `drop_duplicates` approach only keeps the first occurrence. But the sjoin output has a column `index_right` which needs to be dropped. Let me also think about whether we need to handle the `lga_left` vs `lga` column naming issue.

Actually, when you do `sjoin(gdf, lgas[["lga", "geometry"]], ...)`, and the left df already has a 'lga' column? Let me check - no, the left df doesn't have an 'lga' column. So the result should just have 'lga' from the right side.

Wait, but I need to be more careful. Let me check the output columns after sjoin. The previous run showed the column was just 'lga' (not 'lga_left'). But now the intersects join might give us different column names if there are column overlaps. Actually, the buildings don't have an 'lga' column, so it should be fine.

However, the duplicate removal approach is not robust. When a building intersects multiple LGAs, the sjoin will create duplicate rows with the same index. Using `~lagos_buildings.index.duplicated(keep="first")` will keep the first LGA match, which is fine.

But wait - the sjoin already has `how="left"`, which means all buildings are preserved. The issue is that some buildings might intersect multiple LGAs, creating extra rows. So I need to drop duplicates on the original index.

Let me also think about whether the sjoin now introduces an 'index_right' column. Yes it does, and we drop it. But the column might be called 'index_right' from the sjoin operation.

Let me re-run and also handle the edge cases. But first, let me also make sure that the output is clean. Let me re-run with the updated script.
19:22
Bash cd /work && python process_buildings.py
Loading LGA boundaries...
  20 LGAs loaded
  Lagos bounds: (2.7059846, 6.366684, 4.3509056, 6.6967964)

Loading Overture buildings (bbox filter)...
  3055530 buildings loaded in 144.1s
  Columns: ['id', 'names', 'sources', 'level', 'height', 'min_height', 'is_underground', 'num_floors', 'num_floors_underground', 'min_floor', 'subtype', 'class', 'facade_color', 'facade_material', 'roof_material', 'roof_shape', 'roof_direction', 'roof_orientation', 'roof_color', 'roof_height', 'geometry', 'has_parts', 'version', 'bbox']

Filtering buildings to Lagos State polygon...
  2243598 buildings within Lagos in 12.5s

Computing footprint area...
  Done in 4.6s

Assigning LGA via spatial join...
  Done in 9.6s
  NA LGA count: 0

Filtering to footprint_area_m2 > 1000...
  2243598 → 7256 buildings (0.0s)

Preparing output GeoDataFrame...
  Output rows: 7256
  Output columns: ['id', 'height', 'footprint_area_m2', 'lga', 'geometry']

Writing lagos_buildings.geoparquet...
  Written to /work/lagos_buildings.geoparquet in 0.0s

Building per-LGA summary...
  Summary rows: 20
                 lga  n_buildings  total_footprint_m2  n_with_height  p50_height_m
0              Agege           47        9.387676e+04              0           NaN
1   Ajeromi/Ifelodun           59        1.044837e+05              0           NaN
2           Alimosho          350        6.435003e+05              0           NaN
3       Amuwo Odofin          836        1.751259e+06              0           NaN
4              Apapa          286        6.083739e+05              0           NaN
5            Badagry          124        2.259610e+05              0           NaN
6                Epe          342        8.118842e+05              0           NaN
7            Eti Osa         1063        1.853554e+06            237          20.0
8        Ibeju Lekki          596        1.793963e+06              0           NaN
9        Ifako/Ijaye          120        2.485358e+05              0           NaN
10             Ikeja          921        3.166509e+06             46          15.0
11            Kosofe          165        3.437419e+05              0           NaN
12      Lagos Island          117        2.261127e+05              9          20.0
13    Lagos Mainland          122        2.283266e+05              0           NaN
14            Mushin          461        1.058241e+06              1          35.0
15               Ojo          232        4.085015e+05              0           NaN
16      Oshodi/Isolo          417        7.918409e+05              3          27.0
17           Shomolu          124        2.006563e+05              3          17.0
18          Surulere          340        7.783158e+05              0           NaN
19           Ìkòròdú          534        1.857791e+06              4           7.5

Writing lagos_building_summary.parquet...
  Written to /work/lagos_building_summary.parquet in 0.0s

✅ Done!
22:32
Now there are 0 NA LGA values. The summary counts shifted slightly because the boundary buildings are now assigned. Let me do a final verification of both output files.
22:32
Bash python3 << 'EOF'
python3 << 'EOF'
import geopandas as gpd
import pandas as pd

print("=" * 60)
print("VERIFICATION")
print("=" * 60)

# GeoParquet
gdf = gpd.read_parquet("/work/lagos_buildings.geoparquet")
print(f"\n📦 lagos_buildings.geoparquet")
print(f"   Rows: {len(gdf):,}")
print(f"   Columns: {gdf.columns.tolist()}")
print(f"   CRS: EPSG:4326 ✓" if gdf.crs.to_epsg() == 4326 else f"   CRS: {gdf.crs} ✗")
print(f"   Geometry encoding: geoarrow")
print(f"   Null LGA: {gdf['lga'].isna().sum()}")
print(f"   Min area: {gdf['footprint_area_m2'].min():.2f} m²")
print(f"   Max area: {gdf['footprint_area_m2'].max():.2f} m²")
print(f"   All > 1000 m²: {(gdf['footprint_area_m2'] > 1000).all()}")

# Summary Parquet
df = pd.read_parquet("/work/lagos_building_summary.parquet")
print(f"\n📦 lagos_building_summary.parquet")
print(f"   Rows: {len(df)}")
print(f"   Columns: {df.columns.tolist()}")
print(f"   LGAs: {sorted(df['lga'].tolist())}")
print(f"\n   Summary:")
print(df.to_string(index=False))

# Cross-check totals
print(f"\n🔍 Cross-check:")
print(f"   Total buildings in GeoParquet: {len(gdf):,}")
print(f"   Total buildings in summary:   {df['n_buildings'].sum():,}")
print(f"   Match: {len(gdf) == df['n_buildings'].sum()}")
print(f"   Total footprint in GeoParquet: {gdf['footprint_area_m2'].sum():,.0f} m²")
print(f"   Total footprint in summary:    {df['total_footprint_m2'].sum():,.0f} m²")
print(f"   Match: {abs(gdf['footprint_area_m2'].sum() - df['total_footprint_m2'].sum()) < 0.01}")
EOF
============================================================
VERIFICATION
============================================================

📦 lagos_buildings.geoparquet
   Rows: 7,256
   Columns: ['id', 'height', 'footprint_area_m2', 'lga', 'geometry']
   CRS: EPSG:4326 ✓
   Geometry encoding: geoarrow
   Null LGA: 0
   Min area: 1000.04 m²
   Max area: 181048.18 m²
   All > 1000 m²: True

📦 lagos_building_summary.parquet
   Rows: 20
   Columns: ['lga', 'n_buildings', 'total_footprint_m2', 'n_with_height', 'p50_height_m']
   LGAs: ['Agege', 'Ajeromi/Ifelodun', 'Alimosho', 'Amuwo Odofin', 'Apapa', 'Badagry', 'Epe', 'Eti Osa', 'Ibeju Lekki', 'Ifako/Ijaye', 'Ikeja', 'Kosofe', 'Lagos Island', 'Lagos Mainland', 'Mushin', 'Ojo', 'Oshodi/Isolo', 'Shomolu', 'Surulere', 'Ìkòròdú']

   Summary:
             lga  n_buildings  total_footprint_m2  n_with_height  p50_height_m
           Agege           47        9.387676e+04              0           NaN
Ajeromi/Ifelodun           59        1.044837e+05              0           NaN
        Alimosho          350        6.435003e+05              0           NaN
    Amuwo Odofin          836        1.751259e+06              0           NaN
           Apapa          286        6.083739e+05              0           NaN
         Badagry          124        2.259610e+05              0           NaN
             Epe          342        8.118842e+05              0           NaN
         Eti Osa         1063        1.853554e+06            237          20.0
     Ibeju Lekki          596        1.793963e+06              0           NaN
     Ifako/Ijaye          120        2.485358e+05              0           NaN
           Ikeja          921        3.166509e+06             46          15.0
          Kosofe          165        3.437419e+05              0           NaN
    Lagos Island          117        2.261127e+05              9          20.0
  Lagos Mainland          122        2.283266e+05              0           NaN
          Mushin          461        1.058241e+06              1          35.0
             Ojo          232        4.085015e+05              0           NaN
    Oshodi/Isolo          417        7.918409e+05              3          27.0
         Shomolu          124        2.006563e+05              3          17.0
        Surulere          340        7.783158e+05              0           NaN
         Ìkòròdú          534        1.857791e+06              4           7.5

🔍 Cross-check:
   Total buildings in GeoParquet: 7,256
   Total buildings in summary:   7,256
   Match: True
   Total footprint in GeoParquet: 17,195,429 m²
   Total footprint in summary:    17,195,429 m²
   Match: True
22:44
Both files look correct. Let me do a final check to ensure the Parquet files have no issues by reading them back.

Let me also verify that the summary file is indeed a plain Parquet (no geometry column) as specified.

Everything checks out:

1. **lagos_buildings.geoparquet** - EPSG:4326 GeoParquet with columns `id`, `height`, `footprint_area_m2`, `lga`, `geometry`. 7,256 buildings all > 1000 m². ✓
2. **lagos_building_summary.parquet** - Plain Parquet with columns `lga`, `n_buildings`, `total_footprint_m2`, `n_with_height`, `p50_height_m`. ✓

The cross-checks pass - totals match between the two files.
22:44