dc-l2-lagos-snap-normalize
Map (reference ← swipe → agent)
0:00
I need to get our legacy zoning layer ready for the state portal. lagos_zoning_legacy.gpkg has inconsistent zoning_class values (mixed casing, spelling variations, and some blank entries) plus vertex precision issues that cause tiny gaps and overlaps between adjacent parcels. Can you clean up the class names to consistent canonical TitleCase values, drop any rows with blank or whitespace-only classes, fix the vertex precision problems, and produce one merged geometry per canonical zoning class? Please write the result to zoning_aggregated.gpkg, with each row carrying two columns: zoning_class and a numeric area_m2.
0:00
System info
{
"adapter": "openrouter",
"model": "deepseek/deepseek-v4-flash",
"base_url": "https://openrouter.ai/api/v1"
}
0:08
Let me start by reading the input file to understand its structure and contents.
0:08
Bash
ls -lh /work/lagos_zoning_legacy.gpkg 2>&1
-rw-r--r-- 1 agent agent 2.5M Jun 17 23:45 /work/lagos_zoning_legacy.gpkg
0:08
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
print('Columns:', gdf.columns.tolist())
print('Shape:', gdf.shape)
print('CRS:', gdf.crs)
print('---')
print(gdf.head(10).to_string())
print('---')
print('zoning_class values:')
print(gdf['zoning_class'].value_counts(dropna=False))
print('---')
print('dtypes:', gdf.dtypes)
"
Columns: ['parcel_id', 'lga_source', 'zoning_class', 'area_m2', 'geometry']
Shape: (10080, 5)
CRS: EPSG:26331
---
parcel_id lga_source zoning_class area_m2 geometry
0 1 Surulere RESIDENTIAL 100.0 POLYGON ((540000 720000, 540010 720000, 540010 720010, 540000 720010, 540000 720000))
1 2 Yaba residential 100.0 POLYGON ((540010 720000, 540020 720000, 540020 720010, 540010 720010, 540010 720000))
2 3 Apapa Resi. 100.0 POLYGON ((540020 720000, 540030 720000, 540030 720010, 540020 720010, 540020 720000))
3 4 Eti-Osa resi. 100.0 POLYGON ((540030 720000, 540040 720000, 540040 720010, 540030 720010, 540030 720000))
4 5 Ojo RESI. 100.0 POLYGON ((540040 720000, 540050 720000, 540050 720010, 540040 720010, 540040 720000))
5 6 Ikeja Residential 100.0 POLYGON ((540050 720000, 540060 720000, 540060 720010, 540050 720010, 540050 720000))
6 7 Surulere RESIDENTIAL 100.0 POLYGON ((540060 720000, 540070 720000, 540070 720010, 540060 720010, 540060 720000))
7 8 Yaba residential 100.0 POLYGON ((540070 720000, 540080 720000, 540080 720010, 540070 720010, 540070 720000))
8 9 Apapa Resi. 100.0 POLYGON ((540080 720000, 540090 720000, 540090 720010, 540080 720010, 540080 720000))
9 10 Eti-Osa resi. 100.0 POLYGON ((540090 720000, 540100 720000, 540100 720010, 540090 720010, 540090 720000))
---
zoning_class values:
zoning_class
Residential 420
COMMERCIAL 420
industrial 420
Agri. 420
residential 419
resi. 419
Comm. 419
COMM. 419
indus. 419
AGRI. 419
Industrial 418
AGRICULTURAL 418
RESIDENTIAL 417
RESI. 417
comm. 417
commercial 417
Indus. 417
INDUSTRIAL 417
Agricultural 417
agri. 417
Resi. 416
Commercial 416
INDUS. 416
agricultural 416
13
13
NaN 12
\t 12
Name: count, dtype: int64
---
dtypes: parcel_id int64
lga_source str
zoning_class str
area_m2 float64
geometry geometry
dtype: object
0:16
Let me analyze the data: 1. **zoning_class values** - There are many variations: - Residential / RESIDENTIAL / residential -> "Residential" - Commercial / COMMERCIAL / commercial / Comm. / COMM. / comm. -> "Commercial" - Industrial / INDUSTRIAL / industrial / Indus. / INDUS. / indus. -> "Industrial" - Agricultural / AGRICULTURAL / agricultural / Agri. / AGRI. / agri. -> "Agricultural" - Resi. / RESI. / resi. -> "Residential" - Blank/whitespace/NaN rows need to be dropped 2. The canonical classes seem to be: Residential, Commercial, Industrial, Agricultural 3. Vertex precision: The coordinates are like 540000, 720000, 540010, etc. - these are in EPSG:26331 (Meridian 31 / Lagos?), which is metric. The problem is probably tiny gaps/overlaps at the sub-meter level - maybe coordinates like 540000.0000001 vs 540010. Let me check the geometry more carefully. Let me look at the geometry precision more carefully.
0:16
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
# Check geometry types
print(gdf.geometry.type.value_counts())
print()
# Check for any non-polygon types
print('Num geometries:', len(gdf))
print('Num null geometries:', gdf.geometry.isna().sum())
print()
# Check coordinate precision - look at some decimal parts
import numpy as np
coords = []
for g in gdf.geometry:
if g is not None:
coords.extend(g.exterior.coords)
arr = np.array(coords)
print('Coord sample (first 20):')
print(arr[:20])
print()
print('Min:', arr.min(axis=0))
print('Max:', arr.max(axis=0))
print()
# Check for precision beyond integer
frac_x = arr[:,0] - np.floor(arr[:,0])
frac_y = arr[:,1] - np.floor(arr[:,1])
print('Fractional parts unique x:', sorted(set(np.round(frac_x, 10))))
print('Fractional parts unique y:', sorted(set(np.round(frac_y, 10))))
"
Polygon 10080 Name: count, dtype: int64 Num geometries: 10080 Num null geometries: 0 Coord sample (first 20): [[540000. 720000. ] [540010.00002 720000.000005] [540010.000005 720010.000015] [540000.000025 720010.00002 ] [540000. 720000. ] [540010.00001 720000.000025] [540020. 720000.000005] [540020.00002 720010.00001 ] [540010.000005 720010.00002 ] [540010.00001 720000.000025] [540020.000025 720000.00002 ] [540030.000015 720000. ] [540030. 720010.00001 ] [540020.00002 720010.000015] [540020.000025 720000.00002 ] [540030.000005 720000.00002 ] [540040.000025 720000.000025] [540040.000015 720010.000005] [540030. 720010.000015] [540030.000005 720000.00002 ]] Min: [540000. 720000.] Max: [542547. 721000.000025] Fractional parts unique x: [np.float64(0.0), np.float64(5e-06), np.float64(1e-05), np.float64(1.5e-05), np.float64(2e-05), np.float64(2.5e-05), np.float64(3e-05), np.float64(0.5)] Fractional parts unique y: [np.float64(0.0), np.float64(5e-06), np.float64(1e-05), np.float64(1.5e-05), np.float64(2e-05), np.float64(2.5e-05)]
0:44
Let me understand the data: 1. **CRS**: EPSG:26331 (Merchich / Nord Maroc - a metric CRS for Morocco/Lagos region... interesting, but it's what we have) 2. **zoning_class values**: There are lots of inconsistencies: - Mixed casing: "RESIDENTIAL", "residential", "Residential", "RESI.", "resi.", "Resi." - Spelling variations: "Residential", "RESIDENTIAL", "residential" and "Resi.", "resi.", "RESI." - Similarly for other classes: COMMERCIAL/commercial/Commercial/Comm./comm./COMM. - INDUSTRIAL/industrial/Industrial/Indus./indus./INDUS. - AGRICULTURAL/agricultural/Agricultural/Agri./agri./AGRI./AGRI. - Blank/whitespace entries: 13 blank, 13 with just whitespace, 12 NaN, 12 with tab 3. **Vertex precision**: Coordinates have tiny fractional parts (0.000005, 0.00001, 0.000015, etc.) which would cause tiny gaps/overlaps between adjacent parcels (since they're supposed to share edges at integer coordinates like 540000, 540010, etc.) **Cleaning Plan**: 1. Map all zoning_class variations to canonical TitleCase values: "Residential", "Commercial", "Industrial", "Agricultural" 2. Drop rows with blank/whitespace-only/NaN zoning_class 3. Snap coordinates to integer (nearest integer) to fix vertex precision 4. Dissolve/merge by zoning_class to produce one geometry per class 5. Calculate area_m2 for each merged geometry 6. Write to zoning_aggregated.gpkg Let me write the solve script. The mapping I'll use: - residential, RESIDENTIAL, Residential, Resi., resi., RESI. → Residential - commercial, COMMERCIAL, Commercial, Comm., comm., COMM. → Commercial - industrial, INDUSTRIAL, Industrial, Indus., indus., INDUS. → Industrial - agricultural, AGRICULTURAL, Agricultural, Agri., agri., AGRI., AGRI. → Agricultural For the vertex precision, I'll snap coordinates to the nearest integer (round).
0:44
Write
/work/solve.py
content · 153 lines · py
import geopandas as gpd
import pandas as pd
import numpy as np
from shapely.geometry import Polygon, MultiPolygon
from shapely.ops import unary_union, snap
import warnings
warnings.filterwarnings("ignore")
# ── 1. Read ──────────────────────────────────────────────────────────────
src = "/work/lagos_zoning_legacy.gpkg"
gdf = gpd.read_file(src)
print(f"Read {len(gdf)} rows, CRS: {gdf.crs}")
# ── 2. Clean zoning_class values ──────────────────────────────────────────
# Build a canonical mapping — all variations → TitleCase
canon_map = {
# Residential
"residential": "Residential",
"RESIDENTIAL": "Residential",
"Residential": "Residential",
"resi.": "Residential",
"RESI.": "Residential",
"Resi.": "Residential",
# Commercial
"commercial": "Commercial",
"COMMERCIAL": "Commercial",
"Commercial": "Commercial",
"comm.": "Commercial",
"COMM.": "Commercial",
"Comm.": "Commercial",
# Industrial
"industrial": "Industrial",
"INDUSTRIAL": "Industrial",
"Industrial": "Industrial",
"indus.": "Industrial",
"INDUS.": "Industrial",
"Indus.": "Industrial",
# Agricultural
"agricultural": "Agricultural",
"AGRICULTURAL": "Agricultural",
"Agricultural": "Agricultural",
"agri.": "Agricultural",
"AGRI.": "Agricultural",
"Agri.": "Agricultural",
"AGRI.": "Agricultural", # note trailing dot too
}
# Apply the mapping — keep only rows that match a known mapping
gdf["zoning_class"] = gdf["zoning_class"].str.strip()
mask_known = gdf["zoning_class"].isin(canon_map.keys())
gdf = gdf[mask_known].copy()
gdf["zoning_class"] = gdf["zoning_class"].map(canon_map)
print(f"After dropping blank/invalid: {len(gdf)} rows")
print(gdf["zoning_class"].value_counts())
# ── 3. Fix vertex precision — snap to nearest integer ──────────────────────
def snap_to_int(geom):
"""Snap all coordinates of a geometry to the nearest integer."""
import copy
if geom.is_empty:
return geom
# Use shapely's set_precision with grid_size=1 and mode=1 (snap)
# Actually shapely 2.0 has set_precision; but snapping via numpy is reliable.
# We'll use shapely's snap to a grid via coordinates recreation.
from shapely import wkt, wkb
# Round all coords to integer
coords = list(geom.exterior.coords)
rounded = [(round(x), round(y)) for x, y in coords]
# Ensure closed ring
if rounded[0] != rounded[-1]:
rounded.append(rounded[0])
interior_rings = []
for ring in geom.interiors:
icoords = list(ring.coords)
irounded = [(round(x), round(y)) for x, y in icoords]
if irounded[0] != irounded[-1]:
irounded.append(irounded[0])
interior_rings.append(irounded)
return Polygon(rounded, interior_rings)
# Also need to remove duplicate/colinear points that may arise from rounding
from shapely.geometry import Polygon as ShapelyPolygon
from shapely.validation import make_valid
def clean_rounded_polygon(geom):
"""Snap to integer, remove degenerate rings, and make valid."""
if geom.geom_type == "Polygon":
# Snap exterior
ext = [(round(x), round(y)) for x, y in geom.exterior.coords]
# Remove consecutive duplicates (keep closed ring structure)
ext_dedup = [ext[0]]
for pt in ext[1:]:
if pt != ext_dedup[-1]:
ext_dedup.append(pt)
# Ensure closure
if ext_dedup[0] != ext_dedup[-1]:
ext_dedup.append(ext_dedup[0])
if len(ext_dedup) < 4:
return None # degenerate
interiors = []
for ring in geom.interiors:
ir = [(round(x), round(y)) for x, y in ring.coords]
ir_dedup = [ir[0]]
for pt in ir[1:]:
if pt != ir_dedup[-1]:
ir_dedup.append(pt)
if ir_dedup[0] != ir_dedup[-1]:
ir_dedup.append(ir_dedup[0])
if len(ir_dedup) >= 4:
interiors.append(ir_dedup)
try:
p = ShapelyPolygon(ext_dedup, interiors)
if not p.is_valid:
p = make_valid(p)
return p if (not p.is_empty) else None
except Exception:
return None
elif geom.geom_type == "MultiPolygon":
parts = [clean_rounded_polygon(part) for part in geom.geoms]
parts = [p for p in parts if p is not None]
if not parts:
return None
if len(parts) == 1:
return parts[0]
return MultiPolygon(parts)
return geom
print("Snapping vertices to integer grid...")
gdf["geometry"] = gdf["geometry"].apply(clean_rounded_polygon)
gdf = gdf[gdf["geometry"].notna()].copy()
print(f"After snap: {len(gdf)} rows")
# ── 4. Dissolve by zoning_class ──────────────────────────────────────────
# Use unary_union per group to merge all polygons into one geometry per class
print("Dissolving by zoning_class...")
dissolved = gdf.dissolve(by="zoning_class", aggfunc="first")
# The dissolve above keeps first area_m2 which is wrong. We'll recalc area.
# Actually dissolve keeps only the first non-geometry column values.
# We want fresh area calculation.
dissolved = dissolved[["geometry"]].copy()
dissolved.reset_index(inplace=True)
dissolved.rename(columns={"zoning_class": "zoning_class"}, inplace=True)
# Compute area in the CRS units (EPSG:26331 uses metres → area in m²)
dissolved["area_m2"] = dissolved.geometry.area
# ── 5. Write output ──────────────────────────────────────────────────────
out_path = "/work/zoning_aggregated.gpkg"
dissolved.to_file(out_path, layer="zoning_aggregated", driver="GPKG")
print(f"\nWritten {out_path}")
print(dissolved[["zoning_class", "area_m2"]].to_string())
Wrote /work/solve.py (5881 chars).
0:53
Let me test the mapping first to make sure all the keys are correct, and then run the script. Let me also think about the mapping - I see "AGRI." appears twice in the dict, that's fine but redundant. Let me also check - the value_counts showed "AGRI." and "AGRI." (same), so that's fine. Let me double-check the unique values more carefully: From the output: - Residential (TitleCase variations that are already canonical): Residential, COMMERCIAL, industrial, Agri., residential, resi., Comm., COMM., indus., AGRI., Industrial, AGRICULTURAL, RESIDENTIAL, RESI., comm., commercial, Indus., INDUSTRIAL, Agricultural, agri., Resi., Commercial, INDUS., agricultural - Blank entries: empty string '' (13), ' ' (13 spaces?), NaN (12), '\t' (12) Let me check the exact blank entries. Also I need to be careful about the mapping. Let me check what the actual unique values are, including those with whitespace. Also my mapping has "AGRI." which matches "AGRI." - but let me check more carefully. The output shows "AGRI." with 419 and "AGRI." (maybe trailing space?). Let me just check. Also, I used `str.strip()` but I need to make sure that after stripping, the empty/whitespace-only entries are caught. I used `mask_known = gdf["zoning_class"].isin(canon_map.keys())` which would exclude empty/whitespace-only entries since they're not in the map keys. Good. But wait - I should also handle the NaN values. `str.strip()` on NaN would give NaN. Then `isin(canon_map.keys())` would return False for NaN. So that should work. Let me also verify if there are any trailing/leading spaces in the values from the data. The value_counts output shows some entries that might have trailing spaces like "AGRI." appearing twice - one with 419 and one... wait no, let me re-read. Actually looking more carefully at the output: ``` AGRI. 419 AGRICULTURAL 418 ``` And: ``` RESI. 417 ``` So these are just different canonical forms. Let me check if any have whitespace issues by checking raw values. Let me just run the script and see what happens.
0:53
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
# Check exact unique values with repr to see whitespace
vals = gdf['zoning_class'].dropna().unique()
for v in sorted(vals):
print(repr(v))
print('---')
# Check nulls/blanks
print('NaN count:', gdf['zoning_class'].isna().sum())
# Check whitespace-only
mask_ws = gdf['zoning_class'].str.strip().isin(['', '\t', ' ']) & gdf['zoning_class'].notna()
print('Whitespace-only (not NaN):', mask_ws.sum())
"
'' '\t' ' ' 'AGRI.' 'AGRICULTURAL' 'Agri.' 'Agricultural' 'COMM.' 'COMMERCIAL' 'Comm.' 'Commercial' 'INDUS.' 'INDUSTRIAL' 'Indus.' 'Industrial' 'RESI.' 'RESIDENTIAL' 'Resi.' 'Residential' 'agri.' 'agricultural' 'comm.' 'commercial' 'indus.' 'industrial' 'resi.' 'residential' --- NaN count: 12 Whitespace-only (not NaN): 38
0:59
I see there are blank strings ('') and whitespace-only strings (like '\t', ' ') that I need to handle. My current code strips and checks membership in canon_map keys. But '' (blank string) won't be in the map, and '\t' stripped becomes '' which also won't be in the map. So those should be dropped. Let me verify that my code handles that correctly.
Actually let me check: `gdf["zoning_class"].str.strip()` on '\t' gives '', and '' is not in `canon_map.keys()`, so `mask_known` will be False for those. And NaN is also not in the map. So they should all be dropped. Good.
But wait - I also have ' ' (3 spaces). After strip, it becomes '' which also won't be in the map. Good.
Let me also check my rounding logic more carefully. The coordinate values I saw had fractional parts like 0.000005, 0.00001, etc. - these are tiny sub-mm variations. Rounding to integer should fix that cleanly.
Let me just run the script and verify.
0:59
Bash
cd /work && python solve.py
Read 10080 rows, CRS: EPSG:26331 After dropping blank/invalid: 10030 rows zoning_class Residential 2508 Commercial 2508 Industrial 2507 Agricultural 2507 Name: count, dtype: int64 Snapping vertices to integer grid... After snap: 10030 rows Dissolving by zoning_class... Written /work/zoning_aggregated.gpkg zoning_class area_m2 0 Agricultural 250000.0 1 Commercial 250000.0 2 Industrial 250000.0 3 Residential 250000.0
1:09
The process worked. Let me verify the output file exists and sanity-check the coordinates and area values. The areas are all 250000 m² — that seems suspiciously uniform. Let me check if the original data was all 10m×10m parcels (100 m² each). With 2507-2508 per class, 2507 × 100 = 250,700... but we got exactly 250000. Let me check more carefully. Actually, looking at the original data: each polygon was ~10×10 = 100 m² but with the tiny fractional offsets. When we snap to integer grid, the areas should still be 100 m² per parcel. With 2507 parcels of Agricultural × 100 = 250,700. But 250000 is exactly 2500 × 100. Wait, maybe the dissolve/unary_union is merging adjacent polygons and because they were originally adjacent (touching after integer snap), some boundaries dissolve away. The resulting merged polygon area would be correct — 2500 parcels × 100 m² = 250,000 m² per class. But we had 2507/2508 parcels per class. Let me re-check. Actually, looking at the coordinates in the original data, they go from 540000 to ~542547 in x, and 720000 to 721000 in y. Each class seems to have a contiguous block of 10x10 parcels. Let me verify the output more carefully. Let me check if some geometries became invalid after snapping and got dropped.
1:09
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
print('Original per class:')
print(gdf['zoning_class'].str.strip().map({
'residential':'Residential','RESIDENTIAL':'Residential','Residential':'Residential',
'resi.':'Residential','RESI.':'Residential','Resi.':'Residential',
'commercial':'Commercial','COMMERCIAL':'Commercial','Commercial':'Commercial',
'comm.':'Commercial','COMM.':'Commercial','Comm.':'Commercial',
'industrial':'Industrial','INDUSTRIAL':'Industrial','Industrial':'Industrial',
'indus.':'Industrial','INDUS.':'Industrial','Indus.':'Industrial',
'agricultural':'Agricultural','AGRICULTURAL':'Agricultural','Agricultural':'Agricultural',
'agri.':'Agricultural','AGRI.':'Agricultural','Agri.':'Agricultural',
'':None,' ':None,'\t':None
}).value_counts(dropna=False))
# Check output
gdf2 = gpd.read_file('/work/zoning_aggregated.gpkg')
print()
print('Output:')
print(gdf2)
print()
print('Output CRS:', gdf2.crs)
print('Output geometry types:', gdf2.geometry.type.value_counts())
"
Original per class: zoning_class Residential 2508 Commercial 2508 Industrial 2507 Agricultural 2507 NaN 50 Name: count, dtype: int64 Output: zoning_class area_m2 geometry 0 Agricultural 250000.0 GEOMETRYCOLLECTION (LINESTRING (542515 720500,... 1 Commercial 250000.0 GEOMETRYCOLLECTION (LINESTRING (542505 720500,... 2 Industrial 250000.0 GEOMETRYCOLLECTION (LINESTRING (542510 720500,... 3 Residential 250000.0 GEOMETRYCOLLECTION (LINESTRING (542500 720500,... Output CRS: EPSG:26331 Output geometry types: GeometryCollection 4 Name: count, dtype: int64
1:22
The output has GeometryCollection instead of Polygon/MultiPolygon. The dissolve produced a unary_union that left some edges as LineString remnants. I need to handle this better — after dissolving, I should only keep polygon geometries, or during the snapping step ensure no linestrings result. The issue is that after rounding coordinates to integers, some polygons may become degenerate (collapse to lines) and the dissolve/unary_union produces non-polygon artifacts. Let me fix by: 1. During snapping, be more careful to avoid creating degenerate polygons 2. After dissolve, filter to only keep polygon geometries (or extract polygon parts from geometry collections) Actually, looking at the output geometries, they are GeometryCollections containing a single LineString - that means the dissolve/unary_union essentially collapsed everything into lines. The areas are 250000 which is weird - seems like it's just the count of cells * 100.0. Wait, 2500 * 100 = 250,000. And each polygon was exactly 100 m². So it seems like there were 2500 parcels per class. But after snapping to integers, each 10x10 unit polygon... wait the original coords were like (540000, 720000) to (540010, 720010) - that's 10 x 10 = 100 m². After rounding to integers, these remain the same since they were already integers except for fractional noise. Let me think about this differently. The problem is with the approach. Instead of trying to dissolve individual polygons (which might produce artifacts), I should use a grid-based approach or a more robust method. Actually, the problem might be simpler. Let me think about what the dissolve/unary_union does. Each class has 2500+ contiguous polygons that are adjacent. After snapping to integers, they should form a nice grid. The unary_union should merge them into a big rectangle. But the issue is that some edges might have slight misalignments even after rounding. Wait, let me re-read the output. The dissolved result gives GeometryCollections with LineStrings, which means the polygons are so degraded that they collapsed. Let me look at what happens more carefully. The original coordinates had tiny fractional parts. After rounding to integers, adjacent parcels should share exact same boundary coordinates. So why would the dissolve produce lines? Let me debug by looking at one class's snapped geometries.
1:22
1:22
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
# Look at Residential group coords
resi = gdf[gdf['zoning_class'].str.strip() == 'RESIDENTIAL'].head(10)
for i, row in resi.iterrows():
print(i, list(row.geometry.exterior.coords))
print('---')
# Check that snapping works
from shapely.geometry import Polygon
coords = [(540000.000025, 720000.00002), (540010.00001, 720000.000005), (540010.0, 720010.000015), (540000.000025, 720010.0), (540000.000025, 720000.00002)]
rounded = [(round(x), round(y)) for x,y in coords]
print('rounded:', rounded)
p = Polygon(rounded)
print('valid:', p.is_valid)
print('area:', p.area)
print('type:', p.geom_type)
"
0 [(540000.0, 720000.0), (540010.00002, 720000.000005), (540010.000005, 720010.000015), (540000.000025, 720010.00002), (540000.0, 720000.0)] 6 [(540060.000015, 720000.00001), (540070.0, 720000.00002), (540070.00002, 720010.000025), (540060.00001, 720010.000005), (540060.000015, 720000.00001)] 12 [(540120.00003, 720000.00002), (540130.00002, 720000.0), (540130.000005, 720010.00001), (540120.000025, 720010.000015), (540120.00003, 720000.00002)] 18 [(540180.000015, 720000.000005), (540190.0, 720000.000015), (540190.00002, 720010.00002), (540180.00001, 720010.0), (540180.000015, 720000.000005)] 24 [(540240.00003, 720000.000015), (540250.000015, 720000.000025), (540250.000005, 720010.000005), (540240.000025, 720010.00001), (540240.00003, 720000.000015)] 30 [(540300.000015, 720000.0), (540310.0, 720000.00001), (540310.00002, 720010.000015), (540300.000005, 720010.000025), (540300.000015, 720000.0)] 36 [(540360.00003, 720000.00001), (540370.000015, 720000.00002), (540370.000005, 720010.0), (540360.000025, 720010.000005), (540360.00003, 720000.00001)] 42 [(540420.00001, 720000.000025), (540430.0, 720000.000005), (540430.00002, 720010.00001), (540420.000005, 720010.00002), (540420.00001, 720000.000025)] 48 [(540480.00003, 720000.000005), (540490.000015, 720000.000015), (540490.0, 720010.000025), (540480.000025, 720010.0), (540480.00003, 720000.000005)] 102 [(540020.0, 720010.0), (540030.00002, 720010.000005), (540030.000005, 720020.000015), (540020.000025, 720020.00002), (540020.0, 720010.0)] --- rounded: [(540000, 720000), (540010, 720000), (540010, 720010), (540000, 720010), (540000, 720000)] valid: True area: 100.0 type: Polygon
1:38
The problem is that the `clean_rounded_polygon` function might be over-complicating things. Let me look at what happened with the unary_union approach. The issue is that after snapping to integer grid, the parcels that were originally 10x10 units ($100 m^2$) become exactly 10x10 integer polygons. But the `unary_union` in `dissolve` might be producing linestrings due to some issue. Let me look at what's happening more carefully. The GeometryCollection with LINESTRING suggests the union operation resulted in degenerate geometry. Let me test the unary_union on snap-to-integer polygons separately. Actually, looking at the output - the issue might be that `make_valid` returned a GeometryCollection. Let me try a simpler approach: 1. Snap to integer grid 2. Dissolve with `dissolve` + `unary_union` 3. Handle any resulting invalid geometries Let me rewrite the script with a simpler, more robust approach.
1:38
Write
/work/solve.py
content · 144 lines · py
import geopandas as gpd
import pandas as pd
import numpy as np
from shapely.geometry import Polygon, MultiPolygon, GeometryCollection
from shapely.ops import unary_union
from shapely.validation import make_valid
import warnings
warnings.filterwarnings("ignore")
# ── 1. Read ──────────────────────────────────────────────────────────────
src = "/work/lagos_zoning_legacy.gpkg"
gdf = gpd.read_file(src)
print(f"Read {len(gdf)} rows, CRS: {gdf.crs}")
# ── 2. Clean zoning_class values ──────────────────────────────────────────
# Mapping: all observed variants → canonical TitleCase
canon_map = {
"residential": "Residential",
"RESIDENTIAL": "Residential",
"Residential": "Residential",
"resi.": "Residential",
"RESI.": "Residential",
"Resi.": "Residential",
"commercial": "Commercial",
"COMMERCIAL": "Commercial",
"Commercial": "Commercial",
"comm.": "Commercial",
"COMM.": "Commercial",
"Comm.": "Commercial",
"industrial": "Industrial",
"INDUSTRIAL": "Industrial",
"Industrial": "Industrial",
"indus.": "Industrial",
"INDUS.": "Industrial",
"Indus.": "Industrial",
"agricultural": "Agricultural",
"AGRICULTURAL": "Agricultural",
"Agricultural": "Agricultural",
"agri.": "Agricultural",
"AGRI.": "Agricultural",
"Agri.": "Agricultural",
}
# Strip whitespace and drop empty/null/whitespace-only rows
gdf["zoning_class"] = gdf["zoning_class"].astype(str).str.strip()
# Replace literal 'nan' (from NaN conversion), empty strings, and whitespace-only with None
gdf["zoning_class"] = gdf["zoning_class"].replace({"": None, "nan": None})
gdf = gdf.dropna(subset=["zoning_class"])
# Now map to canonical
gdf["zoning_class"] = gdf["zoning_class"].map(canon_map)
# Drop any rows whose class didn't match (safety net)
gdf = gdf.dropna(subset=["zoning_class"])
print(f"After cleaning: {len(gdf)} rows")
print(gdf["zoning_class"].value_counts())
# ── 3. Fix vertex precision — snap each polygon to integer grid ──────────
def snap_to_integer(geom):
"""Snap all coordinates to nearest integer. Returns Polygon or MultiPolygon."""
if geom.is_empty:
return None
if geom.geom_type == "Polygon":
return _snap_polygon(geom)
elif geom.geom_type == "MultiPolygon":
parts = [_snap_polygon(p) for p in geom.geoms]
parts = [p for p in parts if p is not None]
if not parts:
return None
return MultiPolygon(parts) if len(parts) > 1 else parts[0]
return None
def _snap_polygon(poly):
ext = [(round(x), round(y)) for x, y in poly.exterior.coords]
# Deduplicate consecutive identical points
ext_dedup = _dedup_ring(ext)
if len(ext_dedup) < 4: # need at least a triangle
return None
interiors = []
for ring in poly.interiors:
ir = [(round(x), round(y)) for x, y in ring.coords]
ir_dedup = _dedup_ring(ir)
if len(ir_dedup) >= 4:
interiors.append(ir_dedup)
try:
p = Polygon(ext_dedup, interiors)
if not p.is_valid:
p = make_valid(p)
return p if (not p.is_empty) else None
except Exception:
return None
def _dedup_ring(pts):
"""Remove consecutive duplicate points from a closed ring, keeping closure."""
if not pts:
return pts
result = [pts[0]]
for pt in pts[1:]:
if pt != result[-1]:
result.append(pt)
# Ensure closure — first point == last point
if len(result) >= 2 and result[0] != result[-1]:
result.append(result[0])
return result
print("Snapping vertices to integer grid...")
gdf["geometry"] = gdf["geometry"].apply(snap_to_integer)
gdf = gdf[gdf["geometry"].notna()].copy()
print(f"After snap: {len(gdf)} rows")
# ── 4. Dissolve by zoning_class (merge geometries) ──────────────────────
# First check that geometries are valid polygons
print("Geometry types after snap:", gdf.geometry.type.value_counts().to_dict())
print("Dissolving by zoning_class...")
# Use groupby + unary_union for robust merging
merged_rows = []
for cls, group in gdf.groupby("zoning_class"):
unioned = unary_union(group.geometry.tolist())
# Handle degenerate result
if unioned.is_empty:
continue
if not unioned.is_valid:
unioned = make_valid(unioned)
# If unary_union returns GeometryCollection with mixed types, extract polygons
if unioned.geom_type == "GeometryCollection":
# Keep only Polygon / MultiPolygon components
polys = [g for g in unioned.geoms if g.geom_type in ("Polygon", "MultiPolygon")]
if not polys:
continue
unioned = unary_union(polys)
if unioned.is_empty:
continue
merged_rows.append({"zoning_class": cls, "geometry": unioned})
result = gpd.GeoDataFrame(merged_rows, crs=gdf.crs)
result["area_m2"] = result.geometry.area.round(2)
# ── 5. Write output ──────────────────────────────────────────────────────
out_path = "/work/zoning_aggregated.gpkg"
result.to_file(out_path, layer="zoning_aggregated", driver="GPKG")
print(f"\nWritten {out_path}")
for _, r in result.iterrows():
print(f" {r['zoning_class']:15s} area={r['area_m2']:>10.2f} geom_type={r.geometry.geom_type}")
Wrote /work/solve.py (5326 chars).
1:41
Let me run the script and see if it works.
1:41
Bash
cd /work && python solve.py
Read 10080 rows, CRS: EPSG:26331
After cleaning: 10030 rows
zoning_class
Residential 2508
Commercial 2508
Industrial 2507
Agricultural 2507
Name: count, dtype: int64
Snapping vertices to integer grid...
After snap: 10030 rows
Geometry types after snap: {'Polygon': 10000, 'MultiLineString': 30}
Dissolving by zoning_class...
Written /work/zoning_aggregated.gpkg
Agricultural area= 250000.00 geom_type=Polygon
Commercial area= 250000.00 geom_type=Polygon
Industrial area= 250000.00 geom_type=Polygon
Residential area= 250000.00 geom_type=Polygon
1:52
We have 30 MultiLineString geometries after snapping. Those likely came from degenerate polygons (collapsed to lines after rounding). Let me check what those are and handle them.
1:52
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
gdf['zoning_class'] = gdf['zoning_class'].astype(str).str.strip()
gdf['zoning_class'] = gdf['zoning_class'].replace({'':None,'nan':None})
gdf = gdf.dropna(subset=['zoning_class'])
# Let's see those 30 that became lines
import numpy as np
def test_snap(geom):
ext = [(round(x), round(y)) for x, y in geom.exterior.coords]
# dedup
result = [ext[0]]
for pt in ext[1:]:
if pt != result[-1]:
result.append(pt)
if len(result) >= 2 and result[0] != result[-1]:
result.append(result[0])
return len(result)
from shapely.geometry import Polygon
count = 0
for i, row in gdf.iterrows():
g = row.geometry
if g.geom_type == 'Polygon':
ext = [(round(x), round(y)) for x, y in g.exterior.coords]
r = [ext[0]]
for pt in ext[1:]:
if pt != r[-1]:
r.append(pt)
if len(r) >= 2 and r[0] != r[-1]:
r.append(r[0])
if len(r) < 4:
if count < 5:
print(f'Row {i}: original coords={list(g.exterior.coords)}')
print(f' rounded deduped={r}')
print(f' len={len(r)}')
count += 1
print(f'Total Polygons with <4 points after rounding: {count}')
"
Total Polygons with <4 points after rounding: 0
1:58
There are 30 MultiLineString geometries in the data originally - that's not from my snapping. Let me check.
1:58
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
# Check if there are non-Polygon geometries in the source
print('Original geometry types:', gdf.geometry.type.value_counts().to_dict())
# Find the 30 non-polygons
mls = gdf[gdf.geometry.type != 'Polygon']
print('Non-Polygon rows:')
for i, row in mls.iterrows():
print(f' Row {i}: type={row.geometry.geom_type}, class={row.zoning_class!r}')
print(f' coords: {list(row.geometry.coords)}')
"
Original geometry types: {'Polygon': 10080}
Non-Polygon rows:
2:04
The 30 MultiLineString geometries appear after the snap. Let me investigate which geometries produce lines after snapping.
2:04
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
from shapely.geometry import Polygon
problems = []
for i, row in gdf.iterrows():
g = row.geometry
if g.geom_type != 'Polygon':
continue
# Snapshot what happens
ext = [(round(x), round(y)) for x, y in g.exterior.coords]
result = [ext[0]]
for pt in ext[1:]:
if pt != result[-1]:
result.append(pt)
if len(result) >= 2 and result[0] != result[-1]:
result.append(result[0])
p = Polygon(result)
if p.geom_type not in ('Polygon', 'MultiPolygon') or p.is_empty:
problems.append((i, g.exterior.coords[:], result, p.geom_type))
if len(result) < 4:
problems.append((i, g.exterior.coords[:], result, 'degenerate'))
print(f'Problems: {len(problems)}')
for p in problems[:10]:
print(p)
"
Problems: 0
2:13
So all 10080 input geometries are valid polygons. When I snapped them in my solve.py, 30 became MultiLineString. Let me trace exactly which ones fail. The snap function returns None for some and they get dropped, but the geometry type changes happen inside the snap_to_integer function. Let me run a more detailed test. Actually, the issue is that `make_valid` from shapely can return a GeometryCollection if the polygon is invalid after snapping. Let me check what happens with those 30 rows. Let me add debugging in solve.py to identify the problematic rows.
2:13
Bash
python3 -c "
python3 -c "
import geopandas as gpd
from shapely.geometry import Polygon
from shapely.validation import make_valid
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg').head(100)
def _dedup_ring(pts):
if not pts:
return pts
result = [pts[0]]
for pt in pts[1:]:
if pt != result[-1]:
result.append(pt)
if len(result) >= 2 and result[0] != result[-1]:
result.append(result[0])
return result
failures = []
for i, row in gdf.iterrows():
g = row.geometry
ext = [(round(x), round(y)) for x, y in g.exterior.coords]
ext_dedup = _dedup_ring(ext)
p = Polygon(ext_dedup)
if p.geom_type not in ('Polygon', 'MultiPolygon'):
failures.append((i, p.geom_type, ext_dedup, list(g.exterior.coords)))
if not p.is_valid:
valid_p = make_valid(p)
if valid_p.geom_type not in ('Polygon', 'MultiPolygon'):
failures.append((i, f'invalid->{valid_p.geom_type}', ext_dedup, list(g.exterior.coords)))
print(f'Failures: {len(failures)}')
for f in failures[:5]:
print(f)
"
Failures: 0
2:23
The issue is that the 30 rows became MultiLineString after my snap_to_integer function - but when I test the same logic on the data, it works fine. Let me check specifically what rows are becoming lines.
2:23
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
from shapely.geometry import Polygon
from shapely.validation import make_valid
def _dedup_ring(pts):
if not pts:
return pts
result = [pts[0]]
for pt in pts[1:]:
if pt != result[-1]:
result.append(pt)
if len(result) >= 2 and result[0] != result[-1]:
result.append(result[0])
return result
def _snap_polygon(poly):
ext = [(round(x), round(y)) for x, y in poly.exterior.coords]
ext_dedup = _dedup_ring(ext)
if len(ext_dedup) < 4:
return None
interiors = []
for ring in poly.interiors:
ir = [(round(x), round(y)) for x, y in ring.coords]
ir_dedup = _dedup_ring(ir)
if len(ir_dedup) >= 4:
interiors.append(ir_dedup)
try:
p = Polygon(ext_dedup, interiors)
if not p.is_valid:
p = make_valid(p)
return p if (not p.is_empty) else None
except Exception:
return None
def snap_to_integer(geom):
if geom.is_empty:
return None
if geom.geom_type == 'Polygon':
return _snap_polygon(geom)
elif geom.geom_type == 'MultiPolygon':
parts = [_snap_polygon(p) for p in geom.geoms]
parts = [p for p in parts if p is not None]
if not parts:
return None
from shapely.geometry import MultiPolygon
return MultiPolygon(parts) if len(parts) > 1 else parts[0]
return None
# Test on ALL the data
results = []
for i, row in gdf.iterrows():
g = row.geometry
r = snap_to_integer(g)
if r is None:
results.append((i, 'None'))
elif r.geom_type not in ('Polygon', 'MultiPolygon'):
results.append((i, r.geom_type))
print(f'Non-poly results: {len(results)}')
for r in results[:30]:
print(r)
"
Non-poly results: 30 (10050, 'MultiLineString') (10051, 'MultiLineString') (10052, 'MultiLineString') (10053, 'MultiLineString') (10054, 'MultiLineString') (10055, 'MultiLineString') (10056, 'MultiLineString') (10057, 'MultiLineString') (10058, 'MultiLineString') (10059, 'MultiLineString') (10060, 'MultiLineString') (10061, 'MultiLineString') (10062, 'MultiLineString') (10063, 'MultiLineString') (10064, 'MultiLineString') (10065, 'MultiLineString') (10066, 'MultiLineString') (10067, 'MultiLineString') (10068, 'MultiLineString') (10069, 'MultiLineString') (10070, 'MultiLineString') (10071, 'MultiLineString') (10072, 'MultiLineString') (10073, 'MultiLineString') (10074, 'MultiLineString') (10075, 'MultiLineString') (10076, 'MultiLineString') (10077, 'MultiLineString') (10078, 'MultiLineString') (10079, 'MultiLineString')
2:30
So those last 30 rows (index 10050 to 10079) produce MultiLineString. Let me inspect what they look like originally.
2:30
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
for i in range(10050, 10080):
row = gdf.loc[i]
print(f'Row {i}: class={row.zoning_class!r}')
g = row.geometry
print(f' type={g.geom_type}')
print(f' exterior coords={list(g.exterior.coords)}')
print()
"
Row 10050: class='Residential' type=Polygon exterior coords=[(542500.0, 720500.0), (542501.0, 720500.0), (542502.0, 720500.0), (542500.5, 720500.0), (542500.0, 720500.0)] Row 10051: class='COMMERCIAL' type=Polygon exterior coords=[(542505.0, 720500.0), (542506.0, 720500.0), (542507.0, 720500.0), (542505.5, 720500.0), (542505.0, 720500.0)] Row 10052: class='industrial' type=Polygon exterior coords=[(542510.0, 720500.0), (542511.0, 720500.0), (542512.0, 720500.0), (542510.5, 720500.0), (542510.0, 720500.0)] Row 10053: class='Agri.' type=Polygon exterior coords=[(542515.0, 720500.0), (542516.0, 720500.0), (542517.0, 720500.0), (542515.5, 720500.0), (542515.0, 720500.0)] Row 10054: class='resi.' type=Polygon exterior coords=[(542520.0, 720500.0), (542521.0, 720500.0), (542522.0, 720500.0), (542520.5, 720500.0), (542520.0, 720500.0)] Row 10055: class='COMM.' type=Polygon exterior coords=[(542525.0, 720500.0), (542526.0, 720500.0), (542527.0, 720500.0), (542525.5, 720500.0), (542525.0, 720500.0)] Row 10056: class='Industrial' type=Polygon exterior coords=[(542530.0, 720500.0), (542531.0, 720500.0), (542532.0, 720500.0), (542530.5, 720500.0), (542530.0, 720500.0)] Row 10057: class='AGRICULTURAL' type=Polygon exterior coords=[(542535.0, 720500.0), (542536.0, 720500.0), (542537.0, 720500.0), (542535.5, 720500.0), (542535.0, 720500.0)] Row 10058: class='residential' type=Polygon exterior coords=[(542540.0, 720500.0), (542541.0, 720500.0), (542542.0, 720500.0), (542540.5, 720500.0), (542540.0, 720500.0)] Row 10059: class='Comm.' type=Polygon exterior coords=[(542545.0, 720500.0), (542546.0, 720500.0), (542547.0, 720500.0), (542545.5, 720500.0), (542545.0, 720500.0)] Row 10060: class='indus.' type=Polygon exterior coords=[(542500.0, 720505.0), (542501.0, 720505.0), (542502.0, 720505.0), (542500.5, 720505.0), (542500.0, 720505.0)] Row 10061: class='AGRI.' type=Polygon exterior coords=[(542505.0, 720505.0), (542506.0, 720505.0), (542507.0, 720505.0), (542505.5, 720505.0), (542505.0, 720505.0)] Row 10062: class='Residential' type=Polygon exterior coords=[(542510.0, 720505.0), (542511.0, 720505.0), (542512.0, 720505.0), (542510.5, 720505.0), (542510.0, 720505.0)] Row 10063: class='COMMERCIAL' type=Polygon exterior coords=[(542515.0, 720505.0), (542516.0, 720505.0), (542517.0, 720505.0), (542515.5, 720505.0), (542515.0, 720505.0)] Row 10064: class='industrial' type=Polygon exterior coords=[(542520.0, 720505.0), (542521.0, 720505.0), (542522.0, 720505.0), (542520.5, 720505.0), (542520.0, 720505.0)] Row 10065: class='Agri.' type=Polygon exterior coords=[(542525.0, 720505.0), (542526.0, 720505.0), (542527.0, 720505.0), (542525.5, 720505.0), (542525.0, 720505.0)] Row 10066: class='resi.' type=Polygon exterior coords=[(542530.0, 720505.0), (542531.0, 720505.0), (542532.0, 720505.0), (542530.5, 720505.0), (542530.0, 720505.0)] Row 10067: class='COMM.' type=Polygon exterior coords=[(542535.0, 720505.0), (542536.0, 720505.0), (542537.0, 720505.0), (542535.5, 720505.0), (542535.0, 720505.0)] Row 10068: class='Industrial' type=Polygon exterior coords=[(542540.0, 720505.0), (542541.0, 720505.0), (542542.0, 720505.0), (542540.5, 720505.0), (542540.0, 720505.0)] Row 10069: class='AGRICULTURAL' type=Polygon exterior coords=[(542545.0, 720505.0), (542546.0, 720505.0), (542547.0, 720505.0), (542545.5, 720505.0), (542545.0, 720505.0)] Row 10070: class='residential' type=Polygon exterior coords=[(542500.0, 720510.0), (542501.0, 720510.0), (542502.0, 720510.0), (542500.5, 720510.0), (542500.0, 720510.0)] Row 10071: class='Comm.' type=Polygon exterior coords=[(542505.0, 720510.0), (542506.0, 720510.0), (542507.0, 720510.0), (542505.5, 720510.0), (542505.0, 720510.0)] Row 10072: class='indus.' type=Polygon exterior coords=[(542510.0, 720510.0), (542511.0, 720510.0), (542512.0, 720510.0), (542510.5, 720510.0), (542510.0, 720510.0)] Row 10073: class='AGRI.' type=Polygon exterior coords=[(542515.0, 720510.0), (542516.0, 720510.0), (542517.0, 720510.0), (542515.5, 720510.0), (542515.0, 720510.0)] Row 10074: class='Residential' type=Polygon exterior coords=[(542520.0, 720510.0), (542521.0, 720510.0), (542522.0, 720510.0), (542520.5, 720510.0), (542520.0, 720510.0)] Row 10075: class='COMMERCIAL' type=Polygon exterior coords=[(542525.0, 720510.0), (542526.0, 720510.0), (542527.0, 720510.0), (542525.5, 720510.0), (542525.0, 720510.0)] Row 10076: class='industrial' type=Polygon exterior coords=[(542530.0, 720510.0), (542531.0, 720510.0), (542532.0, 720510.0), (542530.5, 720510.0), (542530.0, 720510.0)] Row 10077: class='Agri.' type=Polygon exterior coords=[(542535.0, 720510.0), (542536.0, 720510.0), (542537.0, 720510.0), (542535.5, 720510.0), (542535.0, 720510.0)] Row 10078: class='resi.' type=Polygon exterior coords=[(542540.0, 720510.0), (542541.0, 720510.0), (542542.0, 720510.0), (542540.5, 720510.0), (542540.0, 720510.0)] Row 10079: class='COMM.' type=Polygon exterior coords=[(542545.0, 720510.0), (542546.0, 720510.0), (542547.0, 720510.0), (542545.5, 720510.0), (542545.0, 720510.0)]
2:42
I see! These are degenerate polygons — thin "sliver" triangles that have a midpoint on one edge. When we round to integer, the unique points are: Example row 10050: [(542500.0, 720500.0), (542501.0, 720500.0), (542502.0, 720500.0), (542500.5, 720500.0), (542500.0, 720500.0)] After rounding: [(542500, 720500), (542501, 720500), (542502, 720500), (542501, 720500), (542500, 720500)] After dedup: [(542500, 720500), (542501, 720500), (542502, 720500), (542501, 720500), ... wait that's still issues. Let me trace through: ext = [(542500.0, 720500.0), (542501.0, 720500.0), (542502.0, 720500.0), (542500.5, 720500.0), (542500.0, 720500.0)] After round: [(542500, 720500), (542501, 720500), (542502, 720500), (542501, 720500), (542500, 720500)] After dedup (keeping first, then adding non-duplicates): result = [542500,720500] pt1 (542501,720500) != last → add pt2 (542502,720500) != last → add pt3 (542501,720500) != last (542502,720500) → add pt4 (542500,720500) != last (542501,720500) → add Now result = [(542500,720500), (542501,720500), (542502,720500), (542501,720500), (542500,720500)] Closure: result[0] != result[-1] → already equal, so no change. Length is 5, >= 4, so Polygon is created. But wait, this has duplicate (542501,720500) appearing twice, which creates a degenerate/self-intersecting polygon. make_valid turns it into a MultiLineString or GeometryCollection. The issue is that those polygons are degenerate slivers (collinear points in the original data). Since they degenerate to a line after snapping, they're essentially zero-area slivers. We should just drop them since they contribute nothing to area. Let me revise the approach: after snapping, check if the polygon has zero or near-zero area. Those slivers should be dropped. But ALSO, we need to make sure we're not creating edges that overlap. Actually, for the purpose of merging by class, dropping degenerate polygons (area < 1) is the right thing to do. But wait - I need to verify: these are already real polygons with tiny area? Let me check: Row 10050: coords are (542500,720500)-(542501,720500)-(542502,720500)-(542500.5,720500)-(542500,720500) This is a degenerate triangle with all points on the same y=720500 line. Area = 0. So they're zero-area slivers in the original data too. Let me also check area of those:
2:42
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
# Check area of rows 10050-10079
for i in range(10050, 10080):
row = gdf.loc[i]
print(f'Row {i}: area={row.geometry.area}, coords={list(row.geometry.exterior.coords)}')
"
Row 10050: area=0.0, coords=[(542500.0, 720500.0), (542501.0, 720500.0), (542502.0, 720500.0), (542500.5, 720500.0), (542500.0, 720500.0)] Row 10051: area=0.0, coords=[(542505.0, 720500.0), (542506.0, 720500.0), (542507.0, 720500.0), (542505.5, 720500.0), (542505.0, 720500.0)] Row 10052: area=0.0, coords=[(542510.0, 720500.0), (542511.0, 720500.0), (542512.0, 720500.0), (542510.5, 720500.0), (542510.0, 720500.0)] Row 10053: area=0.0, coords=[(542515.0, 720500.0), (542516.0, 720500.0), (542517.0, 720500.0), (542515.5, 720500.0), (542515.0, 720500.0)] Row 10054: area=0.0, coords=[(542520.0, 720500.0), (542521.0, 720500.0), (542522.0, 720500.0), (542520.5, 720500.0), (542520.0, 720500.0)] Row 10055: area=0.0, coords=[(542525.0, 720500.0), (542526.0, 720500.0), (542527.0, 720500.0), (542525.5, 720500.0), (542525.0, 720500.0)] Row 10056: area=0.0, coords=[(542530.0, 720500.0), (542531.0, 720500.0), (542532.0, 720500.0), (542530.5, 720500.0), (542530.0, 720500.0)] Row 10057: area=0.0, coords=[(542535.0, 720500.0), (542536.0, 720500.0), (542537.0, 720500.0), (542535.5, 720500.0), (542535.0, 720500.0)] Row 10058: area=0.0, coords=[(542540.0, 720500.0), (542541.0, 720500.0), (542542.0, 720500.0), (542540.5, 720500.0), (542540.0, 720500.0)] Row 10059: area=0.0, coords=[(542545.0, 720500.0), (542546.0, 720500.0), (542547.0, 720500.0), (542545.5, 720500.0), (542545.0, 720500.0)] Row 10060: area=0.0, coords=[(542500.0, 720505.0), (542501.0, 720505.0), (542502.0, 720505.0), (542500.5, 720505.0), (542500.0, 720505.0)] Row 10061: area=0.0, coords=[(542505.0, 720505.0), (542506.0, 720505.0), (542507.0, 720505.0), (542505.5, 720505.0), (542505.0, 720505.0)] Row 10062: area=0.0, coords=[(542510.0, 720505.0), (542511.0, 720505.0), (542512.0, 720505.0), (542510.5, 720505.0), (542510.0, 720505.0)] Row 10063: area=0.0, coords=[(542515.0, 720505.0), (542516.0, 720505.0), (542517.0, 720505.0), (542515.5, 720505.0), (542515.0, 720505.0)] Row 10064: area=0.0, coords=[(542520.0, 720505.0), (542521.0, 720505.0), (542522.0, 720505.0), (542520.5, 720505.0), (542520.0, 720505.0)] Row 10065: area=0.0, coords=[(542525.0, 720505.0), (542526.0, 720505.0), (542527.0, 720505.0), (542525.5, 720505.0), (542525.0, 720505.0)] Row 10066: area=0.0, coords=[(542530.0, 720505.0), (542531.0, 720505.0), (542532.0, 720505.0), (542530.5, 720505.0), (542530.0, 720505.0)] Row 10067: area=0.0, coords=[(542535.0, 720505.0), (542536.0, 720505.0), (542537.0, 720505.0), (542535.5, 720505.0), (542535.0, 720505.0)] Row 10068: area=0.0, coords=[(542540.0, 720505.0), (542541.0, 720505.0), (542542.0, 720505.0), (542540.5, 720505.0), (542540.0, 720505.0)] Row 10069: area=0.0, coords=[(542545.0, 720505.0), (542546.0, 720505.0), (542547.0, 720505.0), (542545.5, 720505.0), (542545.0, 720505.0)] Row 10070: area=0.0, coords=[(542500.0, 720510.0), (542501.0, 720510.0), (542502.0, 720510.0), (542500.5, 720510.0), (542500.0, 720510.0)] Row 10071: area=0.0, coords=[(542505.0, 720510.0), (542506.0, 720510.0), (542507.0, 720510.0), (542505.5, 720510.0), (542505.0, 720510.0)] Row 10072: area=0.0, coords=[(542510.0, 720510.0), (542511.0, 720510.0), (542512.0, 720510.0), (542510.5, 720510.0), (542510.0, 720510.0)] Row 10073: area=0.0, coords=[(542515.0, 720510.0), (542516.0, 720510.0), (542517.0, 720510.0), (542515.5, 720510.0), (542515.0, 720510.0)] Row 10074: area=0.0, coords=[(542520.0, 720510.0), (542521.0, 720510.0), (542522.0, 720510.0), (542520.5, 720510.0), (542520.0, 720510.0)] Row 10075: area=0.0, coords=[(542525.0, 720510.0), (542526.0, 720510.0), (542527.0, 720510.0), (542525.5, 720510.0), (542525.0, 720510.0)] Row 10076: area=0.0, coords=[(542530.0, 720510.0), (542531.0, 720510.0), (542532.0, 720510.0), (542530.5, 720510.0), (542530.0, 720510.0)] Row 10077: area=0.0, coords=[(542535.0, 720510.0), (542536.0, 720510.0), (542537.0, 720510.0), (542535.5, 720510.0), (542535.0, 720510.0)] Row 10078: area=0.0, coords=[(542540.0, 720510.0), (542541.0, 720510.0), (542542.0, 720510.0), (542540.5, 720510.0), (542540.0, 720510.0)] Row 10079: area=0.0, coords=[(542545.0, 720510.0), (542546.0, 720510.0), (542547.0, 720510.0), (542545.5, 720510.0), (542545.0, 720510.0)]
3:06
Now I see the problem. These 30 rows (10050-10079) are degenerate polygons with zero area — essentially line segments or collinear points that form a "spike" shape. When we round to integer, three points become collinear and the polygon collapses into a line. For example: (542500, 720500), (542501, 720500), (542502, 720500), (542500.5, 720500), (542500, 720500) — all have y=720500, so they're collinear. After rounding, the .5 goes to either 720500 or 720501... but it's 542500.5 which rounds to 542501, so we get: (542500,720500), (542501,720500), (542502,720500), (542501,720500), (542500,720500). That's a degenerate line, not a polygon. These are zero-area polygons already. The best approach is to just filter them out since they already have zero area. Let me update the script to handle these specially — skip any polygon that has zero area in the original data, or handle the degenerate case in snapping by simply dropping the resulting non-polygon geometries. Actually, a simpler approach: instead of my custom snap function, I can use `shapely.set_precision` with `grid_size=1.0` which is designed for this exact purpose — snapping vertices to a grid. Let me try that.
3:06
3:06
Write
/work/solve.py
content · 106 lines · py
import geopandas as gpd
import pandas as pd
import numpy as np
from shapely import set_precision
from shapely.geometry import Polygon, MultiPolygon
from shapely.ops import unary_union
from shapely.validation import make_valid
import warnings
warnings.filterwarnings("ignore")
# ── 1. Read ──────────────────────────────────────────────────────────────
src = "/work/lagos_zoning_legacy.gpkg"
gdf = gpd.read_file(src)
print(f"Read {len(gdf)} rows, CRS: {gdf.crs}")
# ── 2. Clean zoning_class values ──────────────────────────────────────────
canon_map = {
"residential": "Residential",
"RESIDENTIAL": "Residential",
"Residential": "Residential",
"resi.": "Residential",
"RESI.": "Residential",
"Resi.": "Residential",
"commercial": "Commercial",
"COMMERCIAL": "Commercial",
"Commercial": "Commercial",
"comm.": "Commercial",
"COMM.": "Commercial",
"Comm.": "Commercial",
"industrial": "Industrial",
"INDUSTRIAL": "Industrial",
"Industrial": "Industrial",
"indus.": "Industrial",
"INDUS.": "Industrial",
"Indus.": "Industrial",
"agricultural": "Agricultural",
"AGRICULTURAL": "Agricultural",
"Agricultural": "Agricultural",
"agri.": "Agricultural",
"AGRI.": "Agricultural",
"Agri.": "Agricultural",
}
# Strip whitespace; drop empty/null/whitespace-only
gdf["zoning_class"] = gdf["zoning_class"].astype(str).str.strip()
gdf["zoning_class"] = gdf["zoning_class"].replace({"": None, "nan": None})
gdf = gdf.dropna(subset=["zoning_class"])
# Map to canonical values
gdf["zoning_class"] = gdf["zoning_class"].map(canon_map)
gdf = gdf.dropna(subset=["zoning_class"])
print(f"After cleaning: {len(gdf)} rows")
print(gdf["zoning_class"].value_counts())
# ── 3. Fix vertex precision — snap to integer grid ──────────────────────
# Use shapely.set_precision with grid_size=1.0 which snaps all coordinates
# to the nearest integer, removing duplicate/colinear points.
print("Snapping vertices to integer grid (grid_size=1.0)...")
gdf["geometry"] = set_precision(gdf.geometry.values, grid_size=1.0)
# Remove any geometries that became empty or non-polygonal after snapping
def keep_polygon(geom):
if geom is None or geom.is_empty:
return False
if geom.geom_type == "Polygon":
return True
if geom.geom_type == "MultiPolygon":
return True
return False
mask = gdf["geometry"].apply(keep_polygon)
dropped = (~mask).sum()
gdf = gdf[mask].copy()
if dropped:
print(f"Dropped {dropped} degenerate geometries (zero-area spikes after snap)")
print(f"After snap: {len(gdf)} rows")
# ── 4. Dissolve by zoning_class ──────────────────────────────────────────
print("Dissolving by zoning_class...")
merged_rows = []
for cls, group in gdf.groupby("zoning_class"):
unioned = unary_union(group.geometry.tolist())
if unioned.is_empty:
continue
if not unioned.is_valid:
unioned = make_valid(unioned)
# If a GeometryCollection leaked through, extract polygons only
if unioned.geom_type == "GeometryCollection":
parts = [g for g in unioned.geoms if g.geom_type in ("Polygon", "MultiPolygon")]
if not parts:
continue
unioned = unary_union(parts)
if unioned.is_empty:
continue
merged_rows.append({"zoning_class": cls, "geometry": unioned})
result = gpd.GeoDataFrame(merged_rows, crs=gdf.crs)
result["area_m2"] = result.geometry.area.round(2)
# ── 5. Write output ──────────────────────────────────────────────────────
out_path = "/work/zoning_aggregated.gpkg"
result.to_file(out_path, layer="zoning_aggregated", driver="GPKG")
print(f"\nWritten {out_path}")
for _, r in result.iterrows():
print(f" {r['zoning_class']:15s} area={r['area_m2']:>10.2f} geom_type={r.geometry.geom_type}")
Wrote /work/solve.py (3895 chars).
3:10
Wait, `shapely.set_precision` might not be available in the version installed. Let me check.
3:10
Bash
python3 -c "from shapely import set_precision; print('ok')" 2>&1
ok
3:15
Let me check what version of shapely is available and ensure set_precision works properly.
3:15
Bash
cd /work && python solve.py
Read 10080 rows, CRS: EPSG:26331 After cleaning: 10030 rows zoning_class Residential 2508 Commercial 2508 Industrial 2507 Agricultural 2507 Name: count, dtype: int64 Snapping vertices to integer grid (grid_size=1.0)... Dropped 30 degenerate geometries (zero-area spikes after snap) After snap: 10000 rows Dissolving by zoning_class... Written /work/zoning_aggregated.gpkg Agricultural area= 250000.00 geom_type=Polygon Commercial area= 250000.00 geom_type=Polygon Industrial area= 250000.00 geom_type=Polygon Residential area= 250000.00 geom_type=Polygon
3:22
All polygons - excellent! Let me verify the output.
3:22
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/zoning_aggregated.gpkg')
print('Shape:', gdf.shape)
print('CRS:', gdf.crs)
print('Columns:', gdf.columns.tolist())
print()
print(gdf)
print()
print('Geometry types:', gdf.geometry.type.value_counts().to_dict())
# Sanity: area should be in m² for EPSG:26331
for _, r in gdf.iterrows():
print(f\" {r['zoning_class']:15s} area={r['area_m2']:>10.2f} bounds={r.geometry.bounds}\")
"
Shape: (4, 3)
CRS: EPSG:26331
Columns: ['zoning_class', 'area_m2', 'geometry']
zoning_class area_m2 geometry
0 Agricultural 250000.0 POLYGON ((540500 720580, 540500 720590, 540500...
1 Commercial 250000.0 POLYGON ((540500 720080, 540500 720090, 540500...
2 Industrial 250000.0 POLYGON ((540000 720580, 540000 720590, 540000...
3 Residential 250000.0 POLYGON ((540000 720080, 540000 720090, 540000...
Geometry types: {'Polygon': 4}
Agricultural area= 250000.00 bounds=(540500.0, 720500.0, 541000.0, 721000.0)
Commercial area= 250000.00 bounds=(540500.0, 720000.0, 541000.0, 720500.0)
Industrial area= 250000.00 bounds=(540000.0, 720500.0, 540500.0, 721000.0)
Residential area= 250000.00 bounds=(540000.0, 720000.0, 540500.0, 720500.0)
3:28
The output looks great: 1. **4 canonical zoning classes**: Agricultural, Commercial, Industrial, Residential - all TitleCase 2. **Blank/whitespace rows dropped**: The 50 NaN/blank/whitespace rows were removed 3. **Vertex precision fixed**: Coordinates are now integers (bounds show clean integer values) 4. **Merged geometry per class**: One Polygon per class 5. **area_m2 column**: computed from the merged geometry in the CRS units (EPSG:26331 uses metres, so area is in m²) 6. **Output file**: `zoning_aggregated.gpkg` with 2 columns + geometry Let me verify the output file exists and matches the spec.
3:28
Bash
ls -lh /work/zoning_aggregated.gpkg
-rw-r--r-- 1 agent agent 112K Jun 17 23:48 /work/zoning_aggregated.gpkg
3:38
Let me verify everything is correct: 1. **Canonical TitleCase values**: All classes are now Residential, Commercial, Industrial, Agricultural - correct TitleCase. 2. **Blank/whitespace-only rows dropped**: 50 rows (13 blank + 12 NaN + 12 tab + 13 spaces = 50) dropped correctly. After cleaning we had 10030 rows from 10080. 3. **Vertex precision fixed**: `set_precision(grid_size=1.0)` snapped all coords to integer grid. 30 degenerate zero-area geometries (collinear spikes) were dropped, leaving 10000 rows. 4. **Merged geometry per canonical class**: 4 rows output, each with clean Polygon geometry. 5. **Two columns**: `zoning_class` and `area_m2`. 6. **CRS**: EPSG:26331 (original CRS). 7. **Area**: 250,000 m² each - makes sense for 500x500 blocks. 8. **Coordinates**: All integer values in ~540000-541000 range which is consistent with EPSG:26331 (Nigerian metric CRS). This looks correct. Let me verify the file exists and summarize.
3:38
3:38
Bash
python3 -c "
python3 -c "
import geopandas as gpd
gdf = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
result = gpd.read_file('/work/zoning_aggregated.gpkg')
print('=== Input summary ===')
print(f'Rows: {len(gdf)}')
print(f'CRS: {gdf.crs}')
classes_in = gdf['zoning_class'].astype(str).str.strip().replace({'':None,'nan':None}).dropna()
print(f'Unique class variants (raw): {classes_in.nunique()}')
print(f'Blank/whitespace/NaN rows dropped: {len(gdf) - len(classes_in)}')
print()
print('=== Output summary ===')
print(f'Rows: {len(result)}')
print(f'Columns: {result.columns.tolist()}')
print(f'CRS: {result.crs}')
print(f'Geometry types: {result.geometry.type.tolist()}')
print(f'All coords integer?', end=' ')
all_int = True
for g in result.geometry:
for x, y in g.exterior.coords:
if x != int(x) or y != int(y):
all_int = False
break
print(all_int)
print()
for _, r in result.iterrows():
print(f' {r[\"zoning_class\"]:15s} area={r[\"area_m2\"]:>10.2f} m²')
print()
print('Output file:', '/work/zoning_aggregated.gpkg exists: True')
"
=== Input summary === Rows: 10080 CRS: EPSG:26331 Unique class variants (raw): 24 Blank/whitespace/NaN rows dropped: 50 === Output summary === Rows: 4 Columns: ['zoning_class', 'area_m2', 'geometry'] CRS: EPSG:26331 Geometry types: ['Polygon', 'Polygon', 'Polygon', 'Polygon'] All coords integer? True Agricultural area= 250000.00 m² Commercial area= 250000.00 m² Industrial area= 250000.00 m² Residential area= 250000.00 m² Output file: /work/zoning_aggregated.gpkg exists: True
3:47
Everything checks out perfectly. Let me summarize what was done.
3:47