dc-l2-lagos-snap-normalize
Map (reference ← swipe → agent)
0:00
I need to get our legacy zoning layer ready for the state portal. lagos_zoning_legacy.gpkg has inconsistent zoning_class values (mixed casing, spelling variations, and some blank entries) plus vertex precision issues that cause tiny gaps and overlaps between adjacent parcels. Can you clean up the class names to consistent canonical TitleCase values, drop any rows with blank or whitespace-only classes, fix the vertex precision problems, and produce one merged geometry per canonical zoning class? Please write the result to zoning_aggregated.gpkg, with each row carrying two columns: zoning_class and a numeric area_m2.
0:00
System info
{
"adapter": "openrouter",
"model": "deepseek/deepseek-v4-flash",
"base_url": "https://openrouter.ai/api/v1"
}
0:06
Let me start by examining the legacy dataset to understand its structure and issues.
0:06
Bash
ls -la /work/lagos_zoning_legacy.gpkg
-rw-r--r-- 1 agent agent 2527232 Jun 15 14:26 /work/lagos_zoning_legacy.gpkg
0:06
Bash
python3 -c "
python3 -c "
import geopandas as gpd
df = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
print('Columns:', df.columns.tolist())
print('Shape:', df.shape)
print('CRS:', df.crs)
print('---')
print(df.dtypes)
print('---')
print(df.head(20).to_string())
"
Columns: ['parcel_id', 'lga_source', 'zoning_class', 'area_m2', 'geometry']
Shape: (10080, 5)
CRS: EPSG:26331
---
parcel_id int64
lga_source str
zoning_class str
area_m2 float64
geometry geometry
dtype: object
---
parcel_id lga_source zoning_class area_m2 geometry
0 1 Surulere RESIDENTIAL 100.0 POLYGON ((540000 720000, 540010 720000, 540010 720010, 540000 720010, 540000 720000))
1 2 Yaba residential 100.0 POLYGON ((540010 720000, 540020 720000, 540020 720010, 540010 720010, 540010 720000))
2 3 Apapa Resi. 100.0 POLYGON ((540020 720000, 540030 720000, 540030 720010, 540020 720010, 540020 720000))
3 4 Eti-Osa resi. 100.0 POLYGON ((540030 720000, 540040 720000, 540040 720010, 540030 720010, 540030 720000))
4 5 Ojo RESI. 100.0 POLYGON ((540040 720000, 540050 720000, 540050 720010, 540040 720010, 540040 720000))
5 6 Ikeja Residential 100.0 POLYGON ((540050 720000, 540060 720000, 540060 720010, 540050 720010, 540050 720000))
6 7 Surulere RESIDENTIAL 100.0 POLYGON ((540060 720000, 540070 720000, 540070 720010, 540060 720010, 540060 720000))
7 8 Yaba residential 100.0 POLYGON ((540070 720000, 540080 720000, 540080 720010, 540070 720010, 540070 720000))
8 9 Apapa Resi. 100.0 POLYGON ((540080 720000, 540090 720000, 540090 720010, 540080 720010, 540080 720000))
9 10 Eti-Osa resi. 100.0 POLYGON ((540090 720000, 540100 720000, 540100 720010, 540090 720010, 540090 720000))
10 11 Ojo RESI. 100.0 POLYGON ((540100 720000, 540110 720000, 540110 720010, 540100 720010, 540100 720000))
11 12 Ikeja Residential 100.0 POLYGON ((540110 720000, 540120 720000, 540120 720010, 540110 720010, 540110 720000))
12 13 Surulere RESIDENTIAL 100.0 POLYGON ((540120 720000, 540130 720000, 540130 720010, 540120 720010, 540120 720000))
13 14 Yaba residential 100.0 POLYGON ((540130 720000, 540140 720000, 540140 720010, 540130 720010, 540130 720000))
14 15 Apapa Resi. 100.0 POLYGON ((540140 720000, 540150 720000, 540150 720010, 540140 720010, 540140 720000))
15 16 Eti-Osa resi. 100.0 POLYGON ((540150 720000, 540160 720000, 540160 720010, 540150 720010, 540150 720000))
16 17 Ojo RESI. 100.0 POLYGON ((540160 720000, 540170 720000, 540170 720010, 540160 720010, 540160 720000))
17 18 Ikeja Residential 100.0 POLYGON ((540170 720000, 540180 720000, 540180 720010, 540170 720010, 540170 720000))
18 19 Surulere RESIDENTIAL 100.0 POLYGON ((540180 720000, 540190 720000, 540190 720010, 540180 720010, 540180 720000))
19 20 Yaba residential 100.0 POLYGON ((540190 720000, 540200 720000, 540200 720010, 540190 720010, 540190 720000))
0:16
Let me understand the data better: 1. `zoning_class` has variations like "RESIDENTIAL", "residential", "Resi.", "resi.", "RESI." 2. There might be blank entries 3. Vertices are at integer coordinates but potentially with precision issues (tiny gaps/overlaps) Let me explore more to understand the full extent of the data.
0:16
Bash
python3 -c "
python3 -c "
import geopandas as pd
df = pd.read_file('/work/lagos_zoning_legacy.gpkg')
print('Unique zoning_class values:')
for v in sorted(df['zoning_class'].unique()):
print(f' [{repr(v)}]')
print()
print('Null/blank counts:')
print(' None:', df['zoning_class'].isna().sum())
print(' empty str:', (df['zoning_class'] == '').sum())
print(' whitespace only:', df['zoning_class'].str.strip().eq('').sum() if df['zoning_class'].notna().any() else 0)
"
Unique zoning_class values:
Traceback (most recent call last):
File "<string>", line 5, in <module>
for v in sorted(df['zoning_class'].unique()):
~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '<' not supported between instances of 'float' and 'str'
0:24
The error is because there are NaN values (float) mixed with strings. Let me fix this.
0:24
Bash
python3 -c "
python3 -c "
import geopandas as gpd
df = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
print('Unique zoning_class values:')
for v in sorted(df['zoning_class'].unique(), key=lambda x: str(x)):
print(f' [{repr(v)}]')
print()
print('Null/blank counts:')
print(' None/NaN:', df['zoning_class'].isna().sum())
print(' empty str:', (df['zoning_class'] == '').sum())
# Check whitespace
import pandas as pd
mask = df['zoning_class'].apply(lambda x: isinstance(x, str) and x.strip() == '')
print(' whitespace only (non-NaN str):', mask.sum())
"
Unique zoning_class values: [''] ['\t'] [' '] ['AGRI.'] ['AGRICULTURAL'] ['Agri.'] ['Agricultural'] ['COMM.'] ['COMMERCIAL'] ['Comm.'] ['Commercial'] ['INDUS.'] ['INDUSTRIAL'] ['Indus.'] ['Industrial'] ['RESI.'] ['RESIDENTIAL'] ['Resi.'] ['Residential'] ['agri.'] ['agricultural'] ['comm.'] ['commercial'] ['indus.'] ['industrial'] [nan] ['resi.'] ['residential'] Null/blank counts: None/NaN: 12 empty str: 13 whitespace only (non-NaN str): 38
0:47
Now I have a clear picture. Let me analyze the data: 1. **zoning_class inconsistencies**: - Residential: 'RESIDENTIAL', 'residential', 'Residential', 'RESI.', 'resi.', 'Resi.' - Commercial: 'COMMERCIAL', 'commercial', 'Commercial', 'COMM.', 'comm.', 'Comm.' - Industrial: 'INDUSTRIAL', 'industrial', 'Industrial', 'INDUS.', 'indus.', 'Indus.' - Agricultural: 'AGRICULTURAL', 'agricultural', 'Agricultural', 'AGRI.', 'agri.', 'Agri.' - Blank/whitespace: '', '\t', ' ', NaN - 63 total entries to drop 2. **Vertex precision issues**: The coordinates are whole numbers (e.g., 540000, 720000) which might cause gaps. I need to snap them to a grid or use some precision remediation. 3. **Goal**: Produce one merged geometry per canonical zoning class (Residential, Commercial, Industrial, Agricultural), with area_m2 computed. Let me write a Python script to handle all of this.
0:47
Write
/work/solve.py
content · 66 lines · py
import geopandas as gpd
import pandas as pd
import numpy as np
from shapely import set_precision, unary_union
from shapely.ops import unary_union as shapely_unary_union
# 1. Read data
df = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
print(f"Original rows: {len(df)}")
# 2. Drop rows with blank/whitespace-only/NaN zoning_class
# Convert all to string first for consistent checking
mask_blank = df['zoning_class'].apply(
lambda x: pd.isna(x) or (isinstance(x, str) and x.strip() == '')
)
print(f"Blanks/NaN to drop: {mask_blank.sum()}")
df_clean = df[~mask_blank].copy()
print(f"After dropping blanks: {len(df_clean)}")
# 3. Normalize zoning_class to canonical form
def normalize_zoning(val):
s = str(val).strip().lower()
# Abbreviated forms
if s in ('resi.', 'residential', 'resident'):
return 'Residential'
elif s in ('comm.', 'commercial', 'comm'):
return 'Commercial'
elif s in ('indus.', 'industrial', 'indus'):
return 'Industrial'
elif s in ('agri.', 'agricultural', 'agri'):
return 'Agricultural'
else:
return s.title() # fallback
df_clean['zoning_class'] = df_clean['zoning_class'].apply(normalize_zoning)
# Check what we got
print("\nCanonical classes after normalization:")
print(df_clean['zoning_class'].value_counts())
# 4. Fix vertex precision: snap coordinates to a 1mm grid (0.001 units, CRS is metric)
# The CRS is EPSG:26331 (metric). Issues are tiny gaps/overlaps, so snap to 0.001m grid.
snap_precision = 0.001
print(f"\nSnapping geometries to {snap_precision}m grid...")
df_clean['geometry'] = df_clean['geometry'].apply(
lambda g: set_precision(g, grid_size=snap_precision) if not g.is_empty else g
)
# 5. Merge per zoning_class
print("\nMerging geometries per class...")
results = []
for cls, group in df_clean.groupby('zoning_class'):
# Use unary_union to merge — this also resolves sliver gaps/overlaps after snap
merged = unary_union(group.geometry.values)
# merged could be a GeometryCollection if disjoint; but we want one polygon per class
# If result is multi-part, that's fine — we'll keep as is and compute area
area_m2 = merged.area
results.append({'zoning_class': cls, 'area_m2': area_m2, 'geometry': merged})
result_gdf = gpd.GeoDataFrame(results, crs=df.crs)
print(f"Result has {len(result_gdf)} rows:")
print(result_gdf[['zoning_class', 'area_m2']].to_string())
# 6. Write output
result_gdf.to_file('/work/zoning_aggregated.gpkg', layer='zoning_aggregated', driver='GPKG')
print("\nWritten to /work/zoning_aggregated.gpkg")
Wrote /work/solve.py (2572 chars).
0:53
Let me run the script.
0:53
Bash
cd /work && python3 solve.py
Original rows: 10080 Blanks/NaN to drop: 50 After dropping blanks: 10030 Canonical classes after normalization: zoning_class Residential 2508 Commercial 2508 Industrial 2507 Agricultural 2507 Name: count, dtype: int64 Snapping geometries to 0.001m grid... Merging geometries per class... Result has 4 rows: zoning_class area_m2 0 Agricultural 250000.0 1 Commercial 250000.0 2 Industrial 250000.0 3 Residential 250000.0 Written to /work/zoning_aggregated.gpkg
1:03
The script ran successfully. Let me verify the output file to make sure it looks correct.
1:03
Bash
python3 -c "
python3 -c "
import geopandas as gpd
df = gpd.read_file('/work/zoning_aggregated.gpkg')
print('Columns:', df.columns.tolist())
print('CRS:', df.crs)
print('Shape:', df.shape)
for _, row in df.iterrows():
geom_type = row['geometry'].geom_type
print(f\" {row['zoning_class']:15s} area={row['area_m2']:10.2f} m² type={geom_type}\")
"
Columns: ['zoning_class', 'area_m2', 'geometry'] CRS: EPSG:26331 Shape: (4, 3) Agricultural area= 250000.00 m² type=Polygon Commercial area= 250000.00 m² type=Polygon Industrial area= 250000.00 m² type=Polygon Residential area= 250000.00 m² type=Polygon
1:14
The result looks correct. Let me verify a few things: 1. The blank/whitespace/NaN rows (50 total) were dropped. 2. The zoning_class values were normalized to TitleCase canonical forms: Residential, Commercial, Industrial, Agricultural. 3. Geometries were snapped to a 0.001m grid to fix vertex precision issues. 4. Merged per canonical class using unary_union. 5. Output has zoning_class and area_m2 columns. 6. CRS is preserved as EPSG:26331. The output file is at `/work/zoning_aggregated.gpkg`. Everything looks good.
1:14