dc-l2-lagos-snap-normalize

Status: done
Score: 1.000
Duration: 1:17
Cost: 0.20¢
Model: deepseek/deepseek-v4-flash

Map (reference ← swipe → agent)

0:00
I need to get our legacy zoning layer ready for the state portal. lagos_zoning_legacy.gpkg has inconsistent zoning_class values (mixed casing, spelling variations, and some blank entries) plus vertex precision issues that cause tiny gaps and overlaps between adjacent parcels. Can you clean up the class names to consistent canonical TitleCase values, drop any rows with blank or whitespace-only classes, fix the vertex precision problems, and produce one merged geometry per canonical zoning class? Please write the result to zoning_aggregated.gpkg, with each row carrying two columns: zoning_class and a numeric area_m2.
0:00
System info
{
  "adapter": "openrouter",
  "model": "deepseek/deepseek-v4-flash",
  "base_url": "https://openrouter.ai/api/v1"
}
0:06
Let me start by examining the legacy dataset to understand its structure and issues.
0:06
Bash ls -la /work/lagos_zoning_legacy.gpkg
-rw-r--r-- 1 agent agent 2527232 Jun 15 14:26 /work/lagos_zoning_legacy.gpkg
0:06
Bash python3 -c "
python3 -c "
import geopandas as gpd
df = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
print('Columns:', df.columns.tolist())
print('Shape:', df.shape)
print('CRS:', df.crs)
print('---')
print(df.dtypes)
print('---')
print(df.head(20).to_string())
"
Columns: ['parcel_id', 'lga_source', 'zoning_class', 'area_m2', 'geometry']
Shape: (10080, 5)
CRS: EPSG:26331
---
parcel_id          int64
lga_source           str
zoning_class         str
area_m2          float64
geometry        geometry
dtype: object
---
    parcel_id lga_source zoning_class  area_m2                                                                               geometry
0           1   Surulere  RESIDENTIAL    100.0  POLYGON ((540000 720000, 540010 720000, 540010 720010, 540000 720010, 540000 720000))
1           2       Yaba  residential    100.0  POLYGON ((540010 720000, 540020 720000, 540020 720010, 540010 720010, 540010 720000))
2           3      Apapa        Resi.    100.0  POLYGON ((540020 720000, 540030 720000, 540030 720010, 540020 720010, 540020 720000))
3           4    Eti-Osa        resi.    100.0  POLYGON ((540030 720000, 540040 720000, 540040 720010, 540030 720010, 540030 720000))
4           5        Ojo        RESI.    100.0  POLYGON ((540040 720000, 540050 720000, 540050 720010, 540040 720010, 540040 720000))
5           6      Ikeja  Residential    100.0  POLYGON ((540050 720000, 540060 720000, 540060 720010, 540050 720010, 540050 720000))
6           7   Surulere  RESIDENTIAL    100.0  POLYGON ((540060 720000, 540070 720000, 540070 720010, 540060 720010, 540060 720000))
7           8       Yaba  residential    100.0  POLYGON ((540070 720000, 540080 720000, 540080 720010, 540070 720010, 540070 720000))
8           9      Apapa        Resi.    100.0  POLYGON ((540080 720000, 540090 720000, 540090 720010, 540080 720010, 540080 720000))
9          10    Eti-Osa        resi.    100.0  POLYGON ((540090 720000, 540100 720000, 540100 720010, 540090 720010, 540090 720000))
10         11        Ojo        RESI.    100.0  POLYGON ((540100 720000, 540110 720000, 540110 720010, 540100 720010, 540100 720000))
11         12      Ikeja  Residential    100.0  POLYGON ((540110 720000, 540120 720000, 540120 720010, 540110 720010, 540110 720000))
12         13   Surulere  RESIDENTIAL    100.0  POLYGON ((540120 720000, 540130 720000, 540130 720010, 540120 720010, 540120 720000))
13         14       Yaba  residential    100.0  POLYGON ((540130 720000, 540140 720000, 540140 720010, 540130 720010, 540130 720000))
14         15      Apapa        Resi.    100.0  POLYGON ((540140 720000, 540150 720000, 540150 720010, 540140 720010, 540140 720000))
15         16    Eti-Osa        resi.    100.0  POLYGON ((540150 720000, 540160 720000, 540160 720010, 540150 720010, 540150 720000))
16         17        Ojo        RESI.    100.0  POLYGON ((540160 720000, 540170 720000, 540170 720010, 540160 720010, 540160 720000))
17         18      Ikeja  Residential    100.0  POLYGON ((540170 720000, 540180 720000, 540180 720010, 540170 720010, 540170 720000))
18         19   Surulere  RESIDENTIAL    100.0  POLYGON ((540180 720000, 540190 720000, 540190 720010, 540180 720010, 540180 720000))
19         20       Yaba  residential    100.0  POLYGON ((540190 720000, 540200 720000, 540200 720010, 540190 720010, 540190 720000))
0:16
Let me understand the data better:

1. `zoning_class` has variations like "RESIDENTIAL", "residential", "Resi.", "resi.", "RESI."
2. There might be blank entries
3. Vertices are at integer coordinates but potentially with precision issues (tiny gaps/overlaps)

Let me explore more to understand the full extent of the data.
0:16
Bash python3 -c "
python3 -c "
import geopandas as pd
df = pd.read_file('/work/lagos_zoning_legacy.gpkg')
print('Unique zoning_class values:')
for v in sorted(df['zoning_class'].unique()):
    print(f'  [{repr(v)}]')
print()
print('Null/blank counts:')
print('  None:', df['zoning_class'].isna().sum())
print('  empty str:', (df['zoning_class'] == '').sum())
print('  whitespace only:', df['zoning_class'].str.strip().eq('').sum() if df['zoning_class'].notna().any() else 0)
"
Unique zoning_class values:
Traceback (most recent call last):
  File "<string>", line 5, in <module>
    for v in sorted(df['zoning_class'].unique()):
             ~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '<' not supported between instances of 'float' and 'str'
0:24
The error is because there are NaN values (float) mixed with strings. Let me fix this.
0:24
Bash python3 -c "
python3 -c "
import geopandas as gpd
df = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
print('Unique zoning_class values:')
for v in sorted(df['zoning_class'].unique(), key=lambda x: str(x)):
    print(f'  [{repr(v)}]')
print()
print('Null/blank counts:')
print('  None/NaN:', df['zoning_class'].isna().sum())
print('  empty str:', (df['zoning_class'] == '').sum())
# Check whitespace
import pandas as pd
mask = df['zoning_class'].apply(lambda x: isinstance(x, str) and x.strip() == '')
print('  whitespace only (non-NaN str):', mask.sum())
"
Unique zoning_class values:
  ['']
  ['\t']
  ['   ']
  ['AGRI.']
  ['AGRICULTURAL']
  ['Agri.']
  ['Agricultural']
  ['COMM.']
  ['COMMERCIAL']
  ['Comm.']
  ['Commercial']
  ['INDUS.']
  ['INDUSTRIAL']
  ['Indus.']
  ['Industrial']
  ['RESI.']
  ['RESIDENTIAL']
  ['Resi.']
  ['Residential']
  ['agri.']
  ['agricultural']
  ['comm.']
  ['commercial']
  ['indus.']
  ['industrial']
  [nan]
  ['resi.']
  ['residential']

Null/blank counts:
  None/NaN: 12
  empty str: 13
  whitespace only (non-NaN str): 38
0:47
Now I have a clear picture. Let me analyze the data:

1. **zoning_class inconsistencies**: 
   - Residential: 'RESIDENTIAL', 'residential', 'Residential', 'RESI.', 'resi.', 'Resi.'
   - Commercial: 'COMMERCIAL', 'commercial', 'Commercial', 'COMM.', 'comm.', 'Comm.'
   - Industrial: 'INDUSTRIAL', 'industrial', 'Industrial', 'INDUS.', 'indus.', 'Indus.'
   - Agricultural: 'AGRICULTURAL', 'agricultural', 'Agricultural', 'AGRI.', 'agri.', 'Agri.'
   - Blank/whitespace: '', '\t', '   ', NaN - 63 total entries to drop

2. **Vertex precision issues**: The coordinates are whole numbers (e.g., 540000, 720000) which might cause gaps. I need to snap them to a grid or use some precision remediation.

3. **Goal**: Produce one merged geometry per canonical zoning class (Residential, Commercial, Industrial, Agricultural), with area_m2 computed.

Let me write a Python script to handle all of this.
0:47
Write /work/solve.py
content · 66 lines · py
import geopandas as gpd
import pandas as pd
import numpy as np
from shapely import set_precision, unary_union
from shapely.ops import unary_union as shapely_unary_union

# 1. Read data
df = gpd.read_file('/work/lagos_zoning_legacy.gpkg')
print(f"Original rows: {len(df)}")

# 2. Drop rows with blank/whitespace-only/NaN zoning_class
# Convert all to string first for consistent checking
mask_blank = df['zoning_class'].apply(
    lambda x: pd.isna(x) or (isinstance(x, str) and x.strip() == '')
)
print(f"Blanks/NaN to drop: {mask_blank.sum()}")
df_clean = df[~mask_blank].copy()
print(f"After dropping blanks: {len(df_clean)}")

# 3. Normalize zoning_class to canonical form
def normalize_zoning(val):
    s = str(val).strip().lower()
    # Abbreviated forms
    if s in ('resi.', 'residential', 'resident'):
        return 'Residential'
    elif s in ('comm.', 'commercial', 'comm'):
        return 'Commercial'
    elif s in ('indus.', 'industrial', 'indus'):
        return 'Industrial'
    elif s in ('agri.', 'agricultural', 'agri'):
        return 'Agricultural'
    else:
        return s.title()  # fallback

df_clean['zoning_class'] = df_clean['zoning_class'].apply(normalize_zoning)

# Check what we got
print("\nCanonical classes after normalization:")
print(df_clean['zoning_class'].value_counts())

# 4. Fix vertex precision: snap coordinates to a 1mm grid (0.001 units, CRS is metric)
# The CRS is EPSG:26331 (metric). Issues are tiny gaps/overlaps, so snap to 0.001m grid.
snap_precision = 0.001
print(f"\nSnapping geometries to {snap_precision}m grid...")
df_clean['geometry'] = df_clean['geometry'].apply(
    lambda g: set_precision(g, grid_size=snap_precision) if not g.is_empty else g
)

# 5. Merge per zoning_class
print("\nMerging geometries per class...")
results = []
for cls, group in df_clean.groupby('zoning_class'):
    # Use unary_union to merge — this also resolves sliver gaps/overlaps after snap
    merged = unary_union(group.geometry.values)
    # merged could be a GeometryCollection if disjoint; but we want one polygon per class
    # If result is multi-part, that's fine — we'll keep as is and compute area
    area_m2 = merged.area
    results.append({'zoning_class': cls, 'area_m2': area_m2, 'geometry': merged})

result_gdf = gpd.GeoDataFrame(results, crs=df.crs)
print(f"Result has {len(result_gdf)} rows:")
print(result_gdf[['zoning_class', 'area_m2']].to_string())

# 6. Write output
result_gdf.to_file('/work/zoning_aggregated.gpkg', layer='zoning_aggregated', driver='GPKG')
print("\nWritten to /work/zoning_aggregated.gpkg")
Wrote /work/solve.py (2572 chars).
0:53
Let me run the script.
0:53
Bash cd /work && python3 solve.py
Original rows: 10080
Blanks/NaN to drop: 50
After dropping blanks: 10030

Canonical classes after normalization:
zoning_class
Residential     2508
Commercial      2508
Industrial      2507
Agricultural    2507
Name: count, dtype: int64

Snapping geometries to 0.001m grid...

Merging geometries per class...
Result has 4 rows:
   zoning_class   area_m2
0  Agricultural  250000.0
1    Commercial  250000.0
2    Industrial  250000.0
3   Residential  250000.0

Written to /work/zoning_aggregated.gpkg
1:03
The script ran successfully. Let me verify the output file to make sure it looks correct.
1:03
Bash python3 -c "
python3 -c "
import geopandas as gpd
df = gpd.read_file('/work/zoning_aggregated.gpkg')
print('Columns:', df.columns.tolist())
print('CRS:', df.crs)
print('Shape:', df.shape)
for _, row in df.iterrows():
    geom_type = row['geometry'].geom_type
    print(f\"  {row['zoning_class']:15s}  area={row['area_m2']:10.2f} m²  type={geom_type}\")
"
Columns: ['zoning_class', 'area_m2', 'geometry']
CRS: EPSG:26331
Shape: (4, 3)
  Agricultural     area= 250000.00 m²  type=Polygon
  Commercial       area= 250000.00 m²  type=Polygon
  Industrial       area= 250000.00 m²  type=Polygon
  Residential      area= 250000.00 m²  type=Polygon
1:14
The result looks correct. Let me verify a few things:

1. The blank/whitespace/NaN rows (50 total) were dropped.
2. The zoning_class values were normalized to TitleCase canonical forms: Residential, Commercial, Industrial, Agricultural.
3. Geometries were snapped to a 0.001m grid to fix vertex precision issues.
4. Merged per canonical class using unary_union.
5. Output has zoning_class and area_m2 columns.
6. CRS is preserved as EPSG:26331.

The output file is at `/work/zoning_aggregated.gpkg`. Everything looks good.
1:14