dc-l2-cairo-invalid-dedup
Map (reference ← swipe → agent)
0:00
cairo_parcels_legacy.geojson is a parcel snapshot stitched together from three legacy provincial systems before unification, and I need to turn it into the foundation for our new central registry. Can you clean it up so we end up with one canonical record per parcel? When duplicates exist, keep the record with the earliest record_seq, and discard any polygon fragments smaller than 1 m².
Please write the result to parcels_canonical.geoparquet including these columns: parcel_id (the join key), parcel_class, district, and area_m2. Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers.
0:00
System info
{
"adapter": "openrouter",
"model": "google/gemma-4-26b-a4b-it",
"base_url": "https://openrouter.ai/api/v1"
}
0:01
Bash
ls -F
ls -F
cairo_parcels_legacy.geojson
0:04
Write
/work/inspect_data.py
content · 23 lines · py
import geopandas as gpd
# Load the data to inspect it
gdf = gpd.read_file('cairo_parcels_legacy.geojson')
print("Columns:", gdf.columns.tolist())
print("Shape:", gdf.shape)
print("CRS:", gdf.crs)
print("\nFirst few rows:")
print(gdf.head())
# Check for null geometries
print("\nNull geometries count:", gdf.geometry.isna().sum())
# Check for empty geometries
print("Empty geometries count:", gdf.geometry.is_empty.sum())
# Check for invalid geometries
print("Invalid geometries count:", not gdf.geometry.is_valid.all())
# Check for duplicates based on parcel_id
print("\nDuplicate parcel_id count:", gdf['parcel_id'].duplicated().sum())
Wrote /work/inspect_data.py (643 chars).
0:04
Bash
python inspect_data.py
python inspect_data.py
Columns: ['parcel_id', 'record_seq', 'parcel_class', 'district', 'area_m2', 'geometry'] Shape: (290, 6) CRS: EPSG:22992 First few rows: parcel_id ... geometry 0 1 ... POLYGON ((640000 815000, 640030 815000, 640030... 1 2 ... POLYGON ((640050 815000, 640080 815000, 640080... 2 3 ... POLYGON ((640100 815000, 640130 815000, 640130... 3 4 ... POLYGON ((640150 815000, 640180 815000, 640180... 4 5 ... POLYGON ((640200 815000, 640230 815000, 640230... [5 rows x 6 columns] Null geometries count: 0 Empty geometries count: 0 Invalid geometries count: True Duplicate parcel_id count: 0
0:12
Write
/work/solve.py
content · 69 lines · py
import geopandas as gpd
from shapely.geometry import MultiPolygon, Polygon
def solve():
# Load the data
gdf = gpd.read_file('cairo_parcels_legacy.geojson')
# 1. Fix invalid geometries
# buffer(0) is a common trick to fix minor topological errors
gdf['geometry'] = gdf.geometry.buffer(0)
# 2. Handle duplicates (keep earliest record_seq)
# Sort by parcel_id and record_seq, then drop duplicates of parcel_id
gdf = gdf.sort_values(['parcel_id', 'record_seq'])
gdf = gdf.drop_duplicates(subset=['parcel_id'], keep='first')
# 3. Discard polygon fragments smaller than 1 m2
# We check the area of the geometry
gdf = gdf[gdf.geometry.area >= 1.0]
# 4. Ensure every feature is a MultiPolygon
def ensure_multipolygon(geom):
if geom.is_empty:
return None
if geom.geom_type == 'Polygon':
return MultiPolygon([geom])
elif geom.geom_type == 'MultiPolygon':
return geom
else:
# If it's something else (like a GeometryCollection), try to extract Polygons
if hasattr(geom, 'geoms'):
polys = [g for g in geom.geoms if g.geom_type in ['Polygon', 'MultiPolygon']]
if not polys:
return None
# Union them or just wrap them
from shapely.ops import unary_union
return unary_union(polys)
return None
gdf['geometry'] = gdf.geometry.apply(ensure_multipolygon)
# Drop rows that resulted in None geometries
gdf = gdf.dropna(subset=['geometry'])
gdf = gdf[~gdf.geometry.is_empty]
# 5. Remove duplicate geometries
# Note: duplicate geometries might exist even if parcel_ids are unique
gdf = gdf.drop_duplicates(subset=['geometry'])
# 6. Final cleanup of columns and selection
# Columns to include: parcel_id, parcel_class, district, and area_m2
# Requirement: "Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers."
# We already calculated area_m2 in the input, but let's make sure it's consistent with the actual geometry
gdf['area_m2'] = gdf.geometry.area
final_gdf = gdf[['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']]
# Final check for invalid geometries after manipulations
final_gdf = final_gdf[final_gdf.geometry.is_valid]
# Write to GeoParquet
final_gdf.to_parquet('parcels_canonical.geoparquet')
print(f"Final count: {len(final_gdf)}")
print(f"Columns: {final_gdf.columns.tolist()}")
if __name__ == "__main__":
solve()
Wrote /work/solve.py (2713 chars).
0:13
Bash
python solve.py
python solve.py
Final count: 210 Columns: ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
0:21
Write
/work/solve.py
content · 73 lines · py
import geopandas as gpd
from shapely.geometry import MultiPolygon, Polygon
def solve():
# Load the data
gdf = gpd.read_file('cairo_parcels_legacy.geojson')
# 1. Fix invalid geometries (buffer(0) is a good start)
gdf['geometry'] = gdf.geometry.buffer(0)
# 2. Handle duplicates (keep earliest record_seq)
# Sort by parcel_id and record_seq, then drop duplicates of parcel_id
gdf = gdf.sort_values(['parcel_id', 'record_seq'])
gdf = gdf.drop_duplicates(subset=['parcel_id'], keep='first')
# 3. Discard polygon fragments smaller than 1 m2
# (This helps with artifact slivers)
gdf = gdf[gdf.geometry.area >= 1.0]
# 4. Ensure every feature is a MultiPolygon
def ensure_multipolygon(geom):
if geom.is_empty:
return None
if geom.geom_type == 'Polygon':
return MultiPolygon([geom])
elif geom.geom_type == 'MultiPolygon':
return geom
else:
# If it's a GeometryCollection or other, try to extract Polygons
if hasattr(geom, 'geoms'):
polys = [g for g in geom.geoms if g.geom_type in ['Polygon', 'MultiPolygon']]
if not polys:
return None
# Combine them into a single MultiPolygon
from shapely.ops import unary_union
merged = unary_union(polys)
if merged.geom_type == 'Polygon':
return MultiPolygon([merged])
return merged
return None
gdf['geometry'] = gdf.geometry.apply(ensure_multipolygon)
# Drop rows that resulted in None geometries
gdf = gdf.dropna(subset=['geometry'])
gdf = gdf[~gdf.geometry.is_empty]
# 5. Remove duplicate geometries (if any remain)
# We use a small tolerance for geometry comparison if necessary,
# but standard drop_duplicates on geometry is usually fine for exact matches.
gdf = gdf.drop_duplicates(subset=['geometry'])
# 6. Final selection and column formatting
# Requirement: parcel_id, parcel_class, district, and area_m2
# We should recalculate area_m2 to be sure it reflects the cleaned geometry.
gdf['area_m2'] = gdf.geometry.area
final_gdf = gdf[['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']]
# Final validation check
final_gdf = final_gdf[final_gdf.geometry.is_valid]
final_gdf = final_gdf[~final_gdf.geometry.is_empty]
# Write to GeoParquet
final_gdf.to_parquet('parcels_canonical.geoparquet')
print(f"Final count: {len(final_gdf)}")
print(f"Columns: {final_gdf.columns.tolist()}")
print("Geometry types:", final_gdf.geometry.geom_type.unique())
if __name__ == "__main__":
solve()
Wrote /work/solve.py (2776 chars).
0:22
Bash
python solve.py
python solve.py
Final count: 210 Columns: ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry'] Geometry types: <ArrowStringArray> ['MultiPolygon'] Length: 1, dtype: str
0:24
Bash
# Check if the file exists and has size
# Check if the file exists and has size ls -lh parcels_canonical.geoparquet
-rw-r--r-- 1 agent agent 21K Jun 15 19:05 parcels_canonical.geoparquet
0:31