dc-l2-cairo-invalid-dedup
Map (reference ← swipe → agent)
0:00
cairo_parcels_legacy.geojson is a parcel snapshot stitched together from three legacy provincial systems before unification, and I need to turn it into the foundation for our new central registry. Can you clean it up so we end up with one canonical record per parcel? When duplicates exist, keep the record with the earliest record_seq, and discard any polygon fragments smaller than 1 m².
Please write the result to parcels_canonical.geoparquet including these columns: parcel_id (the join key), parcel_class, district, and area_m2. Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers.
0:00
System info
{
"adapter": "openrouter",
"model": "google/gemma-4-26b-a4b-it",
"base_url": "https://openrouter.ai/api/v1"
}
0:01
Bash
ls -l /work/cairo_parcels_legacy.geojson
-rw-r--r-- 1 agent agent 223991 Jun 16 12:14 /work/cairo_parcels_legacy.geojson
0:03
Write
/work/inspect_data.py
content · 14 lines · py
import geopandas as gpd
# Load the data
gdf = gpd.read_file('/work/cairo_parcels_legacy.geojson')
# Inspect columns and first few rows
print("Columns:", gdf.columns.tolist())
print("\nFirst few rows:")
print(gdf.head())
print("\nInfo:")
print(gdf.info())
print("\nGeometry types:")
print(gdf.geometry.type.value_counts())
Wrote /work/inspect_data.py (324 chars).
0:04
Bash
python /work/inspect_data.py
Columns: ['parcel_id', 'record_seq', 'parcel_class', 'district', 'area_m2', 'geometry'] First few rows: parcel_id ... geometry 0 1 ... POLYGON ((640000 815000, 640030 815000, 640030... 1 2 ... POLYGON ((640050 815000, 640080 815000, 640080... 2 3 ... POLYGON ((640100 815000, 640130 815000, 640130... 3 4 ... POLYGON ((640150 815000, 640180 815000, 640180... 4 5 ... POLYGON ((640200 815000, 640230 815000, 640230... [5 rows x 6 columns] Info: <class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 290 entries, 0 to 289 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 parcel_id 290 non-null int32 1 record_seq 290 non-null int32 2 parcel_class 290 non-null str 3 district 290 non-null str 4 area_m2 290 non-null float64 5 geometry 290 non-null geometry dtypes: float64(1), geometry(1), int32(2), str(2) memory usage: 17.8 KB None Geometry types: Polygon 260 MultiPolygon 30 Name: count, dtype: int64
0:17
Write
/work/solve.py
content · 75 lines · py
import geopandas as gpd
from shapely.geometry import MultiPolygon, Polygon
def solve():
# 1. Load the data
gdf = gpd.read_file('/work/cairo_parcels_legacy.geojson')
# 2. Clean geometries: remove null/empty and fix invalid geometries
# Drop null/empty geometries first
gdf = gdf[gdf.geometry.notnull() & ~gdf.geometry.is_empty].copy()
# Fix invalid geometries (e.g. self-intersections) using buffer(0)
gdf['geometry'] = gdf['geometry'].buffer(0)
# Re-filter in case buffer(0) produced empty geometries
gdf = gdf[gdf.geometry.notnull() & ~gdf.geometry.is_empty].copy()
# 3. Handle Duplicates: keep record with earliest record_seq per parcel_id
# Sort by parcel_id and record_seq (ascending)
gdf = gdf.sort_values(by=['parcel_id', 'record_seq'], ascending=[True, True])
# Drop duplicates based on parcel_id, keeping the first one (which has lowest record_seq)
gdf = gdf.drop_duplicates(subset=['parcel_id'], keep='first').copy()
# 4. Discard polygon fragments smaller than 1 m2
# Note: The prompt asks to discard fragments smaller than 1m2.
# Since we are working with the parsed geometries, we check area of the geometry.
gdf['area_calc'] = gdf.geometry.area
gdf = gdf[gdf['area_calc'] >= 1.0].copy()
# Drop the helper column
gdf = gdf.drop(columns=['area_calc'])
# 5. Ensure every feature is a MultiPolygon
def ensure_multipolygon(geom):
if isinstance(geom, Polygon):
return MultiPolygon([geom])
return geom
gdf['geometry'] = gdf['geometry'].apply(ensure_multipolygon)
# 6. Remove duplicate geometries
# If multiple parcel_ids now have the exact same geometry, we should probably
# keep only one. The prompt says "no duplicate geometries".
# To be safe, we drop duplicate geometries.
# Note: In a real registry, different parcels might share a geometry (e.g. zero-area or identical boundaries),
# but the instruction says "no duplicate geometries".
gdf = gdf.drop_duplicates(subset=['geometry']).copy()
# 7. Final cleaning: Ensure no nulls in required columns and columns exist
# Required columns: parcel_id, parcel_class, district, area_m2
# area_m2 was in the original, but let's ensure it's accurate based on the geometry
gdf['area_m2'] = gdf.geometry.area
# Check for nulls in required columns
cols_to_keep = ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
gdf = gdf[cols_to_keep]
# Drop any remaining rows with nulls in these essential columns
gdf = gdf.dropna(subset=['parcel_id', 'parcel_class', 'district', 'area_m2'])
# Final verification of geometry types
# All should be MultiPolygon
unique_types = gdf.geometry.type.unique()
print(f"Unique geometry types remaining: {unique_types}")
if any(t != 'MultiPolygon' for t in unique_types):
# This shouldn't happen given our logic, but as a safeguard:
gdf['geometry'] = gdf['geometry'].apply(ensure_multipolygon)
# 8. Write to GeoParquet
gdf.to_parquet('/work/parcels_canonical.geoparquet')
print("Successfully wrote parcels_canonical.geoparquet")
print(f"Final count: {len(gdf)}")
if __name__ == "__main__":
solve()
Wrote /work/solve.py (3268 chars).
0:18
Bash
python /work/solve.py
Unique geometry types remaining: <ArrowStringArray> ['MultiPolygon'] Length: 1, dtype: str Successfully wrote parcels_canonical.geoparquet Final count: 210
0:25
Write
/work/verify.py
content · 40 lines · py
import geopandas as gpd
def verify():
gdf = gpd.read_parquet('/work/parcels_canonical.geoparquet')
print("Shape:", gdf.shape)
print("\nColumns:", gdf.columns.tolist())
# Check geometry type
geom_types = gdf.geometry.type.unique()
print("Geometry types:", geom_types)
# Check for null geometries
print("Null geometries:", gdf.geometry.isnull().any())
print("Empty geometries:", gdf.geometry.is_empty.any())
# Check for invalid geometries
print("Invalid geometries:", not gdf.geometry.is_valid.all())
# Check for duplicate geometries
# We use the WKB representation to compare geometries accurately
duplicated_geoms = gdf.geometry.duplicated().any()
print("Duplicate geometries:", duplicated_geoms)
# Check column requirements
required_cols = ['parcel_id', 'parcel_class', 'district', 'area_m2']
missing_cols = [c for c in required_cols if c not in gdf.columns]
print("Missing required columns:", missing_cols)
# Check for nulls in required columns
nulls_in_cols = gdf[required_cols].isnull().any().any()
print("Nulls in required columns:", nulls_in_cols)
# Check area requirement (> 1)
small_areas = (gdf.geometry.area < 1.0).any()
print("Areas < 1m2:", small_areas)
if __name__ == "__main__":
verify()
Wrote /work/verify.py (1348 chars).
0:26
Bash
python /work/verify.py
Shape: (210, 5) Columns: ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry'] Geometry types: <ArrowStringArray> ['MultiPolygon'] Length: 1, dtype: str Null geometries: False Empty geometries: False Invalid geometries: False Duplicate geometries: False Missing required columns: [] Nulls in required columns: False Areas < 1m2: False
0:38