dc-l2-cairo-invalid-dedup

Status: done
Score: 0.914
Duration: 0:32
Cost: 0.27¢
Model: google/gemma-4-26b-a4b-it

Map (reference ← swipe → agent)

0:00
cairo_parcels_legacy.geojson is a parcel snapshot stitched together from three legacy provincial systems before unification, and I need to turn it into the foundation for our new central registry. Can you clean it up so we end up with one canonical record per parcel? When duplicates exist, keep the record with the earliest record_seq, and discard any polygon fragments smaller than 1 m². Please write the result to parcels_canonical.geoparquet including these columns: parcel_id (the join key), parcel_class, district, and area_m2. Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers.
0:00
System info
{
  "adapter": "openrouter",
  "model": "google/gemma-4-26b-a4b-it",
  "base_url": "https://openrouter.ai/api/v1"
}
0:01
Bash ls -F
ls -F
cairo_parcels_legacy.geojson
0:04
Write /work/inspect_data.py
content · 23 lines · py
import geopandas as gpd

# Load the data to inspect it
gdf = gpd.read_file('cairo_parcels_legacy.geojson')

print("Columns:", gdf.columns.tolist())
print("Shape:", gdf.shape)
print("CRS:", gdf.crs)
print("\nFirst few rows:")
print(gdf.head())

# Check for null geometries
print("\nNull geometries count:", gdf.geometry.isna().sum())

# Check for empty geometries
print("Empty geometries count:", gdf.geometry.is_empty.sum())

# Check for invalid geometries
print("Invalid geometries count:", not gdf.geometry.is_valid.all())

# Check for duplicates based on parcel_id
print("\nDuplicate parcel_id count:", gdf['parcel_id'].duplicated().sum())
Wrote /work/inspect_data.py (643 chars).
0:04
Bash python inspect_data.py
python inspect_data.py
Columns: ['parcel_id', 'record_seq', 'parcel_class', 'district', 'area_m2', 'geometry']
Shape: (290, 6)
CRS: EPSG:22992

First few rows:
   parcel_id  ...                                           geometry
0          1  ...  POLYGON ((640000 815000, 640030 815000, 640030...
1          2  ...  POLYGON ((640050 815000, 640080 815000, 640080...
2          3  ...  POLYGON ((640100 815000, 640130 815000, 640130...
3          4  ...  POLYGON ((640150 815000, 640180 815000, 640180...
4          5  ...  POLYGON ((640200 815000, 640230 815000, 640230...

[5 rows x 6 columns]

Null geometries count: 0
Empty geometries count: 0
Invalid geometries count: True

Duplicate parcel_id count: 0
0:12
Write /work/solve.py
content · 69 lines · py
import geopandas as gpd
from shapely.geometry import MultiPolygon, Polygon

def solve():
    # Load the data
    gdf = gpd.read_file('cairo_parcels_legacy.geojson')
    
    # 1. Fix invalid geometries
    # buffer(0) is a common trick to fix minor topological errors
    gdf['geometry'] = gdf.geometry.buffer(0)
    
    # 2. Handle duplicates (keep earliest record_seq)
    # Sort by parcel_id and record_seq, then drop duplicates of parcel_id
    gdf = gdf.sort_values(['parcel_id', 'record_seq'])
    gdf = gdf.drop_duplicates(subset=['parcel_id'], keep='first')
    
    # 3. Discard polygon fragments smaller than 1 m2
    # We check the area of the geometry
    gdf = gdf[gdf.geometry.area >= 1.0]
    
    # 4. Ensure every feature is a MultiPolygon
    def ensure_multipolygon(geom):
        if geom.is_empty:
            return None
        if geom.geom_type == 'Polygon':
            return MultiPolygon([geom])
        elif geom.geom_type == 'MultiPolygon':
            return geom
        else:
            # If it's something else (like a GeometryCollection), try to extract Polygons
            if hasattr(geom, 'geoms'):
                polys = [g for g in geom.geoms if g.geom_type in ['Polygon', 'MultiPolygon']]
                if not polys:
                    return None
                # Union them or just wrap them
                from shapely.ops import unary_union
                return unary_union(polys)
            return None

    gdf['geometry'] = gdf.geometry.apply(ensure_multipolygon)
    
    # Drop rows that resulted in None geometries
    gdf = gdf.dropna(subset=['geometry'])
    gdf = gdf[~gdf.geometry.is_empty]
    
    # 5. Remove duplicate geometries
    # Note: duplicate geometries might exist even if parcel_ids are unique
    gdf = gdf.drop_duplicates(subset=['geometry'])

    # 6. Final cleanup of columns and selection
    # Columns to include: parcel_id, parcel_class, district, and area_m2
    # Requirement: "Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers."
    # We already calculated area_m2 in the input, but let's make sure it's consistent with the actual geometry
    gdf['area_m2'] = gdf.geometry.area
    
    final_gdf = gdf[['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']]
    
    # Final check for invalid geometries after manipulations
    final_gdf = final_gdf[final_gdf.geometry.is_valid]

    # Write to GeoParquet
    final_gdf.to_parquet('parcels_canonical.geoparquet')
    
    print(f"Final count: {len(final_gdf)}")
    print(f"Columns: {final_gdf.columns.tolist()}")

if __name__ == "__main__":
    solve()
Wrote /work/solve.py (2713 chars).
0:13
Bash python solve.py
python solve.py
Final count: 210
Columns: ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
0:21
Write /work/solve.py
content · 73 lines · py
import geopandas as gpd
from shapely.geometry import MultiPolygon, Polygon

def solve():
    # Load the data
    gdf = gpd.read_file('cairo_parcels_legacy.geojson')
    
    # 1. Fix invalid geometries (buffer(0) is a good start)
    gdf['geometry'] = gdf.geometry.buffer(0)
    
    # 2. Handle duplicates (keep earliest record_seq)
    # Sort by parcel_id and record_seq, then drop duplicates of parcel_id
    gdf = gdf.sort_values(['parcel_id', 'record_seq'])
    gdf = gdf.drop_duplicates(subset=['parcel_id'], keep='first')
    
    # 3. Discard polygon fragments smaller than 1 m2
    # (This helps with artifact slivers)
    gdf = gdf[gdf.geometry.area >= 1.0]
    
    # 4. Ensure every feature is a MultiPolygon
    def ensure_multipolygon(geom):
        if geom.is_empty:
            return None
        if geom.geom_type == 'Polygon':
            return MultiPolygon([geom])
        elif geom.geom_type == 'MultiPolygon':
            return geom
        else:
            # If it's a GeometryCollection or other, try to extract Polygons
            if hasattr(geom, 'geoms'):
                polys = [g for g in geom.geoms if g.geom_type in ['Polygon', 'MultiPolygon']]
                if not polys:
                    return None
                # Combine them into a single MultiPolygon
                from shapely.ops import unary_union
                merged = unary_union(polys)
                if merged.geom_type == 'Polygon':
                    return MultiPolygon([merged])
                return merged
            return None

    gdf['geometry'] = gdf.geometry.apply(ensure_multipolygon)
    
    # Drop rows that resulted in None geometries
    gdf = gdf.dropna(subset=['geometry'])
    gdf = gdf[~gdf.geometry.is_empty]
    
    # 5. Remove duplicate geometries (if any remain)
    # We use a small tolerance for geometry comparison if necessary, 
    # but standard drop_duplicates on geometry is usually fine for exact matches.
    gdf = gdf.drop_duplicates(subset=['geometry'])

    # 6. Final selection and column formatting
    # Requirement: parcel_id, parcel_class, district, and area_m2
    # We should recalculate area_m2 to be sure it reflects the cleaned geometry.
    gdf['area_m2'] = gdf.geometry.area
    
    final_gdf = gdf[['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']]
    
    # Final validation check
    final_gdf = final_gdf[final_gdf.geometry.is_valid]
    final_gdf = final_gdf[~final_gdf.geometry.is_empty]
    
    # Write to GeoParquet
    final_gdf.to_parquet('parcels_canonical.geoparquet')
    
    print(f"Final count: {len(final_gdf)}")
    print(f"Columns: {final_gdf.columns.tolist()}")
    print("Geometry types:", final_gdf.geometry.geom_type.unique())

if __name__ == "__main__":
    solve()
Wrote /work/solve.py (2776 chars).
0:22
Bash python solve.py
python solve.py
Final count: 210
Columns: ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
Geometry types: <ArrowStringArray>
['MultiPolygon']
Length: 1, dtype: str
0:24
Bash # Check if the file exists and has size
# Check if the file exists and has size
ls -lh parcels_canonical.geoparquet
-rw-r--r-- 1 agent agent 21K Jun 15 19:05 parcels_canonical.geoparquet
0:31