dc-l2-cairo-invalid-dedup

Status: done
Score: 1.000
Duration: 3:25
Cost: 0.24¢
Model: google/gemma-4-26b-a4b-it
Model error Retrospective assessed 2026-06-09T12:12:10Z

Map (reference ← swipe → agent)

0:00
cairo_parcels_legacy.geojson is a parcel snapshot stitched together from three legacy provincial systems before unification, and I need to turn it into the foundation for our new central registry. Can you clean it up so we end up with one canonical record per parcel? When duplicates exist, keep the record with the earliest record_seq, and discard any polygon fragments smaller than 1 m². Please write the result to parcels_canonical.geoparquet including these columns: parcel_id (the join key), parcel_class, district, and area_m2. Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers.
0:00
System info
{
  "adapter": "openrouter",
  "model": "google/gemma-4-26b-a4b-it",
  "base_url": "https://openrouter.ai/api/v1"
}
0:02
Bash ls -F
cairo_parcels_legacy.geojson
0:04
Write /work/inspect_data.py
content · 11 lines · py
import geopandas as gpd

# Read the file to inspect columns and geometry
gdf = gpd.read_file('cairo_parcels_legacy.geojson')

print("Columns:", gdf.columns.tolist())
print("First 5 rows:\n", gdf.head())
print("\nGeometry types:", gdf.geometry.type.value_counts())
print("\nCRS:", gdf.crs)
print("\nNull geometries:", gdf.geometry.isna().sum())
Wrote /work/inspect_data.py (344 chars).
0:05
Bash python /work/inspect_data.py
Columns: ['parcel_id', 'record_seq', 'parcel_class', 'district', 'area_m2', 'geometry']
First 5 rows:
    parcel_id  ...                                           geometry
0          1  ...  POLYGON ((640000 815000, 640030 815000, 640030...
1          2  ...  POLYGON ((640050 815000, 640080 815000, 640080...
2          3  ...  POLYGON ((640100 815000, 640130 815000, 640130...
3          4  ...  POLYGON ((640150 815000, 640180 815000, 640180...
4          5  ...  POLYGON ((640200 815000, 640230 815000, 640230...

[5 rows x 6 columns]

Geometry types: Polygon         260
MultiPolygon     30
Name: count, dtype: int64

CRS: EPSG:22992

Null geometries: 0
2:07
Write /work/solve.py
content · 72 lines · py
import geopandas as gpd
from shapely.geometry import MultiPolygon, Polygon

def solve():
    # 1. Read the data
    gdf = gpd.read_file('cairo_parcels_legacy.geojson')
    print(f"Initial record count: {len(gdf)}")

    # 2. Ensure valid geometries and remove nulls/empty
    gdf = gdf[gdf.geometry.notna()]
    gdf = gdf[~gdf.geometry.is_empty]
    gdf['geometry'] = gdf.geometry.make_valid()
    
    # 3. Handle duplicates based on parcel_id
    # "When duplicates exist, keep the record with the earliest record_seq"
    # We sort by parcel_id and record_seq, then drop duplicates of parcel_id
    gdf = gdf.sort_values(by=['parcel_id', 'record_seq'], ascending=[True, True])
    gdf = gdf.drop_duplicates(subset='parcel_id', keep='first')
    print(f"Record count after deduplication: {len(gdf)}")

    # 4. Convert all geometries to MultiPolygon
    def ensure_multipolygon(geom):
        if geom.is_empty:
            return None
        if isinstance(geom, MultiPolygon):
            return geom
        elif isinstance(geom, Polygon):
            return MultiPolygon([geom])
        else:
            # For cases like GeometryCollection or others, try to extract polygons
            if hasattr(geom, 'geoms'):
                polys = [g for g in geom.geoms if isinstance(g, (Polygon, MultiPolygon))]
                if not polys:
                    return None
                # If it's a collection of polygons, wrap them
                return MultiPolygon(polys)
            return None

    gdf['geometry'] = gdf['geometry'].apply(ensure_multipolygon)
    gdf = gdf[gdf.geometry.notna()]

    # 5. Calculate area and filter fragments < 1 m2
    # The prompt mentions "discard any polygon fragments smaller than 1 m2"
    # Since we are making them MultiPolygons, we should probably check the total area 
    # of the MultiPolygon or individual components? 
    # Usually, "parcel fragments" refers to the feature itself. 
    # Let's calculate the area of the resulting geometry.
    gdf['area_m2'] = gdf.geometry.area
    gdf = gdf[gdf['area_m2'] >= 1.0]
    print(f"Record count after area filter (< 1m2): {len(gdf)}")

    # 6. Remove invalid geometries (some might have become invalid during conversion/repair)
    gdf = gdf[gdf.is_valid]
    
    # 7. Remove duplicate geometries
    # "no duplicate geometries"
    gdf = gdf.drop_duplicates(subset='geometry')
    print(f"Record count after removing duplicate geometries: {len(gdf)}")

    # 8. Final selection of columns
    # "including these columns: parcel_id, parcel_class, district, and area_m2"
    # Ensure geometry is preserved
    output_cols = ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
    gdf_final = gdf[output_cols].copy()

    # 9. Write to GeoParquet
    gdf_final.to_parquet('parcels_canonical.geoparquet')
    print(f"Successfully wrote {len(gdf_final)} records to parcels_canonical.geoparquet")

if __name__ == "__main__":
    solve()
Wrote /work/solve.py (2948 chars).
2:10
Bash python /work/solve.py
Initial record count: 290
Record count after deduplication: 290
Record count after area filter (< 1m2): 260
Record count after removing duplicate geometries: 210
Successfully wrote 210 records to parcels_canonical.geoparquet
2:54
Write /work/verify.py
content · 40 lines · py
import geopandas as gpd

def verify():
    gdf = gpd.read_parquet('parcels_canonical.geoparquet')
    
    print("--- Verification ---")
    print(f"Total records: {len(gdf)}")
    print(f"Columns: {gdf.columns.tolist()}")
    
    # Check if all are MultiPolygon
    geom_types = gdf.geometry.type.unique()
    print(f"Geometry types present: {geom_types}")
    assert all(t == 'MultiPolygon' for t in geom_types), "Not all geometries are MultiPolygons"

    # Check for nulls or empty
    assert gdf.geometry.notna().all(), "Found null geometries"
    assert not gdf.geometry.is_empty.any(), "Found empty geometries"
    
    # Check for invalid
    assert gdf.is_valid.all(), "Found invalid geometries"

    # Check for duplicate geometries
    num_duplicates = gdf.geometry.duplicated().sum()
    print(f"Duplicate geometries: {num_duplicates}")
    assert num_duplicates == 0, "Found duplicate geometries"

    # Check columns
    expected_cols = {'parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry'}
    assert expected_cols.issubset(set(gdf.columns)), f"Missing columns. Expected {expected_cols}"

    # Check area
    min_area = gdf.geometry.area.min()
    print(f"Minimum area: {min_area}")
    assert min_area >= 1.0, "Found geometry smaller than 1 m2"

    print("All checks passed!")

if __name__ == "__main__":
    verify()
Wrote /work/verify.py (1348 chars).
2:56
Bash python /work/verify.py
--- Verification ---
Total records: 210
Columns: ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
Geometry types present: <ArrowStringArray>
['MultiPolygon']
Length: 1, dtype: str
Duplicate geometries: 0
Minimum area: 600.0
All checks passed!
2:58
Bash ls -F
cairo_parcels_legacy.geojson
inspect_data.py
parcels_canonical.geoparquet
solve.py
verify.py
3:24