dc-l2-cairo-invalid-dedup

Status: done
Score: 0.914
Duration: 1:16
Cost: 0.16¢
Model: google/gemma-4-26b-a4b-it

Map (reference ← swipe → agent)

0:00
cairo_parcels_legacy.geojson is a parcel snapshot stitched together from three legacy provincial systems before unification, and I need to turn it into the foundation for our new central registry. Can you clean it up so we end up with one canonical record per parcel? When duplicates exist, keep the record with the earliest record_seq, and discard any polygon fragments smaller than 1 m². Please write the result to parcels_canonical.geoparquet including these columns: parcel_id (the join key), parcel_class, district, and area_m2. Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers.
0:00
System info
{
  "adapter": "openrouter",
  "model": "google/gemma-4-26b-a4b-it",
  "base_url": "https://openrouter.ai/api/v1"
}
0:02
Bash ls -F
cairo_parcels_legacy.geojson
0:04
Bash head -n 20 cairo_parcels_legacy.geojson
{
  "type": "FeatureCollection",
  "name": "cairo_parcels_legacy",
  "crs": {
    "type": "name",
    "properties": {
      "name": "urn:ogc:def:crs:EPSG::22992"
    }
  },
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              640000.0,
              815000.0
            ],
0:44
Write /work/solve.py
content · 94 lines · py
import geopandas as gpd
import pandas as pd
from shapely.geometry import MultiPolygon, Polygon

def solve():
    # Load the data
    input_file = 'cairo_parcels_legacy.geojson'
    output_file = 'parcels_canonical.geoparquet'
    
    print(f"Loading {input_file}...")
    gdf = gpd.read_file(input_file)
    
    # Initial count
    print(f"Initial records: {len(gdf)}")

    # 1. Remove null or empty geometries
    gdf = gdf[gdf.geometry.notnull()]
    gdf = gdf[~gdf.geometry.is_empty]
    print(f"After removing null/empty geometries: {len(gdf)}")

    # 2. Fix invalid geometries
    print("Fixing invalid geometries...")
    gdf['geometry'] = gdf.geometry.buffer(0)
    gdf = gdf[~gdf.geometry.is_empty] # buffer(0) can result in empty if geometry was invalid
    print(f"After fixing invalid geometries: {len(gdf)}")

    # 3. Ensure all features are MultiPolygon
    print("Converting geometries to MultiPolygon...")
    def ensure_multipolygon(geom):
        if geom.geom_type == 'Polygon':
            return MultiPolygon([geom])
        return geom

    gdf['geometry'] = gdf.geometry.apply(ensure_multipolygon)

    # 4. Handle Duplicates
    # "When duplicates exist, keep the record with the earliest record_seq"
    # The requirement implies duplicates are defined by parcel_id.
    print("Removing duplicates based on parcel_id...")
    # Sort by parcel_id and record_seq (ascending) to keep the earliest
    gdf = gdf.sort_values(by=['parcel_id', 'record_seq'], ascending=[True, True])
    gdf = gdf.drop_duplicates(subset='parcel_id', keep='first')
    print(f"After removing duplicates: {len(gdf)}")

    # 5. Remove duplicate geometries
    # "no duplicate geometries" - this usually means if two different IDs have the same shape, 
    # we keep one (though in a registry, we might want to keep them if IDs differ, 
    # but the prompt says "no duplicate geometries").
    # Let's assume it means spatial duplicates.
    print("Removing duplicate geometries...")
    gdf = gdf.drop_duplicates(subset='geometry')
    print(f"After removing duplicate geometries: {len(gdf)}")

    # 6. Calculate area and filter artifacts
    # The CRS is EPSG:22992 (based on the head output), which is a metric projected CRS.
    # Let's check the CRS just in case.
    if gdf.crs is None:
        print("Warning: CRS not detected. Assuming metric if not provided, but checking...")
        # If it's degrees, we must reproject. But 640000, 815000 looks like metric.
        # The file header said EPSG:22992.
    
    gdf['area_m2'] = gdf.geometry.area
    
    # "discard any polygon fragments smaller than 1 m²"
    print("Filtering small fragments (< 1 m²)...")
    gdf = gdf[gdf['area_m2'] >= 1.0]
    print(f"After filtering small fragments: {len(gdf)}")

    # 7. Final column selection
    # "including these columns: parcel_id, parcel_class, district, and area_m2"
    # We also need to keep geometry.
    output_columns = ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
    
    # Check if columns exist
    missing_cols = [c for c in output_columns if c not in gdf.columns]
    if missing_cols:
        print(f"Warning: Missing columns in input: {missing_cols}")
        # If they are missing, we'll just use what we have or create dummy if needed for the task
        # But let's try to just select what is available.
        output_columns = [c for c in output_columns if c in gdf.columns]

    gdf_final = gdf[output_columns]

    # Final check for any remaining nulls in critical columns
    gdf_final = gdf_final.dropna(subset=['parcel_id', 'parcel_class', 'district'])

    # 8. Write to GeoParquet
    print(f"Writing to {output_file}...")
    gdf_final.to_parquet(output_file)
    print("Done.")

if __name__ == "__main__":
    solve()
Wrote /work/solve.py (3778 chars).
0:45
Bash python solve.py
Loading cairo_parcels_legacy.geojson...
Initial records: 290
After removing null/empty geometries: 290
Fixing invalid geometries...
After fixing invalid geometries: 290
Converting geometries to MultiPolygon...
Removing duplicates based on parcel_id...
After removing duplicates: 290
Removing duplicate geometries...
After removing duplicate geometries: 240
Filtering small fragments (< 1 m²)...
After filtering small fragments: 210
Writing to parcels_canonical.geoparquet...
Done.
0:51
Bash python -c "import geopandas as gpd; gdf = gpd.read_parquet('parcels_canonical.ge...
   parcel_id  ...                                           geometry
0          1  ...  MULTIPOLYGON (((640000 815000, 640000 815040, ...
1          2  ...  MULTIPOLYGON (((640050 815000, 640050 815040, ...
2          3  ...  MULTIPOLYGON (((640100 815000, 640100 815040, ...
3          4  ...  MULTIPOLYGON (((640150 815000, 640150 815040, ...
4          5  ...  MULTIPOLYGON (((640200 815000, 640200 815040, ...

[5 rows x 5 columns]
Index(['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry'], dtype='str')
{"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "ProjectedCRS", "name": "Egypt 1907 / Red Belt", "base_crs": {"name": "Egypt 1907", "datum": {"type": "GeodeticReferenceFrame", "name": "Egypt 1907", "ellipsoid": {"name": "Helmert 1906", "semi_major_axis": 6378200, "inverse_flattening": 298.3}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, "id": {"authority": "EPSG", "code": 4229}}, "conversion": {"name": "Egypt Red Belt", "method": {"name": "Transverse Mercator", "id": {"authority": "EPSG", "code": 9807}}, "parameters": [{"name": "Latitude of natural origin", "value": 30, "unit": "degree", "id": {"authority": "EPSG", "code": 8801}}, {"name": "Longitude of natural origin", "value": 31, "unit": "degree", "id": {"authority": "EPSG", "code": 8802}}, {"name": "Scale factor at natural origin", "value": 1, "unit": "unity", "id": {"authority": "EPSG", "code": 8805}}, {"name": "False easting", "value": 615000, "unit": "metre", "id": {"authority": "EPSG", "code": 8806}}, {"name": "False northing", "value": 810000, "unit": "metre", "id": {"authority": "EPSG", "code": 8807}}]}, "coordinate_system": {"subtype": "Cartesian", "axis": [{"name": "Easting", "abbreviation": "E", "direction": "east", "unit": "metre"}, {"name": "Northing", "abbreviation": "N", "direction": "north", "unit": "metre"}]}, "scope": "Engineering survey, topographic mapping.", "area": "Egypt - onshore between 29\u00b0E and 33\u00b0E, offshore Mediterranean east of 29\u00b0E and offshore Gulf of Suez.", "bbox": {"south_latitude": 21.99, "west_longitude": 29, "north_latitude": 33.82, "east_longitude": 34.27}, "id": {"authority": "EPSG", "code": 22992}}
<ArrowStringArray>
['MultiPolygon']
Length: 1, dtype: str
1:12