dc-l2-cairo-invalid-dedup

Status: done
Score: 1.000
Duration: 1:33
Cost: 0.25¢
Model: google/gemma-4-26b-a4b-it

Map (reference ← swipe → agent)

0:00
cairo_parcels_legacy.geojson is a parcel snapshot stitched together from three legacy provincial systems before unification, and I need to turn it into the foundation for our new central registry. Can you clean it up so we end up with one canonical record per parcel? When duplicates exist, keep the record with the earliest record_seq, and discard any polygon fragments smaller than 1 m². Please write the result to parcels_canonical.geoparquet including these columns: parcel_id (the join key), parcel_class, district, and area_m2. Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers.
0:00
System info
{
  "adapter": "openrouter",
  "model": "google/gemma-4-26b-a4b-it",
  "base_url": "https://openrouter.ai/api/v1"
}
0:07
Bash ls -lh cairo_parcels_legacy.geojson
ls -lh cairo_parcels_legacy.geojson
-rw-r--r-- 1 agent agent 219K Jun 15 23:06 cairo_parcels_legacy.geojson
0:21
Write /work/inspect_data.py
content · 18 lines · py
import geopandas as gpd

def inspect_data(file_path):
    gdf = gpd.read_file(file_path)
    print("--- Info ---")
    print(gdf.info())
    print("\n--- Head ---")
    print(gdf.head())
    print("\n--- Columns ---")
    print(gdf.columns)
    print("\n--- Geometry Types ---")
    print(gdf.geom_type.value_counts())
    print("\n--- Null Values ---")
    print(gdf.isnull().sum())

if __name__ == "__main__":
    inspect_data('cairo_parcels_legacy.geojson')
Wrote /work/inspect_data.py (461 chars).
0:25
Bash python /work/inspect_data.py
python /work/inspect_data.py
--- Info ---
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 290 entries, 0 to 289
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   parcel_id     290 non-null    int32   
 1   record_seq    290 non-null    int32   
 2   parcel_class  290 non-null    str     
 3   district      290 non-null    str     
 4   area_m2       290 non-null    float64 
 5   geometry      290 non-null    geometry
dtypes: float64(1), geometry(1), int32(2), str(2)
memory usage: 17.8 KB
None

--- Head ---
   parcel_id  ...                                           geometry
0          1  ...  POLYGON ((640000 815000, 640030 815000, 640030...
1          2  ...  POLYGON ((640050 815000, 640080 815000, 640080...
2          3  ...  POLYGON ((640100 815000, 640130 815000, 640130...
3          4  ...  POLYGON ((640150 815000, 640180 815000, 640180...
4          5  ...  POLYGON ((640200 815000, 640230 815000, 640230...

[5 rows x 6 columns]

--- Columns ---
Index(['parcel_id', 'record_seq', 'parcel_class', 'district', 'area_m2',
       'geometry'],
      dtype='str')

--- Geometry Types ---
Polygon         260
MultiPolygon     30
Name: count, dtype: int64

--- Null Values ---
parcel_id       0
record_seq      0
parcel_class    0
district        0
area_m2         0
geometry        0
dtype: int64
0:50
Write /work/solve.py
content · 78 lines · py
import geopandas as gpd
from shapely.geometry import MultiPolygon, Polygon

def clean_parcels(input_file, output_file):
    # 1. Load the data
    gdf = gpd.read_file(input_file)
    
    # 2. Remove null or empty geometries
    gdf = gdf[gdf.geometry.notnull() & ~gdf.geometry.is_empty]
    
    # 3. Fix invalid geometries
    gdf['geometry'] = gdf.geometry.make_valid()
    
    # 4. Discard any polygon fragments smaller than 1 m2
    # Note: A MultiPolygon might consist of several polygons. 
    # The requirement says "discard any polygon fragments smaller than 1 m2".
    # This usually implies we should check the components of the MultiPolygons.
    # However, if a MultiPolygon is considered a single feature, we should clarify.
    # Given the instruction "discard any polygon fragments", let's handle individual polygons.
    
    def filter_small_fragments(geom):
        if geom.is_empty:
            return None
        if isinstance(geom, Polygon):
            return geom if geom.area >= 1.0 else None
        elif isinstance(geom, MultiPolygon):
            # Keep only the parts of the MultiPolygon that are >= 1 m2
            parts = [p for p in geom.geoms if p.area >= 1.0]
            if not parts:
                return None
            return MultiPolygon(parts)
        return geom

    gdf['geometry'] = gdf['geometry'].apply(filter_small_fragments)
    gdf = gdf[gdf.geometry.notnull() & ~gdf.geometry.is_empty]

    # 5. Handle duplicates: one canonical record per parcel.
    # Keep the record with the earliest record_seq.
    gdf = gdf.sort_values('record_seq')
    gdf = gdf.drop_duplicates(subset=['parcel_id'], keep='first')
    
    # 6. Ensure every feature is a MultiPolygon
    def ensure_multipolygon(geom):
        if isinstance(geom, Polygon):
            return MultiPolygon([geom])
        return geom

    gdf['geometry'] = gdf['geometry'].apply(ensure_multipolygon)
    
    # 7. Remove duplicate geometries
    # We use a subset of geometry to identify duplicates
    gdf = gdf.drop_duplicates(subset=['geometry'])

    # 8. Final cleaning: remove any remaining empty/invalid (safety check)
    gdf = gdf[gdf.geometry.notnull() & ~gdf.geometry.is_empty]
    gdf['geometry'] = gdf.geometry.make_valid()
    
    # Ensure we only have MultiPolygons after make_valid (if make_valid returned Polygons)
    gdf['geometry'] = gdf['geometry'].apply(ensure_multipolygon)

    # 9. Select final columns
    # parcel_id, parcel_class, district, and area_m2
    # We should recalculate area_m2 to ensure it matches the cleaned geometry?
    # The instruction says "including these columns". It doesn't explicitly say 
    # to recalculate, but usually, in a registry, area should be accurate.
    # However, the original 'area_m2' might be useful. Let's see.
    # Let's recalculate to be safe and consistent with the cleaned geometry.
    gdf['area_m2'] = gdf.geometry.area

    final_gdf = gdf[['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']]
    
    # 10. Save to GeoParquet
    final_gdf.to_parquet(output_file)
    print(f"Successfully saved {len(final_gdf)} records to {output_file}")

if __name__ == "__main__":
    clean_parcels('cairo_parcels_legacy.geojson', 'parcels_canonical.geoparquet')
Wrote /work/solve.py (3257 chars).
0:56
Bash python /work/solve.py
python /work/solve.py
Successfully saved 210 records to parcels_canonical.geoparquet
1:17
Write /work/verify.py
content · 51 lines · py
import geopandas as gpd

def verify_output(file_path):
    gdf = gpd.read_parquet(file_path)
    
    print("--- Info ---")
    print(gdf.info())
    
    print("\n--- Head ---")
    print(gdf.head())
    
    print("\n--- Column Check ---")
    expected_cols = {'parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry'}
    actual_cols = set(gdf.columns)
    print(f"Expected: {expected_cols}")
    print(f"Actual: {actual_cols}")
    assert expected_cols.issubset(actual_cols), "Missing columns!"
    
    print("\n--- Geometry Type Check ---")
    geom_types = gdf.geom_type.unique()
    print(f"Geometry types: {geom_types}")
    assert all(t == 'MultiPolygon' for t in geom_types), f"Unexpected geometry types: {geom_types}"
    
    print("\n--- Area Check ---")
    min_area = gdf.geometry.area.min()
    print(f"Minimum area: {min_area}")
    assert min_area >= 1.0, f"Found area smaller than 1m2: {min_area}"

    print("\n--- Null/Empty Check ---")
    print(f"Null geometries: {gdf.geometry.isnull().sum()}")
    print(f"Empty geometries: {gdf.geometry.is_empty.sum()}")
    assert gdf.geometry.isnull().sum() == 0, "Null geometries found!"
    assert gdf.geometry.is_empty.sum() == 0, "Empty geometries found!"

    print("\n--- Duplicate Geometry Check ---")
    # Note: This might fail if different parcels have the same geometry (unlikely but possible)
    # but the requirement says "no duplicate geometries".
    dupes = gdf.geometry.duplicated().sum()
    print(f"Duplicate geometries: {dupes}")
    assert dupes == 0, "Duplicate geometries found!"

    print("\n--- Duplicate Parcel ID Check ---")
    dupe_ids = gdf['parcel_id'].duplicated().sum()
    print(f"Duplicate parcel_ids: {dupe_ids}")
    assert dupe_ids == 0, "Duplicate parcel_ids found!"

    print("\nAll checks passed!")

if __name__ == "__main__":
    verify_output('parcels_canonical.geoparquet')
Wrote /work/verify.py (1890 chars).
1:20
Bash python /work/verify.py
python /work/verify.py
--- Info ---
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   parcel_id     210 non-null    int32   
 1   parcel_class  210 non-null    str     
 2   district      210 non-null    str     
 3   area_m2       210 non-null    float64 
 4   geometry      210 non-null    geometry
dtypes: float64(1), geometry(1), int32(1), str(2)
memory usage: 12.2 KB
None

--- Head ---
   parcel_id  ...                                           geometry
0          1  ...  MULTIPOLYGON (((640000 815000, 640030 815000, ...
1          2  ...  MULTIPOLYGON (((640050 815000, 640080 815000, ...
2          3  ...  MULTIPOLYGON (((640100 815000, 640130 815000, ...
3          4  ...  MULTIPOLYGON (((640150 815000, 640180 815000, ...
4          5  ...  MULTIPOLYGON (((640200 815000, 640230 815000, ...

[5 rows x 5 columns]

--- Column Check ---
Expected: {'parcel_id', 'district', 'parcel_class', 'geometry', 'area_m2'}
Actual: {'district', 'parcel_class', 'geometry', 'area_m2', 'parcel_id'}

--- Geometry Type Check ---
Geometry types: <ArrowStringArray>
['MultiPolygon']
Length: 1, dtype: str

--- Area Check ---
Minimum area: 600.0

--- Null/Empty Check ---
Null geometries: 0
Empty geometries: 0

--- Duplicate Geometry Check ---
Duplicate geometries: 0

--- Duplicate Parcel ID Check ---
Duplicate parcel_ids: 0

All checks passed!
1:30