dc-l2-cairo-invalid-dedup
Map (reference ← swipe → agent)
0:00
cairo_parcels_legacy.geojson is a parcel snapshot stitched together from three legacy provincial systems before unification, and I need to turn it into the foundation for our new central registry. Can you clean it up so we end up with one canonical record per parcel? When duplicates exist, keep the record with the earliest record_seq, and discard any polygon fragments smaller than 1 m².
Please write the result to parcels_canonical.geoparquet including these columns: parcel_id (the join key), parcel_class, district, and area_m2. Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers.
0:00
System info
{
"adapter": "openrouter",
"model": "google/gemma-4-26b-a4b-it",
"base_url": "https://openrouter.ai/api/v1"
}
0:02
Bash
ls -F
ls -F
cairo_parcels_legacy.geojson
0:07
Write
/work/inspect_data.py
content · 14 lines · py
import geopandas as gpd
# Load the data
gdf = gpd.read_file('cairo_parcels_legacy.geojson')
print("Initial info:")
print(gdf.info())
print("\nFirst few rows:")
print(gdf.head())
print("\nGeometry type counts:")
print(gdf.geom_type.value_counts())
print("\nColumns:")
print(gdf.columns)
Wrote /work/inspect_data.py (288 chars).
0:09
Bash
python /work/inspect_data.py
python /work/inspect_data.py
Initial info:
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 290 entries, 0 to 289
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 parcel_id 290 non-null int32
1 record_seq 290 non-null int32
2 parcel_class 290 non-null str
3 district 290 non-null str
4 area_m2 290 non-null float64
5 geometry 290 non-null geometry
dtypes: float64(1), geometry(1), int32(2), str(2)
memory usage: 17.8 KB
None
First few rows:
parcel_id ... geometry
0 1 ... POLYGON ((640000 815000, 640030 815000, 640030...
1 2 ... POLYGON ((640050 815000, 640080 815000, 640080...
2 3 ... POLYGON ((640100 815000, 640130 815000, 640130...
3 4 ... POLYGON ((640150 815000, 640180 815000, 640180...
4 5 ... POLYGON ((640200 815000, 640230 815000, 640230...
[5 rows x 6 columns]
Geometry type counts:
Polygon 260
MultiPolygon 30
Name: count, dtype: int64
Columns:
Index(['parcel_id', 'record_seq', 'parcel_class', 'district', 'area_m2',
'geometry'],
dtype='str')
0:34
Write
/work/solve.py
content · 76 lines · py
import geopandas as gpd
from shapely.geometry import MultiPolygon, Polygon
def clean_parcels(input_file, output_file):
# 1. Load the data
gdf = gpd.read_file(input_file)
# 2. Handle invalid geometries
# Ensure geometries are valid. If not, try to fix them using buffer(0)
gdf['geometry'] = gdf['geometry'].buffer(0)
# 3. Remove null or empty geometries
gdf = gdf[gdf.geometry.notnull() & ~gdf.geometry.is_empty]
# 4. Handle duplicates based on parcel_id
# Keep the record with the earliest record_seq
gdf = gdf.sort_values('record_seq').drop_duplicates(subset='parcel_id', keep='first')
# 5. Discard polygon fragments smaller than 1 m^2
# Note: The requirement says "discard any polygon fragments smaller than 1 m^2".
# Usually this implies that if a parcel is represented by multiple parts,
# we should filter the parts. However, the instructions also say "Every feature must be a MultiPolygon".
# This suggests we should process the geometry components.
def filter_small_parts(geom):
if geom.is_empty:
return geom
# If it's a Polygon, check its area
if isinstance(geom, Polygon):
return geom if geom.area >= 1.0 else None
# If it's a MultiPolygon, check each part
if isinstance(geom, MultiPolygon):
parts = [p for p in geom.geoms if p.area >= 1.0]
if not parts:
return None
return MultiPolygon(parts)
return geom
# Apply filtering to geometries
gdf['geometry'] = gdf['geometry'].apply(filter_small_parts)
# Drop any rows that became empty/None after filtering parts
gdf = gdf[gdf.geometry.notnull() & ~gdf.geometry.is_empty]
# 6. Ensure every feature is a MultiPolygon
def ensure_multipolygon(geom):
if isinstance(geom, Polygon):
return MultiPolygon([geom])
return geom
gdf['geometry'] = gdf['geometry'].apply(ensure_multipolygon)
# 7. Remove duplicate geometries
gdf = gdf.drop_duplicates(subset='geometry')
# 8. Final column selection and cleanup
# Required columns: parcel_id, parcel_class, district, area_m2
# We'll also recalculate area_m2 based on the cleaned geometry to be sure,
# or just keep the column if it's expected to be the one from the record.
# The prompt says "including these columns", implying they should be present.
# Let's recalculate area_m2 to ensure it matches the cleaned geometry.
gdf['area_m2'] = gdf['geometry'].area
output_cols = ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
gdf = gdf[output_cols]
# Save to geoparquet
gdf.to_parquet(output_file)
print(f"Successfully processed {len(gdf)} parcels.")
if __name__ == "__main__":
clean_parcels('cairo_parcels_legacy.geojson', 'parcels_canonical.geoparquet')
Wrote /work/solve.py (2939 chars).
0:35
Bash
python /work/solve.py
python /work/solve.py
Successfully processed 210 parcels.
1:07
Write
/work/validate_result.py
content · 37 lines · py
import geopandas as gpd
# Load the result
gdf = gpd.read_parquet('parcels_canonical.geoparquet')
print("Final info:")
print(gdf.info())
print("\nGeometry type counts:")
print(gdf.geom_type.value_counts())
# Check for duplicates in parcel_id
duplicates = gdf.duplicated(subset='parcel_id').sum()
print(f"\nDuplicate parcel_ids: {duplicates}")
# Check for null/empty geometries
null_empty = gdf.geometry.isna().sum() + gdf.geometry.is_empty.sum()
print(f"Null or empty geometries: {null_empty}")
# Check for invalid geometries
invalid = gdf.geometry.is_valid.sum()
print(f"Invalid geometries: {len(gdf) - invalid}")
# Check for duplicate geometries
geom_duplicates = gdf.duplicated(subset='geometry').sum()
print(f"Duplicate geometries: {geom_duplicates}")
# Check all are MultiPolygon
all_multipolygon = all(gdf.geom_type == 'MultiPolygon')
print(f"All are MultiPolygon: {all_multipolygon}")
# Check area_m2 matches geometry area
area_mismatch = (gdf['area_m2'] - gdf.geometry.area).abs().max()
print(f"Max area mismatch: {area_mismatch}")
# Check smallest area
print(f"Minimum area: {gdf['area_m2'].min()}")
Wrote /work/validate_result.py (1117 chars).
1:09
Bash
python /work/validate_result.py
python /work/validate_result.py
Final info: <class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 210 entries, 0 to 209 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 parcel_id 210 non-null int32 1 parcel_class 210 non-null str 2 district 210 non-null str 3 area_m2 210 non-null float64 4 geometry 210 non-null geometry dtypes: float64(1), geometry(1), int32(1), str(2) memory usage: 12.2 KB None Geometry type counts: MultiPolygon 210 Name: count, dtype: int64 Duplicate parcel_ids: 0 Null or empty geometries: 0 Invalid geometries: 0 Duplicate geometries: 0 All are MultiPolygon: True Max area mismatch: 0.0 Minimum area: 300.0
1:26