dc-l2-cairo-invalid-dedup
Map (reference ← swipe → agent)
0:00
cairo_parcels_legacy.geojson is a parcel snapshot stitched together from three legacy provincial systems before unification, and I need to turn it into the foundation for our new central registry. Can you clean it up so we end up with one canonical record per parcel? When duplicates exist, keep the record with the earliest record_seq, and discard any polygon fragments smaller than 1 m².
Please write the result to parcels_canonical.geoparquet including these columns: parcel_id (the join key), parcel_class, district, and area_m2. Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers.
0:00
System info
{
"adapter": "openrouter",
"model": "google/gemma-4-26b-a4b-it",
"base_url": "https://openrouter.ai/api/v1"
}
0:07
Bash
ls -lh cairo_parcels_legacy.geojson
ls -lh cairo_parcels_legacy.geojson
-rw-r--r-- 1 agent agent 219K Jun 15 23:06 cairo_parcels_legacy.geojson
0:21
Write
/work/inspect_data.py
content · 18 lines · py
import geopandas as gpd
def inspect_data(file_path):
gdf = gpd.read_file(file_path)
print("--- Info ---")
print(gdf.info())
print("\n--- Head ---")
print(gdf.head())
print("\n--- Columns ---")
print(gdf.columns)
print("\n--- Geometry Types ---")
print(gdf.geom_type.value_counts())
print("\n--- Null Values ---")
print(gdf.isnull().sum())
if __name__ == "__main__":
inspect_data('cairo_parcels_legacy.geojson')
Wrote /work/inspect_data.py (461 chars).
0:25
Bash
python /work/inspect_data.py
python /work/inspect_data.py
--- Info ---
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 290 entries, 0 to 289
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 parcel_id 290 non-null int32
1 record_seq 290 non-null int32
2 parcel_class 290 non-null str
3 district 290 non-null str
4 area_m2 290 non-null float64
5 geometry 290 non-null geometry
dtypes: float64(1), geometry(1), int32(2), str(2)
memory usage: 17.8 KB
None
--- Head ---
parcel_id ... geometry
0 1 ... POLYGON ((640000 815000, 640030 815000, 640030...
1 2 ... POLYGON ((640050 815000, 640080 815000, 640080...
2 3 ... POLYGON ((640100 815000, 640130 815000, 640130...
3 4 ... POLYGON ((640150 815000, 640180 815000, 640180...
4 5 ... POLYGON ((640200 815000, 640230 815000, 640230...
[5 rows x 6 columns]
--- Columns ---
Index(['parcel_id', 'record_seq', 'parcel_class', 'district', 'area_m2',
'geometry'],
dtype='str')
--- Geometry Types ---
Polygon 260
MultiPolygon 30
Name: count, dtype: int64
--- Null Values ---
parcel_id 0
record_seq 0
parcel_class 0
district 0
area_m2 0
geometry 0
dtype: int64
0:50
Write
/work/solve.py
content · 78 lines · py
import geopandas as gpd
from shapely.geometry import MultiPolygon, Polygon
def clean_parcels(input_file, output_file):
# 1. Load the data
gdf = gpd.read_file(input_file)
# 2. Remove null or empty geometries
gdf = gdf[gdf.geometry.notnull() & ~gdf.geometry.is_empty]
# 3. Fix invalid geometries
gdf['geometry'] = gdf.geometry.make_valid()
# 4. Discard any polygon fragments smaller than 1 m2
# Note: A MultiPolygon might consist of several polygons.
# The requirement says "discard any polygon fragments smaller than 1 m2".
# This usually implies we should check the components of the MultiPolygons.
# However, if a MultiPolygon is considered a single feature, we should clarify.
# Given the instruction "discard any polygon fragments", let's handle individual polygons.
def filter_small_fragments(geom):
if geom.is_empty:
return None
if isinstance(geom, Polygon):
return geom if geom.area >= 1.0 else None
elif isinstance(geom, MultiPolygon):
# Keep only the parts of the MultiPolygon that are >= 1 m2
parts = [p for p in geom.geoms if p.area >= 1.0]
if not parts:
return None
return MultiPolygon(parts)
return geom
gdf['geometry'] = gdf['geometry'].apply(filter_small_fragments)
gdf = gdf[gdf.geometry.notnull() & ~gdf.geometry.is_empty]
# 5. Handle duplicates: one canonical record per parcel.
# Keep the record with the earliest record_seq.
gdf = gdf.sort_values('record_seq')
gdf = gdf.drop_duplicates(subset=['parcel_id'], keep='first')
# 6. Ensure every feature is a MultiPolygon
def ensure_multipolygon(geom):
if isinstance(geom, Polygon):
return MultiPolygon([geom])
return geom
gdf['geometry'] = gdf['geometry'].apply(ensure_multipolygon)
# 7. Remove duplicate geometries
# We use a subset of geometry to identify duplicates
gdf = gdf.drop_duplicates(subset=['geometry'])
# 8. Final cleaning: remove any remaining empty/invalid (safety check)
gdf = gdf[gdf.geometry.notnull() & ~gdf.geometry.is_empty]
gdf['geometry'] = gdf.geometry.make_valid()
# Ensure we only have MultiPolygons after make_valid (if make_valid returned Polygons)
gdf['geometry'] = gdf['geometry'].apply(ensure_multipolygon)
# 9. Select final columns
# parcel_id, parcel_class, district, and area_m2
# We should recalculate area_m2 to ensure it matches the cleaned geometry?
# The instruction says "including these columns". It doesn't explicitly say
# to recalculate, but usually, in a registry, area should be accurate.
# However, the original 'area_m2' might be useful. Let's see.
# Let's recalculate to be safe and consistent with the cleaned geometry.
gdf['area_m2'] = gdf.geometry.area
final_gdf = gdf[['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']]
# 10. Save to GeoParquet
final_gdf.to_parquet(output_file)
print(f"Successfully saved {len(final_gdf)} records to {output_file}")
if __name__ == "__main__":
clean_parcels('cairo_parcels_legacy.geojson', 'parcels_canonical.geoparquet')
Wrote /work/solve.py (3257 chars).
0:56
Bash
python /work/solve.py
python /work/solve.py
Successfully saved 210 records to parcels_canonical.geoparquet
1:17
Write
/work/verify.py
content · 51 lines · py
import geopandas as gpd
def verify_output(file_path):
gdf = gpd.read_parquet(file_path)
print("--- Info ---")
print(gdf.info())
print("\n--- Head ---")
print(gdf.head())
print("\n--- Column Check ---")
expected_cols = {'parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry'}
actual_cols = set(gdf.columns)
print(f"Expected: {expected_cols}")
print(f"Actual: {actual_cols}")
assert expected_cols.issubset(actual_cols), "Missing columns!"
print("\n--- Geometry Type Check ---")
geom_types = gdf.geom_type.unique()
print(f"Geometry types: {geom_types}")
assert all(t == 'MultiPolygon' for t in geom_types), f"Unexpected geometry types: {geom_types}"
print("\n--- Area Check ---")
min_area = gdf.geometry.area.min()
print(f"Minimum area: {min_area}")
assert min_area >= 1.0, f"Found area smaller than 1m2: {min_area}"
print("\n--- Null/Empty Check ---")
print(f"Null geometries: {gdf.geometry.isnull().sum()}")
print(f"Empty geometries: {gdf.geometry.is_empty.sum()}")
assert gdf.geometry.isnull().sum() == 0, "Null geometries found!"
assert gdf.geometry.is_empty.sum() == 0, "Empty geometries found!"
print("\n--- Duplicate Geometry Check ---")
# Note: This might fail if different parcels have the same geometry (unlikely but possible)
# but the requirement says "no duplicate geometries".
dupes = gdf.geometry.duplicated().sum()
print(f"Duplicate geometries: {dupes}")
assert dupes == 0, "Duplicate geometries found!"
print("\n--- Duplicate Parcel ID Check ---")
dupe_ids = gdf['parcel_id'].duplicated().sum()
print(f"Duplicate parcel_ids: {dupe_ids}")
assert dupe_ids == 0, "Duplicate parcel_ids found!"
print("\nAll checks passed!")
if __name__ == "__main__":
verify_output('parcels_canonical.geoparquet')
Wrote /work/verify.py (1890 chars).
1:20
Bash
python /work/verify.py
python /work/verify.py
--- Info ---
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 parcel_id 210 non-null int32
1 parcel_class 210 non-null str
2 district 210 non-null str
3 area_m2 210 non-null float64
4 geometry 210 non-null geometry
dtypes: float64(1), geometry(1), int32(1), str(2)
memory usage: 12.2 KB
None
--- Head ---
parcel_id ... geometry
0 1 ... MULTIPOLYGON (((640000 815000, 640030 815000, ...
1 2 ... MULTIPOLYGON (((640050 815000, 640080 815000, ...
2 3 ... MULTIPOLYGON (((640100 815000, 640130 815000, ...
3 4 ... MULTIPOLYGON (((640150 815000, 640180 815000, ...
4 5 ... MULTIPOLYGON (((640200 815000, 640230 815000, ...
[5 rows x 5 columns]
--- Column Check ---
Expected: {'parcel_id', 'district', 'parcel_class', 'geometry', 'area_m2'}
Actual: {'district', 'parcel_class', 'geometry', 'area_m2', 'parcel_id'}
--- Geometry Type Check ---
Geometry types: <ArrowStringArray>
['MultiPolygon']
Length: 1, dtype: str
--- Area Check ---
Minimum area: 600.0
--- Null/Empty Check ---
Null geometries: 0
Empty geometries: 0
--- Duplicate Geometry Check ---
Duplicate geometries: 0
--- Duplicate Parcel ID Check ---
Duplicate parcel_ids: 0
All checks passed!
1:30