dc-l2-cairo-invalid-dedup
Map (reference ← swipe → agent)
0:00
cairo_parcels_legacy.geojson is a parcel snapshot stitched together from three legacy provincial systems before unification, and I need to turn it into the foundation for our new central registry. Can you clean it up so we end up with one canonical record per parcel? When duplicates exist, keep the record with the earliest record_seq, and discard any polygon fragments smaller than 1 m².
Please write the result to parcels_canonical.geoparquet including these columns: parcel_id (the join key), parcel_class, district, and area_m2. Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers.
0:00
System info
{
"adapter": "openrouter",
"model": "google/gemma-4-26b-a4b-it",
"base_url": "https://openrouter.ai/api/v1"
}
0:02
Bash
ls -F
cairo_parcels_legacy.geojson
0:04
Bash
head -n 20 cairo_parcels_legacy.geojson
{
"type": "FeatureCollection",
"name": "cairo_parcels_legacy",
"crs": {
"type": "name",
"properties": {
"name": "urn:ogc:def:crs:EPSG::22992"
}
},
"features": [
{
"type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
640000.0,
815000.0
],
0:44
Write
/work/solve.py
content · 94 lines · py
import geopandas as gpd
import pandas as pd
from shapely.geometry import MultiPolygon, Polygon
def solve():
# Load the data
input_file = 'cairo_parcels_legacy.geojson'
output_file = 'parcels_canonical.geoparquet'
print(f"Loading {input_file}...")
gdf = gpd.read_file(input_file)
# Initial count
print(f"Initial records: {len(gdf)}")
# 1. Remove null or empty geometries
gdf = gdf[gdf.geometry.notnull()]
gdf = gdf[~gdf.geometry.is_empty]
print(f"After removing null/empty geometries: {len(gdf)}")
# 2. Fix invalid geometries
print("Fixing invalid geometries...")
gdf['geometry'] = gdf.geometry.buffer(0)
gdf = gdf[~gdf.geometry.is_empty] # buffer(0) can result in empty if geometry was invalid
print(f"After fixing invalid geometries: {len(gdf)}")
# 3. Ensure all features are MultiPolygon
print("Converting geometries to MultiPolygon...")
def ensure_multipolygon(geom):
if geom.geom_type == 'Polygon':
return MultiPolygon([geom])
return geom
gdf['geometry'] = gdf.geometry.apply(ensure_multipolygon)
# 4. Handle Duplicates
# "When duplicates exist, keep the record with the earliest record_seq"
# The requirement implies duplicates are defined by parcel_id.
print("Removing duplicates based on parcel_id...")
# Sort by parcel_id and record_seq (ascending) to keep the earliest
gdf = gdf.sort_values(by=['parcel_id', 'record_seq'], ascending=[True, True])
gdf = gdf.drop_duplicates(subset='parcel_id', keep='first')
print(f"After removing duplicates: {len(gdf)}")
# 5. Remove duplicate geometries
# "no duplicate geometries" - this usually means if two different IDs have the same shape,
# we keep one (though in a registry, we might want to keep them if IDs differ,
# but the prompt says "no duplicate geometries").
# Let's assume it means spatial duplicates.
print("Removing duplicate geometries...")
gdf = gdf.drop_duplicates(subset='geometry')
print(f"After removing duplicate geometries: {len(gdf)}")
# 6. Calculate area and filter artifacts
# The CRS is EPSG:22992 (based on the head output), which is a metric projected CRS.
# Let's check the CRS just in case.
if gdf.crs is None:
print("Warning: CRS not detected. Assuming metric if not provided, but checking...")
# If it's degrees, we must reproject. But 640000, 815000 looks like metric.
# The file header said EPSG:22992.
gdf['area_m2'] = gdf.geometry.area
# "discard any polygon fragments smaller than 1 m²"
print("Filtering small fragments (< 1 m²)...")
gdf = gdf[gdf['area_m2'] >= 1.0]
print(f"After filtering small fragments: {len(gdf)}")
# 7. Final column selection
# "including these columns: parcel_id, parcel_class, district, and area_m2"
# We also need to keep geometry.
output_columns = ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
# Check if columns exist
missing_cols = [c for c in output_columns if c not in gdf.columns]
if missing_cols:
print(f"Warning: Missing columns in input: {missing_cols}")
# If they are missing, we'll just use what we have or create dummy if needed for the task
# But let's try to just select what is available.
output_columns = [c for c in output_columns if c in gdf.columns]
gdf_final = gdf[output_columns]
# Final check for any remaining nulls in critical columns
gdf_final = gdf_final.dropna(subset=['parcel_id', 'parcel_class', 'district'])
# 8. Write to GeoParquet
print(f"Writing to {output_file}...")
gdf_final.to_parquet(output_file)
print("Done.")
if __name__ == "__main__":
solve()
Wrote /work/solve.py (3778 chars).
0:45
Bash
python solve.py
Loading cairo_parcels_legacy.geojson... Initial records: 290 After removing null/empty geometries: 290 Fixing invalid geometries... After fixing invalid geometries: 290 Converting geometries to MultiPolygon... Removing duplicates based on parcel_id... After removing duplicates: 290 Removing duplicate geometries... After removing duplicate geometries: 240 Filtering small fragments (< 1 m²)... After filtering small fragments: 210 Writing to parcels_canonical.geoparquet... Done.
0:51
Bash
python -c "import geopandas as gpd; gdf = gpd.read_parquet('parcels_canonical.ge...
parcel_id ... geometry
0 1 ... MULTIPOLYGON (((640000 815000, 640000 815040, ...
1 2 ... MULTIPOLYGON (((640050 815000, 640050 815040, ...
2 3 ... MULTIPOLYGON (((640100 815000, 640100 815040, ...
3 4 ... MULTIPOLYGON (((640150 815000, 640150 815040, ...
4 5 ... MULTIPOLYGON (((640200 815000, 640200 815040, ...
[5 rows x 5 columns]
Index(['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry'], dtype='str')
{"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "ProjectedCRS", "name": "Egypt 1907 / Red Belt", "base_crs": {"name": "Egypt 1907", "datum": {"type": "GeodeticReferenceFrame", "name": "Egypt 1907", "ellipsoid": {"name": "Helmert 1906", "semi_major_axis": 6378200, "inverse_flattening": 298.3}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, "id": {"authority": "EPSG", "code": 4229}}, "conversion": {"name": "Egypt Red Belt", "method": {"name": "Transverse Mercator", "id": {"authority": "EPSG", "code": 9807}}, "parameters": [{"name": "Latitude of natural origin", "value": 30, "unit": "degree", "id": {"authority": "EPSG", "code": 8801}}, {"name": "Longitude of natural origin", "value": 31, "unit": "degree", "id": {"authority": "EPSG", "code": 8802}}, {"name": "Scale factor at natural origin", "value": 1, "unit": "unity", "id": {"authority": "EPSG", "code": 8805}}, {"name": "False easting", "value": 615000, "unit": "metre", "id": {"authority": "EPSG", "code": 8806}}, {"name": "False northing", "value": 810000, "unit": "metre", "id": {"authority": "EPSG", "code": 8807}}]}, "coordinate_system": {"subtype": "Cartesian", "axis": [{"name": "Easting", "abbreviation": "E", "direction": "east", "unit": "metre"}, {"name": "Northing", "abbreviation": "N", "direction": "north", "unit": "metre"}]}, "scope": "Engineering survey, topographic mapping.", "area": "Egypt - onshore between 29\u00b0E and 33\u00b0E, offshore Mediterranean east of 29\u00b0E and offshore Gulf of Suez.", "bbox": {"south_latitude": 21.99, "west_longitude": 29, "north_latitude": 33.82, "east_longitude": 34.27}, "id": {"authority": "EPSG", "code": 22992}}
<ArrowStringArray>
['MultiPolygon']
Length: 1, dtype: str
1:12