dc-l2-cairo-invalid-dedup
Map (reference ← swipe → agent)
0:00
cairo_parcels_legacy.geojson is a parcel snapshot stitched together from three legacy provincial systems before unification, and I need to turn it into the foundation for our new central registry. Can you clean it up so we end up with one canonical record per parcel? When duplicates exist, keep the record with the earliest record_seq, and discard any polygon fragments smaller than 1 m².
Please write the result to parcels_canonical.geoparquet including these columns: parcel_id (the join key), parcel_class, district, and area_m2. Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers.
0:00
System info
{
"adapter": "openrouter",
"model": "deepseek/deepseek-v4-flash",
"base_url": "https://openrouter.ai/api/v1"
}
0:08
Let me start by reading the input file to understand its structure.
0:08
Bash
ls -la /work/cairo_parcels_legacy.geojson
-rw-r--r-- 1 agent agent 223991 Jun 16 16:41 /work/cairo_parcels_legacy.geojson
0:13
Let me look at the data first to understand its structure.
0:13
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_file('cairo_parcels_legacy.geojson')
print('Shape:', gdf.shape)
print('Columns:', gdf.columns.tolist())
print('CRS:', gdf.crs)
print('Dtypes:')
print(gdf.dtypes)
print('---')
print(gdf.head(3).to_string())
print('---')
print('Null geometries:', gdf.geometry.isna().sum())
print('Invalid geometries:', (~gdf.geometry.is_valid).sum())
print('Duplicate geometries:', gdf.geometry.duplicated().sum())
print('---')
print('Any empty geometries:', (gdf.geometry.is_empty).sum())
print('---')
print('Geometry types:', gdf.geometry.type.value_counts())
"
Shape: (290, 6) Columns: ['parcel_id', 'record_seq', 'parcel_class', 'district', 'area_m2', 'geometry'] CRS: EPSG:22992 Dtypes: parcel_id int32 record_seq int32 parcel_class str district str area_m2 float64 geometry geometry dtype: object --- parcel_id record_seq parcel_class district area_m2 geometry 0 1 1 commercial Giza-East 1200.0 POLYGON ((640000 815000, 640030 815000, 640030 815040, 640000 815040, 640000 815000)) 1 2 2 industrial Qalyubia-South 1200.0 POLYGON ((640050 815000, 640080 815000, 640080 815040, 640050 815040, 640050 815000)) 2 3 3 agricultural Cairo-Central 1200.0 POLYGON ((640100 815000, 640130 815000, 640130 815040, 640100 815040, 640100 815000)) --- Null geometries: 0 Invalid geometries: 20 Duplicate geometries: 50 --- Any empty geometries: 0 --- Geometry types: Polygon 260 MultiPolygon 30 Name: count, dtype: int64
0:22
Let me understand the data better before writing the solution. We have: - 290 rows - Columns: parcel_id, record_seq, parcel_class, district, area_m2, geometry - CRS: EPSG:22992 (which appears to be a metric CRS - Egypt Red Belt / Belt (deg) - but wait, the coordinates look like 640000, 815000 which are in meters, so it could be EPSG:22992 which is "Egypt Red Belt / Belt" with meters. Actually let me check.) - 20 invalid geometries - 50 duplicate geometries - Geometry types: Polygon (260) and MultiPolygon (30) - Null geometries: 0 - Empty geometries: 0 The task says: 1. One canonical record per parcel - when duplicates exist (same parcel_id?), keep the one with earliest record_seq 2. Discard polygon fragments smaller than 1 m² 3. Output to parcels_canonical.geoparquet with columns: parcel_id, parcel_class, district, area_m2 4. Every feature must be a MultiPolygon 5. No null/empty geometries, no invalid geometries, no duplicate geometries, no artifact slivers Let me check what "duplicates" means - I see 50 duplicate geometries. Let me check more about the data. Actually, the user says "one canonical record per parcel" - so I think parcel_id is the unique identifier and we need to deduplicate by parcel_id keeping the earliest record_seq. Let me check for duplicate parcel_ids.
0:22
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_file('cairo_parcels_legacy.geojson')
# Check for duplicate parcel_ids
dup_ids = gdf[gdf.duplicated(subset='parcel_id', keep=False)]
print('Duplicate parcel_id count:', gdf.duplicated(subset='parcel_id').sum())
print('Rows with dup parcel_id:', len(dup_ids))
print()
print(dup_ids[['parcel_id','record_seq','parcel_class','district','area_m2']].to_string())
print()
# Check the invalid geometries
invalid = gdf[~gdf.geometry.is_valid]
print('Invalid geometries:')
print(invalid[['parcel_id','record_seq','area_m2']].head(10).to_string())
print()
# Check area distribution
print('Area stats:')
print(gdf['area_m2'].describe())
print()
print('Min geometry area:', gdf.geometry.area.min())
print('Max geometry area:', gdf.geometry.area.max())
print()
# Small areas
small = gdf[gdf.geometry.area < 1.0]
print('Geometries < 1 m²:', len(small))
if len(small) > 0:
print(small[['parcel_id','record_seq','area_m2']].head(10).to_string())
"
Duplicate parcel_id count: 0
Rows with dup parcel_id: 0
Empty DataFrame
Columns: [parcel_id, record_seq, parcel_class, district, area_m2]
Index: []
Invalid geometries:
parcel_id record_seq area_m2
6 7 7 1200.0
16 17 17 1200.0
22 23 23 1200.0
34 35 35 1200.0
40 41 41 1200.0
52 53 53 1200.0
57 58 58 1200.0
70 71 71 1200.0
75 76 76 1200.0
88 89 89 1200.0
Area stats:
count 290.000000
mean 1075.912759
std 365.934414
min 0.490000
25% 1200.000000
50% 1200.000000
75% 1200.000000
max 1200.000000
Name: area_m2, dtype: float64
Min geometry area: 0.0
Max geometry area: 1280.0
Geometries < 1 m²: 50
parcel_id record_seq area_m2
6 7 7 1200.0
16 17 17 1200.0
22 23 23 1200.0
34 35 35 1200.0
40 41 41 1200.0
52 53 53 1200.0
57 58 58 1200.0
70 71 71 1200.0
75 76 76 1200.0
88 89 89 1200.0
0:33
Interesting. So there are 0 duplicate parcel_ids, but 50 duplicate geometries and 20 invalid geometries, and many small geometries. Wait - the problem says "When duplicates exist, keep the record with the earliest record_seq". Since there are no duplicate parcel_ids, but there are 50 duplicate geometries... Let me think about what "duplicates" means here. The user says "one canonical record per parcel". The `parcel_id` is described as "the join key". Let me check more carefully - maybe some parcel_ids differ but geometries are duplicated. Let me look at the duplicate geometry records more carefully. Also, 50 geometries have area < 1 m² (the ones with area 0.49), and there are 50 duplicate geometries. Let me check if these are the same records.
0:33
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_file('cairo_parcels_legacy.geojson')
# Check duplicate geometries - which ones?
dup_geom_mask = gdf.geometry.duplicated(keep=False)
dup_geom = gdf[dup_geom_mask]
print('Rows with duplicate geometry:', len(dup_geom))
print(dup_geom[['parcel_id','record_seq','area_m2']].sort_values('area_m2').to_string())
print()
# Check if duplicate geometries are also the small ones
small_mask = gdf.geometry.area < 1.0
print('Small geom rows:', small_mask.sum())
print('Overlap:', (dup_geom_mask & small_mask).sum())
print()
# Are there any duplicate parcel_ids scattered?
print('Unique parcel_ids:', gdf['parcel_id'].nunique())
print('Total rows:', len(gdf))
print()
# Let's check what these duplicate geometries look like
dup_groups = gdf.groupby(gdf.geometry.wkt).filter(lambda x: len(x) > 1)
print('Grouped by geometry - groups with count>1:')
print(dup_groups.groupby(gdf.geometry.wkt).size().describe())
print()
# Check the invalid geometries
invalid = gdf[~gdf.geometry.is_valid]
print('Invalid geom records:')
print(invalid[['parcel_id','record_seq','parcel_class','district','area_m2']].to_string())
"
Rows with duplicate geometry: 100
parcel_id record_seq area_m2
0 1 1 1200.0
1 2 2 1200.0
2 3 3 1200.0
3 4 4 1200.0
4 5 5 1200.0
5 6 6 1200.0
7 8 8 1200.0
8 9 9 1200.0
9 10 10 1200.0
11 12 12 1200.0
12 13 13 1200.0
14 15 15 1200.0
15 16 16 1200.0
17 18 18 1200.0
19 20 20 1200.0
20 21 21 1200.0
21 22 22 1200.0
23 24 24 1200.0
24 25 25 1200.0
25 26 26 1200.0
26 27 27 1200.0
27 28 28 1200.0
29 30 30 1200.0
30 31 31 1200.0
32 33 33 1200.0
33 34 34 1200.0
35 36 36 1200.0
37 38 38 1200.0
38 39 39 1200.0
39 40 40 1200.0
41 42 42 1200.0
42 43 43 1200.0
43 44 44 1200.0
44 45 45 1200.0
45 46 46 1200.0
47 48 48 1200.0
48 49 49 1200.0
50 51 51 1200.0
51 52 52 1200.0
53 54 54 1200.0
55 56 56 1200.0
56 57 57 1200.0
58 59 59 1200.0
59 60 60 1200.0
60 61 61 1200.0
61 62 62 1200.0
62 63 63 1200.0
63 64 64 1200.0
65 66 66 1200.0
66 67 67 1200.0
210 900001 211 1200.0
211 900002 212 1200.0
212 900003 213 1200.0
213 900004 214 1200.0
214 900005 215 1200.0
215 900006 216 1200.0
216 900007 217 1200.0
217 900008 218 1200.0
218 900009 219 1200.0
219 900010 220 1200.0
220 900011 221 1200.0
221 900012 222 1200.0
222 900013 223 1200.0
223 900014 224 1200.0
224 900015 225 1200.0
225 900016 226 1200.0
226 900017 227 1200.0
227 900018 228 1200.0
228 900019 229 1200.0
229 900020 230 1200.0
230 900021 231 1200.0
231 900022 232 1200.0
232 900023 233 1200.0
233 900024 234 1200.0
234 900025 235 1200.0
235 900026 236 1200.0
236 900027 237 1200.0
237 900028 238 1200.0
238 900029 239 1200.0
239 900030 240 1200.0
240 900031 241 1200.0
241 900032 242 1200.0
242 900033 243 1200.0
243 900034 244 1200.0
244 900035 245 1200.0
245 900036 246 1200.0
246 900037 247 1200.0
247 900038 248 1200.0
248 900039 249 1200.0
249 900040 250 1200.0
250 900041 251 1200.0
251 900042 252 1200.0
252 900043 253 1200.0
253 900044 254 1200.0
254 900045 255 1200.0
255 900046 256 1200.0
256 900047 257 1200.0
257 900048 258 1200.0
258 900049 259 1200.0
259 900050 260 1200.0
Small geom rows: 50
Overlap: 0
Unique parcel_ids: 290
Total rows: 290
Traceback (most recent call last):
File "<string>", line 24, in <module>
dup_groups = gdf.groupby(gdf.geometry.wkt).filter(lambda x: len(x) > 1)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.14/site-packages/pandas/core/generic.py", line 6206, in __getattr__
return object.__getattribute__(self, name)
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
AttributeError: 'GeoSeries' object has no attribute 'wkt'
1:08
Let me understand the data: 1. 290 rows, 290 unique parcel_ids - so no duplicate parcel_id. But there are 100 rows with duplicate geometries (50 pairs?). 2. 50 geometries are < 1 m² - these are the "artifact slivers" we need to drop. 3. 20 invalid geometries need fixing. 4. The task says "When duplicates exist, keep the record with the earliest record_seq" - but there are no duplicate parcel_ids. Let me re-examine - maybe the "duplicates" are based on geometry? The user says "parcel_id (the join key)" - so parcel_id is the key. Wait - 290 rows, 290 unique parcel_ids. So no actual duplicate parcel_ids. But the task says to keep the one with earliest record_seq when duplicates exist. Since there are no real parcel_id duplicates, this step is a no-op. Let me re-read more carefully. There are 100 rows with duplicate geometries (50 pairs?). But the geometries themselves being duplicated means some parcels share the exact same geometry - that would make them "duplicates" in the spatial sense. Actually, I'm seeing 100 rows have duplicate geometries. That's 50 original + 50 duplicates - so 50 geometries are duplicated exactly once. Plus 50 small ones. Let me think about what the instructions mean by "duplicates" and "discard any polygon fragments smaller than 1 m²". The flow should be: 1. Some parcel_id values appear more than once (duplicate parcel_ids). But the data shows all 290 parcel_ids are unique. Hmm. Wait, let me re-examine. The user says "When duplicates exist, keep the record with the earliest record_seq". This is clearly talking about the same parcel appearing multiple times (same parcel_id). But the data shows all parcel_ids are unique. Let me check again... Actually, looking at the output - `duplicate parcel_id count: 0` and `Unique parcel_ids: 290` out of 290 rows. So there are NO duplicate parcel_ids in this data. So the dedup step by record_seq is a no-op. But the task says "When duplicates exist" as a conditional - so if there are none, we just skip that step. So the actual work is: 1. Make all geometries valid (20 invalid) 2. Remove geometries < 1 m² (50 geometries) 3. Handle duplicate geometries (100 rows with duplicate geometries) 4. Convert all to MultiPolygon 5. Ensure no null/empty geometries Wait, what should happen with duplicate geometries? The user says "There should be... no duplicate geometries". So we should drop exact duplicate geometry records. Actually, let me re-read more carefully. The duplicate geometries are across different parcel_ids. If parcel_id 1 and parcel_id 900001 have the same geometry, and both have different parcel_ids... The task says "one canonical record per parcel" meaning one per parcel_id. Since each parcel_id is unique already, we'd keep both even if they share geometry. But the final output says "no duplicate geometries" - so we need to drop exact duplicate geometries. Hmm, but dropping duplicate geometries would mean keeping only one of the two parcels that share the same geometry. Which one? Since the task already said "When duplicates exist, keep the record with the earliest record_seq" - this could apply to both parcel_id duplicates AND geometry duplicates. Let me interpret it as: when multiple records have the same geometry, keep the one with the smallest record_seq. Wait actually re-reading: "When duplicates exist, keep the record with the earliest record_seq" - this could mean when there are duplicate records (same parcel_id), keep earliest. And separately, "no duplicate geometries" means remove exact geometric duplicates. Let me look at this more carefully. There are 50 geometries < 1 m² (these are the slivers to discard). The duplicate geometries are 100 rows (so 50 unique geometries each appearing twice). Let me check if the small ones overlap with the dup geometries. From the data: "Small geom rows: 50" and "Overlap: 0" - so the 50 small geometries and the 100 duplicate-geometry rows are completely separate sets. So plan: 1. Handle invalid geometries (20) - make them valid 2. Drop geometries < 1 m² (50) - these are artifact slivers 3. For duplicate geometries (same exact geometry appearing more than once), keep the one with the lowest record_seq 4. Convert all to MultiPolygon 5. Ensure no null/empty geometries 6. Recalculate area_m2 from the geometry 7. Write to GeoParquet Let me look at what's invalid about those 20 geometries. Actually, let me check whether the "duplicate geometries" are intentional or should also be handled. The task says "no duplicate geometries" in the final output. So let me implement this. Let me also check the invalid geometries more carefully.
1:08
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_file('cairo_parcels_legacy.geojson')
# Check invalid geometry types
invalid = gdf[~gdf.geometry.is_valid]
for idx, row in invalid.iterrows():
g = row.geometry
print(f'parcel_id={row.parcel_id}, type={g.geom_type}, area={g.area:.6f}')
from shapely.validation import explain_validity
print(f' validity: {explain_validity(g)}')
print()
# Check for duplicate geometry records - what's the record_seq pattern
dup_mask = gdf.geometry.duplicated(keep=False)
dup = gdf[dup_mask].sort_values('geometry')
print('Duplicate geometry groups sample:')
# Group by geometry wkt
from shapely import to_wkt
groups = {}
for idx, row in dup.iterrows():
wkt = to_wkt(row.geometry, rounding_precision=6)
if wkt not in groups:
groups[wkt] = []
groups[wkt].append((row.parcel_id, row.record_seq, row.parcel_class, row.district))
for wkt, recs in list(groups.items())[:3]:
print(f' Geometry: {wkt[:80]}...')
for r in recs:
print(f' parcel_id={r[0]}, record_seq={r[1]}, class={r[2]}, district={r[3]}')
print()
"
parcel_id=7, type=Polygon, area=0.000000
validity: Self-intersection[640315 815020]
parcel_id=17, type=Polygon, area=0.000000
validity: Self-intersection[640065 815080]
parcel_id=23, type=Polygon, area=0.000000
validity: Self-intersection[640365 815080]
parcel_id=35, type=Polygon, area=0.000000
validity: Self-intersection[640215 815140]
parcel_id=41, type=Polygon, area=0.000000
validity: Self-intersection[640515 815140]
parcel_id=53, type=Polygon, area=0.000000
validity: Self-intersection[640365 815200]
parcel_id=58, type=Polygon, area=0.000000
validity: Self-intersection[640615 815200]
parcel_id=71, type=Polygon, area=0.000000
validity: Self-intersection[640515 815260]
parcel_id=76, type=Polygon, area=0.000000
validity: Self-intersection[640015 815320]
parcel_id=89, type=Polygon, area=0.000000
validity: Self-intersection[640665 815320]
parcel_id=94, type=Polygon, area=0.000000
validity: Self-intersection[640165 815380]
parcel_id=107, type=Polygon, area=0.000000
validity: Self-intersection[640065 815440]
parcel_id=112, type=Polygon, area=0.000000
validity: Self-intersection[640315 815440]
parcel_id=125, type=Polygon, area=0.000000
validity: Self-intersection[640215 815500]
parcel_id=130, type=Polygon, area=0.000000
validity: Self-intersection[640465 815500]
parcel_id=143, type=Polygon, area=0.000000
validity: Self-intersection[640365 815560]
parcel_id=148, type=Polygon, area=0.000000
validity: Self-intersection[640615 815560]
parcel_id=161, type=Polygon, area=0.000000
validity: Self-intersection[640515 815620]
parcel_id=166, type=Polygon, area=0.000000
validity: Self-intersection[640015 815680]
parcel_id=179, type=Polygon, area=0.000000
validity: Self-intersection[640665 815680]
Duplicate geometry groups sample:
Geometry: POLYGON ((640000 815000, 640030 815000, 640030 815040, 640000 815040, 640000 815...
parcel_id=1, record_seq=1, class=commercial, district=Giza-East
parcel_id=900001, record_seq=211, class=industrial, district=Giza-East
Geometry: POLYGON ((640050 815000, 640080 815000, 640080 815040, 640050 815040, 640050 815...
parcel_id=900002, record_seq=212, class=agricultural, district=Qalyubia-South
parcel_id=2, record_seq=2, class=industrial, district=Qalyubia-South
Geometry: POLYGON ((640000 815060, 640030 815060, 640030 815100, 640000 815100, 640000 815...
parcel_id=16, record_seq=16, class=residential, district=Giza-East
parcel_id=900013, record_seq=223, class=commercial, district=Giza-East
1:46
Let me understand the data: 1. **Duplicate geometries**: There are duplicate geometries where the same polygon appears for two different parcel_ids. One has low record_seq (e.g., record_seq=1), the other has high record_seq (e.g., record_seq=211). The instruction says: "keep the record with the earliest record_seq" - so when duplicate geometries exist, keep only the one with the lowest record_seq. 2. **Duplicate parcel_ids**: None. Each row has unique parcel_id. So the duplication is about geometry duplicates. 3. **Invalid geometries**: 20 geometries, all self-intersecting (bow-tie polygons). Need to fix (make valid). 4. **Small geometries (< 1 m²)**: 50 geometries with area = 0.0 (the invalid ones). Need to be discarded. 5. **Geometry types**: 260 Polygons and 30 MultiPolygons. Need all to be MultiPolygons. Now let me think about the pipeline: 1. Start with 290 records 2. Fix invalid geometries (make valid using `buffer(0)` or similar) 3. Remove geometries < 1 m² area 4. For duplicate geometries, keep only the one with the lowest record_seq 5. Ensure all features are MultiPolygons 6. Output the specified columns: parcel_id, parcel_class, district, area_m2 Wait, but the duplicate geometries have different parcel_classes and districts too. Let me re-read the requirement: "When duplicates exist, keep the record with the earliest record_seq". So we keep the one with lower record_seq. Let me check what record_seq is for the 900000 series: - parcel_id 900001-900050 have record_seq 211-260 - The "normal" ones (parcel_id 1-240) have record_seq 1-240 as well (mostly) Wait, actually looking at the output: - parcel_id 1, record_seq=1, commercial, Giza-East - parcel_id 900001, record_seq=211, industrial, Giza-East These share the same geometry. We keep record_seq=1 (parcel_id=1). Let me check the invalid geometries more carefully. They are self-intersecting polygons with area=0.0. After making valid, they might become actual polygons or might remain as tiny things. Since they have area 0.0 as invalid, after `buffer(0)` they might have some area but could still be < 1 m². Actually looking at the data: - The 50 geometries < 1 m² are the 20 invalid ones plus 30 others? Let me check: invalid count = 20, small count = 50. So 30 are valid but have area < 1 m²? No wait, the invalid ones have area 0.0 from shapely.area calculation. Let me look at the area_m2 column - those rows have area_m2=1200.0 but the actual geometry area from shapely is 0 (since they're self-intersecting). Let me check: are there any small (<1 m²) geometries that are valid? Or are all the <1 m² ones the invalid ones? From the output: - Small geom rows: 50 - Invalid: 20 So 30 more are valid but have geometry area < 1 m². Those need to be discarded too. Let me look at this more carefully. Actually the "Small geom rows" count of 50 might be counting based on `gdf.geometry.area` which for self-intersecting polygons returns 0.0. So the 20 invalid ones have area 0.0 and the other 30 might be actual small polygons. Let me write the script now. Plan: 1. Read GeoJSON 2. Fix invalid geometries with `buffer(0)` 3. Recalculate geometry area, filter out geometries with area < 1.0 4. Remove duplicate geometries, keeping earliest record_seq 5. Convert all to MultiPolygon 6. Remove any remaining null/empty geometries 7. Ensure there are no duplicate geometries remaining 8. Write to GeoParquet with columns: parcel_id, parcel_class, district, area_m2 Let me check if there's a nuance - the area_m2 column is currently pre-computed. Should I use that or recompute? The user says "area_m2" as an output column. I'll compute it from the actual geometry to be accurate. Actually, let me also check: after removing duplicates by geometry (keeping earliest record_seq), I need to ensure no duplicate parcel_ids remain. Since each record has unique parcel_id, and I'm only removing some records, this should be fine. Let me write the script.
1:46
1:46
Write
/work/solve.py
content · 136 lines · py
#!/usr/bin/env python3
"""
Clean cairo_parcels_legacy.geojson into parcels_canonical.geoparquet.
Steps:
1. Read source GeoJSON (EPSG:22992).
2. Fix invalid (self-intersecting) geometries via buffer(0).
3. Drop rows where the geometry area < 1.0 m².
4. When duplicate geometries exist, keep only the row with the smallest record_seq.
5. Convert every geometry to MultiPolygon.
6. Ensure no null/empty/invalid geometries remain.
7. Compute area_m2 from geometry (CRS is metric).
8. Write GeoParquet with columns: parcel_id, parcel_class, district, area_m2.
"""
import geopandas as gpd
import pandas as pd
from shapely import make_valid, area as shp_area
from shapely.geometry import MultiPolygon
# ---------------------------------------------------------------------------
# 1. Read
gdf = gpd.read_file("/work/cairo_parcels_legacy.geojson")
print(f"Read {len(gdf)} rows, CRS={gdf.crs}")
# ---------------------------------------------------------------------------
# 2. Fix invalid geometries
invalid_mask = ~gdf.geometry.is_valid
n_invalid = invalid_mask.sum()
if n_invalid:
print(f"Fixing {n_invalid} invalid geometries via make_valid / buffer(0)")
# make_valid is the preferred modern method
gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].apply(
lambda g: make_valid(g) if not g.is_valid else g
)
# Any remaining invalid after fix?
still_invalid = ~gdf.geometry.is_valid
if still_invalid.any():
print(f" {still_invalid.sum()} still invalid — applying buffer(0)")
gdf.loc[still_invalid, "geometry"] = gdf.loc[still_invalid, "geometry"].buffer(0)
# Verify
assert gdf.geometry.is_valid.all(), "Still have invalid geometries after fix!"
# ---------------------------------------------------------------------------
# 3. Remove polygon fragments < 1 m²
# Compute actual geometry area (CRS is metric — EPSG:22992)
areas = gdf.geometry.area
small_mask = areas < 1.0
n_small = small_mask.sum()
print(f"Dropping {n_small} features with geometry area < 1 m²")
gdf = gdf[~small_mask].copy()
print(f" -> {len(gdf)} rows remain")
# ---------------------------------------------------------------------------
# 4. Deduplicate by geometry — keep the earliest record_seq
# Group by geometry WKT, keep row with minimum record_seq within each group
# We use a stable sort approach
gdf = (
gdf.assign(geom_wkt=gdf.geometry.apply(lambda g: g.wkt))
.sort_values("record_seq")
.drop_duplicates(subset="geom_wkt", keep="first")
.drop(columns="geom_wkt")
.reset_index(drop=True)
)
print(f"After geometry dedup: {len(gdf)} rows")
# ---------------------------------------------------------------------------
# 5. Convert every geometry to MultiPolygon
def ensure_multipolygon(geom):
gtype = geom.geom_type
if gtype == "MultiPolygon":
return geom
elif gtype == "Polygon":
return MultiPolygon([geom])
else:
# GeometryCollection or other — filter to only polygons
from shapely.geometry import GeometryCollection
polys = [g for g in geom.geoms if g.geom_type in ("Polygon", "MultiPolygon")]
# flatten MultiPolygons inside
final = []
for p in polys:
if p.geom_type == "MultiPolygon":
final.extend(list(p.geoms))
else:
final.append(p)
if len(final) == 0:
return None
elif len(final) == 1:
return MultiPolygon(final)
else:
return MultiPolygon(final)
gdf["geometry"] = gdf["geometry"].apply(ensure_multipolygon)
# ---------------------------------------------------------------------------
# 6. Drop any rows that ended up with None/empty geometry
n_none = gdf["geometry"].isna().sum()
if n_none:
print(f"Dropping {n_none} rows that became null geometries")
gdf = gdf.dropna(subset=["geometry"])
empty_mask = gdf.geometry.is_empty
if empty_mask.any():
print(f"Dropping {empty_mask.sum()} empty geometries")
gdf = gdf[~empty_mask].copy()
# Ensure no invalid geometries remain
invalid_final = ~gdf.geometry.is_valid
if invalid_final.any():
print(f"Fixing {invalid_final.sum()} still-invalid geometries")
gdf.loc[invalid_final, "geometry"] = gdf.loc[invalid_final, "geometry"].buffer(0)
# drop any that remain invalid or empty
gdf = gdf[~gdf.geometry.isna() & gdf.geometry.is_valid & ~gdf.geometry.is_empty].copy()
# ---------------------------------------------------------------------------
# 7. Compute area_m2 from cleaned geometry
gdf["area_m2"] = gdf.geometry.area.round(2)
# ---------------------------------------------------------------------------
# 8. Select output columns and write
OUT_COLS = ["parcel_id", "parcel_class", "district", "area_m2", "geometry"]
gdf_out = gdf[OUT_COLS].copy()
gdf_out = gdf_out.reset_index(drop=True)
gdf_out.to_parquet("/work/parcels_canonical.geoparquet", index=False)
print(f"\nWritten {len(gdf_out)} parcels to parcels_canonical.geoparquet")
print(f"Columns: {gdf_out.columns.tolist()}")
print(f"CRS: {gdf_out.crs}")
print(f"Geometry types: {gdf_out.geometry.type.value_counts().to_dict()}")
print(f"Any null/empty: {(gdf_out.geometry.isna() | gdf_out.geometry.is_empty).sum()}")
print(f"Any invalid: {(~gdf_out.geometry.is_valid).sum()}")
print(f"Any duplicate geometry: {gdf_out.geometry.duplicated().sum()}")
Wrote /work/solve.py (5362 chars).
1:49
Let me run the script and see the results.
1:49
Bash
cd /work && python solve.py 2>&1
Read 290 rows, CRS=EPSG:22992
Fixing 20 invalid geometries via make_valid / buffer(0)
Dropping 30 features with geometry area < 1 m²
-> 260 rows remain
After geometry dedup: 210 rows
Written 210 parcels to parcels_canonical.geoparquet
Columns: ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
CRS: EPSG:22992
Geometry types: {'MultiPolygon': 210}
Any null/empty: 0
Any invalid: 0
Any duplicate geometry: 0
1:55
The script ran successfully. Let me verify the output more thoroughly by reading the geoparquet file and checking everything.
1:55
1:55
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
import pandas as pd
gdf = gpd.read_parquet('parcels_canonical.geoparquet')
print('Shape:', gdf.shape)
print('CRS:', gdf.crs)
print('Columns:', gdf.columns.tolist())
print()
print('Dtypes:')
print(gdf.dtypes)
print()
print('Geometry types:')
print(gdf.geometry.type.value_counts())
print()
print('Any null geometry:', gdf.geometry.isna().sum())
print('Any empty geometry:', gdf.geometry.is_empty.sum())
print('Any invalid geometry:', (~gdf.geometry.is_valid).sum())
print('Duplicate geometries:', gdf.geometry.duplicated().sum())
print()
print('area_m2 range:', gdf.area_m2.min(), '—', gdf.area_m2.max())
print()
print('Sample rows:')
print(gdf.sample(10, random_state=42).to_string())
"
Shape: (210, 5)
CRS: {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "ProjectedCRS", "name": "Egypt 1907 / Red Belt", "base_crs": {"name": "Egypt 1907", "datum": {"type": "GeodeticReferenceFrame", "name": "Egypt 1907", "ellipsoid": {"name": "Helmert 1906", "semi_major_axis": 6378200, "inverse_flattening": 298.3}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, "id": {"authority": "EPSG", "code": 4229}}, "conversion": {"name": "Egypt Red Belt", "method": {"name": "Transverse Mercator", "id": {"authority": "EPSG", "code": 9807}}, "parameters": [{"name": "Latitude of natural origin", "value": 30, "unit": "degree", "id": {"authority": "EPSG", "code": 8801}}, {"name": "Longitude of natural origin", "value": 31, "unit": "degree", "id": {"authority": "EPSG", "code": 8802}}, {"name": "Scale factor at natural origin", "value": 1, "unit": "unity", "id": {"authority": "EPSG", "code": 8805}}, {"name": "False easting", "value": 615000, "unit": "metre", "id": {"authority": "EPSG", "code": 8806}}, {"name": "False northing", "value": 810000, "unit": "metre", "id": {"authority": "EPSG", "code": 8807}}]}, "coordinate_system": {"subtype": "Cartesian", "axis": [{"name": "Easting", "abbreviation": "E", "direction": "east", "unit": "metre"}, {"name": "Northing", "abbreviation": "N", "direction": "north", "unit": "metre"}]}, "scope": "Engineering survey, topographic mapping.", "area": "Egypt - onshore between 29\u00b0E and 33\u00b0E, offshore Mediterranean east of 29\u00b0E and offshore Gulf of Suez.", "bbox": {"south_latitude": 21.99, "west_longitude": 29, "north_latitude": 33.82, "east_longitude": 34.27}, "id": {"authority": "EPSG", "code": 22992}}
Columns: ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
Dtypes:
parcel_id int32
parcel_class str
district str
area_m2 float64
geometry geometry
dtype: object
Geometry types:
MultiPolygon 210
Name: count, dtype: int64
Any null geometry: 0
Any empty geometry: 0
Any invalid geometry: 0
Duplicate geometries: 0
area_m2 range: 600.0 — 1280.0
Sample rows:
parcel_id parcel_class district area_m2 geometry
30 31 agricultural Giza-East 1200.0 MULTIPOLYGON (((640000 815120, 640030 815120, 640030 815160, 640000 815160, 640000 815120)))
172 173 commercial Qalyubia-South 1280.0 MULTIPOLYGON (((640350 815660, 640380 815660, 640380 815700, 640350 815700, 640350 815660)), ((640385 815665, 640393 815665, 640393 815675, 640385 815675, 640385 815665)))
84 85 commercial Giza-East 1200.0 MULTIPOLYGON (((640450 815300, 640480 815300, 640480 815340, 640450 815340, 640450 815300)))
199 200 residential Qalyubia-South 1200.0 MULTIPOLYGON (((640200 815780, 640230 815780, 640230 815820, 640200 815820, 640200 815780)))
60 61 commercial Giza-East 1200.0 MULTIPOLYGON (((640000 815240, 640030 815240, 640030 815280, 640000 815280, 640000 815240)))
155 156 residential Cairo-Central 1200.0 MULTIPOLYGON (((640250 815600, 640280 815600, 640280 815640, 640250 815640, 640250 815600)))
45 46 industrial Giza-East 1200.0 MULTIPOLYGON (((640000 815180, 640030 815180, 640030 815220, 640000 815220, 640000 815180)))
182 183 agricultural Cairo-Central 1200.0 MULTIPOLYGON (((640100 815720, 640130 815720, 640130 815760, 640100 815760, 640100 815720)))
9 10 industrial Giza-East 1200.0 MULTIPOLYGON (((640450 815000, 640480 815000, 640480 815040, 640450 815040, 640450 815000)))
196 197 commercial Qalyubia-South 1200.0 MULTIPOLYGON (((640050 815780, 640080 815780, 640080 815820, 640050 815820, 640050 815780)))
2:08
Everything looks good! Let me also verify the file exists and check the areas are reasonable (metric CRS area should be in the hundreds to thousands range, which matches). Let me also double-check that any invalid geometries from the small fragments were properly handled — the original had 20 invalid, 50 small fragments. After dropping <1m², the 20 invalid ones (which had area 0.0) would be dropped too if they were <1m². But we need to verify the invalid ones from earlier weren't also the duplicate geometries. Actually wait — looking at the invalid rows: parcel_id=7 had geometry area=0.0, etc. So those were all dropped in step 3 (area < 1.0). Then buffer(0) fix was applied before dropping. Let me re-check the logic — we fix invalid first, then drop small. But the invalid ones had area=0, so fixing them makes them valid, then we'd drop them anyway because area < 1.0. That's fine, they're gone. But wait — the invalid ones (area=0 after self-intersection) might have been fixed by make_valid into valid geometries with non-zero area. Let me verify that didn't happen on important parcels. Actually the output says "Dropping 30 features with geometry area < 1 m²" — so 30 were dropped. The 20 invalid ones probably became valid with non-zero area (maybe all < 1 though), or they were already counted among the 30. Actually the key point: we fixed invalid first, then dropped small ones. The result is clean. Let me verify everything once more. All checks pass: - 210 rows, all MultiPolygon - No null/empty/invalid geometries - No duplicate geometries - All columns present: parcel_id, parcel_class, district, area_m2 - CRS preserved as EPSG:22992 - area_m2 values are reasonable (600–1280) Let's confirm the file exists.
2:08
2:08
Bash
ls -lh /work/parcels_canonical.geoparquet
-rw-r--r-- 1 agent agent 19K Jun 16 16:43 /work/parcels_canonical.geoparquet
2:17
Everything looks good. Let me summarize what I did.
2:17