dc-l2-cairo-invalid-dedup

Status: done
Score: 1.000
Duration: 2:19
Cost: 0.55¢
Model: deepseek/deepseek-v4-flash

Map (reference ← swipe → agent)

0:00
cairo_parcels_legacy.geojson is a parcel snapshot stitched together from three legacy provincial systems before unification, and I need to turn it into the foundation for our new central registry. Can you clean it up so we end up with one canonical record per parcel? When duplicates exist, keep the record with the earliest record_seq, and discard any polygon fragments smaller than 1 m². Please write the result to parcels_canonical.geoparquet including these columns: parcel_id (the join key), parcel_class, district, and area_m2. Every feature must be a MultiPolygon, and there should be no null or empty geometries, no invalid geometries, no duplicate geometries, and no artifact slivers.
0:00
System info
{
  "adapter": "openrouter",
  "model": "deepseek/deepseek-v4-flash",
  "base_url": "https://openrouter.ai/api/v1"
}
0:08
Let me start by reading the input file to understand its structure.
0:08
Bash ls -la /work/cairo_parcels_legacy.geojson
-rw-r--r-- 1 agent agent 223991 Jun 16 16:41 /work/cairo_parcels_legacy.geojson
0:13
Let me look at the data first to understand its structure.
0:13
Bash cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_file('cairo_parcels_legacy.geojson')
print('Shape:', gdf.shape)
print('Columns:', gdf.columns.tolist())
print('CRS:', gdf.crs)
print('Dtypes:')
print(gdf.dtypes)
print('---')
print(gdf.head(3).to_string())
print('---')
print('Null geometries:', gdf.geometry.isna().sum())
print('Invalid geometries:', (~gdf.geometry.is_valid).sum())
print('Duplicate geometries:', gdf.geometry.duplicated().sum())
print('---')
print('Any empty geometries:', (gdf.geometry.is_empty).sum())
print('---')
print('Geometry types:', gdf.geometry.type.value_counts())
"
Shape: (290, 6)
Columns: ['parcel_id', 'record_seq', 'parcel_class', 'district', 'area_m2', 'geometry']
CRS: EPSG:22992
Dtypes:
parcel_id          int32
record_seq         int32
parcel_class         str
district             str
area_m2          float64
geometry        geometry
dtype: object
---
   parcel_id  record_seq  parcel_class        district  area_m2                                                                               geometry
0          1           1    commercial       Giza-East   1200.0  POLYGON ((640000 815000, 640030 815000, 640030 815040, 640000 815040, 640000 815000))
1          2           2    industrial  Qalyubia-South   1200.0  POLYGON ((640050 815000, 640080 815000, 640080 815040, 640050 815040, 640050 815000))
2          3           3  agricultural   Cairo-Central   1200.0  POLYGON ((640100 815000, 640130 815000, 640130 815040, 640100 815040, 640100 815000))
---
Null geometries: 0
Invalid geometries: 20
Duplicate geometries: 50
---
Any empty geometries: 0
---
Geometry types: Polygon         260
MultiPolygon     30
Name: count, dtype: int64
0:22
Let me understand the data better before writing the solution.

We have:
- 290 rows
- Columns: parcel_id, record_seq, parcel_class, district, area_m2, geometry
- CRS: EPSG:22992 (which appears to be a metric CRS - Egypt Red Belt / Belt (deg) - but wait, the coordinates look like 640000, 815000 which are in meters, so it could be EPSG:22992 which is "Egypt Red Belt / Belt" with meters. Actually let me check.)
- 20 invalid geometries
- 50 duplicate geometries
- Geometry types: Polygon (260) and MultiPolygon (30)
- Null geometries: 0
- Empty geometries: 0

The task says:
1. One canonical record per parcel - when duplicates exist (same parcel_id?), keep the one with earliest record_seq
2. Discard polygon fragments smaller than 1 m²
3. Output to parcels_canonical.geoparquet with columns: parcel_id, parcel_class, district, area_m2
4. Every feature must be a MultiPolygon
5. No null/empty geometries, no invalid geometries, no duplicate geometries, no artifact slivers

Let me check what "duplicates" means - I see 50 duplicate geometries. Let me check more about the data.

Actually, the user says "one canonical record per parcel" - so I think parcel_id is the unique identifier and we need to deduplicate by parcel_id keeping the earliest record_seq.

Let me check for duplicate parcel_ids.
0:22
Bash cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_file('cairo_parcels_legacy.geojson')

# Check for duplicate parcel_ids
dup_ids = gdf[gdf.duplicated(subset='parcel_id', keep=False)]
print('Duplicate parcel_id count:', gdf.duplicated(subset='parcel_id').sum())
print('Rows with dup parcel_id:', len(dup_ids))
print()
print(dup_ids[['parcel_id','record_seq','parcel_class','district','area_m2']].to_string())
print()
# Check the invalid geometries
invalid = gdf[~gdf.geometry.is_valid]
print('Invalid geometries:')
print(invalid[['parcel_id','record_seq','area_m2']].head(10).to_string())
print()

# Check area distribution
print('Area stats:')
print(gdf['area_m2'].describe())
print()
print('Min geometry area:', gdf.geometry.area.min())
print('Max geometry area:', gdf.geometry.area.max())
print()
# Small areas
small = gdf[gdf.geometry.area < 1.0]
print('Geometries < 1 m²:', len(small))
if len(small) > 0:
    print(small[['parcel_id','record_seq','area_m2']].head(10).to_string())
"
Duplicate parcel_id count: 0
Rows with dup parcel_id: 0

Empty DataFrame
Columns: [parcel_id, record_seq, parcel_class, district, area_m2]
Index: []

Invalid geometries:
    parcel_id  record_seq  area_m2
6           7           7   1200.0
16         17          17   1200.0
22         23          23   1200.0
34         35          35   1200.0
40         41          41   1200.0
52         53          53   1200.0
57         58          58   1200.0
70         71          71   1200.0
75         76          76   1200.0
88         89          89   1200.0

Area stats:
count     290.000000
mean     1075.912759
std       365.934414
min         0.490000
25%      1200.000000
50%      1200.000000
75%      1200.000000
max      1200.000000
Name: area_m2, dtype: float64

Min geometry area: 0.0
Max geometry area: 1280.0

Geometries < 1 m²: 50
    parcel_id  record_seq  area_m2
6           7           7   1200.0
16         17          17   1200.0
22         23          23   1200.0
34         35          35   1200.0
40         41          41   1200.0
52         53          53   1200.0
57         58          58   1200.0
70         71          71   1200.0
75         76          76   1200.0
88         89          89   1200.0
0:33
Interesting. So there are 0 duplicate parcel_ids, but 50 duplicate geometries and 20 invalid geometries, and many small geometries.

Wait - the problem says "When duplicates exist, keep the record with the earliest record_seq". Since there are no duplicate parcel_ids, but there are 50 duplicate geometries... Let me think about what "duplicates" means here.

The user says "one canonical record per parcel". The `parcel_id` is described as "the join key". Let me check more carefully - maybe some parcel_ids differ but geometries are duplicated. Let me look at the duplicate geometry records more carefully.

Also, 50 geometries have area < 1 m² (the ones with area 0.49), and there are 50 duplicate geometries. Let me check if these are the same records.
0:33
Bash cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_file('cairo_parcels_legacy.geojson')

# Check duplicate geometries - which ones?
dup_geom_mask = gdf.geometry.duplicated(keep=False)
dup_geom = gdf[dup_geom_mask]
print('Rows with duplicate geometry:', len(dup_geom))
print(dup_geom[['parcel_id','record_seq','area_m2']].sort_values('area_m2').to_string())
print()

# Check if duplicate geometries are also the small ones
small_mask = gdf.geometry.area < 1.0
print('Small geom rows:', small_mask.sum())
print('Overlap:', (dup_geom_mask & small_mask).sum())
print()

# Are there any duplicate parcel_ids scattered?
print('Unique parcel_ids:', gdf['parcel_id'].nunique())
print('Total rows:', len(gdf))
print()

# Let's check what these duplicate geometries look like
dup_groups = gdf.groupby(gdf.geometry.wkt).filter(lambda x: len(x) > 1)
print('Grouped by geometry - groups with count>1:')
print(dup_groups.groupby(gdf.geometry.wkt).size().describe())
print()

# Check the invalid geometries
invalid = gdf[~gdf.geometry.is_valid]
print('Invalid geom records:')
print(invalid[['parcel_id','record_seq','parcel_class','district','area_m2']].to_string())
"
Rows with duplicate geometry: 100
     parcel_id  record_seq  area_m2
0            1           1   1200.0
1            2           2   1200.0
2            3           3   1200.0
3            4           4   1200.0
4            5           5   1200.0
5            6           6   1200.0
7            8           8   1200.0
8            9           9   1200.0
9           10          10   1200.0
11          12          12   1200.0
12          13          13   1200.0
14          15          15   1200.0
15          16          16   1200.0
17          18          18   1200.0
19          20          20   1200.0
20          21          21   1200.0
21          22          22   1200.0
23          24          24   1200.0
24          25          25   1200.0
25          26          26   1200.0
26          27          27   1200.0
27          28          28   1200.0
29          30          30   1200.0
30          31          31   1200.0
32          33          33   1200.0
33          34          34   1200.0
35          36          36   1200.0
37          38          38   1200.0
38          39          39   1200.0
39          40          40   1200.0
41          42          42   1200.0
42          43          43   1200.0
43          44          44   1200.0
44          45          45   1200.0
45          46          46   1200.0
47          48          48   1200.0
48          49          49   1200.0
50          51          51   1200.0
51          52          52   1200.0
53          54          54   1200.0
55          56          56   1200.0
56          57          57   1200.0
58          59          59   1200.0
59          60          60   1200.0
60          61          61   1200.0
61          62          62   1200.0
62          63          63   1200.0
63          64          64   1200.0
65          66          66   1200.0
66          67          67   1200.0
210     900001         211   1200.0
211     900002         212   1200.0
212     900003         213   1200.0
213     900004         214   1200.0
214     900005         215   1200.0
215     900006         216   1200.0
216     900007         217   1200.0
217     900008         218   1200.0
218     900009         219   1200.0
219     900010         220   1200.0
220     900011         221   1200.0
221     900012         222   1200.0
222     900013         223   1200.0
223     900014         224   1200.0
224     900015         225   1200.0
225     900016         226   1200.0
226     900017         227   1200.0
227     900018         228   1200.0
228     900019         229   1200.0
229     900020         230   1200.0
230     900021         231   1200.0
231     900022         232   1200.0
232     900023         233   1200.0
233     900024         234   1200.0
234     900025         235   1200.0
235     900026         236   1200.0
236     900027         237   1200.0
237     900028         238   1200.0
238     900029         239   1200.0
239     900030         240   1200.0
240     900031         241   1200.0
241     900032         242   1200.0
242     900033         243   1200.0
243     900034         244   1200.0
244     900035         245   1200.0
245     900036         246   1200.0
246     900037         247   1200.0
247     900038         248   1200.0
248     900039         249   1200.0
249     900040         250   1200.0
250     900041         251   1200.0
251     900042         252   1200.0
252     900043         253   1200.0
253     900044         254   1200.0
254     900045         255   1200.0
255     900046         256   1200.0
256     900047         257   1200.0
257     900048         258   1200.0
258     900049         259   1200.0
259     900050         260   1200.0

Small geom rows: 50
Overlap: 0

Unique parcel_ids: 290
Total rows: 290
Traceback (most recent call last):
  File "<string>", line 24, in <module>
    dup_groups = gdf.groupby(gdf.geometry.wkt).filter(lambda x: len(x) > 1)
                             ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/pandas/core/generic.py", line 6206, in __getattr__
    return object.__getattribute__(self, name)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
AttributeError: 'GeoSeries' object has no attribute 'wkt'
1:08
Let me understand the data:

1. 290 rows, 290 unique parcel_ids - so no duplicate parcel_id. But there are 100 rows with duplicate geometries (50 pairs?).
2. 50 geometries are < 1 m² - these are the "artifact slivers" we need to drop.
3. 20 invalid geometries need fixing.
4. The task says "When duplicates exist, keep the record with the earliest record_seq" - but there are no duplicate parcel_ids. Let me re-examine - maybe the "duplicates" are based on geometry? The user says "parcel_id (the join key)" - so parcel_id is the key.

Wait - 290 rows, 290 unique parcel_ids. So no actual duplicate parcel_ids. But the task says to keep the one with earliest record_seq when duplicates exist. Since there are no real parcel_id duplicates, this step is a no-op.

Let me re-read more carefully. There are 100 rows with duplicate geometries (50 pairs?). But the geometries themselves being duplicated means some parcels share the exact same geometry - that would make them "duplicates" in the spatial sense.

Actually, I'm seeing 100 rows have duplicate geometries. That's 50 original + 50 duplicates - so 50 geometries are duplicated exactly once. Plus 50 small ones. Let me think about what the instructions mean by "duplicates" and "discard any polygon fragments smaller than 1 m²".

The flow should be:
1. Some parcel_id values appear more than once (duplicate parcel_ids). But the data shows all 290 parcel_ids are unique. Hmm.

Wait, let me re-examine. The user says "When duplicates exist, keep the record with the earliest record_seq". This is clearly talking about the same parcel appearing multiple times (same parcel_id). But the data shows all parcel_ids are unique. Let me check again...

Actually, looking at the output - `duplicate parcel_id count: 0` and `Unique parcel_ids: 290` out of 290 rows. So there are NO duplicate parcel_ids in this data. So the dedup step by record_seq is a no-op. But the task says "When duplicates exist" as a conditional - so if there are none, we just skip that step.

So the actual work is:
1. Make all geometries valid (20 invalid)
2. Remove geometries < 1 m² (50 geometries)
3. Handle duplicate geometries (100 rows with duplicate geometries)
4. Convert all to MultiPolygon
5. Ensure no null/empty geometries

Wait, what should happen with duplicate geometries? The user says "There should be... no duplicate geometries". So we should drop exact duplicate geometry records.

Actually, let me re-read more carefully. The duplicate geometries are across different parcel_ids. If parcel_id 1 and parcel_id 900001 have the same geometry, and both have different parcel_ids... The task says "one canonical record per parcel" meaning one per parcel_id. Since each parcel_id is unique already, we'd keep both even if they share geometry. But the final output says "no duplicate geometries" - so we need to drop exact duplicate geometries.

Hmm, but dropping duplicate geometries would mean keeping only one of the two parcels that share the same geometry. Which one? Since the task already said "When duplicates exist, keep the record with the earliest record_seq" - this could apply to both parcel_id duplicates AND geometry duplicates. Let me interpret it as: when multiple records have the same geometry, keep the one with the smallest record_seq.

Wait actually re-reading: "When duplicates exist, keep the record with the earliest record_seq" - this could mean when there are duplicate records (same parcel_id), keep earliest. And separately, "no duplicate geometries" means remove exact geometric duplicates.

Let me look at this more carefully. There are 50 geometries < 1 m² (these are the slivers to discard). The duplicate geometries are 100 rows (so 50 unique geometries each appearing twice). Let me check if the small ones overlap with the dup geometries.

From the data: "Small geom rows: 50" and "Overlap: 0" - so the 50 small geometries and the 100 duplicate-geometry rows are completely separate sets.

So plan:
1. Handle invalid geometries (20) - make them valid
2. Drop geometries < 1 m² (50) - these are artifact slivers
3. For duplicate geometries (same exact geometry appearing more than once), keep the one with the lowest record_seq
4. Convert all to MultiPolygon
5. Ensure no null/empty geometries
6. Recalculate area_m2 from the geometry
7. Write to GeoParquet

Let me look at what's invalid about those 20 geometries.

Actually, let me check whether the "duplicate geometries" are intentional or should also be handled. The task says "no duplicate geometries" in the final output. So let me implement this.

Let me also check the invalid geometries more carefully.
1:08
Bash cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_file('cairo_parcels_legacy.geojson')

# Check invalid geometry types
invalid = gdf[~gdf.geometry.is_valid]
for idx, row in invalid.iterrows():
    g = row.geometry
    print(f'parcel_id={row.parcel_id}, type={g.geom_type}, area={g.area:.6f}')
    from shapely.validation import explain_validity
    print(f'  validity: {explain_validity(g)}')

print()
# Check for duplicate geometry records - what's the record_seq pattern
dup_mask = gdf.geometry.duplicated(keep=False)
dup = gdf[dup_mask].sort_values('geometry')
print('Duplicate geometry groups sample:')
# Group by geometry wkt
from shapely import to_wkt
groups = {}
for idx, row in dup.iterrows():
    wkt = to_wkt(row.geometry, rounding_precision=6)
    if wkt not in groups:
        groups[wkt] = []
    groups[wkt].append((row.parcel_id, row.record_seq, row.parcel_class, row.district))

for wkt, recs in list(groups.items())[:3]:
    print(f'  Geometry: {wkt[:80]}...')
    for r in recs:
        print(f'    parcel_id={r[0]}, record_seq={r[1]}, class={r[2]}, district={r[3]}')
    print()
"
parcel_id=7, type=Polygon, area=0.000000
  validity: Self-intersection[640315 815020]
parcel_id=17, type=Polygon, area=0.000000
  validity: Self-intersection[640065 815080]
parcel_id=23, type=Polygon, area=0.000000
  validity: Self-intersection[640365 815080]
parcel_id=35, type=Polygon, area=0.000000
  validity: Self-intersection[640215 815140]
parcel_id=41, type=Polygon, area=0.000000
  validity: Self-intersection[640515 815140]
parcel_id=53, type=Polygon, area=0.000000
  validity: Self-intersection[640365 815200]
parcel_id=58, type=Polygon, area=0.000000
  validity: Self-intersection[640615 815200]
parcel_id=71, type=Polygon, area=0.000000
  validity: Self-intersection[640515 815260]
parcel_id=76, type=Polygon, area=0.000000
  validity: Self-intersection[640015 815320]
parcel_id=89, type=Polygon, area=0.000000
  validity: Self-intersection[640665 815320]
parcel_id=94, type=Polygon, area=0.000000
  validity: Self-intersection[640165 815380]
parcel_id=107, type=Polygon, area=0.000000
  validity: Self-intersection[640065 815440]
parcel_id=112, type=Polygon, area=0.000000
  validity: Self-intersection[640315 815440]
parcel_id=125, type=Polygon, area=0.000000
  validity: Self-intersection[640215 815500]
parcel_id=130, type=Polygon, area=0.000000
  validity: Self-intersection[640465 815500]
parcel_id=143, type=Polygon, area=0.000000
  validity: Self-intersection[640365 815560]
parcel_id=148, type=Polygon, area=0.000000
  validity: Self-intersection[640615 815560]
parcel_id=161, type=Polygon, area=0.000000
  validity: Self-intersection[640515 815620]
parcel_id=166, type=Polygon, area=0.000000
  validity: Self-intersection[640015 815680]
parcel_id=179, type=Polygon, area=0.000000
  validity: Self-intersection[640665 815680]

Duplicate geometry groups sample:
  Geometry: POLYGON ((640000 815000, 640030 815000, 640030 815040, 640000 815040, 640000 815...
    parcel_id=1, record_seq=1, class=commercial, district=Giza-East
    parcel_id=900001, record_seq=211, class=industrial, district=Giza-East

  Geometry: POLYGON ((640050 815000, 640080 815000, 640080 815040, 640050 815040, 640050 815...
    parcel_id=900002, record_seq=212, class=agricultural, district=Qalyubia-South
    parcel_id=2, record_seq=2, class=industrial, district=Qalyubia-South

  Geometry: POLYGON ((640000 815060, 640030 815060, 640030 815100, 640000 815100, 640000 815...
    parcel_id=16, record_seq=16, class=residential, district=Giza-East
    parcel_id=900013, record_seq=223, class=commercial, district=Giza-East
1:46
Let me understand the data:

1. **Duplicate geometries**: There are duplicate geometries where the same polygon appears for two different parcel_ids. One has low record_seq (e.g., record_seq=1), the other has high record_seq (e.g., record_seq=211). The instruction says: "keep the record with the earliest record_seq" - so when duplicate geometries exist, keep only the one with the lowest record_seq.

2. **Duplicate parcel_ids**: None. Each row has unique parcel_id. So the duplication is about geometry duplicates.

3. **Invalid geometries**: 20 geometries, all self-intersecting (bow-tie polygons). Need to fix (make valid).

4. **Small geometries (< 1 m²)**: 50 geometries with area = 0.0 (the invalid ones). Need to be discarded.

5. **Geometry types**: 260 Polygons and 30 MultiPolygons. Need all to be MultiPolygons.

Now let me think about the pipeline:

1. Start with 290 records
2. Fix invalid geometries (make valid using `buffer(0)` or similar)
3. Remove geometries < 1 m² area
4. For duplicate geometries, keep only the one with the lowest record_seq
5. Ensure all features are MultiPolygons
6. Output the specified columns: parcel_id, parcel_class, district, area_m2

Wait, but the duplicate geometries have different parcel_classes and districts too. Let me re-read the requirement: "When duplicates exist, keep the record with the earliest record_seq". So we keep the one with lower record_seq.

Let me check what record_seq is for the 900000 series:
- parcel_id 900001-900050 have record_seq 211-260
- The "normal" ones (parcel_id 1-240) have record_seq 1-240 as well (mostly)

Wait, actually looking at the output:
- parcel_id 1, record_seq=1, commercial, Giza-East
- parcel_id 900001, record_seq=211, industrial, Giza-East
These share the same geometry. We keep record_seq=1 (parcel_id=1).

Let me check the invalid geometries more carefully. They are self-intersecting polygons with area=0.0. After making valid, they might become actual polygons or might remain as tiny things. Since they have area 0.0 as invalid, after `buffer(0)` they might have some area but could still be < 1 m².

Actually looking at the data:
- The 50 geometries < 1 m² are the 20 invalid ones plus 30 others? Let me check: invalid count = 20, small count = 50. So 30 are valid but have area < 1 m²? No wait, the invalid ones have area 0.0 from shapely.area calculation. Let me look at the area_m2 column - those rows have area_m2=1200.0 but the actual geometry area from shapely is 0 (since they're self-intersecting).

Let me check: are there any small (<1 m²) geometries that are valid? Or are all the <1 m² ones the invalid ones?

From the output:
- Small geom rows: 50
- Invalid: 20

So 30 more are valid but have geometry area < 1 m². Those need to be discarded too.

Let me look at this more carefully. Actually the "Small geom rows" count of 50 might be counting based on `gdf.geometry.area` which for self-intersecting polygons returns 0.0. So the 20 invalid ones have area 0.0 and the other 30 might be actual small polygons.

Let me write the script now.

Plan:
1. Read GeoJSON
2. Fix invalid geometries with `buffer(0)`
3. Recalculate geometry area, filter out geometries with area < 1.0
4. Remove duplicate geometries, keeping earliest record_seq
5. Convert all to MultiPolygon
6. Remove any remaining null/empty geometries
7. Ensure there are no duplicate geometries remaining
8. Write to GeoParquet with columns: parcel_id, parcel_class, district, area_m2

Let me check if there's a nuance - the area_m2 column is currently pre-computed. Should I use that or recompute? The user says "area_m2" as an output column. I'll compute it from the actual geometry to be accurate.

Actually, let me also check: after removing duplicates by geometry (keeping earliest record_seq), I need to ensure no duplicate parcel_ids remain. Since each record has unique parcel_id, and I'm only removing some records, this should be fine.

Let me write the script.
1:46
1:46
Write /work/solve.py
content · 136 lines · py
#!/usr/bin/env python3
"""
Clean cairo_parcels_legacy.geojson into parcels_canonical.geoparquet.

Steps:
  1. Read source GeoJSON (EPSG:22992).
  2. Fix invalid (self-intersecting) geometries via buffer(0).
  3. Drop rows where the geometry area < 1.0 m².
  4. When duplicate geometries exist, keep only the row with the smallest record_seq.
  5. Convert every geometry to MultiPolygon.
  6. Ensure no null/empty/invalid geometries remain.
  7. Compute area_m2 from geometry (CRS is metric).
  8. Write GeoParquet with columns: parcel_id, parcel_class, district, area_m2.
"""

import geopandas as gpd
import pandas as pd
from shapely import make_valid, area as shp_area
from shapely.geometry import MultiPolygon

# ---------------------------------------------------------------------------
# 1. Read
gdf = gpd.read_file("/work/cairo_parcels_legacy.geojson")

print(f"Read {len(gdf)} rows, CRS={gdf.crs}")

# ---------------------------------------------------------------------------
# 2. Fix invalid geometries
invalid_mask = ~gdf.geometry.is_valid
n_invalid = invalid_mask.sum()
if n_invalid:
    print(f"Fixing {n_invalid} invalid geometries via make_valid / buffer(0)")
    # make_valid is the preferred modern method
    gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].apply(
        lambda g: make_valid(g) if not g.is_valid else g
    )

# Any remaining invalid after fix?
still_invalid = ~gdf.geometry.is_valid
if still_invalid.any():
    print(f"  {still_invalid.sum()} still invalid — applying buffer(0)")
    gdf.loc[still_invalid, "geometry"] = gdf.loc[still_invalid, "geometry"].buffer(0)

# Verify
assert gdf.geometry.is_valid.all(), "Still have invalid geometries after fix!"

# ---------------------------------------------------------------------------
# 3. Remove polygon fragments < 1 m²
# Compute actual geometry area (CRS is metric — EPSG:22992)
areas = gdf.geometry.area
small_mask = areas < 1.0
n_small = small_mask.sum()
print(f"Dropping {n_small} features with geometry area < 1 m²")
gdf = gdf[~small_mask].copy()
print(f"  -> {len(gdf)} rows remain")

# ---------------------------------------------------------------------------
# 4. Deduplicate by geometry — keep the earliest record_seq
# Group by geometry WKT, keep row with minimum record_seq within each group
# We use a stable sort approach
gdf = (
    gdf.assign(geom_wkt=gdf.geometry.apply(lambda g: g.wkt))
    .sort_values("record_seq")
    .drop_duplicates(subset="geom_wkt", keep="first")
    .drop(columns="geom_wkt")
    .reset_index(drop=True)
)
print(f"After geometry dedup: {len(gdf)} rows")

# ---------------------------------------------------------------------------
# 5. Convert every geometry to MultiPolygon
def ensure_multipolygon(geom):
    gtype = geom.geom_type
    if gtype == "MultiPolygon":
        return geom
    elif gtype == "Polygon":
        return MultiPolygon([geom])
    else:
        # GeometryCollection or other — filter to only polygons
        from shapely.geometry import GeometryCollection
        polys = [g for g in geom.geoms if g.geom_type in ("Polygon", "MultiPolygon")]
        # flatten MultiPolygons inside
        final = []
        for p in polys:
            if p.geom_type == "MultiPolygon":
                final.extend(list(p.geoms))
            else:
                final.append(p)
        if len(final) == 0:
            return None
        elif len(final) == 1:
            return MultiPolygon(final)
        else:
            return MultiPolygon(final)

gdf["geometry"] = gdf["geometry"].apply(ensure_multipolygon)

# ---------------------------------------------------------------------------
# 6. Drop any rows that ended up with None/empty geometry
n_none = gdf["geometry"].isna().sum()
if n_none:
    print(f"Dropping {n_none} rows that became null geometries")
    gdf = gdf.dropna(subset=["geometry"])

empty_mask = gdf.geometry.is_empty
if empty_mask.any():
    print(f"Dropping {empty_mask.sum()} empty geometries")
    gdf = gdf[~empty_mask].copy()

# Ensure no invalid geometries remain
invalid_final = ~gdf.geometry.is_valid
if invalid_final.any():
    print(f"Fixing {invalid_final.sum()} still-invalid geometries")
    gdf.loc[invalid_final, "geometry"] = gdf.loc[invalid_final, "geometry"].buffer(0)
    # drop any that remain invalid or empty
    gdf = gdf[~gdf.geometry.isna() & gdf.geometry.is_valid & ~gdf.geometry.is_empty].copy()

# ---------------------------------------------------------------------------
# 7. Compute area_m2 from cleaned geometry
gdf["area_m2"] = gdf.geometry.area.round(2)

# ---------------------------------------------------------------------------
# 8. Select output columns and write
OUT_COLS = ["parcel_id", "parcel_class", "district", "area_m2", "geometry"]
gdf_out = gdf[OUT_COLS].copy()
gdf_out = gdf_out.reset_index(drop=True)

gdf_out.to_parquet("/work/parcels_canonical.geoparquet", index=False)

print(f"\nWritten {len(gdf_out)} parcels to parcels_canonical.geoparquet")
print(f"Columns: {gdf_out.columns.tolist()}")
print(f"CRS: {gdf_out.crs}")
print(f"Geometry types: {gdf_out.geometry.type.value_counts().to_dict()}")
print(f"Any null/empty: {(gdf_out.geometry.isna() | gdf_out.geometry.is_empty).sum()}")
print(f"Any invalid: {(~gdf_out.geometry.is_valid).sum()}")
print(f"Any duplicate geometry: {gdf_out.geometry.duplicated().sum()}")
Wrote /work/solve.py (5362 chars).
1:49
Let me run the script and see the results.
1:49
Bash cd /work && python solve.py 2>&1
Read 290 rows, CRS=EPSG:22992
Fixing 20 invalid geometries via make_valid / buffer(0)
Dropping 30 features with geometry area < 1 m²
  -> 260 rows remain
After geometry dedup: 210 rows

Written 210 parcels to parcels_canonical.geoparquet
Columns: ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']
CRS: EPSG:22992
Geometry types: {'MultiPolygon': 210}
Any null/empty: 0
Any invalid: 0
Any duplicate geometry: 0
1:55
The script ran successfully. Let me verify the output more thoroughly by reading the geoparquet file and checking everything.
1:55
1:55
Bash cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
import pandas as pd

gdf = gpd.read_parquet('parcels_canonical.geoparquet')

print('Shape:', gdf.shape)
print('CRS:', gdf.crs)
print('Columns:', gdf.columns.tolist())
print()
print('Dtypes:')
print(gdf.dtypes)
print()
print('Geometry types:')
print(gdf.geometry.type.value_counts())
print()
print('Any null geometry:', gdf.geometry.isna().sum())
print('Any empty geometry:', gdf.geometry.is_empty.sum())
print('Any invalid geometry:', (~gdf.geometry.is_valid).sum())
print('Duplicate geometries:', gdf.geometry.duplicated().sum())
print()
print('area_m2 range:', gdf.area_m2.min(), '—', gdf.area_m2.max())
print()
print('Sample rows:')
print(gdf.sample(10, random_state=42).to_string())
"
Shape: (210, 5)
CRS: {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "ProjectedCRS", "name": "Egypt 1907 / Red Belt", "base_crs": {"name": "Egypt 1907", "datum": {"type": "GeodeticReferenceFrame", "name": "Egypt 1907", "ellipsoid": {"name": "Helmert 1906", "semi_major_axis": 6378200, "inverse_flattening": 298.3}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, "id": {"authority": "EPSG", "code": 4229}}, "conversion": {"name": "Egypt Red Belt", "method": {"name": "Transverse Mercator", "id": {"authority": "EPSG", "code": 9807}}, "parameters": [{"name": "Latitude of natural origin", "value": 30, "unit": "degree", "id": {"authority": "EPSG", "code": 8801}}, {"name": "Longitude of natural origin", "value": 31, "unit": "degree", "id": {"authority": "EPSG", "code": 8802}}, {"name": "Scale factor at natural origin", "value": 1, "unit": "unity", "id": {"authority": "EPSG", "code": 8805}}, {"name": "False easting", "value": 615000, "unit": "metre", "id": {"authority": "EPSG", "code": 8806}}, {"name": "False northing", "value": 810000, "unit": "metre", "id": {"authority": "EPSG", "code": 8807}}]}, "coordinate_system": {"subtype": "Cartesian", "axis": [{"name": "Easting", "abbreviation": "E", "direction": "east", "unit": "metre"}, {"name": "Northing", "abbreviation": "N", "direction": "north", "unit": "metre"}]}, "scope": "Engineering survey, topographic mapping.", "area": "Egypt - onshore between 29\u00b0E and 33\u00b0E, offshore Mediterranean east of 29\u00b0E and offshore Gulf of Suez.", "bbox": {"south_latitude": 21.99, "west_longitude": 29, "north_latitude": 33.82, "east_longitude": 34.27}, "id": {"authority": "EPSG", "code": 22992}}
Columns: ['parcel_id', 'parcel_class', 'district', 'area_m2', 'geometry']

Dtypes:
parcel_id          int32
parcel_class         str
district             str
area_m2          float64
geometry        geometry
dtype: object

Geometry types:
MultiPolygon    210
Name: count, dtype: int64

Any null geometry: 0
Any empty geometry: 0
Any invalid geometry: 0
Duplicate geometries: 0

area_m2 range: 600.0 — 1280.0

Sample rows:
     parcel_id  parcel_class        district  area_m2                                                                                                                                                                     geometry
30          31  agricultural       Giza-East   1200.0                                                                                 MULTIPOLYGON (((640000 815120, 640030 815120, 640030 815160, 640000 815160, 640000 815120)))
172        173    commercial  Qalyubia-South   1280.0  MULTIPOLYGON (((640350 815660, 640380 815660, 640380 815700, 640350 815700, 640350 815660)), ((640385 815665, 640393 815665, 640393 815675, 640385 815675, 640385 815665)))
84          85    commercial       Giza-East   1200.0                                                                                 MULTIPOLYGON (((640450 815300, 640480 815300, 640480 815340, 640450 815340, 640450 815300)))
199        200   residential  Qalyubia-South   1200.0                                                                                 MULTIPOLYGON (((640200 815780, 640230 815780, 640230 815820, 640200 815820, 640200 815780)))
60          61    commercial       Giza-East   1200.0                                                                                 MULTIPOLYGON (((640000 815240, 640030 815240, 640030 815280, 640000 815280, 640000 815240)))
155        156   residential   Cairo-Central   1200.0                                                                                 MULTIPOLYGON (((640250 815600, 640280 815600, 640280 815640, 640250 815640, 640250 815600)))
45          46    industrial       Giza-East   1200.0                                                                                 MULTIPOLYGON (((640000 815180, 640030 815180, 640030 815220, 640000 815220, 640000 815180)))
182        183  agricultural   Cairo-Central   1200.0                                                                                 MULTIPOLYGON (((640100 815720, 640130 815720, 640130 815760, 640100 815760, 640100 815720)))
9           10    industrial       Giza-East   1200.0                                                                                 MULTIPOLYGON (((640450 815000, 640480 815000, 640480 815040, 640450 815040, 640450 815000)))
196        197    commercial  Qalyubia-South   1200.0                                                                                 MULTIPOLYGON (((640050 815780, 640080 815780, 640080 815820, 640050 815820, 640050 815780)))
2:08
Everything looks good! Let me also verify the file exists and check the areas are reasonable (metric CRS area should be in the hundreds to thousands range, which matches).

Let me also double-check that any invalid geometries from the small fragments were properly handled — the original had 20 invalid, 50 small fragments. After dropping <1m², the 20 invalid ones (which had area 0.0) would be dropped too if they were <1m². But we need to verify the invalid ones from earlier weren't also the duplicate geometries.

Actually wait — looking at the invalid rows: parcel_id=7 had geometry area=0.0, etc. So those were all dropped in step 3 (area < 1.0). Then buffer(0) fix was applied before dropping. Let me re-check the logic — we fix invalid first, then drop small. But the invalid ones had area=0, so fixing them makes them valid, then we'd drop them anyway because area < 1.0. That's fine, they're gone.

But wait — the invalid ones (area=0 after self-intersection) might have been fixed by make_valid into valid geometries with non-zero area. Let me verify that didn't happen on important parcels. Actually the output says "Dropping 30 features with geometry area < 1 m²" — so 30 were dropped. The 20 invalid ones probably became valid with non-zero area (maybe all < 1 though), or they were already counted among the 30.

Actually the key point: we fixed invalid first, then dropped small ones. The result is clean. Let me verify everything once more.

All checks pass:
- 210 rows, all MultiPolygon
- No null/empty/invalid geometries
- No duplicate geometries
- All columns present: parcel_id, parcel_class, district, area_m2
- CRS preserved as EPSG:22992
- area_m2 values are reasonable (600–1280)

Let's confirm the file exists.
2:08
2:08
Bash ls -lh /work/parcels_canonical.geoparquet
-rw-r--r-- 1 agent agent 19K Jun 16 16:43 /work/parcels_canonical.geoparquet
2:17
Everything looks good. Let me summarize what I did.
2:17