spa-l1-capetown-hospital-nn
Map (reference ← swipe → agent)
0:00
I am putting together EMS coverage planning for the Western Cape and need each residential pickup address tagged with its nearest hospital so the ops-research team can join the result against shift schedules. For every address in `addresses.parquet`, find the nearest hospital from `hospitals.parquet` by straight-line distance and record that hospital's name plus the distance in metres. Keep the original `address_id` from the input so we can join back to the address master, and make sure `nearest_hospital_name` comes through as a non-empty string and `distance_m` as a numeric, finite, non-negative value in metres.
Please write the result to `nearest_hospital.gpkg` in GeoPackage format with one feature per input address, keeping the original address geometry and the columns `address_id`, `nearest_hospital_name`, and `distance_m`.
0:00
System info
{
"adapter": "openrouter",
"model": "deepseek/deepseek-v4-flash",
"base_url": "https://openrouter.ai/api/v1"
}
0:05
Let me start by examining the input files to understand their structure.
0:05
Bash
cd /work && ls -la *.parquet 2>/dev/null; python3 -c "
cd /work && ls -la *.parquet 2>/dev/null; python3 -c "
import pandas as pd
print('=== addresses.parquet ===')
df = pd.read_parquet('addresses.parquet')
print('Shape:', df.shape)
print('Columns:', list(df.columns))
print('Dtypes:')
print(df.dtypes)
print()
print('First 3 rows:')
print(df.head(3).to_string())
print()
print('=== hospitals.parquet ===')
df2 = pd.read_parquet('hospitals.parquet')
print('Shape:', df2.shape)
print('Columns:', list(df2.columns))
print('Dtypes:')
print(df2.dtypes)
print()
print('First 3 rows:')
print(df2.head(3).to_string())
"
-rw-r--r-- 1 agent agent 10609 Jun 9 07:31 addresses.parquet -rw-r--r-- 1 agent agent 10031 Jun 9 07:31 hospitals.parquet === addresses.parquet === Shape: (120, 2) Columns: ['address_id', 'geometry'] Dtypes: address_id str geometry object dtype: object First 3 rows: address_id geometry 0 A0001 b'\x01\x01\x00\x00\x00\xc2\x99$,\xcf\x882@\xcdF\xfe\x1a\x8e\xf8@\xc0' 1 A0002 b'\x01\x01\x00\x00\x00\xfcg\xbf\x12\xc2\x802@t\xe7`f\xd6\xf7@\xc0' 2 A0003 b'\x01\x01\x00\x00\x00\xb8\xa8\x80\x7fG\x882@r\xf1\x03Z\xc6\xfa@\xc0' === hospitals.parquet === Shape: (37, 3) Columns: ['hospital_id', 'name', 'geometry'] Dtypes: hospital_id str name str geometry object dtype: object First 3 rows: hospital_id name geometry 0 H001 Mowbray Maternity Hospital b"\x01\x01\x00\x00\x00\xb7''^\x8ey2@\x12\xf4z\xa4\x7f\xf9@\xc0" 1 H002 Life Claremont Hospital b'\x01\x01\x00\x00\x00\xb5\x01THdw2@\x15\xe1`NT\xfe@\xc0' 2 H003 Old Somerset Hospital b'\x01\x01\x00\x00\x00\x0b\xe8K\x90\x8dk2@\xd6_\xc0\xdc8\xf5@\xc0'
0:14
The geometries are stored as WKB bytes. Let me check what CRS they're in by parsing the geometry and looking at coordinate values.
0:14
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import pandas as pd
import geopandas as gpd
from shapely import wkb
# Parse geometries
df = pd.read_parquet('addresses.parquet')
df['geometry'] = df['geometry'].apply(lambda x: wkb.loads(x))
gdf_addr = gpd.GeoDataFrame(df, geometry='geometry', crs='EPSG:4326')
df2 = pd.read_parquet('hospitals.parquet')
df2['geometry'] = df2['geometry'].apply(lambda x: wkb.loads(x))
gdf_hosp = gpd.GeoDataFrame(df2, geometry='geometry', crs='EPSG:4326')
print('Addresses CRS:', gdf_addr.crs)
print('Bounds:', gdf_addr.total_bounds)
print('First address coord:', gdf_addr.geometry.iloc[0].x, gdf_addr.geometry.iloc[0].y)
print()
print('Hospitals CRS:', gdf_hosp.crs)
print('Bounds:', gdf_hosp.total_bounds)
print('First hospital coord:', gdf_hosp.geometry.iloc[0].x, gdf_hosp.geometry.iloc[0].y)
" 2>&1
Addresses CRS: EPSG:4326 Bounds: [ 18.4008295 -33.99916655 18.54986077 -33.90055752] First address coord: 18.534411200462053 -33.94183671394203 Hospitals CRS: EPSG:4326 Bounds: [ 18.41153087 -33.99823 18.53345868 -33.89997785] First hospital coord: 18.47482861 -33.94920784
0:43
Good, so both datasets are in EPSG:4326 (lat/lon). I need to compute straight-line (Euclidean) distance between each address and each hospital. The user says "by straight-line distance" - since these are lat/lon points, I should use a metric CRS to compute proper distances in metres. Let me project to a suitable local projection for the Western Cape, South Africa. Western Cape is around 33°S, 18-24°E. A good UTM zone would be UTM 34S (EPSG:32734) covering 18°E to 24°E, which covers the area. Let me write a script to: 1. Load both parquets 2. Parse WKB geometries 3. Project to a metric CRS (UTM 34S - EPSG:32734) 4. For each address, find the nearest hospital 5. Record hospital name and distance in metres 6. Write to GeoPackage with original geometry (lat/lon) and the three required columns Actually, I'll keep the original geometry (EPSG:4326) as requested, but compute distances in a metric CRS. Let me write the script.
0:43
Write
/work/solve.py
content · 84 lines · py
import pandas as pd
import geopandas as gpd
from shapely import wkb
from shapely.ops import transform
from pyproj import Transformer
import numpy as np
# ── 1. Load data ──────────────────────────────────────────────────────
addr_df = pd.read_parquet("/work/addresses.parquet")
hosp_df = pd.read_parquet("/work/hospitals.parquet")
# Parse WKB geometries (stored as bytes in the parquet)
addr_df["geometry"] = addr_df["geometry"].apply(lambda b: wkb.loads(b))
hosp_df["geometry"] = hosp_df["geometry"].apply(lambda b: wkb.loads(b))
gdf_addr = gpd.GeoDataFrame(addr_df, geometry="geometry", crs="EPSG:4326")
gdf_hosp = gpd.GeoDataFrame(hosp_df, geometry="geometry", crs="EPSG:4326")
# ── 2. Project to metric CRS for distance calculation ────────────────
# UTM zone 34S (EPSG:32734) covers the Western Cape area
METRIC_CRS = "EPSG:32734"
gdf_addr_m = gdf_addr.to_crs(METRIC_CRS)
gdf_hosp_m = gdf_hosp.to_crs(METRIC_CRS)
# Extract coordinate arrays for fast computation
addr_coords = np.column_stack([
gdf_addr_m.geometry.x.values,
gdf_addr_m.geometry.y.values,
]) # shape (n_addr, 2)
hosp_coords = np.column_stack([
gdf_hosp_m.geometry.x.values,
gdf_hosp_m.geometry.y.values,
]) # shape (n_hosp, 2)
hosp_names = gdf_hosp["name"].values
hosp_indices = np.arange(len(gdf_hosp))
# ── 3. Nearest-neighbour search (brute force — small dataset) ────────
# For each address, find the hospital with minimum Euclidean distance
nearest_names = []
nearest_dists = []
for i in range(len(addr_coords)):
ax, ay = addr_coords[i]
# squared distance (avoid sqrt per hospital)
dx = hosp_coords[:, 0] - ax
dy = hosp_coords[:, 1] - ay
dist_sq = dx * dx + dy * dy
idx_min = np.argmin(dist_sq)
dist_m = np.sqrt(dist_sq[idx_min])
nearest_names.append(hosp_names[idx_min])
nearest_dists.append(dist_m)
# ── 4. Build output GeoDataFrame ─────────────────────────────────────
result = gdf_addr[["address_id", "geometry"]].copy()
result["nearest_hospital_name"] = nearest_names
result["distance_m"] = nearest_dists
# Ensure distance_m is numeric, finite, non-negative
result["distance_m"] = pd.to_numeric(result["distance_m"], errors="coerce")
assert result["distance_m"].isna().sum() == 0, "NaN distances produced!"
assert (result["distance_m"] >= 0).all(), "Negative distances produced!"
assert np.isfinite(result["distance_m"]).all(), "Infinite distances produced!"
# Ensure nearest_hospital_name is non-empty string
assert result["nearest_hospital_name"].str.len().min() > 0, "Empty hospital names found!"
assert result["nearest_hospital_name"].apply(lambda x: isinstance(x, str)).all(), "Non-string hospital names!"
# ── 5. Write to GeoPackage ───────────────────────────────────────────
result.to_file("/work/nearest_hospital.gpkg", layer="nearest_hospital", driver="GPKG")
print("✅ Written nearest_hospital.gpkg")
print(f" Features: {len(result)}")
print(f" Columns: {list(result.columns)}")
print(f" CRS: {result.crs}")
print()
print(" Sample output:")
print(result.head(10).to_string())
print()
print(f" Distance stats: min={result['distance_m'].min():.2f} m, "
f"max={result['distance_m'].max():.2f} m, "
f"mean={result['distance_m'].mean():.2f} m")
Wrote /work/solve.py (3230 chars).
0:47
Let me run the script.
0:47
Bash
cd /work && python solve.py
✅ Written nearest_hospital.gpkg Features: 120 Columns: ['address_id', 'geometry', 'nearest_hospital_name', 'distance_m'] CRS: EPSG:4326 Sample output: address_id geometry nearest_hospital_name distance_m 0 A0001 POINT (18.53441 -33.94184) Melomed Gatesville Private Hospital 3192.958737 1 A0002 POINT (18.50296 -33.93623) Life Vincent Pallotti Hospital 1468.601852 2 A0003 POINT (18.53234 -33.95918) Melomed Gatesville Private Hospital 1271.941611 3 A0004 POINT (18.44857 -33.98103) Maitland Cottage Children's Orthopaedic Hospital 984.047189 4 A0005 POINT (18.49928 -33.99649) Kingsbury Hospital 3021.737824 5 A0006 POINT (18.48709 -33.97114) Intercare Physical Rehabilitation Hospital - Newlands 1777.422286 6 A0007 POINT (18.54461 -33.96394) Melomed Gatesville Private Hospital 1268.558960 7 A0008 POINT (18.52017 -33.97329) Melomed Gatesville Private Hospital 1264.240167 8 A0009 POINT (18.51179 -33.97298) Melomed Gatesville Private Hospital 2020.478965 9 A0010 POINT (18.49876 -33.9682) Red Cross Children's Hospital 1811.017934 Distance stats: min=47.02 m, max=5956.74 m, mean=1841.99 m
0:54
The output looks good. Let me verify the output file exists and do a quick sanity check.
0:54
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
gdf = gpd.read_file('nearest_hospital.gpkg')
print('Shape:', gdf.shape)
print('Columns:', list(gdf.columns))
print('CRS:', gdf.crs)
print('dtypes:')
print(gdf.dtypes)
print()
print('address_id sample:', gdf['address_id'].head(3).tolist())
print('nearest_hospital_name sample:', gdf['nearest_hospital_name'].head(3).tolist())
print('distance_m sample:', gdf['distance_m'].head(3).tolist())
print()
print('distance_m min:', gdf['distance_m'].min())
print('distance_m max:', gdf['distance_m'].max())
print('Any NaN distance:', gdf['distance_m'].isna().any())
print('Any empty name:', (gdf['nearest_hospital_name'] == '').any())
print('Geometry type:', gdf.geometry.iloc[0])
"
Shape: (120, 4) Columns: ['address_id', 'nearest_hospital_name', 'distance_m', 'geometry'] CRS: EPSG:4326 dtypes: address_id str nearest_hospital_name str distance_m float64 geometry geometry dtype: object address_id sample: ['A0001', 'A0002', 'A0003'] nearest_hospital_name sample: ['Melomed Gatesville Private Hospital', 'Life Vincent Pallotti Hospital', 'Melomed Gatesville Private Hospital'] distance_m sample: [3192.95873666979, 1468.6018520020573, 1271.941610753537] distance_m min: 47.01874251111449 distance_m max: 5956.744689131489 Any NaN distance: False Any empty name: False Geometry type: POINT (18.534411200462053 -33.94183671394203)
1:00
Everything checks out perfectly. Let me verify the file exists on disk.
1:00
Bash
ls -lh /work/nearest_hospital.gpkg
-rw-r--r-- 1 agent agent 116K Jun 9 07:32 /work/nearest_hospital.gpkg
1:11
Everything checks out: 1. ✅ Output file `nearest_hospital.gpkg` exists (116K) 2. ✅ Contains 120 features (same as input addresses) 3. ✅ Columns: `address_id`, `nearest_hospital_name`, `distance_m`, `geometry` 4. ✅ `address_id` preserved from input 5. ✅ `nearest_hospital_name` is non-empty string for all rows 6. ✅ `distance_m` is float64, all finite, non-negative (min=47.02 m, max=5956.74 m) 7. ✅ Original geometry kept (point, EPSG:4326) 8. ✅ CRS is EPSG:4326 as in the input All good!
1:11