dd-l2-bangkok-multicrs-audit
Map (reference ← swipe → agent)
0:00
A contractor just shipped `bangkok_contractor_delivery.gpkg` and I need a defect cite-sheet before I sign off. Please walk every layer in the file and write one row per layer to `crs_audit.csv` with these columns: layer_name, declared_crs (formatted as `EPSG:NNNN`), geometry_type, feature_count, sample_x and sample_y for a sample coordinate taken from the layer as it sits on disk, and encoding_detected. For encoding_detected, look at the text attributes for anomalies and report either `latin1-mojibake` or `utf-8`.
0:00
System info
{
"adapter": "openrouter",
"model": "deepseek/deepseek-v4-flash",
"base_url": "https://openrouter.ai/api/v1"
}
0:02
Let me start by examining the GPKG file to understand its contents.
0:02
Bash
ls -lh /work/bangkok_contractor_delivery.gpkg
-rw-r--r-- 1 agent agent 2.6M Jun 16 17:32 /work/bangkok_contractor_delivery.gpkg
0:12
Let me first explore the layers in the GeoPackage and understand its structure.
0:12
Bash
python3 -c "
python3 -c "
import fiona
gpkg = '/work/bangkok_contractor_delivery.gpkg'
layers = fiona.list_layers(gpkg)
print('Layers:', layers)
"
Traceback (most recent call last):
File "<string>", line 4, in <module>
layers = fiona.list_layers(gpkg)
^^^^^^^^^^^^^^^^^
AttributeError: module 'fiona' has no attribute 'list_layers'. Did you mean: 'listlayers'?
0:19
Let me fix the API call.
0:19
Bash
python3 -c "
python3 -c "
import fiona
gpkg = '/work/bangkok_contractor_delivery.gpkg'
layers = fiona.listlayers(gpkg)
print('Layers:', layers)
"
Layers: ['parcels', 'roads', 'markets']
0:47
Good, there are 3 layers: parcels, roads, markets. Now I need to inspect each layer's CRS, geometry type, feature count, sample coordinate, and detect encoding issues with text attributes. Let me write a comprehensive Python script.
0:47
Bash
python3 << 'PYEOF'
python3 << 'PYEOF'
import fiona
import geopandas as gpd
import pyproj
import csv
gpkg = '/work/bangkok_contractor_delivery.gpkg'
layers = ['parcels', 'roads', 'markets']
results = []
for lyr in layers:
# Open the layer to get schema and CRS info via fiona
with fiona.open(gpkg, layer=lyr) as src:
crs = src.crs
# declared CRS as EPSG:NNNN
if crs:
epsg_num = crs.to_epsg()
if epsg_num:
declared_crs = f"EPSG:{epsg_num}"
else:
# try building from authority
auth = crs.get('init') or crs.get('authority') or str(crs)
declared_crs = str(crs)
else:
declared_crs = "None"
# geometry type from schema
geom_type = src.schema['geometry']
# For first feature, grab a sample coordinate
first_feat = next(iter(src))
geom = first_feat['geometry']
if geom and geom['type'] == 'Point':
sample_x, sample_y = geom['coordinates']
elif geom and geom['type'] in ('MultiPoint', 'LineString', 'MultiLineString'):
coords = geom['coordinates']
if geom['type'] == 'MultiPoint':
first_pt = coords[0]
elif geom['type'] == 'LineString':
first_pt = coords[0]
elif geom['type'] == 'MultiLineString':
first_pt = coords[0][0]
else:
first_pt = coords[0][0]
sample_x, sample_y = first_pt
elif geom and geom['type'] in ('Polygon', 'MultiPolygon'):
if geom['type'] == 'Polygon':
first_pt = geom['coordinates'][0][0]
else: # MultiPolygon
first_pt = geom['coordinates'][0][0][0]
sample_x, sample_y = first_pt
else:
sample_x, sample_y = None, None
# Feature count via geopandas (len)
gdf = gpd.read_file(gpkg, layer=lyr)
feature_count = len(gdf)
# Encoding detection: look at text/object columns for mojibake patterns
text_cols = gdf.select_dtypes(include=['object']).columns.tolist()
# Exclude geometry column
text_cols = [c for c in text_cols if c != 'geometry']
encoding_detected = 'utf-8'
for col in text_cols:
for val in gdf[col].dropna().head(100):
if isinstance(val, str):
# Check for common latin1-mojibake patterns:
# Characters that appear when utf-8 bytes are decoded as latin-1
# e.g. é (é mojibaked), ñ (ñ), ü (ü), etc.
# Also check for high-byte chars like ranges
for ch in val:
ord_ch = ord(ch)
if 0x80 <= ord_ch <= 0x9F:
# control chars in latin1 supplement - often mojibake
encoding_detected = 'latin1-mojibake'
break
# Latin-1 accented chars when re-encoded may produce two-byte sequences
# Check for common patterns
import re
# Common mojibake patterns: é (é), ñ (ñ), ü (ü), á (á), etc.
mojibake_pattern = r'[ÃÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝàáâãäåæçèéêëìíîïðñòóôõöùúûüý]'
if re.search(mojibake_pattern, val):
# But these chars can also appear legitimately in Thai text
# We need to be smarter: check if these appear in non-Thai context
# Actually, let's look for very specific mojibake sequences
pass
# Let's do a more careful encoding detection
encoding_detected = 'utf-8'
suspicious_count = 0
total_checked = 0
for col in text_cols:
for val in gdf[col].dropna().head(200):
if isinstance(val, str):
total_checked += 1
# Try re-encoding: if the string contains chars like é,
# it might mean the original UTF-8 bytes were decoded as latin-1
# Let's detect by trying to encode as latin-1 and then decode as utf-8
try:
# If the string can be encoded back to latin-1, and the result
# is valid UTF-8 bytes, then the original might be mojibake
# Actually, let's look for 2-byte chars that are common in mojibake
for ch in val:
o = ord(ch)
# High latin-1 supplement: À-ÿ (0xC0-0xFF)
# when these appear next to each other forming
# what would be UTF-8 multi-byte sequences
if 0xC0 <= o <= 0xFF:
suspicious_count += 1
break
except:
pass
if suspicious_count > 5:
encoding_detected = 'latin1-mojibake'
results.append({
'layer_name': lyr,
'declared_crs': declared_crs,
'geometry_type': geom_type,
'feature_count': feature_count,
'sample_x': sample_x,
'sample_y': sample_y,
'encoding_detected': encoding_detected
})
print(f"--- {lyr} ---")
print(f" CRS: {declared_crs}")
print(f" Geometry: {geom_type}")
print(f" Features: {feature_count}")
print(f" Sample: ({sample_x}, {sample_y})")
print(f" Encoding: {encoding_detected}")
print(f" Text cols: {text_cols}")
# Also print a few text values for inspection
for col in text_cols:
vals = gdf[col].dropna().head(5).tolist()
print(f" {col}: {vals}")
# Write CSV
with open('/work/crs_audit.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['layer_name','declared_crs','geometry_type','feature_count','sample_x','sample_y','encoding_detected'])
writer.writeheader()
writer.writerows(results)
print("\nWritten crs_audit.csv")
PYEOF
--- parcels ---
CRS: EPSG:24047
Geometry: Polygon
Features: 4000
Sample: (657421.5158199039, 1509162.3701859994)
Encoding: latin1-mojibake
Text cols: ['id', 'name_th', 'owner_class']
id: ['P00000', 'P00001', 'P00002', 'P00003', 'P00004']
name_th: ['à¹\x81à¸\x9bลà¸\x87à¸\x97ีà¹\x88à¸\x94ิà¸\x99 à¸\x9aาà¸\x87รัà¸\x81', 'à¹\x81à¸\x9bลà¸\x87à¸\x97ีà¹\x88à¸\x94ิà¸\x99 à¸\x9bà¸\x97ุมวัà¸\x99', 'à¹\x81à¸\x9bลà¸\x87à¸\x97ีà¹\x88à¸\x94ิà¸\x99 วัà¸\x92à¸\x99า', 'à¹\x81à¸\x9bลà¸\x87à¸\x97ีà¹\x88à¸\x94ิà¸\x99 à¸\x84ลà¸\xadà¸\x87à¹\x80à¸\x95ย', 'à¹\x81à¸\x9bลà¸\x87à¸\x97ีà¹\x88à¸\x94ิà¸\x99 à¸\x94ิà¸\x99à¹\x81à¸\x94à¸\x87']
owner_class: ['private', 'state', 'religious', 'private', 'state']
--- roads ---
CRS: EPSG:32647
Geometry: LineString
Features: 5000
Sample: (656835.6850029442, 1509487.0031804747)
Encoding: latin1-mojibake
Text cols: ['id', 'name', 'highway']
id: ['R00000', 'R00001', 'R00002', 'R00003', 'R00004']
name: ['à¸\x96à¸\x99à¸\x99สุà¸\x82ุมวิà¸\x97', 'à¸\x96à¸\x99à¸\x99à¸\x9eระราม 4', 'à¸\x96à¸\x99à¸\x99à¸\x9eระราม 9', 'à¸\x96à¸\x99à¸\x99รัà¸\x8aà¸\x94าà¸\xa0ิà¹\x80ษà¸\x81', 'à¸\x96à¸\x99à¸\x99à¹\x80à¸\x9eà¸\x8aรà¸\x9aุรี']
highway: ['primary', 'secondary', 'tertiary', 'residential', 'service']
--- markets ---
CRS: EPSG:4326
Geometry: Point
Features: 1000
Sample: (100.4533, 13.655000000000001)
Encoding: utf-8
Text cols: ['id', 'name_th', 'kind']
id: ['M00000', 'M00001', 'M00002', 'M00003', 'M00004']
name_th: ['ตลาดนัดจตุจักร', 'ตลาดอ.ต.ก.', 'ตลาดคลองเตย', 'ตลาดสะพานขาว', 'ตลาดบางรัก']
kind: ['fresh', 'wet', 'night', 'fresh', 'wet']
Written crs_audit.csv
<stdin>:60: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning. See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3. <stdin>:60: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning. See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3. <stdin>:60: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning. See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
0:56
The results look good. Let me verify the CSV was written correctly and check the file. Looking at the output: 1. **parcels**: CRS=EPSG:24047 (Indian 1960 / UTM zone 47N), Polygon, 4000 features, sample (657421.5, 1509162.4) - metric coords, encoding detected as latin1-mojibake because the Thai text in `name_th` shows garbled characters like `à¹\x81à¸\x9bลà¸\x87à¸\x97ีà¹\x88à¸\x94ิà¸\x99` instead of proper Thai script. 2. **roads**: CRS=EPSG:32647 (WGS 84 / UTM zone 47N), LineString, 5000 features, sample (656835.7, 1509487.0) - metric coords, encoding detected as latin1-mojibake for same reason. 3. **markets**: CRS=EPSG:4326 (WGS 84), Point, 1000 features, sample (100.4533, 13.655) - decimal degrees, encoding detected as utf-8 because the Thai text renders correctly: 'ตลาดนัดจตุจักร', 'ตลาดอ.ต.ก.', etc. This makes sense - the parcels and roads layers have Thai text that appears to have been incorrectly decoded as latin-1 mojibake (the `à¹\x81` etc. patterns are classic UTF-8 bytes misinterpreted as Latin-1), while the markets layer has proper Thai text. Let me verify the CSV was written correctly.
0:56
Read
/work/crs_audit.csv
1 layer_name,declared_crs,geometry_type,feature_count,sample_x,sample_y,encoding_detected
2 parcels,EPSG:24047,Polygon,4000,657421.5158199039,1509162.3701859994,latin1-mojibake
3 roads,EPSG:32647,LineString,5000,656835.6850029442,1509487.0031804747,latin1-mojibake
4 markets,EPSG:4326,Point,1000,100.4533,13.655000000000001,utf-8
0:56
Bash
ls -l /work/crs_audit.csv
-rw-r--r-- 1 agent agent 326 Jun 16 17:33 /work/crs_audit.csv
1:05
Let me verify the results against the requirements: 1. **layer_name**: parcels, roads, markets ✓ 2. **declared_crs**: EPSG:24047 (parcels), EPSG:32647 (roads), EPSG:4326 (markets) ✓ 3. **geometry_type**: Polygon, LineString, Point ✓ 4. **feature_count**: 4000, 5000, 1000 ✓ 5. **sample_x, sample_y**: - parcels: (657421.52, 1509162.37) — these are metric coordinates, EPSG:24047 (Indian 1960 / UTM zone 47N) ✓ - roads: (656835.69, 1509487.00) — metric, EPSG:32647 (WGS 84 / UTM zone 47N) ✓ - markets: (100.4533, 13.655) — degrees, EPSG:4326 ✓ 6. **encoding_detected**: - parcels: latin1-mojibake ✓ (the `à¹\x81à¸\x9b...` patterns are clearly UTF-8 Thai text decoded as latin-1) - roads: latin1-mojibake ✓ (same pattern) - markets: utf-8 ✓ (Thai text displays correctly: ตลาดนัดจตุจักร etc.) Everything checks out! The coordinate magnitudes make sense — parcels and roads use metric UTM zones around Bangkok (~650k-1.5M), while markets use degrees (~100.45, ~13.65). The encoding detection correctly identifies the mojibake in parcels and roads (Thai UTF-8 bytes interpreted as Latin-1) and proper UTF-8 in markets.
1:05