dd-l2-bangkok-multicrs-audit
Map (reference ← swipe → agent)
0:00
A contractor just shipped `bangkok_contractor_delivery.gpkg` and I need a defect cite-sheet before I sign off. Please walk every layer in the file and write one row per layer to `crs_audit.csv` with these columns: layer_name, declared_crs (formatted as `EPSG:NNNN`), geometry_type, feature_count, sample_x and sample_y for a sample coordinate taken from the layer as it sits on disk, and encoding_detected. For encoding_detected, look at the text attributes for anomalies and report either `latin1-mojibake` or `utf-8`.
0:00
System info
{
"adapter": "openrouter",
"model": "deepseek/deepseek-v4-flash",
"base_url": "https://openrouter.ai/api/v1"
}
0:04
Let me start by examining the GeoPackage file to understand its structure and layers.
0:04
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
import fiona
# List layers in the GeoPackage
layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
print('Layers:', layers)
"
Layers: ['parcels', 'roads', 'markets']
0:20
0:20
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
import fiona
layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
for lyr in layers:
gdf = gpd.read_file('bangkok_contractor_delivery.gpkg', layer=lyr)
print(f'=== {lyr} ===')
print(f'CRS: {gdf.crs}')
print(f'Geometry types: {gdf.geometry.type.value_counts().to_dict()}')
print(f'Shape: {gdf.shape}')
print(f'Columns: {list(gdf.columns)}')
print(gdf.head(2).to_string())
print()
"
=== parcels ===
CRS: EPSG:24047
Geometry types: {'Polygon': 4000}
Shape: (4000, 4)
Columns: ['id', 'name_th', 'owner_class', 'geometry']
id name_th owner_class geometry
0 P00000 à¹à¸à¸¥à¸à¸à¸µà¹à¸à¸´à¸ à¸à¸²à¸à¸£à¸±à¸ private POLYGON ((657421.516 1509162.37, 657992.722 1509165.787, 657991.396 1509387.035, 657420.195 1509383.617, 657421.516 1509162.37))
1 P00001 à¹à¸à¸¥à¸à¸à¸µà¹à¸à¸´à¸ à¸à¸à¸¸à¸¡à¸§à¸±à¸ state POLYGON ((658135.524 1509166.644, 658706.731 1509170.077, 658705.399 1509391.324, 658134.196 1509387.891, 658135.524 1509166.644))
=== roads ===
CRS: EPSG:32647
Geometry types: {'LineString': 5000}
Shape: (5000, 4)
Columns: ['id', 'name', 'highway', 'geometry']
id name highway geometry
0 R00000 à¸à¸à¸à¸ªà¸¸à¸à¸¸à¸¡à¸§à¸´à¸ primary LINESTRING (656835.685 1509487.003, 657156.989 1509488.924)
1 R00001 à¸à¸à¸à¸à¸£à¸°à¸£à¸²à¸¡ 4 secondary LINESTRING (657192.69 1509489.138, 657190.305 1509887.384)
=== markets ===
CRS: EPSG:4326
Geometry types: {'Point': 1000}
Shape: (1000, 4)
Columns: ['id', 'name_th', 'kind', 'geometry']
id name_th kind geometry
0 M00000 ตลาดนัดจตุจักร fresh POINT (100.4533 13.655)
1 M00001 ตลาดอ.ต.ก. wet POINT (100.4599 13.655)
0:34
Let me analyze what I'm seeing: 1. **parcels** - CRS: EPSG:24047 (Indian 1960 / UTM zone 47N), Polygon, 4000 features. The name_th column shows `à¹à¸à¸¥à¸à¸à¸µà¹à¸à¸´à¸ à¸à¸²à¸à¸£à¸±à¸` which is Thai text "แปลงที่ดิน บางรัก" displayed as mojibake (this is typical when UTF-8 encoded Thai text is interpreted as Latin-1/ISO-8859-1). This suggests the encoding is "latin1-mojibake" - i.e., the text was encoded as UTF-8 but is being read incorrectly, or it was written with wrong encoding. 2. **roads** - CRS: EPSG:32647 (WGS 84 / UTM zone 47N), LineString, 5000 features. The name column shows `à¸à¸à¸à¸ªà¸¸à¸à¸¸à¸¡à¸§à¸´à¸` which is Thai "ถนนสุขุมวิท" but displayed incorrectly - also mojibake. 3. **markets** - CRS: EPSG:4326 (WGS 84), Point, 1000 features. The name_th column shows `ตลาดนัดจตุจักร` which looks correct - proper Thai characters. This is likely correctly UTF-8 encoded. So the encoding detection: - parcels: the text attributes show mojibake (Thai bytes interpreted as Latin-1). Need to check if it's actually mojibake or if the data itself is correct. Let me look more carefully at the raw bytes. Actually, let me think about this. The output shows `à¹à¸à¸¥à¸à¸à¸µà¹à¸à¸´à¸` for the first parcel. This is the classic Thai UTF-8 mojibake pattern. When Thai UTF-8 bytes are displayed as Latin-1/ISO-8859-1, we get this string of accented characters. For the markets layer, the Thai text `ตลาดนัดจตุจักร` displays correctly, meaning it was properly read as UTF-8. So: - parcels: latin1-mojibake - roads: latin1-mojibake - markets: utf-8 Now for sample coordinates. I need to pick a sample coordinate from each layer as it sits on disk. Let me think about what coordinate to use. For parcels (polygons): I could take the centroid of the first feature or the first coordinate point. For roads (linestrings): first coordinate of the first feature. For markets (points): just the first point coordinates. Let me get more details about the actual raw bytes to confirm encoding. Actually, let me write a comprehensive script that does this properly.
0:34
(no output)
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
import fiona
layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
for lyr in layers:
gdf = gpd.read_file('bangkok_contractor_delivery.gpkg', layer=lyr)
# Check text columns for encoding issues
for col in gdf.columns:
if gdf[col].dtype == 'object':
sample = str(gdf[col].iloc[0])
print(f'{lyr}.{col}: {sample!r}')
print(f' bytes: {sample.encode(\"utf-8\", errors=\"replace\")[:80]}')
print()
"
0:51
0:51
(no output)
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
import fiona
layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
for lyr in layers:
gdf = gpd.read_file('bangkok_contractor_delivery.gpkg', layer=lyr)
for col in gdf.columns:
if gdf[col].dtype == 'object':
val = gdf[col].iloc[0]
if isinstance(val, str):
print(f'{lyr}.{col}: repr={val!r}')
print(f' bytes: {val.encode(\"utf-8\")}')
else:
print(f'{lyr}.{col}: type={type(val)}, val={val!r}')
print()
"
0:56
0:56
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import sys
import geopandas as gpd
import fiona
layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
print('layers:', layers, flush=True)
for lyr in layers:
gdf = gpd.read_file('bangkok_contractor_delivery.gpkg', layer=lyr)
for col in gdf.columns:
if gdf[col].dtype == 'object':
val = gdf[col].iloc[0]
print(f'{lyr}.{col} type={type(val).__name__} val={repr(val)[:200]}', flush=True)
" 2>&1
layers: ['parcels', 'roads', 'markets']
1:03
1:03
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import geopandas as gpd
import fiona
layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
for lyr in layers:
gdf = gpd.read_file('bangkok_contractor_delivery.gpkg', layer=lyr)
print(f'=== {lyr} ===')
print(f'dtypes: {gdf.dtypes}')
for col in gdf.columns:
if gdf[col].dtype == 'object':
val = gdf[col].iloc[0]
print(f' {col}: type={type(val).__name__}, len={len(str(val))}')
# Try to see raw bytes
try:
raw = val.encode('latin-1')
print(f' latin-1 bytes: {raw[:60]}')
except:
pass
try:
raw = val.encode('utf-8')
print(f' utf-8 bytes: {raw[:60]}')
except:
pass
" 2>&1
=== parcels === dtypes: id str name_th str owner_class str geometry geometry dtype: object === roads === dtypes: id str name str highway str geometry geometry dtype: object === markets === dtypes: id str name_th str kind str geometry geometry dtype: object
1:09
1:09
Write
/work/check_encoding.py
content · 22 lines · py
import geopandas as gpd
import fiona
import sys
layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
for lyr in layers:
gdf = gpd.read_file('bangkok_contractor_delivery.gpkg', layer=lyr)
print(f'=== {lyr} ===')
for col in gdf.columns:
if gdf[col].dtype == 'object':
val = gdf[col].iloc[0]
print(f' {col}: {val!r}')
# Try to interpret as latin1 mojibake
try:
# If it looks like mojibake (e.g. à¹...) try decoding as latin1 then re-encode
raw_bytes = val.encode('latin-1')
decoded = raw_bytes.decode('utf-8', errors='replace')
print(f' latin1->utf8: {decoded!r}')
except:
pass
print()
Wrote /work/check_encoding.py (761 chars).
1:11
1:11
Bash
cd /work && python3 check_encoding.py 2>&1
=== parcels === === roads === === markets ===
1:18
1:18
Write
/work/check_encoding.py
content · 22 lines · py
import geopandas as gpd
import fiona
import sys
layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
with open('/work/encoding_check.txt', 'w', encoding='utf-8') as f:
for lyr in layers:
gdf = gpd.read_file('bangkok_contractor_delivery.gpkg', layer=lyr)
f.write(f'=== {lyr} ===\n')
for col in gdf.columns:
if gdf[col].dtype == 'object':
val = gdf[col].iloc[0]
f.write(f' {col}: {val!r}\n')
# Try to interpret as latin1 mojibake
try:
raw_bytes = val.encode('latin-1')
decoded = raw_bytes.decode('utf-8', errors='replace')
f.write(f' latin1->utf8: {decoded!r}\n')
except:
pass
f.write('\n')
Wrote /work/check_encoding.py (811 chars).
1:26
1:26
Bash
cd /work && python3 check_encoding.py && cat /work/encoding_check.txt
=== parcels === === roads === === markets ===
1:30
1:30
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import fiona
layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
print('Layers:', layers)
with fiona.open('bangkok_contractor_delivery.gpkg', layer='parcels') as src:
rec = next(iter(src))
print('Record keys:', rec.keys())
print('Properties:', rec['properties'])
print('Geometry:', rec['geometry'])
" 2>&1
Layers: ['parcels', 'roads', 'markets'] Record keys: KeysView(fiona.Feature(geometry=fiona.Geometry(coordinates=[[(657421.5158199039, 1509162.3701859994), ...]], type='Polygon'), id='1', properties=fiona.Properties(id='P00000', name_th='à¹\x81à¸\x9bลà¸\x87à¸\x97ีà¹\x88à¸\x94ิà¸\x99 à¸\x9aาà¸\x87รัà¸\x81', owner_class='private'))) Properties: fiona.Properties(id='P00000', name_th='à¹\x81à¸\x9bลà¸\x87à¸\x97ีà¹\x88à¸\x94ิà¸\x99 à¸\x9aาà¸\x87รัà¸\x81', owner_class='private') Geometry: fiona.Geometry(coordinates=[[(657421.5158199039, 1509162.3701859994), ...]], type='Polygon')
1:35
1:35
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import fiona
with fiona.open('bangkok_contractor_delivery.gpkg', layer='markets') as src:
rec = next(iter(src))
print('Properties:', rec['properties'])
print('name_th repr:', repr(rec['properties']['name_th']))
" 2>&1
Properties: fiona.Properties(id='M00000', name_th='ตลาดนัดจตุจักร', kind='fresh') name_th repr: 'ตลาดนัดจตุจักร'
1:54
So the `markets` layer has proper UTF-8 Thai text (ตลาดนัดจตุจักร), while the `parcels` layer has mojibake (à¹\x81à¸\x9b...). Let me check the `roads` layer too.
1:54
1:54
Bash
cd /work && python3 -c "
cd /work && python3 -c "
import fiona
with fiona.open('bangkok_contractor_delivery.gpkg', layer='roads') as src:
rec = next(iter(src))
print('Properties:', rec['properties'])
print('name repr:', repr(rec['properties']['name']))
" 2>&1
Properties: fiona.Properties(id='R00000', name='à¸\x96à¸\x99à¸\x99สุà¸\x82ุมวิà¸\x97', highway='primary') name repr: 'à¸\x96à¸\x99à¸\x99สุà¸\x82ุมวิà¸\x97'
2:11
So we have: - parcels: mojibake (latin1-mojibake) - the Thai text was encoded as UTF-8, but then read as latin-1 - roads: mojibake (latin1-mojibake) - same issue - markets: proper UTF-8 Thai text Now I need to build the CSV. Let me collect all the data: **parcels:** - declared_crs: EPSG:24047 - geometry_type: Polygon - feature_count: 4000 - sample: need to get a coordinate from the geometry as it sits on disk. Let me get a sample coordinate. **roads:** - declared_crs: EPSG:32647 - geometry_type: LineString - feature_count: 5000 **markets:** - declared_crs: EPSG:4326 - geometry_type: Point - feature_count: 1000 Let me write the full script to generate the CSV.
2:11
2:11
Write
/work/solve.py
content · 117 lines · py
import fiona
import csv
import math
layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
rows = []
for lyr in layers:
with fiona.open('bangkok_contractor_delivery.gpkg', layer=lyr) as src:
# Get CRS
crs = src.crs
if crs is None:
declared_crs = 'EPSG:0'
else:
# Extract EPSG code
epsg = crs.to_epsg()
if epsg:
declared_crs = f'EPSG:{epsg}'
else:
# Try to extract from dict
if 'init' in crs:
declared_crs = str(crs['init']).upper()
elif 'epsg' in str(crs).lower():
# Try to extract number
import re
m = re.search(r'(\d+)', str(crs))
if m:
declared_crs = f'EPSG:{m.group(1)}'
else:
declared_crs = str(crs)
else:
declared_crs = str(crs)
# Get geometry type and count
geom_type = None
feature_count = 0
sample_geom = None
for i, feat in enumerate(src):
if i == 0:
geom = feat['geometry']
if geom and geom['type']:
geom_type = geom['type']
# Get sample coordinate
coords = geom['coordinates']
if geom['type'] == 'Point':
sample_geom = coords
elif geom['type'] in ('MultiPoint',):
sample_geom = coords[0]
elif geom['type'] in ('LineString',):
sample_geom = coords[0]
elif geom['type'] in ('MultiLineString',):
sample_geom = coords[0][0]
elif geom['type'] in ('Polygon',):
sample_geom = coords[0][0]
elif geom['type'] in ('MultiPolygon',):
sample_geom = coords[0][0][0]
feature_count = i + 1
# Determine sample_x and sample_y
if sample_geom:
sample_x = sample_geom[0]
sample_y = sample_geom[1]
else:
sample_x = None
sample_y = None
# Check encoding: look at first text attribute
encoding_detected = 'utf-8'
# Re-read to check text
with fiona.open('bangkok_contractor_delivery.gpkg', layer=lyr) as src2:
first_feat = next(iter(src2))
for key, val in first_feat['properties'].items():
if val and isinstance(val, str) and len(val) > 1:
# Check for mojibake: try to encode as latin-1 then decode as utf-8
try:
raw_bytes = val.encode('latin-1')
# If it can be encoded as latin-1 and the result looks like valid UTF-8 sequences
# containing bytes 0xc0-0xff (common in Thai UTF-8), it's mojibake
high_bytes = sum(1 for b in raw_bytes if b > 0x7f)
if high_bytes > 0:
# Try to decode those bytes as UTF-8
try:
decoded = raw_bytes.decode('utf-8')
# If it decodes successfully, the original was UTF-8 bytes
# mis-interpreted as latin-1
# Check if the decoded text contains Thai characters
has_thai = any(ord(c) > 0x0E00 and ord(c) < 0x10000 for c in decoded)
if has_thai:
encoding_detected = 'latin1-mojibake'
break
except (UnicodeDecodeError, UnicodeEncodeError):
pass
except (UnicodeEncodeError, UnicodeDecodeError):
pass
break
rows.append({
'layer_name': lyr,
'declared_crs': declared_crs,
'geometry_type': geom_type,
'feature_count': feature_count,
'sample_x': sample_x,
'sample_y': sample_y,
'encoding_detected': encoding_detected
})
# Write CSV
with open('/work/crs_audit.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['layer_name', 'declared_crs', 'geometry_type', 'feature_count', 'sample_x', 'sample_y', 'encoding_detected'])
writer.writeheader()
writer.writerows(rows)
print('CSV written successfully')
for r in rows:
print(r)
Wrote /work/solve.py (4724 chars).
2:17
Wait, the `to_epsg()` method might not be on the CRS object from fiona. Let me check how fiona exposes CRS. Actually, `src.crs` from fiona might return a dict, not a pyproj CRS object. Let me adjust. Also, I need to be more careful about the encoding detection. Let me run the script first to see what happens.
2:17
Bash
cd /work && python3 solve.py
CSV written successfully
{'layer_name': 'parcels', 'declared_crs': 'EPSG:24047', 'geometry_type': 'Polygon', 'feature_count': 4000, 'sample_x': 657421.5158199039, 'sample_y': 1509162.3701859994, 'encoding_detected': 'utf-8'}
{'layer_name': 'roads', 'declared_crs': 'EPSG:32647', 'geometry_type': 'LineString', 'feature_count': 5000, 'sample_x': 656835.6850029442, 'sample_y': 1509487.0031804747, 'encoding_detected': 'utf-8'}
{'layer_name': 'markets', 'declared_crs': 'EPSG:4326', 'geometry_type': 'Point', 'feature_count': 1000, 'sample_x': 100.4533, 'sample_y': 13.655000000000001, 'encoding_detected': 'utf-8'}
2:39
The encoding detection is not working correctly. I need to look more carefully. Let me re-examine the data.
From the fiona output:
- `parcels` name_th: `à¹\x81à¸\x9bลà¸\x87à¸\x97ีà¹\x88à¸\x94ิà¸\x99 à¸\x9aาà¸\x87รัà¸\x81` - This is latin-1 mojibake. The bytes are like `\xe0\xb9\x81` which is UTF-8 for `แ`.
- `roads` name: `à¸\x96à¸\x99à¸\x99สุà¸\x82ุมวิà¸\x97` - This is also latin-1 mojibake.
- `markets` name_th: `ตลาดนัดจตุจักร` - This is proper UTF-8 Thai.
So parcels and roads should be `latin1-mojibake`, and markets should be `utf-8`.
The issue is that when I read with geopandas, the strings are already Python str objects. The question is: when fiona reads the data, are the strings already decoded?
Let me look at the bytes more carefully. The string `à¹\x81à¸\x9b` - in Python, when I do `val.encode('latin-1')`, the character `à` (U+00E0) encodes to `\xe0`, `¹` (U+00B9) encodes to `\xb9`, `\x81` is already a byte... wait, `\x81` is a control character in the string.
Actually, looking at the fiona output more carefully: `'à¹\x81à¸\x9bลà¸\x87à¸\x97ีà¹\x88à¸\x94ิà¸\x99 à¸\x9aาà¸\x87รัà¸\x81'`
So the string contains characters like `à`, `¹`, then a literal `\x81` byte (which would be a control character), then `à`, `¸`, `\x9b`, etc.
When I call `val.encode('latin-1')` on this string, the `\x81` character (U+0081) encodes to `\x81` in latin-1. The `à` (U+00E0) encodes to `\xe0`. So the result should be the original UTF-8 byte sequence.
Let me verify this by doing it more carefully in the script.
2:39
2:39
Write
/work/debug_enc.py
content · 35 lines · py
import fiona
with fiona.open('bangkok_contractor_delivery.gpkg', layer='parcels') as src:
feat = next(iter(src))
val = feat['properties']['name_th']
print(f'val: {val!r}')
print(f'len: {len(val)}')
for i, ch in enumerate(val[:20]):
print(f' char[{i}]: U+{ord(ch):04X} {ch!r}')
# Encode as latin-1 to get the raw bytes
raw = val.encode('latin-1')
print(f'raw bytes: {raw.hex()}')
print(f'raw bytes repr: {list(raw)[:30]}')
# Decode as UTF-8
try:
decoded = raw.decode('utf-8')
print(f'decoded as utf-8: {decoded!r}')
except Exception as e:
print(f'Cannot decode as utf-8: {e}')
print()
with fiona.open('bangkok_contractor_delivery.gpkg', layer='markets') as src:
feat = next(iter(src))
val = feat['properties']['name_th']
print(f'val: {val!r}')
for i, ch in enumerate(val[:10]):
print(f' char[{i}]: U+{ord(ch):04X} {ch!r}')
raw = val.encode('latin-1')
print(f'raw bytes: {raw.hex()}')
try:
decoded = raw.decode('utf-8')
print(f'decoded as utf-8: {decoded!r}')
except Exception as e:
print(f'Cannot decode as utf-8: {e}')
Wrote /work/debug_enc.py (1171 chars).
2:50
Bash
cd /work && python3 debug_enc.py
val: 'à¹\x81à¸\x9bลà¸\x87à¸\x97ีà¹\x88à¸\x94ิà¸\x99 à¸\x9aาà¸\x87รัà¸\x81' len: 49 char[0]: U+00E0 'à' char[1]: U+00B9 '¹' char[2]: U+0081 '\x81' char[3]: U+00E0 'à' char[4]: U+00B8 '¸' char[5]: U+009B '\x9b' char[6]: U+00E0 'à' char[7]: U+00B8 '¸' char[8]: U+00A5 '¥' char[9]: U+00E0 'à' char[10]: U+00B8 '¸' char[11]: U+0087 '\x87' char[12]: U+00E0 'à' char[13]: U+00B8 '¸' char[14]: U+0097 '\x97' char[15]: U+00E0 'à' char[16]: U+00B8 '¸' char[17]: U+00B5 'µ' char[18]: U+00E0 'à' char[19]: U+00B9 '¹' raw bytes: e0b981e0b89be0b8a5e0b887e0b897e0b8b5e0b988e0b894e0b8b4e0b89920e0b89ae0b8b2e0b887e0b8a3e0b8b1e0b881 raw bytes repr: [224, 185, 129, 224, 184, 155, 224, 184, 165, 224, 184, 135, 224, 184, 151, 224, 184, 181, 224, 185, 136, 224, 184, 148, 224, 184, 180, 224, 184, 153] decoded as utf-8: 'แปลงที่ดิน บางรัก' val: 'ตลาดนัดจตุจักร' char[0]: U+0E15 'ต' char[1]: U+0E25 'ล' char[2]: U+0E32 'า' char[3]: U+0E14 'ด' char[4]: U+0E19 'น' char[5]: U+0E31 'ั' char[6]: U+0E14 'ด' char[7]: U+0E08 'จ' char[8]: U+0E15 'ต' char[9]: U+0E38 'ุ'
Traceback (most recent call last):
File "/work/debug_enc.py", line 29, in <module>
raw = val.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-13: ordinal not in range(256)
3:13
Now I understand the issue. In `parcels` and `roads`, the Thai text was encoded as UTF-8 bytes, but then those bytes were read as if they were latin-1 characters. So the string `'à¹\x81à¸\x9b...'` is a Python string where each char is in the latin-1 range (U+0000 to U+00FF). When I do `.encode('latin-1')`, it produces the original UTF-8 bytes. Then those bytes decode as UTF-8 to the correct Thai text.
In `markets`, the Thai text is stored correctly as actual Thai Unicode characters. When I try `.encode('latin-1')`, it fails because Thai characters are outside the latin-1 range.
So my detection logic should be:
1. Try to encode the string as latin-1. If it succeeds, AND the resulting bytes can be decoded as UTF-8 to produce meaningful text (containing Thai characters), then it's `latin1-mojibake`.
2. If encoding as latin-1 fails (because the text contains actual Unicode characters beyond U+00FF), then it's `utf-8`.
Let me fix the script.
3:13
3:13
Write
/work/solve.py
content · 98 lines · py
import fiona
import csv
layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
rows = []
for lyr in layers:
with fiona.open('bangkok_contractor_delivery.gpkg', layer=lyr) as src:
# Get CRS
crs = src.crs
if crs is None:
declared_crs = 'EPSG:0'
else:
epsg = crs.to_epsg()
if epsg:
declared_crs = f'EPSG:{epsg}'
else:
# Fallback: extract from string
import re
m = re.search(r'(\d+)', str(crs))
declared_crs = f'EPSG:{m.group(1)}' if m else str(crs)
# Get geometry type, feature count, and sample coordinates
geom_type = None
feature_count = 0
sample_x = None
sample_y = None
for i, feat in enumerate(src):
if i == 0:
geom = feat['geometry']
if geom and geom.get('type'):
geom_type = geom['type']
coords = geom['coordinates']
if geom_type == 'Point':
sample_x, sample_y = coords[0], coords[1]
elif geom_type == 'MultiPoint':
sample_x, sample_y = coords[0][0], coords[0][1]
elif geom_type == 'LineString':
sample_x, sample_y = coords[0][0], coords[0][1]
elif geom_type == 'MultiLineString':
sample_x, sample_y = coords[0][0][0], coords[0][0][1]
elif geom_type == 'Polygon':
sample_x, sample_y = coords[0][0][0], coords[0][0][1]
elif geom_type == 'MultiPolygon':
sample_x, sample_y = coords[0][0][0][0], coords[0][0][0][1]
feature_count = i + 1
# Detect encoding from first text attribute
encoding_detected = 'utf-8'
with fiona.open('bangkok_contractor_delivery.gpkg', layer=lyr) as src2:
first_feat = next(iter(src2))
for key, val in first_feat['properties'].items():
if val and isinstance(val, str) and len(val) > 0:
# Check if this is latin1-mojibake:
# The string contains only chars in range U+0000-U+00FF (latin-1 encodable)
# AND when encoded as latin-1 the resulting bytes decode as valid UTF-8
try:
raw_bytes = val.encode('latin-1')
# If we got here, all chars are in latin-1 range
# Now try to decode those bytes as UTF-8
try:
decoded = raw_bytes.decode('utf-8')
# Check if decoded text contains non-ASCII characters (e.g. Thai)
has_high = any(ord(c) > 0x7f for c in decoded)
if has_high:
encoding_detected = 'latin1-mojibake'
break
except UnicodeDecodeError:
pass
except UnicodeEncodeError:
# Contains characters outside latin-1 range -> proper Unicode
pass
break # Only check first text attribute
rows.append({
'layer_name': lyr,
'declared_crs': declared_crs,
'geometry_type': geom_type,
'feature_count': feature_count,
'sample_x': sample_x,
'sample_y': sample_y,
'encoding_detected': encoding_detected
})
# Write CSV
with open('/work/crs_audit.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=[
'layer_name', 'declared_crs', 'geometry_type',
'feature_count', 'sample_x', 'sample_y', 'encoding_detected'
])
writer.writeheader()
writer.writerows(rows)
print('crs_audit.csv written successfully')
for r in rows:
print(r)
Wrote /work/solve.py (4041 chars).
3:21
Wait, I need to think about the fields. For parcels, the text attribute is `name_th`. For roads, it's `name`. For markets, it's `name_th`. Let me re-examine.
The parcels `name_th`: `à¹\x81à¸\x9bลà¸\x87...` - These are all latin-1 characters (U+00E0, U+00B9, etc.). When encoded as latin-1, they produce bytes like `e0 b9 81` which is the UTF-8 encoding of `แ` (Thai). So this is latin1-mojibake.
The roads `name`: `à¸\x96à¸\x99à¸\x99ส...` - Same pattern.
The markets `name_th`: `ตลาดนัดจตุจักร` - These are Thai characters (U+0E15, etc.) which cannot be encoded in latin-1. So this is proper UTF-8.
My logic should work: For parcels/roads, `val.encode('latin-1')` succeeds (all chars in latin-1), then the bytes decode as valid UTF-8 giving Thai text -> latin1-mojibake. For markets, `val.encode('latin-1')` will raise UnicodeEncodeError because Thai chars are outside latin-1 range -> utf-8.
3:21
Bash
cd /work && python3 solve.py
crs_audit.csv written successfully
{'layer_name': 'parcels', 'declared_crs': 'EPSG:24047', 'geometry_type': 'Polygon', 'feature_count': 4000, 'sample_x': 657421.5158199039, 'sample_y': 1509162.3701859994, 'encoding_detected': 'utf-8'}
{'layer_name': 'roads', 'declared_crs': 'EPSG:32647', 'geometry_type': 'LineString', 'feature_count': 5000, 'sample_x': 656835.6850029442, 'sample_y': 1509487.0031804747, 'encoding_detected': 'utf-8'}
{'layer_name': 'markets', 'declared_crs': 'EPSG:4326', 'geometry_type': 'Point', 'feature_count': 1000, 'sample_x': 100.4533, 'sample_y': 13.655000000000001, 'encoding_detected': 'utf-8'}
3:29
The detection still says utf-8 for parcels and roads. Let me debug more. The issue might be that `val.encode('latin-1')` fails for parcels because `\x81` is not a valid latin-1 character? Actually, latin-1 (ISO-8859-1) covers all bytes 0-255, so it should work. Let me check.
Wait, the error earlier was for markets (the proper Thai text) - that failed because characters like ต (U+0E15) can't be encoded as latin-1. For parcels, the characters are in the range U+00E0, U+00B9, U+0081, etc. - all within 0-255, so latin-1 should work.
But the result shows 'utf-8' for parcels too. Let me check what's happening in the encoding detection loop more carefully.
3:29
3:29
Write
/work/debug_enc2.py
content · 20 lines · py
import fiona
with fiona.open('bangkok_contractor_delivery.gpkg', layer='parcels') as src:
first_feat = next(iter(src))
for key, val in first_feat['properties'].items():
if val and isinstance(val, str) and len(val) > 0:
print(f'key={key}, val={val!r}')
try:
raw_bytes = val.encode('latin-1')
print(f' encoded to latin-1 OK, bytes={raw_bytes[:20]}')
try:
decoded = raw_bytes.decode('utf-8')
print(f' decoded as utf-8: {decoded!r}')
has_high = any(ord(c) > 0x7f for c in decoded)
print(f' has_high={has_high}')
except UnicodeDecodeError as e:
print(f' utf-8 decode failed: {e}')
except UnicodeEncodeError as e:
print(f' latin-1 encode failed: {e}')
break
Wrote /work/debug_enc2.py (901 chars).
3:39
Bash
cd /work && python3 debug_enc2.py
key=id, val='P00000' encoded to latin-1 OK, bytes=b'P00000' decoded as utf-8: 'P00000' has_high=False
4:04
4:04
Write
/work/debug_enc3.py
content · 21 lines · py
import fiona
# Check which columns have Thai/mojibake content
for lyr in ['parcels', 'roads', 'markets']:
with fiona.open('bangkok_contractor_delivery.gpkg', layer=lyr) as src:
first_feat = next(iter(src))
print(f'=== {lyr} ===')
for key, val in first_feat['properties'].items():
print(f' {key}: {val!r}')
if val and isinstance(val, str) and len(val) > 0:
try:
raw = val.encode('latin-1')
try:
decoded = raw.decode('utf-8')
has_high = any(ord(c) > 0x7f for c in decoded)
print(f' -> latin1->utf8: {decoded!r} (has_high={has_high})')
except:
print(f' -> latin1 bytes but not valid utf-8')
except:
print(f' -> not latin-1 encodable (proper unicode)')
print()
Wrote /work/debug_enc3.py (935 chars).
4:14