run-20260528-2225Z / dd-l2-bangkok-multicrs-audit

0:00

A contractor just shipped `bangkok_contractor_delivery.gpkg` and I need a defect cite-sheet before I sign off. Please walk every layer in the file and write one row per layer to `crs_audit.csv` with these columns: layer_name, declared_crs (formatted as `EPSG:NNNN`), geometry_type, feature_count, sample_x and sample_y for a sample coordinate taken from the layer as it sits on disk, and encoding_detected. For encoding_detected, look at the text attributes for anomalies and report either `latin1-mojibake` or `utf-8`.

0:00

System info

{
  "adapter": "openrouter",
  "model": "google/gemma-4-26b-a4b-it",
  "base_url": "https://openrouter.ai/api/v1"
}

0:10

Bash ls -F

bangkok_contractor_delivery.gpkg

0:50

Write /work/solve.py

content · 102 lines · py

import geopandas as gpd
import fiona
import pandas as pd
import numpy as np

def detect_encoding(gdf):
    """
    Heuristic to detect if text columns contain mojibake.
    We check string columns for common patterns of mojibake (e.g., weird character sequences).
    However, a simpler way is to check if we can decode/encode without error 
    or if there are high counts of replacement characters.
    """
    # In the context of this prompt, it asks to look for 'latin1-mojibake' or 'utf-8'.
    # A common way to detect mojibake is checking if 'Ã' or similar characters appear 
    # where they shouldn't, or if string encoding/decoding cycles produce artifacts.
    
    # Let's check all object columns
    str_cols = gdf.select_dtypes(include=['object']).columns
    if len(str_cols) == 0:
        return "utf-8" # Default if no text
    
    for col in str_cols:
        # Get all unique non-null values
        samples = gdf[col].dropna().unique()
        for s in samples:
            if not isinstance(s, str):
                continue
            # If it's mojibake, it often contains characters like Ã, ©, etc. 
            # resulting from interpreting latin1 as utf-8.
            # A common check: if we try to encode as utf-8 and then decode as latin1,
            # does it look like gibberish? 
            # Actually, the prompt suggests looking for 'anomalies'.
            # Let's try to see if there are "replacement characters" or common mojibake markers.
            if "Ã" in s or "Â" in s:
                return "latin1-mojibake"
                
    return "utf-8"

def audit_layers(file_path):
    layers = fiona.listlayers(file_path)
    audit_results = []

    for layer in layers:
        gdf = gpd.read_file(file_path, layer=layer)
        
        # layer_name
        layer_name = layer
        
        # declared_crs (formatted as EPSG:NNNN)
        crs = gdf.crs
        if crs is not None:
            if crs.is_epsg:
                declared_crs = f"EPSG:{crs.to_epsg()}"
            else:
                # Fallback for non-standard or projected CRS without EPSG code
                declared_crs = str(crs)
        else:
            declared_crs = "None"
            
        # geometry_type
        # gdf.geometry.type returns the type of the geometry series
        if not gdf.empty:
            geometry_type = gdf.geometry.type.iloc[0]
        else:
            geometry_type = "None"
            
        # feature_count
        feature_count = len(gdf)
        
        # sample_x and sample_y
        sample_x = None
        sample_y = None
        if feature_count > 0:
            # Get the first geometry's representative point or centroid
            sample_geom = gdf.geometry.iloc[0]
            if sample_geom is not None:
                # Use centroid to ensure we get a point
                centroid = sample_geom.centroid
                sample_x = centroid.x
                sample_y = centroid.y
                
        # encoding_detected
        encoding_detected = detect_encoding(gdf)
        
        audit_results.append({
            'layer_name': layer_name,
            'declared_crs': declared_crs,
            'geometry_type': geometry_type,
            'feature_count': feature_count,
            'sample_x': sample_x,
            'sample_y': sample_y,
            'encoding_detected': encoding_detected
        })

    return pd.DataFrame(audit_results)

if __name__ == "__main__":
    file_path = 'bangkok_contractor_delivery.gpkg'
    df = audit_layers(file_path)
    df.to_csv('crs_audit.csv', index=False)
    print("Audit complete. Results written to crs_audit.csv")

Wrote /work/solve.py (3651 chars).