dd-l2-bangkok-multicrs-audit

Status: done
Score: 0.882
Duration: 1:08
Cost: 0.32¢
Model: google/gemma-4-26b-a4b-it

Map (reference ← swipe → agent)

0:00
A contractor just shipped `bangkok_contractor_delivery.gpkg` and I need a defect cite-sheet before I sign off. Please walk every layer in the file and write one row per layer to `crs_audit.csv` with these columns: layer_name, declared_crs (formatted as `EPSG:NNNN`), geometry_type, feature_count, sample_x and sample_y for a sample coordinate taken from the layer as it sits on disk, and encoding_detected. For encoding_detected, look at the text attributes for anomalies and report either `latin1-mojibake` or `utf-8`.
0:00
System info
{
  "adapter": "openrouter",
  "model": "google/gemma-4-26b-a4b-it",
  "base_url": "https://openrouter.ai/api/v1"
}
0:01
Bash ls -l bangkok_contractor_delivery.gpkg
-rw-r--r-- 1 agent agent 2703360 Jun 17 20:25 bangkok_contractor_delivery.gpkg
0:29
Write /work/solve.py
content · 99 lines · py
import geopandas as gpd
import pandas as pd
import fiona
from fiona.crs import CRS

def check_encoding(df):
    """
    Heuristic to check if a dataframe has mojibake.
    We check string columns for common mojibake patterns or 
    sequences that suggest incorrect encoding.
    """
    # If no string columns, return utf-8 as a default
    str_cols = df.select_dtypes(include=['object']).columns
    if len(str_cols) == 0:
        return "utf-8"
    
    for col in str_cols:
        # Check for common mojibake indicators like Ã, ©, etc. 
        # in contexts where they don't belong, but a simpler way is to 
        # try to see if the column can be encoded/decoded cleanly if it were bytes.
        # However, since it's already loaded as a string, we look for "garbage" characters.
        # A common sign of latin1-mojibake in utf-8 is the presence of 'Ã' followed by 
        # something else.
        sample_text = df[col].dropna().astype(str).head(100).tolist()
        for text in sample_text:
            # A very basic heuristic: if we see characters like à followed by something
            # that looks like a single character's byte representation.
            # Example: 'é' instead of 'é'
            if "Ã" in text:
                return "latin1-mojibake"
    
    return "utf-8"

def audit_gpkg(file_path, output_csv):
    layers = fiona.listlayers(file_path)
    audit_results = []

    for layer in layers:
        gdf = gpd.read_file(file_path, layer=layer)
        
        # layer_name
        layer_name = layer
        
        # declared_crs
        # fiona/geopandas handles CRS. We want EPSG:NNNN format.
        # If it's a custom CRS or missing, we'll try to format it or use 'Unknown'
        try:
            if gdf.crs is not None:
                # Use to_epsg() if available, otherwise string representation
                epsg_code = gdf.crs.to_epsg()
                declared_crs = f"EPSG:{epsg_code}" if epsg_code else str(gdf.crs)
            else:
                declared_crs = "Unknown"
        except Exception:
            declared_crs = "Unknown"

        # geometry_type
        geometry_type = gdf.geometry.type.iloc[0] if not gdf.empty else "None"
        
        # feature_count
        feature_count = len(gdf)
        
        # sample_x and sample_y
        sample_x = None
        sample_y = None
        if not gdf.empty:
            sample_point = gdf.geometry.iloc[0]
            if sample_point.geom_type == 'Point':
                sample_x = sample_point.x
                sample_y = sample_point.y
            elif sample_point.geom_type in ['LineString', 'MultiLineString', 'Polygon', 'MultiPolygon']:
                # Get representative point or centroid
                rep_point = sample_point.centroid
                sample_x = rep_point.x
                sample_y = rep_point.y
            elif sample_point.geom_type == 'MultiPoint':
                sample_x = sample_point.geoms[0].x
                sample_y = sample_point.geoms[0].y

        # encoding_detected
        encoding = check_encoding(gdf)
        
        audit_results.append({
            "layer_name": layer_name,
            "declared_crs": declared_crs,
            "geometry_type": geometry_type,
            "feature_count": feature_count,
            "sample_x": sample_x,
            "sample_y": sample_y,
            "encoding_detected": encoding
        })

    audit_df = pd.DataFrame(audit_results)
    audit_df.to_csv(output_csv, index=False)
    print(f"Audit complete. Results written to {output_csv}")

if __name__ == "__main__":
    audit_gpkg("bangkok_contractor_delivery.gpkg", "crs_audit.csv")
Wrote /work/solve.py (3657 chars).
0:31
Bash python solve.py
Audit complete. Results written to crs_audit.csv
/work/solve.py:13: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning.
See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  str_cols = df.select_dtypes(include=['object']).columns
/work/solve.py:13: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning.
See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  str_cols = df.select_dtypes(include=['object']).columns
/work/solve.py:13: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning.
See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  str_cols = df.select_dtypes(include=['object']).columns
0:33
Read crs_audit.csv
     1	layer_name,declared_crs,geometry_type,feature_count,sample_x,sample_y,encoding_detected
     2	parcels,EPSG:24047,Polygon,4000,657706.4571303247,1509274.702150785,utf-8
     3	roads,EPSG:32647,LineString,5000,656996.3371533038,1509487.963830215,utf-8
     4	markets,EPSG:4326,Point,1000,100.4533,13.655000000000001,utf-8
0:38
Bash # Let's see if I can find any actual mojibake by manually checking the file if i...
# Let's see if I can find any actual mojibake by manually checking the file if it's small enough,
# but since the CSV output didn't show any, I'll trust my logic for now unless I find evidence.
# Wait, I should check if any layer actually HAS mojibake.
# Let's double check the 'markets' layer specifically.
python -c "import geopandas as gpd; gdf = gpd.read_file('bangkok_contractor_delivery.gpkg', layer='markets'); print(gdf.head())"
       id         name_th   kind                 geometry
0  M00000  ตลาดนัดจตุจักร  fresh  POINT (100.4533 13.655)
1  M00001      ตลาดอ.ต.ก.    wet  POINT (100.4599 13.655)
2  M00002     ตลาดคลองเตย  night  POINT (100.4665 13.655)
3  M00003    ตลาดสะพานขาว  fresh  POINT (100.4731 13.655)
4  M00004      ตลาดบางรัก    wet  POINT (100.4797 13.655)
0:42
Bash python -c "import geopandas as gpd; gdf = gpd.read_file('bangkok_contractor_deli...
python -c "import geopandas as gpd; gdf = gpd.read_file('bangkok_contractor_delivery.gpkg', layer='parcels'); print(gdf.head())"
       id  ...                                           geometry
0  P00000  ...  POLYGON ((657421.516 1509162.37, 657992.722 15...
1  P00001  ...  POLYGON ((658135.524 1509166.644, 658706.731 1...
2  P00002  ...  POLYGON ((658849.533 1509170.937, 659420.743 1...
3  P00003  ...  POLYGON ((659563.545 1509175.249, 660134.756 1...
4  P00004  ...  POLYGON ((660277.558 1509179.581, 660848.771 1...

[5 rows x 4 columns]
0:46
Bash # Check the roads layer for mojibake
# Check the roads layer for mojibake
python -c "import geopandas as gpd; gdf = gpd.read_file('bangkok_contractor_delivery.gpkg', layer='roads'); print(gdf.head())"
       id  ...                                           geometry
0  R00000  ...  LINESTRING (656835.685 1509487.003, 657156.989...
1  R00001  ...  LINESTRING (657192.69 1509489.138, 657190.305 ...
2  R00002  ...  LINESTRING (657549.695 1509491.278, 657871 150...
3  R00003  ...  LINESTRING (657906.701 1509493.423, 657904.305...
4  R00004  ...  LINESTRING (658263.707 1509495.572, 658585.013...

[5 rows x 4 columns]
0:58
Bash # Let's try to check if there's any layer that has encoding issues.
# Let's try to check if there's any layer that has encoding issues. 
# I'll run a script that specifically looks for mojibake in all layers.
cat << 'EOF' > check_mojibake.py
import geopandas as gpd
import fiona

def find_mojibake(gdf):
    str_cols = gdf.select_dtypes(include=['object']).columns
    for col in str_cols:
        sample = gdf[col].dropna().astype(str).head(100).tolist()
        for text in sample:
            # Check for common mojibake patterns
            # Latin-1 interpreted as UTF-8 often results in characters like Ã, Â, etc.
            # followed by another character.
            for char in ["Ã", "Â", "Ã", "Â"]:
                if char in text:
                    return True
    return False

layers = fiona.listlayers('bangkok_contractor_delivery.gpkg')
for layer in layers:
    gdf = gpd.read_file('bangkok_contractor_delivery.gpkg', layer=layer)
    if find_mojibake(gdf):
        print(f"Layer {layer} HAS mojibake")
    else:
        print(f"Layer {layer} is clean")
EOF
python check_mojibake.py
Layer parcels is clean
Layer roads is clean
Layer markets is clean
/work/check_mojibake.py:5: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning.
See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  str_cols = gdf.select_dtypes(include=['object']).columns
/work/check_mojibake.py:5: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning.
See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  str_cols = gdf.select_dtypes(include=['object']).columns
/work/check_mojibake.py:5: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning.
See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  str_cols = gdf.select_dtypes(include=['object']).columns
1:01
Bash ls -l crs_audit.csv
-rw-r--r-- 1 agent agent 300 Jun 17 20:26 crs_audit.csv
1:07