skills/narenaryan/agent-skills/python-unicode-gotchas

python-unicode-gotchas

Installation
SKILL.md

Python Unicode Gotchas

str = code points, bytes = encoded output. Visually identical strings can differ in length (precomposed vs combining). str + bytesTypeError; decode at input, encode at output.

Error handlers

Handler dec enc Effect
strict (default) raises UnicodeError
ignore drops offending char
replace U+FFFD / ?
backslashreplace \x80, \uXXXX
xmlcharrefreplace &#NNN;
namereplace \N{NAME}
surrogateescape lossless round-trip of unknown bytes
surrogatepass allows lone surrogates in UTF-8/16/32

surrogateescape maps bytes 0x80–0xFFU+DC80–U+DCFF. Use for read-then-write-back of files with unknown encoding.

File I/O

open(p, encoding='utf-8-sig')                      # strip UTF-8 BOM on read
open(p, encoding='ascii', errors='surrogateescape') # round-trip unknown bytes

utf-16 auto-detects endianness + writes BOM; utf-16-le/-be skip BOM. Default encoding on Py3.15+ is UTF-8; on 3.10–3.14 it follows locale.getpreferredencoding() unless PYTHONUTF8=1.

Normalization

Four forms. Same visual character can be one code point or base + combining:

import unicodedata
s1 = 'ê'                                   # len 1  — precomposed
s2 = 'e\N{COMBINING CIRCUMFLEX ACCENT}'    # len 2  — decomposed
s1 == s2                                   # False
unicodedata.normalize('NFC', s1) == unicodedata.normalize('NFC', s2)  # True
Form Decomposes? Compat map? Use for
NFC recomposed no storage, equality after input
NFD fully split no searching / regex on base chars
NFKC recomposed yes (fi) loose matching, search keys
NFKD fully split yes aggressive normalization

Case-insensitive compare

def eq_ci(a, b):
    n = unicodedata.normalize
    return n('NFD', n('NFD', a).casefold()) == n('NFD', n('NFD', b).casefold())

casefold() handles cases lower() misses (ß → ss, Greek final sigma). Double NFD because casefolding may introduce combining sequences.

Regex

Pattern \d \w \s
str (default) any Unicode digit letters+digits+_ (Unicode) Unicode whitespace
bytes [0-9] [A-Za-z0-9_] ASCII whitespace
str + re.ASCII [0-9] [A-Za-z0-9_] ASCII whitespace

Thai ๕๗ matches \d+ in a str pattern but not with re.ASCII.

Pitfalls

  • len(s) ≠ grapheme count: normalize first, or use a grapheme library.
  • Undecodable filenames: os.listdir('.') can raise — use surrogateescape or pass bytes.
  • UTF-8 BOM from Excel/Notepad: breaks naive encoding='utf-8'; use utf-8-sig.
  • Python <3.15 on non-UTF-8 locale: open() default is cp1252 etc. Set PYTHONUTF8=1 or always pass encoding=.
  • casefold() still needs normalization: ssß without it; fi without NFKC.
  • Lone surrogates only encode under surrogatepass: a str holding U+DC80 errors on plain UTF-8 encode.
Weekly Installs
2
First Seen
1 day ago
Installed on
amp2
cline2
opencode2
cursor2
kimi-cli2
warp2