python-unicode-gotchas
Installation
SKILL.md
Python Unicode Gotchas
str = code points, bytes = encoded output. Visually identical strings can differ in length (precomposed vs combining). str + bytes → TypeError; decode at input, encode at output.
Error handlers
| Handler | dec | enc | Effect |
|---|---|---|---|
strict (default) |
✓ | ✓ | raises UnicodeError |
ignore |
✓ | ✓ | drops offending char |
replace |
✓ | ✓ | U+FFFD / ? |
backslashreplace |
✓ | ✓ | \x80, \uXXXX |
xmlcharrefreplace |
— | ✓ | &#NNN; |
namereplace |
— | ✓ | \N{NAME} |
surrogateescape |
✓ | ✓ | lossless round-trip of unknown bytes |
surrogatepass |
✓ | ✓ | allows lone surrogates in UTF-8/16/32 |
surrogateescape maps bytes 0x80–0xFF ↔ U+DC80–U+DCFF. Use for read-then-write-back of files with unknown encoding.
File I/O
open(p, encoding='utf-8-sig') # strip UTF-8 BOM on read
open(p, encoding='ascii', errors='surrogateescape') # round-trip unknown bytes
utf-16 auto-detects endianness + writes BOM; utf-16-le/-be skip BOM. Default encoding on Py3.15+ is UTF-8; on 3.10–3.14 it follows locale.getpreferredencoding() unless PYTHONUTF8=1.
Normalization
Four forms. Same visual character can be one code point or base + combining:
import unicodedata
s1 = 'ê' # len 1 — precomposed
s2 = 'e\N{COMBINING CIRCUMFLEX ACCENT}' # len 2 — decomposed
s1 == s2 # False
unicodedata.normalize('NFC', s1) == unicodedata.normalize('NFC', s2) # True
| Form | Decomposes? | Compat map? | Use for |
|---|---|---|---|
| NFC | recomposed | no | storage, equality after input |
| NFD | fully split | no | searching / regex on base chars |
| NFKC | recomposed | yes (fi → fi) |
loose matching, search keys |
| NFKD | fully split | yes | aggressive normalization |
Case-insensitive compare
def eq_ci(a, b):
n = unicodedata.normalize
return n('NFD', n('NFD', a).casefold()) == n('NFD', n('NFD', b).casefold())
casefold() handles cases lower() misses (ß → ss, Greek final sigma). Double NFD because casefolding may introduce combining sequences.
Regex
| Pattern | \d |
\w |
\s |
|---|---|---|---|
str (default) |
any Unicode digit | letters+digits+_ (Unicode) |
Unicode whitespace |
bytes |
[0-9] |
[A-Za-z0-9_] |
ASCII whitespace |
str + re.ASCII |
[0-9] |
[A-Za-z0-9_] |
ASCII whitespace |
Thai ๕๗ matches \d+ in a str pattern but not with re.ASCII.
Pitfalls
len(s)≠ grapheme count: normalize first, or use a grapheme library.- Undecodable filenames:
os.listdir('.')can raise — usesurrogateescapeor passbytes. - UTF-8 BOM from Excel/Notepad: breaks naive
encoding='utf-8'; useutf-8-sig. - Python <3.15 on non-UTF-8 locale:
open()default is cp1252 etc. SetPYTHONUTF8=1or always passencoding=. casefold()still needs normalization:ss≠ßwithout it;fi≠fiwithout NFKC.- Lone surrogates only encode under
surrogatepass: astrholdingU+DC80errors on plain UTF-8 encode.
Weekly Installs
2
Repository
narenaryan/agent-skillsFirst Seen
1 day ago
Security Audits
Installed on
amp2
cline2
opencode2
cursor2
kimi-cli2
warp2