Checklist
TagStudio Version
9.5.6
Operating System & Version
Fedora 43 x86_64
Description
There's a few places where we don't specify UTF-8 as the format for files that TagStudio owns. For Linux and Mac it's probably not a problem since they should default to UTF-8 but on Windows systems that can have non-UTF-8 codepages, this will cause problems.
Examples:
- A user on Japanese Windows (
cp932) saves a library path containing 写真 to settings.toml. The file's written in cp932. If that user opens TagStudio on system with a different Windows codepage, or on Linux, read_settings() blows up with UnicodeDecodeError.
- TOML and JSON specs both mandate UTF-8, so anything we write that isn't UTF-8 is nonconforming and won't
round-trip through other tools. (Probably pretty minor, in the grand scheme of things)
- The compiled ignore file (
.compiled_ignore) passed to ripgrep is opened with default encoding, so a user with non-ASCII glob patterns can poison the ripgrep input.
ResourceManager.read_text() reads bundled resource files (English translations, etc.) under the system encoding.
PR inbound.
Expected Behavior
TagStudio files (configurations like settings.toml, .ts_ignore, etc.) should use a standard codepage to ensure files moved between systems are readable by those systems if one or the other(s) are using a non-UTF-8 codepage.
If a TagStudio-owned file is written with non-UTF-8 formatting, it should be rewritten in UTF-8 on next save.
Steps to Reproduce
# Build a settings.toml whose ASCII structure is intact but whose string *value*
# contains raw cp932 bytes (invalid as UTF-8).
import pathlib
legacy = "写真".encode("cp932") # b"\x8e\xca\x90^"
payload = b'language = "ja"\ndate_format = "' + legacy + b'"\n'
pathlib.Path("/tmp/settings.toml").write_bytes(payload)
Use that settings file in TagStudio on a non-UTF-8 system with a non-cp932 codepage and watch it fall over.
Logs
No response
Checklist
TagStudio Version
9.5.6
Operating System & Version
Fedora 43 x86_64
Description
There's a few places where we don't specify UTF-8 as the format for files that TagStudio owns. For Linux and Mac it's probably not a problem since they should default to UTF-8 but on Windows systems that can have non-UTF-8 codepages, this will cause problems.
Examples:
cp932) saves a library path containing写真tosettings.toml. The file's written in cp932. If that user opens TagStudio on system with a different Windows codepage, or on Linux,read_settings()blows up withUnicodeDecodeError.round-trip through other tools. (Probably pretty minor, in the grand scheme of things)
.compiled_ignore) passed toripgrepis opened with default encoding, so a user with non-ASCII glob patterns can poison the ripgrep input.ResourceManager.read_text()reads bundled resource files (English translations, etc.) under the system encoding.PR inbound.
Expected Behavior
TagStudio files (configurations like settings.toml, .ts_ignore, etc.) should use a standard codepage to ensure files moved between systems are readable by those systems if one or the other(s) are using a non-UTF-8 codepage.
If a TagStudio-owned file is written with non-UTF-8 formatting, it should be rewritten in UTF-8 on next save.
Steps to Reproduce
Use that settings file in TagStudio on a non-UTF-8 system with a non-cp932 codepage and watch it fall over.
Logs
No response