关于utf 8：lua，截断包含utf-8编码字符的字符串

lua, truncate string containing utf-8 encoded chars

我正在重写awk程序，该程序格式化要输出到状态栏的字符串。我不是一个程序员，只是一个业余爱好者，试图在我遇到的任何停机时间学习。

截断任何非ASCII字符时，例如西里尔字母(utf8)会导致输出损坏，该输出显示为一系列问号。

Ouverture Il Ritorno dall'Estero op. 89 / Mendelsshon / Великие ?… / 320 kb/s

string.len和#计数字节，而不是字符。单个西里尔字母char计为2个字节，而不是1个字节。这显然会使截断变得复杂。幸运的是，Lua 5.3包含utf8库，这是unicode支持上的Wiki，可简化使用非acsii字符的工作。我修改了"缩短"功能，以使用utf8.len来获得截断的准确字符数，但是问题仍然存在。

1
2
3
4
5
6
7
8
9
10

--from penlight library, use utf8.len, not string.len
function shorten(s,w)
local ellipsis ="…"
local n_ellipsis = utf8.len(ellipsis)
assert_string(1,s)
if utf8.len(s) > w then
return s:sub(1,w-n_ellipsis) .. ellipsis
end
return s
end

通过进一步阅读，我了解到每当需要截断时都应使用utf8.offset。

You should use these functions anywhere you need to manipulate text that you didn’t write yourself or may contain non-ASCII or non-English characters. If you truncate a string at a byte index that is not between whole codepoints you will end up with an invalid UTF-8 string that may render incorrectly or cannot be stored in a DataStore.

If you are truncating a string at an index you should use string.sub with a byte index given by utf8.offset.

我一直在尝试找出如何使用utf8.offset来获取所需的字节索引，但是到目前为止，成功率为零。如果进一步的情况有帮助，这是我非常在意的完整脚本

任何提示，代码，批评等，将不胜感激。