关于utf 8：是否可以将字节解码为UTF-8，并将错误转换为Rust中的转义序列？

Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust?

在Rust中，可以通过执行以下操作从字节获取UTF-8：

1
2
3

if let Ok(s) = str::from_utf8(some_u8_slice) {
println!("example {}", s);
}

这要么起作用要么不起作用，但是Python能够处理错误，例如：

1	s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');

在此示例中，参数surrogateescape将无效的utf-8序列转换为转义码，因此，它们将使用有效的utf-8字节文字表达式代替，而不是忽略或替换无法解码的文本。有关详细信息，请参见：Python文档。

Rust是否有办法从字节中获取UTF-8字符串，而该字符串可以避免错误而不是完全失败？

是的，通过String::from_utf8_lossy：

1
2
3
4
5

fn main() {
let text = [104, 101, 0xFF, 108, 111];
let s = String::from_utf8_lossy(&text);
println!("{}", s); // he?lo
}

如果您需要对该过程进行更多控制，则可以使用std::str::from_utf8，如其他答案所建议。但是，没有理由像建议的那样对字节进行双重验证。

一个简单的例子：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

use std::str;

fn example(mut bytes: &[u8]) -> String {
let mut output = String::new();

loop {
match str::from_utf8(bytes) {
Ok(s) => {
// The entire rest of the string was valid UTF-8, we are done
output.push_str(s);
return output;
}
Err(e) => {
let (good, bad) = bytes.split_at(e.valid_up_to());

if !good.is_empty() {
let s = unsafe {
// This is safe because we have already validated this
// UTF-8 data via the call to `str::from_utf8`; there's
// no need to check it a second time
str::from_utf8_unchecked(good)
};
output.push_str(s);
}

if bad.is_empty() {
// No more data left
return output;
}

// Do whatever type of recovery you need to here
output.push_str("<badbyte>");

// Skip the bad byte and try again
bytes = &bad[1..];
}
}
}
}

fn main() {
let r = example(&[104, 101, 0xFF, 108, 111]);
println!("{}", r); // he<badbyte>lo
}

您可以将其扩展为采用值来替换不良字节，使用闭包来处理不良字节等。例如：

1
2
3
4
5

fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
// ...
handler(&mut output, bad);
// ...
}

1
2
3
4
5

let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
use std::fmt::Write;
write!(output,"\\\\U{{{}}}", bytes[0]).unwrap()
});
println!("{}", r); // he\\U{255}lo

也可以看看：

如何将字节向量(u8)转换为字符串
如果我不在乎特定的编码，如何将u8切片打印为文本？