确定字符串的C编码#

Determine a string's encoding in C#

是否有任何方法可以确定C中字符串的编码？

比如说，我有一个文件名字符串，但我不知道它是用~~unicode~~utf-16编码的，还是用系统默认编码的，我该如何查找？~~~~

相关讨论

不能用Unicode"编码"。在没有任何其他先验信息的情况下，无法自动确定任何给定字符串的编码。

"您不能用unicode‘编码’——如果我们将unicode解释为utf-16(或任何其他特定的utf*)，那么这是将代码点编写为字节序列(=编码)的一种完全有效的方法。"

你怎么能写出这样的近似值？UTF-16是编码Unicode数据的可能方法之一。不能"unicode编码"；unicode不是utf-；utf-不是unicode。抱歉，如果我们继续写这样的近似值，与Unicode相关的行为将如何变化？初学者总是会被黑暗的Unicode怪物搞糊涂，事情永远不会改变。让我们说清楚。

更清楚的可能是：使用"编码"方案(utf-、iso-、big5、shift-jis等)将Unicode代码点编码为字符集的字节字符串，然后将字节字符串从字符集解码为Unicode。您不使用Unicode编码字节串。您不需要在字节串中解码Unicode。

谢谢尼古姆兹，你让我觉得很傻。的S

@nicdunz-编码本身(特别是utf-16)也通常被称为"unicode"。对与错，这就是生活。即使在.NET中，也要看encoding.unicode-表示utf-16。

哦，好吧，我不知道.NET是如此误导人。这似乎是一个很难学的习惯。抱歉@krebstar，这不是我的意图(我仍然认为你编辑的问题现在比以前更有意义)

@Nicdumz 1：有一种方法可以概率地确定要使用哪种编码。看看IE(现在还有FF视图-字符编码-自动检测)是怎么做到的：它尝试一种编码，看看它是否可能是"写得好<在这里输入一个语言名称>"，或者更改它，然后再试一次。来吧，这很有趣！

有没有完整的源代码示例的最终解决方案？

这个问题写得不合理。在.NET中，一旦有了字符串对象，它的字符就是U+0000到U+FFFF范围内的Unicode字符。它不再"有编码"，从这个意义上说，问题是在问。或者您也可以说.NET的字符串编码总是UTF-16。任何"编码"都由将原始字节流转换为.NET字符串对象的任何代码处理。

下面的代码具有以下功能：

检测或尝试检测utf-7、utf-8/16/32(bom、no bom、little&big endian)

如果找不到Unicode编码，则返回到本地默认代码页。

检测(很有可能)缺少BOM/签名的Unicode文件

在文件中搜索charset=xyz和encoding=xyz以帮助确定编码。

要保存处理，您可以"品尝"文件(可定义的字节数)。

返回编码和解码的文本文件。

纯粹基于字节的效率解决方案

正如其他人所说，任何解决方案都不可能是完美的(当然，在世界各地使用的各种8位扩展的ASCII编码之间也很难区分)，但我们可以"足够好"，特别是如果开发人员也向用户提供一系列可选编码，如图所示：每种语言最常见的编码是什么？
使用Encoding.GetEncodings();可以找到完整的编码列表。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
// Function to detect the encoding for UTF-7, UTF-8/16/32 (bom, no bom, little
// & big endian), and local default codepage, and potentially other codepages.
// 'taster' = number of bytes to check of the file (to save processing). Higher
// value is slower, but more reliable (especially UTF-8 with special characters
// later on may appear to be ASCII initially). If taster = 0, then taster
// becomes the length of the file (for maximum reliability). 'text' is simply
// the string with the discovered encoding applied to the file.
public Encoding detectTextEncoding(string filename, out String text, int taster = 1000)
{
byte[] b = File.ReadAllBytes(filename);

//////////////// First check the low hanging fruit by checking if a
//////////////// BOM/signature exists (sourced from http://www.unicode.org/faq/utf_bom.html#bom4)
if (b.Length >= 4 && b[0] == 0x00 && b[1] == 0x00 && b[2] == 0xFE && b[3] == 0xFF) { text = Encoding.GetEncoding("utf-32BE").GetString(b, 4, b.Length - 4); return Encoding.GetEncoding("utf-32BE"); } // UTF-32, big-endian
else if (b.Length >= 4 && b[0] == 0xFF && b[1] == 0xFE && b[2] == 0x00 && b[3] == 0x00) { text = Encoding.UTF32.GetString(b, 4, b.Length - 4); return Encoding.UTF32; } // UTF-32, little-endian
else if (b.Length >= 2 && b[0] == 0xFE && b[1] == 0xFF) { text = Encoding.BigEndianUnicode.GetString(b, 2, b.Length - 2); return Encoding.BigEndianUnicode; } // UTF-16, big-endian
else if (b.Length >= 2 && b[0] == 0xFF && b[1] == 0xFE) { text = Encoding.Unicode.GetString(b, 2, b.Length - 2); return Encoding.Unicode; } // UTF-16, little-endian
else if (b.Length >= 3 && b[0] == 0xEF && b[1] == 0xBB && b[2] == 0xBF) { text = Encoding.UTF8.GetString(b, 3, b.Length - 3); return Encoding.UTF8; } // UTF-8
else if (b.Length >= 3 && b[0] == 0x2b && b[1] == 0x2f && b[2] == 0x76) { text = Encoding.UTF7.GetString(b,3,b.Length-3); return Encoding.UTF7; } // UTF-7

//////////// If the code reaches here, no BOM/signature was found, so now
//////////// we need to 'taste' the file to see if can manually discover
//////////// the encoding. A high taster value is desired for UTF-8
if (taster == 0 || taster > b.Length) taster = b.Length; // Taster size can't be bigger than the filesize obviously.

// Some text files are encoded in UTF8, but have no BOM/signature. Hence
// the below manually checks for a UTF8 pattern. This code is based off
// the top answer at: https://stackoverflow.com/questions/6555015/check-for-invalid-utf8
// For our purposes, an unnecessarily strict (and terser/slower)
// implementation is shown at: https://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c
// For the below, false positives should be exceedingly rare (and would
// be either slightly malformed UTF-8 (which would suit our purposes
// anyway) or 8-bit extended ASCII/UTF-16/32 at a vanishingly long shot).
int i = 0;
bool utf8 = false;
while (i < taster - 4)
{
if (b[i] <= 0x7F) { i += 1; continue; } // If all characters are below 0x80, then it is valid UTF8, but UTF8 is not 'required' (and therefore the text is more desirable to be treated as the default codepage of the computer). Hence, there's no"utf8 = true;" code unlike the next three checks.
if (b[i] >= 0xC2 && b[i] <= 0xDF && b[i + 1] >= 0x80 && b[i + 1] < 0xC0) { i += 2; utf8 = true; continue; }
if (b[i] >= 0xE0 && b[i] <= 0xF0 && b[i + 1] >= 0x80 && b[i + 1] < 0xC0 && b[i + 2] >= 0x80 && b[i + 2] < 0xC0) { i += 3; utf8 = true; continue; }
if (b[i] >= 0xF0 && b[i] <= 0xF4 && b[i + 1] >= 0x80 && b[i + 1] < 0xC0 && b[i + 2] >= 0x80 && b[i + 2] < 0xC0 && b[i + 3] >= 0x80 && b[i + 3] < 0xC0) { i += 4; utf8 = true; continue; }
utf8 = false; break;
}
if (utf8 == true) {
text = Encoding.UTF8.GetString(b);
return Encoding.UTF8;
}

// The next check is a heuristic attempt to detect UTF-16 without a BOM.
// We simply look for zeroes in odd or even byte places, and if a certain
// threshold is reached, the code is 'probably' UF-16.
double threshold = 0.1; // proportion of chars step 2 which must be zeroed to be diagnosed as utf-16. 0.1 = 10%
int count = 0;
for (int n = 0; n < taster; n += 2) if (b[n] == 0) count++;
if (((double)count) / taster > threshold) { text = Encoding.BigEndianUnicode.GetString(b); return Encoding.BigEndianUnicode; }
count = 0;
for (int n = 1; n < taster; n += 2) if (b[n] == 0) count++;
if (((double)count) / taster > threshold) { text = Encoding.Unicode.GetString(b); return Encoding.Unicode; } // (little-endian)

// Finally, a long shot - let's see if we can find"charset=xyz" or
//"encoding=xyz" to identify the encoding:
for (int n = 0; n < taster-9; n++)
{
if (
((b[n + 0] == 'c' || b[n + 0] == 'C') && (b[n + 1] == 'h' || b[n + 1] == 'H') && (b[n + 2] == 'a' || b[n + 2] == 'A') && (b[n + 3] == 'r' || b[n + 3] == 'R') && (b[n + 4] == 's' || b[n + 4] == 'S') && (b[n + 5] == 'e' || b[n + 5] == 'E') && (b[n + 6] == 't' || b[n + 6] == 'T') && (b[n + 7] == '=')) ||
((b[n + 0] == 'e' || b[n + 0] == 'E') && (b[n + 1] == 'n' || b[n + 1] == 'N') && (b[n + 2] == 'c' || b[n + 2] == 'C') && (b[n + 3] == 'o' || b[n + 3] == 'O') && (b[n + 4] == 'd' || b[n + 4] == 'D') && (b[n + 5] == 'i' || b[n + 5] == 'I') && (b[n + 6] == 'n' || b[n + 6] == 'N') && (b[n + 7] == 'g' || b[n + 7] == 'G') && (b[n + 8] == '='))
)
{
if (b[n + 0] == 'c' || b[n + 0] == 'C') n += 8; else n += 9;
if (b[n] == '"' || b[n] == '\'') n++;
int oldn = n;
while (n < taster && (b[n] == '_' || b[n] == '-' || (b[n] >= '0' && b[n] <= '9') || (b[n] >= 'a' && b[n] <= 'z') || (b[n] >= 'A' && b[n] <= 'Z')))
{ n++; }
byte[] nb = new byte[n-oldn];
Array.Copy(b, oldn, nb, 0, n-oldn);
try {
string internalEnc = Encoding.ASCII.GetString(nb);
text = Encoding.GetEncoding(internalEnc).GetString(b);
return Encoding.GetEncoding(internalEnc);
}
catch { break; } // If C# doesn't recognize the name of the encoding, break.
}
}

// If all else fails, the encoding is probably (though certainly not
// definitely) the user's local codepage! One might present to the user a
// list of alternative encodings as shown here: https://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
// A full list can be found using Encoding.GetEncodings();
text = Encoding.Default.GetString(b);
return Encoding.Default;
}

相关讨论

本厂为cyrillic(和其他可能的.eml文件(S)从mail'字符标题)

我不能decoded UTF-7 naively that is its，其实全preamble；两周，和包括第一位of the character。to have the .NET系统似乎不支持preamble at all for utf7 S系统。

加工方法为我的checked of other当它不帮助！谢谢，丹。

在这里，你的解决方案。我用it to the encoding在线确定从源文件完全不同。我发现的是，虽然，如果使用太低taster the result of a值，可能是错误的。(例如the encoding是校正队列。默认for a UTF8文件，即使我使用AS / 10长度B。我知道taster。)我想知道，什么是the argument for using a是b taster less。长度？可以得出结论，似乎是唯一的acceptable encoding is that if and only。默认if the Whole我扫描文件。

肖恩：这是"for when速度超过精度问题，especially for which may be dozens文件或hundreds of size字节中。在我的经验，即使是正确的值可以在低taster results of the time～99.9%。你的经验会有什么不同。

签出utf8checker这是一个简单的类，它在纯托管代码中实现了这一点。http://utf8checker.codeplex.com
注意：正如已经指出的，"确定编码"只对字节流有意义。如果您有一个字符串，那么它已经由已经知道或猜到编码的人进行编码，以便首先获得该字符串。

相关讨论

if the done is an incorrect字符串解码与编码和你简简单单的8位编码decode have the used to You can get the en，usually字节后没有任何腐败，虽然。

这取决于字符串的来源。.NET字符串是Unicode(UTF-16)。唯一不同的方法是，将数据从数据库读取到字节数组中。
这篇代码项目文章可能会感兴趣：检测输入和输出文本的编码
乔恩·斯基特的C和.NET字符串是对.NET字符串的极好解释。

相关讨论

它的肉从非Unicode应用程序的C + +。the article在CodeProject位似乎太复杂，不管一个人多，似乎给我想给的。谢谢……

我知道这有点晚了，但要清楚：
字符串没有真正的编码…在.NET中，字符串是char对象的集合。本质上，如果它是一个字符串，那么它已经被解码了。
但是，如果您正在读取由字节组成的文件的内容，并且希望将其转换为字符串，则必须使用该文件的编码。
.NET包括以下内容的编码和解码类：ascii、utf7、utf8、utf32等。
大多数编码都包含某些字节顺序标记，可用于区分使用的编码类型。
.NET类system.io.streamreader能够通过读取这些字节顺序标记来确定流中使用的编码；
下面是一个例子：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/// <summary>
/// return the detected encoding and the contents of the file.
/// </summary>
/// <param name="fileName"></param>
/// <param name="contents"></param>
/// <returns></returns>
public static Encoding DetectEncoding(String fileName, out String contents)
{
// open the file with the stream-reader:
using (StreamReader reader = new StreamReader(fileName, true))
{
// read the contents of the file into a string
contents = reader.ReadToEnd();

// return the encoding.
return reader.CurrentEncoding;
}
}

相关讨论

这不会工作，没有探测UTF 16 for the BOM。将它与Fall back to the s局域网用户默认代码页if it to any失败探明的Unicode编码。You can fix the后者模式参数作为增Encoding.DefaultStreamReader，but then the队列不会探明UTF8 without the BOM。

danw utf - 16 @：is done其实永远没有BOM，虽然？我永远不会使用它，在拓展训练；to be anything to open灾难在线漂亮多了。

另一个选择，很晚才来，对不起：
http://www.architectshack.com/textfileencodingdetector.ashx
这个仅限C_的小类使用bom(如果存在)，尝试自动检测可能的Unicode编码，否则，如果没有任何Unicode编码是可能的或可能的，则返回。
听起来像上面引用的utf8checker做了类似的事情，但我认为它的作用域稍微宽一点，而不仅仅是utf8，它还检查可能缺少BOM的其他可能的Unicode编码(utf-16le或be)。
希望这能帮助别人！

相关讨论

好漂亮的尾巴，它solved encoding of detection)：我的问题

simplehelpers.fileencoding nuget包将Mozilla通用字符集检测器的C端口包装成一个死的简单API：

1
var encoding = FileEncoding.DetectFileEncoding(txtFile);

相关讨论

this should be高上提供解决方案，它很简单：让别人做的工作：D

我的解决方案是使用带有一些回退的内置工具。
我从StackOverflow的另一个类似问题的答案中选择了这个策略，但现在找不到。
它首先使用streamreader中的内置逻辑检查bom，如果有bom，编码将不是Encoding.Default的内容，我们应该相信这个结果。
如果不是，则检查字节序列是否是有效的UTF-8序列。如果是的话，它将猜测UTF-8作为编码，如果不是的话，那么结果将是默认的ASCII编码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static Encoding getEncoding(string path) {
var stream = new FileStream(path, FileMode.Open);
var reader = new StreamReader(stream, Encoding.Default, true);
reader.Read();

if (reader.CurrentEncoding != Encoding.Default) {
reader.Close();
return reader.CurrentEncoding;
}

stream.Position = 0;

reader = new StreamReader(stream, new UTF8Encoding(false, true));
try {
reader.ReadToEnd();
reader.Close();
return Encoding.UTF8;
}
catch (Exception) {
reader.Close();
return Encoding.Default;
}
}

注意：这是一个实验，看看UTF-8编码是如何在内部工作的。Vilicvane提供的解决方案是使用初始化为在解码失败时引发异常的UTF8Encoding对象，它要简单得多，而且基本上是相同的。
我编写了这段代码来区分utf-8和windows-1252。不过，它不应该用于庞大的文本文件，因为它会将整个文件载入内存并完全扫描。我将它用于.srt副标题文件，只是为了能够将它们保存回加载时使用的编码。
为函数as-ref提供的编码应该是8位回退编码，以防检测到文件不是有效的UTF-8；通常，在Windows系统上，这将是Windows-1252。不过，这并不像检查实际有效的ASCII范围那样做，即使在字节顺序标记上也无法检测到UTF-16。
位检测背后的理论可以在这里找到：https://ianthehenry.com/2015/1/17/decoding-utf-8/
基本上，第一个字节的位范围决定了它是UTF-8实体的一部分之后的位数。后面的这些字节总是在相同的位范围内。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
/// <summary>
/// Detects whether the encoding of the data is valid UTF-8 or ascii. If detection fails, the text is decoded using the given fallback encoding.
/// Bit-wise mechanism for detecting valid UTF-8 based on https://ianthehenry.com/2015/1/17/decoding-utf-8/
/// Note that pure ascii detection should not be trusted: it might mean the file is meant to be UTF-8 or Windows-1252 but simply contains no special characters.
/// </summary>
/// <param name="docBytes">The bytes of the text document.</param>
/// <param name="encoding">The default encoding to use as fallback if the text is detected not to be pure ascii or UTF-8 compliant. This ref parameter is changed to the detected encoding, or Windows-1252 if the given encoding parameter is null and the text is not valid UTF-8.</param>
/// <returns>The contents of the read file</returns>
public static String ReadFileAndGetEncoding(Byte[] docBytes, ref Encoding encoding)
{
if (encoding == null)
encoding = Encoding.GetEncoding(1252);
// BOM detection is not added in this example. Add it yourself if you feel like it. Should set the"encoding" param and return the decoded string.
//String file = DetectByBOM(docBytes, ref encoding);
//if (file != null)
// return file;
Boolean isPureAscii = true;
Boolean isUtf8Valid = true;
for (Int32 i = 0; i < docBytes.Length; i++)
{
Int32 skip = TestUtf8(docBytes, i);
if (skip != 0)
{
if (isPureAscii)
isPureAscii = false;
if (skip < 0)
isUtf8Valid = false;
else
i += skip;
}
// if already detected that it's not valid utf8, there's no sense in going on.
if (!isUtf8Valid)
break;
}
if (isPureAscii)
encoding = new ASCIIEncoding(); // pure 7-bit ascii.
else if (isUtf8Valid)
encoding = new UTF8Encoding(false);
// else, retain given fallback encoding.
return encoding.GetString(docBytes);
}

/// <summary>
/// Tests if the bytes following the given offset are UTF-8 valid, and returns
/// the extra amount of bytes to skip ahead to do the next read if it is
/// (meaning, detecting a single-byte ascii character would return 0).
/// If the text is not UTF-8 valid it returns -1.
/// </summary>
/// <param name="binFile">Byte array to test</param>
/// <param name="offset">Offset in the byte array to test.</param>
/// <returns>The amount of extra bytes to skip ahead for the next read, or -1 if the byte sequence wasn't valid UTF-8</returns>
public static Int32 TestUtf8(Byte[] binFile, Int32 offset)
{
Byte current = binFile[offset];
if ((current & 0x80) == 0)
return 0; // valid 7-bit ascii. Added length is 0 bytes.
else
{
Int32 len = binFile.Length;
Int32 fullmask = 0xC0;
Int32 testmask = 0;
for (Int32 addedlength = 1; addedlength < 6; addedlength++)
{
// This code adds shifted bits to get the desired full mask.
// If the full mask is [111]0 0000, then test mask will be [110]0 0000. Since this is
// effectively always the previous step in the iteration I just store it each time.
testmask = fullmask;
fullmask += (0x40 >> addedlength);
// Test bit mask for this level
if ((current & fullmask) == testmask)
{
// End of file. Might be cut off, but either way, deemed invalid.
if (offset + addedlength >= len)
return -1;
else
{
// Lookahead. Pattern of any following bytes is always 10xxxxxx
for (Int32 i = 1; i <= addedlength; i++)
{
// If it does not match the pattern for an added byte, it is deemed invalid.
if ((binFile[offset + i] & 0xC0) != 0x80)
return -1;
}
return addedlength;
}
}
}
// Value is greater than the start of a 6-byte utf8 sequence. Deemed invalid.
return -1;
}
}

相关讨论

也不elsethere is if ((current & 0xE0) == 0xC0) { ... } else if ((current & 0xF0) == 0xE0) { ... } else if ((current & 0xF0) == 0xE0) { ... } else if ((current & 0xF8) == 0xF0) { ... }after last statement。我想我的房子elseUTF8 isUtf8Valid = false;INVALID：好的。你会吗？

@哈尔哈，真的……因为我自己最新的代码与通用(黑莓，黑莓系统和先进的uses)在环3，但可以上去technically be changed to the further(环包是位unclear that is possible to在线；EXPAND UTF-8字节added up to 6 3只觉得，but are used在implementations)流，我知道这不更新队列。

恩，我在哈尔updated"的解决方案。the the same but the保持原则，masks are created和中位比在环checked Rather explicitly所写出来的代码。

关于.net：C＃中的自动编码检测

关于c＃4.0：有没有办法检查C＃字符串的编码？

关于编码：C＃是有效的UTF-8

关于C#：byte[] to string

关于utf 8：如何在普通C中检测UTF-8？

关于asp.net:How判断字符串是否已在C＃中以编程方式编码？

每种语言最常见的编码是什么？

关于.net：确定TextFile编码？

关于c＃：Encoding.Default如何在.NET中运行？

.net:c#中的String和String有什么区别?

C#的隐藏特征

C#:将int强制转换为枚举enum

.net:如何在c#中枚举enum?

关于字符串：javascript中的endswith

用javascript编码URL？

.net:如何在不手动指定编码的情况下，在c#中获得字符串的一致字节表示?

关于C#：如何计算字符串(实际上是字符)在字符串中的出现次数？

C#中的多行字符串文字

关于http：查询字符串参数的Java URL编码