关于utf 8:Java – 以独立于系统的方式从文件读取UTF8字节到字符串

Java - Reading UTF8 bytes from File into String in a system independent way

如何将Java中的UTF8编码文件准确地读入字符串?

当我将这个.java文件的编码更改为UTF-8(Eclipse >右击App.java>属性>资源>文本文件编码)时,它在Eclipse中运行良好,而不是命令行。似乎Eclipse在运行应用程序时正在设置file.encoding参数。

为什么源文件的编码对从字节创建字符串有任何影响?当已知编码时,从字节创建字符串的防错方法是什么?我可能有不同编码的文件。一旦知道了文件的编码,不管file.encoding的值是多少,我都必须能够读入字符串。

utf8文件的内容如下

1
2
3
4
5
6
7
8
9
English Hello World.
Korean ?????.
Japanese 世界こんにちは。
Russian Привет мир.
German Hallo Welt.
Spanish Hola mundo.
Hindi ???? ???????
Gujarati ???? ??????.
Thai ????????????.

-文件结束-

代码如下。我的意见在里面。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
public class App {
public static void main(String[] args) {
    String slash = System.getProperty("file.separator");
    File inputUtfFile = new File("C:" + slash +"sources" + slash +"TestUtfRead" + slash +"utf8text.txt");
    File outputUtfFile = new File("C:" + slash +"sources" + slash +"TestUtfRead" + slash +"utf8text_out.txt");
    File outputUtfByteWrittenFile = new File(
           "C:" + slash +"sources" + slash +"TestUtfRead" + slash +"utf8text_byteout.txt");
    outputUtfFile.delete();
    outputUtfByteWrittenFile.delete();

    try {

        /*
         * read a utf8 text file with internationalized strings into bytes.
         * there should be no information loss here, when read into raw bytes.
         * We are sure that this file is UTF-8 encoded.
         * Input file created using Notepad++. Text copied from Google translate.
         */

        byte[] fileBytes = readBytes(inputUtfFile);

        /*
         * Create a string from these bytes. Specify that the bytes are UTF-8 bytes.
         */

        String str = new String(fileBytes, StandardCharsets.UTF_8);

        /*
         * The console is incapable of displaying this string.
         * So we write into another file. Open in notepad++ to check.
         */

        ArrayList<String> list = new ArrayList<>();
        list.add(str);
        writeLines(list, outputUtfFile);

        /*
         * Works fine when I read bytes and write bytes.
         * Open the other output file in notepad++ and check.
         */

        writeBytes(fileBytes, outputUtfByteWrittenFile);

        /*
         * I am using JDK 8u60.
         * I tried running this on command line instead of eclipse. Does not work.
         * I tried using apache commons io library. Does not work.
         *  
         * This means that new String(bytes, charset); does not work correctly.
         * There is no real effect of specifying charset to string.
         */

    } catch (IOException e) {
        e.printStackTrace();
    }

}

public static void writeLines(List<String> lines, File file) throws IOException {
    BufferedWriter writer = null;
    OutputStreamWriter osw = null;
    OutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        osw = new OutputStreamWriter(fos);
        writer = new BufferedWriter(osw);
        String lineSeparator = System.getProperty("line.separator");
        for (int i = 0; i < lines.size(); i++) {
            String line = lines.get(i);
            writer.write(line);
            if (i < lines.size() - 1) {
                writer.write(lineSeparator);
            }
        }
    } catch (IOException e) {
        throw e;
    } finally {
        close(writer);
        close(osw);
        close(fos);
    }
}

public static byte[] readBytes(File file) {
    FileInputStream fis = null;
    byte[] b = null;
    try {
        fis = new FileInputStream(file);
        b = readBytesFromStream(fis);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fis);
    }
    return b;
}

public static void writeBytes(byte[] inBytes, File file) {
    FileOutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        writeBytesToStream(inBytes, fos);
        fos.flush();
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fos);
    }
}

public static void close(InputStream inStream) {
    try {
        inStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    inStream = null;
}

public static void close(OutputStream outStream) {
    try {
        outStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    outStream = null;
}

public static void close(Writer writer) {
    if (writer != null) {
        try {
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        writer = null;
    }
}

public static long copy(InputStream readStream, OutputStream writeStream) throws IOException {
    int bytesread = -1;
    byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions
    long count = 0;
    bytesread = readStream.read(b);
    while (bytesread != -1) {
        writeStream.write(b, 0, bytesread);
        count += bytesread;
        bytesread = readStream.read(b);
    }
    return count;
}
public static byte[] readBytesFromStream(InputStream readStream) throws IOException {
    ByteArrayOutputStream writeStream = null;
    byte[] byteArr = null;
    writeStream = new ByteArrayOutputStream();
    try {
        copy(readStream, writeStream);
        writeStream.flush();
        byteArr = writeStream.toByteArray();
    } finally {
        close(writeStream);
    }
    return byteArr;
}
public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException {
    ByteArrayInputStream bis = null;
    bis = new ByteArrayInputStream(inBytes);
    try {
        copy(bis, writeStream);
    } finally {
        close(bis);
    }
}
};

编辑:针对@jb nizet和所有人:)

1
2
3
//writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work
//writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work.
writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works

我需要在将字节读取到字符串中时指定字节编码。当我将字节从字符串写入文件时,需要指定字节编码。

一旦我在JVM中有了一个字符串,我就不需要记住源字节编码,对吗?

当我写入文件时,它应该将字符串转换为我的机器的默认字符集(不管是utf8、ascii或cp1252)。这是失败的。UTF16也失败了。为什么有些字符集会失败?


Java源代码编码确实不相关。代码的阅读部分是正确的(尽管效率很低)。不正确的是书写部分:

1
osw = new OutputStreamWriter(fos);

应改为

1
osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);

否则,您将使用默认编码(在您的系统中似乎不是utf8)而不是utf8。

注意,Java允许在文件路径中使用前斜杠,即使在Windows上也是如此。你可以简单地写

1
File inputUtfFile = new File("C:/sources/TestUtfRead/utf8text.txt");

编辑:

Once I have a String in JVM, I do not need to remember the source byte encoding, am I right?

是的,你是对的。

When I write to file, it should convert the String into the default Charset of my machine (be it UTF8 or ASCII or cp1252). That is failing.

如果没有指定任何编码,Java实际上将使用平台默认编码将字符转换为字节。如果您指定了一个编码(如本答案开头所建议的那样),那么它将使用您告诉它要使用的编码。

但是所有编码不能像UTF8那样表示所有的Unicode字符。例如,ASCII只支持128个不同的字符。CP1252,afaik,仅支持256个字符。因此,编码成功了,但它用一个特殊的字符(我记不清是哪个)替换了不可编码的字符,这意味着:我不能对这个泰语或俄语字符进行编码,因为它不是我支持的字符集的一部分。

UTF16编码应该可以。但请确保在读取和显示文件内容时也将文本编辑器配置为使用UTF16。如果配置为使用其他编码,则显示的内容将不正确。