关于java:是否有针对未编码的URL执行URLDecoder的担忧?

Any concerns about executing URLDecoder against a URL that was not encoded?

当前将URLEncoder和URLDecoder合并到一些代码中。
已经保存的许多URL将由URLDecoder例程最初未处理的URLDecoder例程处理。

根据一些测试,似乎没有问题,但是我没有测试所有场景。

我确实注意到有些像/这样的字符通常会被编码,即使最初没有编码,它们也会被解码例程找到。

这导致我进行了过于简单的分析。 看起来URLDecoder例程实际上是在URL中检查%和接下来的2个字节(使用提供的UTF-8)。 只要先前保存的URL中没有任何%,那么在由URLDecoder例程处理时就不会有问题。 那个听起来是对的吗?


是的,虽然它适用于"简单"情况,但如果为包含某些特殊字符的未编码URL调用URLDecoder.decode,则可能会遇到a)异常或b)意外行为。

请考虑以下示例:对于第三项测试,它将抛出java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern,并且对于第二项测试,它将毫无例外地更改URL(而常规的编码/解码工作没有问题):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import java.net.URLDecoder;
import java.net.URLEncoder;

public class Test {
    public static void main(String[] args) throws Exception {
        test("http://www.foo.bar/");
        test("http://www.foo.bar/?q=a+b");
        test("http://www.foo.bar/?q=??ü?%"); // Will throw exception
    }

    private static void test(String url) throws Exception {
        String encoded = URLEncoder.encode(url,"UTF-8");
        String decoded = URLDecoder.decode(encoded,"UTF-8");
        System.out.println("encoded:" + encoded);
        System.out.println("decoded:" + decoded);
        System.out.println(URLDecoder.decode(decoded,"UTF-8"));
    }
}

输出(注意+符号如何消失):

1
2
3
4
5
6
7
8
9
10
11
encoded: http%3A%2F%2Fwww.foo.bar%2F
decoded: http://www.foo.bar/
http://www.foo.bar/
encoded: http%3A%2F%2Fwww.foo.bar%2F%3Fq%3Da%2Bb
decoded: http://www.foo.bar/?q=a+b
http://www.foo.bar/?q=a b
encoded: http%3A%2F%2Fwww.foo.bar%2F%3Fq%3D%C3%A4%C3%B6%C3%BC%C3%9F%25
decoded: http://www.foo.bar/?q=??ü?%
Exception in thread"main" java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern
    at java.net.URLDecoder.decode(Unknown Source)
    at Test.test(Test.java:16)

对于这两种情况,也请参见URLDecoder的javadoc:

  • The plus sign"+" is converted into a space character" " .
  • A sequence of the form"%xy" will be treated as representing a byte where xy is the two-digit hexadecimal representation of the 8 bits.
    Then, all substrings that contain one or more of these byte sequences
    consecutively will be replaced by the character(s) whose encoding
    would result in those consecutive bytes. The encoding scheme used to
    decode these characters may be specified, or if unspecified, the
    default encoding of the platform will be used.

如果您确定未编码的URL不包含+%,则可以说调用URLDecoder.decode是安全的。 否则,我建议您进行其他检查,例如 尝试解码并与原始图像比较(参见SO上的此问题)。