关于Java：使用jsoup逃避不允许的标签

Using jsoup to escape disallowed tags

我正在评估jsoup的功能，该功能可以清理(但不能删除！)未列入白名单的标签。假设仅允许使用标记，因此以下输入

1	foo bar <script onLoad='stealYourCookies();'>baz

必须产生以下内容：

1	foo bar <script onLoad='stealYourCookies();'>baz

我发现jsoup存在以下问题/问题：

document.getAllElements()始终假定，和。是的，我可以打电话给document.body().getAllElements()，但要点是我不知道我的来源是完整的HTML文档还是仅仅是正文-我希望结果与输入的结果具有相同的形状和形式；
如何用...替换...？我只想用转义的实体替换方括号，并且不想更改任何属性，等等。Node.replaceWith听起来像是一个过分的杀伤力。
是否可以完全关闭漂亮的打印功能(例如插入新行等)？

还是我应该使用其他框架？到目前为止，我已经窥视了htmlcleaner，但是给出的示例并不表明支持我所需的功能。

答案1

如何使用Jsoup加载/解析Document？如果使用parse()或connect().get()，则jsoup将自动格式化html(插入html，body和head标签)。这样可以确保您始终拥有完整的HTML文档-即使输入不完整。

假设您只想清除输入(不进行进一步处理)，则应使用clean()而不是前面列出的方法。

示例1-使用parse()

1
2
3

final String html ="a";

System.out.println(Jsoup.parse(html));

输出：

1
2
3
4
5
6

输入html已完成，以确保您具有完整的文档。

示例2-使用clean()

1
2
3

final String html ="a";

System.out.println(Jsoup.clean("a", Whitelist.relaxed()));

输出：

输入html被清除，不能更多。

说明文件：

答案2

方法replaceWith()完全满足您的需求：

例：

1
2
3
4
5
6
7
8
9

final String html ="your script here";
Document doc = Jsoup.parse(html);

for( Element element : doc.select("script") )
{
element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
}

System.out.println(doc);

输出：

1
2
3
4
5
6

<html>
<head></head>
<body>
your script here
</body>
</html>

或仅身体：

1	System.out.println(doc.body().html());

输出：

1	your script here

说明文件：

Node.replaceWith(Node in)
文字节点

答案3

是的，Jsoup.OutputSettings的prettyPrint()方法可以做到这一点。

例：

1
2
3
4
5
6
7
8

final String html ="<p>
your html here
</p>";

Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);

System.out.println(doc);

注意：如果outputSettings()方法不可用，请更新Jsoup。

输出：

1
2
3

说明文件：

Document.OutputSettings.prettyPrint(布尔值漂亮)

答案4(无项目符号)

没有！ Jsoup是那里最好，功能最强大的HTML库之一！