关于Java：如何将HTML文本转换为纯文本？

how to convert HTML text to plain text?

朋友的
我必须从url解析描述，其中解析的内容只有几个html标签，因此如何将其转换为纯文本。

是的，Jsoup将是更好的选择。只需执行以下操作即可将整个HTML文本转换为纯文本。

1	String plainText= Jsoup.parse(yout_html_text).text();

相关讨论

摆脱HTML标签很简单：

1
2
3

// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\\\\s*<[^>]*>)*","");

但不幸的是，这些要求从未如此简单：

通常，

和元素需要单独处理，可能会有带有>个字符(例如javascript)的cdata块弄乱了正则表达式等。

相关讨论

您可以使用这一行删除html标记，并将其显示为纯文本。

1	htmlString=htmlString.replaceAll("\\\\<.*?\\\\>","");

使用类似htmlCleaner的HTML解析器

有关详细的答案：如何在Java中删除HTML标记

我建议通过jTidy解析原始HTML，这应该为您提供输出，您可以针对其编写xpath表达式。这是我发现的抓取HTML的最可靠的方法。

我使用HTMLUtil.textFromHTML(value)
从

1
2
3
4
5

<dependency>
<groupId>org.clapper</groupId>
javautil</artifactId>
<version>3.2.0</version>
</dependency>

我需要一些包含FreeMarker标签的HTML的纯文本表示形式。这个问题是通过JSoup解决方案解决的，但是JSoup逃避了FreeMarker标签，从而破坏了功能。我还尝试了htmlCleaner(sourceforge)，但是留下了HTML标头和样式内容(已删除标签)。
http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726

我的代码：

1	return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

maxLineLength确保行不被人为地包裹为80个字符。
setNewLine(null)使用与源相同的换行符。

如果要像浏览器显示一样进行解析，请使用：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

import net.htmlparser.jericho.*;
import java.util.*;
import java.io.*;
import java.net.*;

public class RenderToText {
public static void main(String[] args) throws Exception {
String sourceUrlString="data/test.html";
if (args.length==0)
System.err.println("Using default argument of \""+sourceUrlString+'"');
else
sourceUrlString=args[0];
if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
Source source=new Source(new URL(sourceUrlString));
String renderedText=source.getRenderer().toString();
System.out.println("\
Simple rendering of the HTML document:\
");
System.out.println(renderedText);
}
}

我希望这将有助于以浏览器格式解析表。

谢谢，
加内什

相关讨论