关于php：从doc和docx提取文本

Extract text from doc and docx

我想知道如何阅读doc或docx的内容。我使用的是Linux VPS和PHP，但是如果有使用其他语言的更简单的解决方案，请告诉我，只要它可以在linux Web服务器下运行即可。

这仅是.DOCX解决方案。对于.DOC或.PDF，您需要使用其他类似PDF的pdf2text.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

function docx2text($filename) {
return readZippedXML($filename,"word/document.xml");
}

function readZippedXML($archiveFile, $dataFile) {
// Create new ZIP archive
$zip = new ZipArchive;

// Open received archive file
if (true === $zip->open($archiveFile)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = new DOMDocument();
$xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Return data without XML formatting tags
return strip_tags($xml->saveXML());
}
$zip->close();
}

// In case of failure return empty string
return"";
}

echo docx2text("test.docx"); // Save this contents to file

相关讨论

在这里，我添加了从.doc，.docx单词文件中获取文本的解决方案

如何从Word文件.doc，docx php中提取文本

对于.doc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

private function read_doc() {
$fileHandle = fopen($this->filename,"r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext ="";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline."";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\\s\\,\\.\\-\
\
\\t@\\/\\_\$\$]/","",$outtext);
return $outtext;
}

对于.docx

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

private function read_docx(){

$striped_content = '';
$content = '';

$zip = zip_open($this->filename);

if (!$zip || is_numeric($zip)) return false;

while ($zip_entry = zip_read($zip)) {

if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

if (zip_entry_name($zip_entry) !="word/document.xml") continue;

$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

zip_entry_close($zip_entry);
}// end while

zip_close($zip);

$content = str_replace('</w:r></w:p></w:tc><w:tc>',"", $content);
$content = str_replace('</w:r></w:p>',"\
\
", $content);
$striped_content = strip_tags($content);

return $striped_content;
}

相关讨论

解析.docx，.odt，.doc和.rtf文件

我写了一个库，可以根据此处和其他地方的答案来解析docx，odt和rtf文档。

我对.docx和.odt解析所做的主要改进是，该库处理了描述文档的XML，并尝试使其符合HTML标签(即em和强标签)。这意味着，如果您将库用于CMS，则不会丢失文本格式

你可以在这里得到它

相关讨论

我的解决方案是.doc的Antiword和.docx的docx2txt

假设您控制的是Linux服务器，请下载每个服务器，解压缩然后安装。我在整个系统上安装了每个系统：

反对词：make global_install
docx2txt：make install

然后使用这些工具将文本提取到php中的字符串中：

1
2
3
4
5
6
7

//for .doc
$text = shell_exec('/usr/local/bin/antiword -w 0 ' .
escapeshellarg($docFilePath));

//for .docx
$text = shell_exec('/usr/local/bin/docx2txt.pl ' .
escapeshellarg($docxFilePath) . ' -');

docx2txt需要perl

no_freedom的解决方案确实从docx文件中提取文本，但是它可以保留空白。我测试的大多数文件都有应分隔的单词之间没有空格的实例。如果您想全文搜索正在处理的文档，那就不好了。

我建议，使用apache Tika提取文本，您可以提取多种类型的文件内容，例如.doc / .docx和pdf等。

尝试ApachePOI。它适用于Java。我想您在Linux上安装Java不会有任何困难。

您可以使用Apache Tika作为提供REST API的完整解决方案。

另一个好的库是RawText，因为它可以对图像进行OCR，并从任何文档中提取文本。它是非免费的，并且可以通过REST API使用。

使用RawText提取文件的示例代码：

1	$result = $rawText->extract($your_file)

我在doc到txt转换器功能中插入了一点改进

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

private function read_doc() {
$line_array = array();
$fileHandle = fopen( $this->filename,"r" );
$line = @fread( $fileHandle, filesize( $this->filename ) );
$lines = explode( chr( 0x0D ), $line );
$outtext ="";
foreach ( $lines as $thisline ) {
$pos = strpos( $thisline, chr( 0x00 ) );
if ( $pos !== false ) {

} else {
$line_array[] = preg_replace("/[^a-zA-Z0-9\\s\\,\\.\\-\
\
\\t@\\/\\_\$\$]/","", $thisline );

}
}

return implode("\
",$line_array);
}

现在，它保存了空行，而txt文件逐行查找。

我使用docxtotxt提取docx文件内容。我的代码如下：

1
2
3
4
5
6

if($extention =="docx")
{
$docxFilePath ="/var/www/vhosts/abc.com/httpdocs/writers/filename.docx";
$content = shell_exec('/var/www/vhosts/abc.com/httpdocs/docx2txt/docx2txt.pl
'.escapeshellarg($docxFilePath) . ' -');
}