python对比pdf文件内容_利用Python提取PDF数据的部分方法比较

By doing some researches about the best suitable python library for NLP to extract the contents and tables from PDF, four methods are used to test (Pdfminer3K, Pdfplumber, PyPDF, tabula). And this report mainly uses one example article: LPE-thesmallletter.pdf. It is sometimes difficult for some of libraries to identify the PDF contents. The four methods and codes are shown below:

1. Pdfminer3K

Firstly, I use Pdfminer3K to extract the contents from PDF. It is relatively more complex than other methods. But it can extract all of relevant data from table and extract the relevant paragraphs of stocks. (Recommend)

from pdfminer.pdfparser import PDFParser, PDFDocument

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

from pdfminer.converter import PDFPageAggregator

from pdfminer.layout import LTTextBoxHorizontal, LAParams

from pdfminer.pdfinterp import PDFTextExtractionNotAllowed

path = r'E:\yangdi2\VT\NLP\LPE-thesmallletter.pdf'

fp = open(path, 'rb')

praser = PDFParser(fp)

doc = PDFDocument()

praser.set_document(doc)

doc.set_parser(praser)

doc.initialize()

if not doc.is_extractable:

raise PDFTextExtractionNotAllowed

print ('111')

else:

rsrcmgr = PDFResourceManager()

laparams = LAParams()

device = PDFPageAggregator(rsrcmgr, laparams=laparams)

interpreter = PDFPageInterpreter(rsrcmgr, device)

page_num=0

key_flag=False

for page in doc.get_pages():

if key_flag:

break

page_num=page_num+1

interpreter.process_page(page)

layout = device.get_result()

for x1 in layout:

if (isinstance(x1, LTTextBoxHorizontal)):

results = x1.get_text()

if "7.87" in results:

key_flag=True

break

print (x1.get_text())#get the data of all stock price.

for x2 in layout:

if (isinstance(x2, LTTextBoxHorizontal)):

results = x2.get_text()

if "24/10/2018" in results:

key_flag=True

break

print (x2.get_text()) # get the data: Date of Conviction

for x3 in layout:

if (isinstance(x3, LTTextBoxHorizontal)):

results = x3.get_text()

if "+27.8%" in results:

key_flag=True

break

print (x3.get_text()) # get the data: One-month change

for x4 in layout:

if (isinstance(x4, LTTextBoxHorizontal)):

results = x4.get_text()

if "+1.3%" in results:

key_flag=True

break

print (x4.get_text()) # get the data:change since conviction

for x5 in layout:

if (isinstance(x5, LTTextBoxHorizontal)):

results = x5.get_text()

if "Claranova" in results:

key_flag=True

break

print (x5.get_text()) # get the text of stock Claranova

for x6 in layout:

if (isinstance(x6, LTTextBoxHorizontal)):

results = x6.get_text()

if "Eurobio" in results:

key_flag=True

break

print (x6.get_text()) # get the text of stock Eurobio

for x7 in layout:

if (isinstance(x7, LTTextBoxHorizontal)):

results = x7.get_text()

if "G.E.A" in results:

key_flag=True

break

print (x7.get_text()) # get the text of stock G.E.A

for x8 in layout:

if (isinstance(x8, LTTextBoxHorizontal)):

results = x8.get_text()

if "SII" in results:

key_flag=True

break

print (x8.get_text()) # get the text of stock SII

for x9 in layout:

if (isinstance(x9, LTTextBoxHorizontal)):

results = x9.get_text()

if "Solocal" in results:

key_flag=True

break

print (x9.get_text()) # get the text of stock Solocal

for x10 in layout:

if (isinstance(x10, LTTextBoxHorizontal)):

results = x10.get_text()

if "BUY" in results:

key_flag=True

break

print (x10.get_text())

Then we will use the outputs above to extract the Name of stock and Conviction.

a1 = x5.get_text()

a2 = x6.get_text()

a3 = x7.get_text()

a4 = x8.get_text()

a5 = x9.get_text()

stocknames = ['Claranova', 'Eurobio', 'G.E.A', 'SII', 'Solocal']

samples = [a1, a2, a3, a4, a5]

zhilin = ['SELL', 'BUY']

for x in samples:

x.replace('\n', ' ')

results = {}

for x in samples:

for y in zhilin:

nn = x.find(y)

beg = nn - 40

if beg < 0:

beg = 0

fini = nn + 40

if fini > len(x):

fini = len(x)

for z in stocknames:

nnnn = x[beg:fini].find(z)

if nnnn == -1:

continue

else:

print(y, z)

results[z] = y

print(results)

And finally we can get the relevant data of stocks and Convictions: {'G.E.A': 'SELL', 'Solocal': 'SELL', 'Claranova': 'BUY', 'Eurobio': 'BUY', 'SII': 'BUY'}.

In order to apply this method to more PDF to extract the information of stocks, I recommend that we need to find a package which include all of stock names. In this case, we can extract any stocks information from PDF.

2. Pdfplumber

This method is easy and useful to extract the text and table from the PDF. It can be used to extract the whole page of PDF and also can be used to extract the exact table of a page.

I used it to extract the table in page 2 of the PDF (LPE-thesmallletter). To be more specifically, I select the ‘rows [1948:] ‘to select the table directly, which may be difficult to apply to other articles. And finally we extract the table with Name, stock price, Date, Conviction etc.

import pandas as pd

import numpy as np

import pdfplumber

pathh = r'C:\LPEthesmallletter.pdf'

with pdfplumber.open(pathh) as pdff:

pages = pdff.pages[1]

rows = pages.extract_text()

print (rows[1948:])

resulist=resu.split('\n') #convert resu to list

print(resulist)

newresulist=[]

for a in resulist:

b=a.replace(' ','')

newresulist.append(b)

length=len(newresulist)

x=0

while x < length:

if newresulist[x] == '':

del newresulist[x]

x -= 1

length -= 1

x += 1

print(newresulist)

L=[]

for aa in newresulist:

ak=aa.split('+')

for aaa in ak:

akk=aaa.split('-')

L+=akk

print(L)

ll=len(L)

x=0

while x < ll:

if '%' in L[x]:

del L[x]

x -= 1

ll -= 1

x += 1

print(L)

from docx import Document

document = Document()

for ccc in L:

p = document.add_paragraph(ccc)

document.save('results.docx')

3. PyPDF

import pandas as pd

import numpy as np

import PyPDF2

from PyPDF2 import PdfFileReader

path = r'E:\yangdi2\VT\NLP\LPE-thesmallletter.pdf'

reader = PdfFileReader(path)

if reader. isEncrypted:

reader.decrypt('')

page = reader. getNumPages()

print (page)

from PyPDF2 import PdfFileWriter, PdfFileReader

def pdfCrap(path, start_page, save_path):

# the start page

start_page = start_page - 1

# the end page

end_page = start_page + 1 # we can set 1 page

output = PdfFileWriter()

pdf_file = PdfFileReader(open(path, "rb"), strict=False)

pdf_pages_len = pdf_file.getNumPages()

for i in range(start_page, end_page):

output.addPage(pdf_file.getPage(i))

outputStream = open(save_path, "wb")

output.write(outputStream)

4. Tabula

The tabula need the java operating environment and sometimes it cannot identify the PDF well.

In conclusion, I did not use tabula and PyPDF to extract the useful information from PDF. The most recommended library is Pdfminer3K for Python3. However, there may be many stock names and situations which are different from the example articles: LPE-thesmallletter.pdf. We need to find general codes to extract contents from PDF such as finding a package including all of stock names, which can help us extract more stocks information easily.