By doing some researches about the best suitable python library for NLP to extract the contents and tables from PDF, four methods are used to test (Pdfminer3K, Pdfplumber, PyPDF, tabula). And this report mainly uses one example article: LPE-thesmallletter.pdf. It is sometimes difficult for some of libraries to identify the PDF contents. The four methods and codes are shown below:
1. Pdfminer3K
Firstly, I use Pdfminer3K to extract the contents from PDF. It is relatively more complex than other methods. But it can extract all of relevant data from table and extract the relevant paragraphs of stocks. (Recommend)
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal, LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
path = r'E:\yangdi2\VT\NLP\LPE-thesmallletter.pdf'
fp = open(path, 'rb')
praser = PDFParser(fp)
doc = PDFDocument()
praser.set_document(doc)
doc.set_parser(praser)
doc.initialize()
if not doc.is_extractable:
raise PDFTextExtractionNotAllowed
print ('111')
else:
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
page_num=0
key_flag=False
for page in doc.get_pages():
if key_flag:
break
page_num=page_num+1
interpreter.process_page(page)
layout = device.get_result()
for x1 in layout:
if (isinstance(x1, LTTextBoxHorizontal)):
results = x1.get_text()
if "7.87" in results:
key_flag=True
break
print (x1.get_text())#get the data of all stock price.
for x2 in layout:
if (isinstance(x2, LTTextBoxHorizontal)):
results = x2.get_text()
if "24/10/2018" in results:
key_flag=True
break
print (x2.get_text()) # get the data: Date of Conviction
for x3 in layout:
if (isinstance(x3, LTTextBoxHorizontal)):
results = x3.get_text()
if "+27.8%" in results:
key_flag=True
break
print (x3.get_text()) # get the data: One-month change
for x4 in layout:
if (isinstance(x4, LTTextBoxHorizontal)):
results = x4.get_text()
if "+1.3%" in results:
key_flag=True
break
print (x4.get_text()) # get the data:change since conviction
for x5 in layout:
if (isinstance(x5, LTTextBoxHorizontal)):
results = x5.get_text()
if "Claranova" in results:
key_flag=True
break
print (x5.get_text()) # get the text of stock Claranova
for x6 in layout:
if (isinstance(x6, LTTextBoxHorizontal)):
results = x6.get_text()
if "Eurobio" in results:
key_flag=True
break
print (x6.get_text()) # get the text of stock Eurobio
for x7 in layout:
if (isinstance(x7, LTTextBoxHorizontal)):
results = x7.get_text()
if "G.E.A" in results:
key_flag=True
break
print (x7.get_text()) # get the text of stock G.E.A
for x8 in layout:
if (isinstance(x8, LTTextBoxHorizontal)):
results = x8.get_text()
if "SII" in results:
key_flag=True
break
print (x8.get_text()) # get the text of stock SII
for x9 in layout:
if (isinstance(x9, LTTextBoxHorizontal)):
results = x9.get_text()
if "Solocal" in results:
key_flag=True
break
print (x9.get_text()) # get the text of stock Solocal
for x10 in layout:
if (isinstance(x10, LTTextBoxHorizontal)):
results = x10.get_text()
if "BUY" in results:
key_flag=True
break
print (x10.get_text())
Then we will use the outputs above to extract the Name of stock and Conviction.
a1 = x5.get_text()
a2 = x6.get_text()
a3 = x7.get_text()
a4 = x8.get_text()
a5 = x9.get_text()
stocknames = ['Claranova', 'Eurobio', 'G.E.A', 'SII', 'Solocal']
samples = [a1, a2, a3, a4, a5]
zhilin = ['SELL', 'BUY']
for x in samples:
x.replace('\n', ' ')
results = {}
for x in samples:
for y in zhilin:
nn = x.find(y)
beg = nn - 40
if beg < 0:
beg = 0
fini = nn + 40
if fini > len(x):
fini = len(x)
for z in stocknames:
nnnn = x[beg:fini].find(z)
if nnnn == -1:
continue
else:
print(y, z)
results[z] = y
print(results)
And finally we can get the relevant data of stocks and Convictions: {'G.E.A': 'SELL', 'Solocal': 'SELL', 'Claranova': 'BUY', 'Eurobio': 'BUY', 'SII': 'BUY'}.
In order to apply this method to more PDF to extract the information of stocks, I recommend that we need to find a package which include all of stock names. In this case, we can extract any stocks information from PDF.
2. Pdfplumber
This method is easy and useful to extract the text and table from the PDF. It can be used to extract the whole page of PDF and also can be used to extract the exact table of a page.
I used it to extract the table in page 2 of the PDF (LPE-thesmallletter). To be more specifically, I select the ‘rows [1948:] ‘to select the table directly, which may be difficult to apply to other articles. And finally we extract the table with Name, stock price, Date, Conviction etc.
import pandas as pd
import numpy as np
import pdfplumber
pathh = r'C:\LPEthesmallletter.pdf'
with pdfplumber.open(pathh) as pdff:
pages = pdff.pages[1]
rows = pages.extract_text()
print (rows[1948:])
resulist=resu.split('\n') #convert resu to list
print(resulist)
newresulist=[]
for a in resulist:
b=a.replace(' ','')
newresulist.append(b)
length=len(newresulist)
x=0
while x < length:
if newresulist[x] == '':
del newresulist[x]
x -= 1
length -= 1
x += 1
print(newresulist)
L=[]
for aa in newresulist:
ak=aa.split('+')
for aaa in ak:
akk=aaa.split('-')
L+=akk
print(L)
ll=len(L)
x=0
while x < ll:
if '%' in L[x]:
del L[x]
x -= 1
ll -= 1
x += 1
print(L)
from docx import Document
document = Document()
for ccc in L:
p = document.add_paragraph(ccc)
document.save('results.docx')
3. PyPDF
import pandas as pd
import numpy as np
import PyPDF2
from PyPDF2 import PdfFileReader
path = r'E:\yangdi2\VT\NLP\LPE-thesmallletter.pdf'
reader = PdfFileReader(path)
if reader. isEncrypted:
reader.decrypt('')
page = reader. getNumPages()
print (page)
from PyPDF2 import PdfFileWriter, PdfFileReader
def pdfCrap(path, start_page, save_path):
# the start page
start_page = start_page - 1
# the end page
end_page = start_page + 1 # we can set 1 page
output = PdfFileWriter()
pdf_file = PdfFileReader(open(path, "rb"), strict=False)
pdf_pages_len = pdf_file.getNumPages()
for i in range(start_page, end_page):
output.addPage(pdf_file.getPage(i))
outputStream = open(save_path, "wb")
output.write(outputStream)
4. Tabula
The tabula need the java operating environment and sometimes it cannot identify the PDF well.
In conclusion, I did not use tabula and PyPDF to extract the useful information from PDF. The most recommended library is Pdfminer3K for Python3. However, there may be many stock names and situations which are different from the example articles: LPE-thesmallletter.pdf. We need to find general codes to extract contents from PDF such as finding a package including all of stock names, which can help us extract more stocks information easily.