Hey there all, today, in this blog,I will explain you a proper way to extract text from a pdf.
Installation
For extraction , we need a python module named ‘pdf2image’. We can install this module by using this command in your terminal.
pip install pdf2image
Import the Module
First, we will import all the packages. You need pdf2image to convert PDF files to ppm image files.
We will also manipulate the paths to join and rename text files, so that we can import the os and sys packages. The following part calls a PIL library and imports the image with pytesseract:
import pdf2image
import os, sys
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
In the next step we have initialize the path of your documents and the counter to be used later.
PATH = 'Enter your path'
#initialize the counter that you will use later in your pdf extraction function
i = 1
Some unrequired files will also be created with this function. So in this step , we will delete these files which are not required.
def delete_ppms():
for file in os.listdir(PATH):
if '.ppm' in file or '.DS_Store' in file:
try:
os.remove(PATH + file)
except FileNotFoundError: pass
Sorting Files
Now will sort the pdf files according to their types.
pdf_files = []
docx_files = []
for f in os.listdir(PATH):
full_name = os.path.join(PATH, f)
if os.path.isfile(full_name):
name = os.path.basename(f)
filename, ext =os.path.splitext(name)
if ext == '.pdf':
pdf_files.append(name)
elif ext == ('.docx'):
docx_files.append(name)
After this piece of code, we can extract text from pdf using pdf_extract function.
Last Step
This print function will help you see which file is currently checked out:
def pdf_extract(file, i):
print("extracting from file:", file)
delete_ppms()
images = pdf2image.convert_from_path(PATH + file, output_folder=PATH)
j = 0
for file in sorted (os.listdir(PATH)):
if '.ppm' in file and 'image' not in file:
os.rename(PATH + file, PATH + 'image' + str(i) + '-' + str(j) + '.ppm')
j += 1
j = 0
f = open(PATH +'result{}.txt'.format(i), 'w')
files = [f for f in os.listdir(PATH) if '.ppm' in f]
for file in sorted(files, key=lambda x: int(x[x.index('-') + 1: x.index('.')])):
temp = pytesseract.image_to_string(Image.open(PATH + file))
f.write(temp)
f.close()
The last step in this project , when you will run the below command,you will go to the directory you will see a text file by the name of result1.txt with all the text extracted from the PDF file.
Code :
for i in range(len(pdf_files)):
pdf_file = pdf_files[i]
pdf_extract(pdf_file, i)
Thats all for today’s Blog, meet you in the next blog. Follow me on twitter to stay notified for more blogs and related to python world.
https://twitter.com/thegeekyb0y