‘PyPDF2
‘ is a Python library for reading and manipulating PDF files. It provides functionalities to extract information from PDF documents, merge multiple PDFs, split PDFs, encrypt and decrypt PDFs, and more.
The provided snippet serves as an illustrative example for extracting data from a PDF file, subsequently transforming it into a string of extracted content.
def extract(file_path):
# Fetch the file from the given file path
pdf_reader = PyPDF2.PdfReader(file_path)
# Count the number of pages to be processed
num_pages = len(pdf_reader.pages)
# Create a empty list / string to store the data.
full_text = [] # full_text = ''
# Run a loop to extract the data line by line as the PyPDF2 scans a document
line by line
for i in range(num_pages):
page = pdf_reader.pages[i]
full_text.append(page.extract_text())
return full_text
file_path = "Path of file to extract the data"
extracted_data = extract(file_path)
def extract(file_path):
This line defines a function named extract
that takes one argument file_path
.pdf_reader = PyPDF2.PdfReader(file_path)
: This line creates an instance of the PdfReader
class from the PyPDF2 library, which is used for reading PDF files. It takes file_path
as input, indicating the location of the PDF file to be processed.num_pages = len(pdf_reader.pages)
: This line determines the total number of pages in the PDF document by accessing the pages
attribute of the pdf_reader
object.full_text = []
: This creates an empty list called full_text
which will be used to store the extracted text from each page of the PDF.for i in range(num_pages):
This starts a loop that will iterate over each page in the PDF document.page = pdf_reader.pages[i]
: Within the loop, this line selects the i
-th page for processing.full_text.append(page.extract_text())
: This extracts the text content from the selected page and appends it to the full_text
list. The extract_text()
method is used to obtain the text from the page.extract
returns the full_text
list, which now contains the extracted text from all pages of the PDF.file_path = "Path of file to extract the data"
: This assigns the file path (in string format) to the variable file_path
to specify the location of the PDF file to be processed.extracted_data = extract(file_path)
: This calls the extract
function, passing file_path
as an argument, and assigns the returned value (the extracted text) to the variable extracted_data
.In summary, this code defines a function extract
that uses the PyPDF2 library to read a PDF file specified by file_path
, extracts the text content from each page, and returns the extracted text as a list. The provided file_path
variable is then used to extract data from a specific PDF file.