2.1 Content Extraction using PyPDF2

PyPDF2‘ is a Python library for reading and manipulating PDF files. It provides functionalities to extract information from PDF documents, merge multiple PDFs, split PDFs, encrypt and decrypt PDFs, and more.

The provided snippet serves as an illustrative example for extracting data from a PDF file, subsequently transforming it into a string of extracted content.

def extract(file_path):
    # Fetch the file from the given file path
    pdf_reader = PyPDF2.PdfReader(file_path)
    # Count the number of pages to be processed 
    num_pages = len(pdf_reader.pages)
    # Create a empty list / string to store the data.
    full_text = [] # full_text = ''
    # Run a loop to extract the data line by line as the PyPDF2 scans a document 
      line by line
    for i in range(num_pages):
      page = pdf_reader.pages[i]
      full_text.append(page.extract_text())
    return full_text
file_path = "Path of file to extract the data"
extracted_data = extract(file_path)
  1. def extract(file_path): This line defines a function named extract that takes one argument file_path.
  2. pdf_reader = PyPDF2.PdfReader(file_path): This line creates an instance of the PdfReader class from the PyPDF2 library, which is used for reading PDF files. It takes file_path as input, indicating the location of the PDF file to be processed.
  3. num_pages = len(pdf_reader.pages): This line determines the total number of pages in the PDF document by accessing the pages attribute of the pdf_reader object.
  4. full_text = [] : This creates an empty list called full_text which will be used to store the extracted text from each page of the PDF.
  5. for i in range(num_pages): This starts a loop that will iterate over each page in the PDF document.
  6. page = pdf_reader.pages[i] : Within the loop, this line selects the i-th page for processing.
  7. full_text.append(page.extract_text()) : This extracts the text content from the selected page and appends it to the full_text list. The extract_text() method is used to obtain the text from the page.
  8. After processing all pages, the loop completes, and the function extract returns the full_text list, which now contains the extracted text from all pages of the PDF.
  9. file_path = "Path of file to extract the data" : This assigns the file path (in string format) to the variable file_path to specify the location of the PDF file to be processed.
  10. extracted_data = extract(file_path) : This calls the extract function, passing file_path as an argument, and assigns the returned value (the extracted text) to the variable extracted_data.

In summary, this code defines a function extract that uses the PyPDF2 library to read a PDF file specified by file_path, extracts the text content from each page, and returns the extracted text as a list. The provided file_path variable is then used to extract data from a specific PDF file.