2.3 Universal Content Extraction using subprocess

This method uses a flexible approach with special tools in Python (called ‘Subprocess’) to change DOC files into either PDF or DOCX formats. Afterward, you can use the new file to take out information from PDFs or DOCX files. This powerful combination makes it easy to get important data from documents, which is super useful for projects that need a lot of information. With ‘subprocess‘, you get a bunch of strong features that make it faster and better at getting content out of different kinds of files.

The provided snippet serves as an illustrative example of converting data from a DOC file to a PDF file and subsequently extracting the content from the generated PDF. This content is then transformed into a string of extracted data.

def the_extract():
  # setting output path
  output_pdf_path = f'/folder_path/{base}.pdf'

  # setting the conversion command
  conversion_command = f'unoconv -f pdf "in_file_path"'
  try:
    # running the conversion command 
    subprocess.run(conversion_command, shell = True, check = True)
  except subprocess.CalledProcessError as e:
      print(f"Conversion failed with error: {e}")

  # fetching the converted file for extraction
  file_pa =f'Give your file path'
  pdf_reader = PyPDF2.PdfReader(file_pa)
  num_pages = len(pdf_reader.pages)
  full_text = []
  for i in range(num_pages):
    page = pdf_reader.pages[i]
    full_text.append(page.extract_text())

  return full_text


the_extract()
  1. Setting Output Path:
    • It creates a string output_pdf_path containing a file path for the output PDF. This path is constructed using the value of a variable base.
  2. Setting Conversion Command:
    • It creates a string conversion_command which is a command-line instruction for converting a file to PDF using the unoconv tool. The input file path (in_file_path) is used in this command.
  3. Conversion Attempt:
    • It attempts to run the conversion command using the subprocess.run function. This executes a shell command, which in this case is the unoconv command for PDF conversion.
  4. Error Handling:
    • If an error occurs during the conversion (indicated by subprocess.CalledProcessError), it catches the error and prints a message indicating the conversion failure along with the error message.
  5. Fetching and Extracting Content from PDF:
    • It attempts to open a PDF file located at the specified file_pa path using PyPDF2.PdfReader. This assumes that you should replace 'Give your file path' with the actual file path.
    • It retrieves the number of pages in the PDF and initializes an empty list full_text to store the extracted text.
    • It then iterates through each page in the PDF, extracts the text content using page.extract_text(), and appends it to the full_text list.
  6. Returning Extracted Text:
    • Finally, it returns the list full_text containing the extracted text from all the pages.
  7. Function Call:
    • The function the_extract is called at the end, but it appears that file_pa needs to be properly defined before running the function.

Please note that for this code to work, you need to have the unoconv tool installed on your system and properly configured. Additionally, you should replace 'Give your file path' with the actual file path you want to extract content from.