This method uses a flexible approach with special tools in Python (called ‘Subprocess’) to change DOC files into either PDF or DOCX formats. Afterward, you can use the new file to take out information from PDFs or DOCX files. This powerful combination makes it easy to get important data from documents, which is super useful for projects that need a lot of information. With ‘subprocess‘, you get a bunch of strong features that make it faster and better at getting content out of different kinds of files.
The provided snippet serves as an illustrative example of converting data from a DOC file to a PDF file and subsequently extracting the content from the generated PDF. This content is then transformed into a string of extracted data.
def the_extract():
# setting output path
output_pdf_path = f'/folder_path/{base}.pdf'
# setting the conversion command
conversion_command = f'unoconv -f pdf "in_file_path"'
try:
# running the conversion command
subprocess.run(conversion_command, shell = True, check = True)
except subprocess.CalledProcessError as e:
print(f"Conversion failed with error: {e}")
# fetching the converted file for extraction
file_pa =f'Give your file path'
pdf_reader = PyPDF2.PdfReader(file_pa)
num_pages = len(pdf_reader.pages)
full_text = []
for i in range(num_pages):
page = pdf_reader.pages[i]
full_text.append(page.extract_text())
return full_text
the_extract()
output_pdf_path
containing a file path for the output PDF. This path is constructed using the value of a variable base
.conversion_command
which is a command-line instruction for converting a file to PDF using the unoconv
tool. The input file path (in_file_path
) is used in this command.subprocess.run
function. This executes a shell command, which in this case is the unoconv
command for PDF conversion.subprocess.CalledProcessError
), it catches the error and prints a message indicating the conversion failure along with the error message.file_pa
path using PyPDF2.PdfReader
. This assumes that you should replace 'Give your file path'
with the actual file path.full_text
to store the extracted text.page.extract_text()
, and appends it to the full_text
list.full_text
containing the extracted text from all the pages.the_extract
is called at the end, but it appears that file_pa
needs to be properly defined before running the function.Please note that for this code to work, you need to have the unoconv
tool installed on your system and properly configured. Additionally, you should replace 'Give your file path'
with the actual file path you want to extract content from.