2.2 Content Extraction using python-docx

python-docx is a Python library for reading and manipulating docx files. It provides functionalities to extract information from docx documents, create and manipulate the docx files and more.

The provided snippet serves as an illustrative example for extracting data from a DOCX file, subsequently transforming it into a string of extracted content.

def extract(file_path):
  # Extract the file from filepath
  doc = Document(file_path)
  full_text = []

  # Read the, extract and append the data per paragraph 
  for paragraph in doc.paragraphs:

  # Read the table content and append
  for table in doc.tables:
      for row in table.rows:
          for cell in row.cells:

  # Read the content by sections like header footer etc
  for section in doc.sections:
      for header in section.header.paragraphs:

      for footer in section.footer.paragraphs:
  for shape in doc.inline_shapes:
      if shape.type == docx.enum.text.WD_INLINE_SHAPE.TEXT_BOX:
          textbox_text = shape.text_frame.text
  return full_text
file_path = "Path of file to extract the data"
extracted_data = extract(file_path)
  1. def extract(file_path):: This line defines a function named extract that takes a single argument file_path.
  2. doc = Document(file_path): This line creates a document object doc by opening the file located at file_path. It assumes that the file is in a format that can be read by the Document class (likely a .docx file).
  3. full_text = []: This initializes an empty list called full_text which will be used to store the extracted content.
  4. for paragraph in doc.paragraphs:: This initiates a loop that iterates over each paragraph in the document.
    • full_text.append(paragraph.text): For each paragraph, the text content is extracted and appended to the full_text list.
  5. for table in doc.tables:: This starts another loop that iterates over each table in the document.
    • for row in table.rows:: For each table, it iterates over each row.
      • for cell in row.cells:: For each row, it iterates over each cell.
        • full_text.append(cell.text): The text content of each cell is extracted and appended to the full_text list.
  6. The next section of code extracts content from headers, footers, and other sections.
    • for section in doc.sections:: This loop iterates over each section in the document.
      • for header in section.header.paragraphs:: For each section, it iterates over the paragraphs in the header.
        • full_text.append(header.text): The text content of each header paragraph is appended to the full_text list.
      • for footer in section.footer.paragraphs:: Similarly, for each section, it iterates over the paragraphs in the footer.
        • full_text.append(footer.text): The text content of each footer paragraph is appended to the full_text list.
  7. for shape in doc.inline_shapes:: This loop iterates over inline shapes in the document (like text boxes).
    • if shape.type == docx.enum.text.WD_INLINE_SHAPE.TEXT_BOX:: It checks if the shape is a text box.
      • textbox_text = shape.text_frame.text: If it is a text box, it extracts the text from the text frame.
      • full_text.append(textbox_text): The text from the text box is appended to the full_text list.
  8. Finally, return full_text sends back the list containing all the extracted text.
  9. file_path = "Path of file to extract the data": This line assigns the path of the file you want to extract data from to the variable file_path.
  10. extracted_data = extract(file_path): This line calls the extract function with the specified file_path and assigns the returned list of extracted data to the variable extracted_data.