Error while loading the docx(KeyError: "There is no item named 'word/#_top' in the archive") #1351

akash97715 · 2024-02-29T16:11:59Z

Hello Team we are using below code to load the document

from docx import Document

# Path to your DOCX file
docx_file_path = 'myfile.docx'

# Load the DOCX file
document = Document(docx_file_path)

# Example: Print all the text in the document
for para in document.paragraphs:
    print(para.text)

Getting below error:

KeyError                                  Traceback (most recent call last)~\AppData\Local\Temp\ipykernel_13540\1987488543.py in <module>      5       6 # Load the DOCX file----> 7 document = Document(docx_file_path)      8       9 # Example: Print all the text in the document~\Anaconda3\lib\site-packages\docx\api.py in Document(docx)     21     """     22     docx = _default_docx_path() if docx is None else docx---> 23     document_part = Package.open(docx).main_document_part     24     if document_part.content_type != CT.WML_DOCUMENT_MAIN:     25         tmpl = "file '%s' is not a Word file, content type is '%s'"~\Anaconda3\lib\site-packages\docx\opc\package.py in open(cls, pkg_file)    114     def open(cls, pkg_file):    115         """Return an |OpcPackage| instance loaded with the contents of `pkg_file`."""--> 116         pkg_reader = PackageReader.from_file(pkg_file)    117         package = cls()    118         Unmarshaller.unmarshal(pkg_reader, package, PartFactory)~\Anaconda3\lib\site-packages\docx\opc\pkgreader.py in from_file(pkg_file)     23         content_types = _ContentTypeMap.from_xml(phys_reader.content_types_xml)     24         pkg_srels = PackageReader._srels_for(phys_reader, PACKAGE_URI)---> 25         sparts = PackageReader._load_serialized_parts(     26             phys_reader, pkg_srels, content_types     27         )~\Anaconda3\lib\site-packages\docx\opc\pkgreader.py in _load_serialized_parts(phys_reader, pkg_srels, content_types)     51         sparts = []     52         part_walker = PackageReader._walk_phys_parts(phys_reader, pkg_srels)---> 53         for partname, blob, reltype, srels in part_walker:     54             content_type = content_types[partname]     55             spart = _SerializedPart(partname, content_type, reltype, blob, srels)~\Anaconda3\lib\site-packages\docx\opc\pkgreader.py in _walk_phys_parts(phys_reader, srels, visited_partnames)     84                 phys_reader, part_srels, visited_partnames     85             )---> 86             for partname, blob, reltype, srels in next_walker:     87                 yield (partname, blob, reltype, srels)     88 
~\Anaconda3\lib\site-packages\docx\opc\pkgreader.py in _walk_phys_parts(phys_reader, srels, visited_partnames)     84                 phys_reader, part_srels, visited_partnames     85             )---> 86             for partname, blob, reltype, srels in next_walker:     87                 yield (partname, blob, reltype, srels)     88 
~\Anaconda3\lib\site-packages\docx\opc\pkgreader.py in _walk_phys_parts(phys_reader, srels, visited_partnames)     79             reltype = srel.reltype     80             part_srels = PackageReader._srels_for(phys_reader, partname)---> 81             blob = phys_reader.blob_for(partname)     82             yield (partname, blob, reltype, part_srels)     83             next_walker = PackageReader._walk_phys_parts(~\Anaconda3\lib\site-packages\docx\opc\phys_pkg.py in blob_for(self, pack_uri)     81         Raises |ValueError| if no matching member is present in zip archive.     82         """---> 83         return self._zipf.read(pack_uri.membername)     84      85     def close(self):~\Anaconda3\lib\zipfile.py in read(self, name, pwd)   1470     def read(self, name, pwd=None):   1471         """Return file bytes for name."""-> 1472         with self.open(name, "r", pwd) as fp:   1473             return fp.read()   1474 
~\Anaconda3\lib\zipfile.py in open(self, name, mode, pwd, force_zip64)   1509         else:   1510             # Get info object for name-> 1511             zinfo = self.getinfo(name)   1512    1513         if mode == 'w':~\Anaconda3\lib\zipfile.py in getinfo(self, name)   1436         info = self.NameToInfo.get(name)
   1437         if info is None:
-> 1438             raise KeyError(
   1439                 'There is no item named %r in the archive' % name)
   1440 

KeyError: "There is no item named 'word/#_top' in the archive"

Let me know am i doing anything wrong, also it will be helpful if u provide some suggestion to resolve this issue

The text was updated successfully, but these errors were encountered:

scanny · 2024-02-29T18:37:20Z

Sounds like a corrupted docx file. Maybe open it with Word or LibreOffice and save as a new name so it rewrites the file.

akash97715 · 2024-02-29T19:16:54Z

Hello, Thanks for you response. I tried saving with new name but still got the same error. I revalidated the docs it’s not corrupted

scanny · 2024-03-02T20:35:07Z

@akash97715 if you can send the file I'll take a look at it. Otherwise I just don't have enough to go on. I've never seen this error before and I've been at it for over a decade, so this is something of an edge case.

Do you know the provenance of the document? Was it generated by some package rather than being authored using Word or LibreOffice?

msr22 · 2024-03-18T15:55:49Z

@scanny I am facing the same issue. I tried renaming the file & saving it again but still getting the issue. I am able to open the file correctly in MS Word.

JiuJiu998 · 2024-10-14T03:10:41Z

I also encountered the same problem, unfortunately I couldn't find the cause, but I used the pypandoc library to read it and saved it again to read it normally, although it would lose my necessary file metadata.

michaelromagne · 2025-02-21T14:52:11Z

The fix below was inspired by other issues in this repo (here and here). You just have to add the below code snippet before running the document = Document(docx_file_path) line @akash97715

It seems to be due to styled headers or headers with bookmarks. @scanny , I would be interested to know if that makes sense for you. On my use case, the header name was still present in the end list after the change.

from docx.opc.oxml import parse_xml
from docx.opc.pkgreader import _SerializedRelationship, _SerializedRelationships

def load_from_xml_v2(baseURI, rels_item_xml):
    """
    Return |_SerializedRelationships| instance loaded with the
    relationships contained in *rels_item_xml*. Returns an empty
    collection if *rels_item_xml* is |None|.
    """
    srels = _SerializedRelationships()
    if rels_item_xml is not None:
        rels_elm = parse_xml(rels_item_xml)
        for rel_elm in rels_elm.Relationship_lst:
            print(rel_elm.target_ref)
            if (
                rel_elm.target_ref in ("../NULL", "NULL")
                or rel_elm.target_ref.startswith("#_")  # Styled headers
            ):
                continue
            srels._srels.append(_SerializedRelationship(baseURI, rel_elm))
    return srels


_SerializedRelationships.load_from_xml = load_from_xml_v2

This is to workaround issue with loading relationships from XML. See python-openxml/python-docx#1351

scanny closed this as completed Feb 29, 2024

snopoke mentioned this issue Mar 6, 2025

PipelineNodeRunError: "There is no item named 'word/#_top' in the archive" dimagi/open-chat-studio#1286

Closed

snopoke added a commit to dimagi/open-chat-studio that referenced this issue Mar 6, 2025

patch docx reader

5b41862

This is to workaround issue with loading relationships from XML. See python-openxml/python-docx#1351

snopoke mentioned this issue Mar 6, 2025

patch docx reader dimagi/open-chat-studio#1288

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while loading the docx(KeyError: "There is no item named 'word/#_top' in the archive") #1351

Error while loading the docx(KeyError: "There is no item named 'word/#_top' in the archive") #1351

akash97715 commented Feb 29, 2024

scanny commented Feb 29, 2024

akash97715 commented Feb 29, 2024

scanny commented Mar 2, 2024

msr22 commented Mar 18, 2024

JiuJiu998 commented Oct 14, 2024

michaelromagne commented Feb 21, 2025 •

edited

Loading

Error while loading the docx(KeyError: "There is no item named 'word/#_top' in the archive") #1351

Error while loading the docx(KeyError: "There is no item named 'word/#_top' in the archive") #1351

Comments

akash97715 commented Feb 29, 2024

scanny commented Feb 29, 2024

akash97715 commented Feb 29, 2024

scanny commented Mar 2, 2024

msr22 commented Mar 18, 2024

JiuJiu998 commented Oct 14, 2024

michaelromagne commented Feb 21, 2025 • edited Loading

michaelromagne commented Feb 21, 2025 •

edited

Loading