Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while loading the docx(KeyError: "There is no item named 'word/#_top' in the archive") #1351

Closed
akash97715 opened this issue Feb 29, 2024 · 6 comments

Comments

@akash97715
Copy link

Hello Team we are using below code to load the document

from docx import Document

# Path to your DOCX file
docx_file_path = 'myfile.docx'

# Load the DOCX file
document = Document(docx_file_path)

# Example: Print all the text in the document
for para in document.paragraphs:
    print(para.text)

Getting below error:

KeyError                                  Traceback (most recent call last)~\AppData\Local\Temp\ipykernel_13540\1987488543.py in <module>      5       6 # Load the DOCX file----> 7 document = Document(docx_file_path)      8       9 # Example: Print all the text in the document~\Anaconda3\lib\site-packages\docx\api.py in Document(docx)     21     """     22     docx = _default_docx_path() if docx is None else docx---> 23     document_part = Package.open(docx).main_document_part     24     if document_part.content_type != CT.WML_DOCUMENT_MAIN:     25         tmpl = "file '%s' is not a Word file, content type is '%s'"~\Anaconda3\lib\site-packages\docx\opc\package.py in open(cls, pkg_file)    114     def open(cls, pkg_file):    115         """Return an |OpcPackage| instance loaded with the contents of `pkg_file`."""--> 116         pkg_reader = PackageReader.from_file(pkg_file)    117         package = cls()    118         Unmarshaller.unmarshal(pkg_reader, package, PartFactory)~\Anaconda3\lib\site-packages\docx\opc\pkgreader.py in from_file(pkg_file)     23         content_types = _ContentTypeMap.from_xml(phys_reader.content_types_xml)     24         pkg_srels = PackageReader._srels_for(phys_reader, PACKAGE_URI)---> 25         sparts = PackageReader._load_serialized_parts(     26             phys_reader, pkg_srels, content_types     27         )~\Anaconda3\lib\site-packages\docx\opc\pkgreader.py in _load_serialized_parts(phys_reader, pkg_srels, content_types)     51         sparts = []     52         part_walker = PackageReader._walk_phys_parts(phys_reader, pkg_srels)---> 53         for partname, blob, reltype, srels in part_walker:     54             content_type = content_types[partname]     55             spart = _SerializedPart(partname, content_type, reltype, blob, srels)~\Anaconda3\lib\site-packages\docx\opc\pkgreader.py in _walk_phys_parts(phys_reader, srels, visited_partnames)     84                 phys_reader, part_srels, visited_partnames     85             )---> 86             for partname, blob, reltype, srels in next_walker:     87                 yield (partname, blob, reltype, srels)     88 
~\Anaconda3\lib\site-packages\docx\opc\pkgreader.py in _walk_phys_parts(phys_reader, srels, visited_partnames)     84                 phys_reader, part_srels, visited_partnames     85             )---> 86             for partname, blob, reltype, srels in next_walker:     87                 yield (partname, blob, reltype, srels)     88 
~\Anaconda3\lib\site-packages\docx\opc\pkgreader.py in _walk_phys_parts(phys_reader, srels, visited_partnames)     79             reltype = srel.reltype     80             part_srels = PackageReader._srels_for(phys_reader, partname)---> 81             blob = phys_reader.blob_for(partname)     82             yield (partname, blob, reltype, part_srels)     83             next_walker = PackageReader._walk_phys_parts(~\Anaconda3\lib\site-packages\docx\opc\phys_pkg.py in blob_for(self, pack_uri)     81         Raises |ValueError| if no matching member is present in zip archive.     82         """---> 83         return self._zipf.read(pack_uri.membername)     84      85     def close(self):~\Anaconda3\lib\zipfile.py in read(self, name, pwd)   1470     def read(self, name, pwd=None):   1471         """Return file bytes for name."""-> 1472         with self.open(name, "r", pwd) as fp:   1473             return fp.read()   1474 
~\Anaconda3\lib\zipfile.py in open(self, name, mode, pwd, force_zip64)   1509         else:   1510             # Get info object for name-> 1511             zinfo = self.getinfo(name)   1512    1513         if mode == 'w':~\Anaconda3\lib\zipfile.py in getinfo(self, name)   1436         info = self.NameToInfo.get(name)
   1437         if info is None:
-> 1438             raise KeyError(
   1439                 'There is no item named %r in the archive' % name)
   1440 

KeyError: "There is no item named 'word/#_top' in the archive"

Let me know am i doing anything wrong, also it will be helpful if u provide some suggestion to resolve this issue

@scanny
Copy link
Contributor

scanny commented Feb 29, 2024

Sounds like a corrupted docx file. Maybe open it with Word or LibreOffice and save as a new name so it rewrites the file.

@scanny scanny closed this as completed Feb 29, 2024
@akash97715
Copy link
Author

Hello, Thanks for you response. I tried saving with new name but still got the same error. I revalidated the docs it’s not corrupted

@scanny
Copy link
Contributor

scanny commented Mar 2, 2024

@akash97715 if you can send the file I'll take a look at it. Otherwise I just don't have enough to go on. I've never seen this error before and I've been at it for over a decade, so this is something of an edge case.

Do you know the provenance of the document? Was it generated by some package rather than being authored using Word or LibreOffice?

@msr22
Copy link

msr22 commented Mar 18, 2024

@scanny I am facing the same issue. I tried renaming the file & saving it again but still getting the issue. I am able to open the file correctly in MS Word.

@JiuJiu998
Copy link

I also encountered the same problem, unfortunately I couldn't find the cause, but I used the pypandoc library to read it and saved it again to read it normally, although it would lose my necessary file metadata.

@michaelromagne
Copy link

michaelromagne commented Feb 21, 2025

The fix below was inspired by other issues in this repo (here and here). You just have to add the below code snippet before running the document = Document(docx_file_path) line @akash97715

It seems to be due to styled headers or headers with bookmarks. @scanny , I would be interested to know if that makes sense for you. On my use case, the header name was still present in the end list after the change.

from docx.opc.oxml import parse_xml
from docx.opc.pkgreader import _SerializedRelationship, _SerializedRelationships

def load_from_xml_v2(baseURI, rels_item_xml):
    """
    Return |_SerializedRelationships| instance loaded with the
    relationships contained in *rels_item_xml*. Returns an empty
    collection if *rels_item_xml* is |None|.
    """
    srels = _SerializedRelationships()
    if rels_item_xml is not None:
        rels_elm = parse_xml(rels_item_xml)
        for rel_elm in rels_elm.Relationship_lst:
            print(rel_elm.target_ref)
            if (
                rel_elm.target_ref in ("../NULL", "NULL")
                or rel_elm.target_ref.startswith("#_")  # Styled headers
            ):
                continue
            srels._srels.append(_SerializedRelationship(baseURI, rel_elm))
    return srels


_SerializedRelationships.load_from_xml = load_from_xml_v2

snopoke added a commit to dimagi/open-chat-studio that referenced this issue Mar 6, 2025
This is to workaround issue with loading relationships from XML.

See python-openxml/python-docx#1351
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants