XML file processing with Python lxml Module

This article were to explain about lxml general functionality and demonstrate how lxml can provide XML content parsing and reading efficiently with the aim to make programmer life easier. lxml consider as one of the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. This article we are going to walk through come of the core feature lxml can provides. lxml package has a quite different way of representing documents as trees. In the DOM, trees are built out of nodes represented as Node instances. Some nodes are Element instances, representing whole elements as lists.

Example XML sample file, sample.xml.

At its most fundamental, XML schema file needs to be parse and process. We may utilize parse function to quickly convert an XML file into an ElementTree.

General way to import lxml as etree, and assign xml file name/path as source

from lxml import etree
tree = etree.parse('sample.xml', parser=etree.XMLParser())

Top Level Element

print(tree.getroot())

Output:

<Element PurchaseOrders at 0x1fb8c81ed40>

Element nodes for its element children.

print(tree.getroot().getchildren())

Output:

[<Element PurchaseOrder at 0x1dae421eec0>, <Element PurchaseOrder at 0x1dae421f200>, <Element PurchaseOrder at 0x1dae421f2c0>]

Attribute nodes for its attributes.

for e in tree.getroot().getchildren():
    print(e.attrib)

Output:

{'PurchaseOrderNumber': '99504', 'OrderDate': '2001-10-20'}
{'PurchaseOrderNumber': '99505', 'OrderDate': '2001-10-22'}
{'PurchaseOrderNumber': '99503', 'OrderDate': '2001-10-22'}

Text nodes for textual content.

for e in tree.getroot().getchildren()[0]:
    print(e.text)

Output:

Please leave packages in shed by driveway.

XML schema structure

Each Element has an assortment of child nodes of various types:

Supported XML schema format can refer to below link:

Serialise XML element objects as string type

Serialize an element to an encoded string representation of its XML tree element.

print(etree.tostring(tree.getroot().getchildren()[0]).decode("utf-8"))

Output:

<PurchaseOrder PurchaseOrderNumber="99504" OrderDate="2001-10-20">
    <Address Type="Shipping">
      <Name>Amy Adams</Name>
      <Street>123 Maple Street</Street>
      <City>Mill Valley</City>
      <State>CA</State>
      <Zip>10999</Zip>
      <Country>USA</Country>
    </Address>
    <Address Type="Billing">
      <Name>Chong Wei</Name>
      <Street>8 Oak Avenue</Street>
      <City>Old Town</City>
      <State>PA</State>
      <Zip>95819</Zip>
      <Country>USA</Country>
    </Address>
    <DeliveryNotes>Please leave packages in shed by driveway.</DeliveryNotes>
    <Items>
      <Item PartNumber="872-AC">
        <ProductName>Lawnmower</ProductName>
        <Quantity>1</Quantity>
        <USPrice>148.95</USPrice>
        <Comment>Confirm this is electric</Comment>
      </Item>
      <Item PartNumber="926-AD">
        <ProductName>Dell Monitor</ProductName>
        <Quantity>2</Quantity>
        <USPrice>39.98</USPrice>
        <ShipDate>1999-05-21</ShipDate>
      </Item>
    </Items>
  </PurchaseOrder>

XML Content search

lxml provides multiple function to locate ElemenTree (ET) element path. For this particular demonstration findall seem to be a good fit to locate matching keyword within which child element, and return its index number.

Set search element path

roottree = tree.getroot()
subelement = roottree[0].tag           # PurchaseOrder
findalltree = tree.findall(subelement)

print(findalltree)

[<Element PurchaseOrder at 0x16675a1f040>, <Element PurchaseOrder at 0x16675a1f380>, <Element PurchaseOrder at 0x16675a1f440>]

Setup search argument and enumeration expression and condition.
For this particular use case, the objective were to identify interested information reside within which sub-element object. For demonstration purposes, "PartNumber" used as unique keyword to identify sub-element object index id, and sub-element objects.

keyword = 'PartNumber="456-NF"'
for h, i in enumerate(findalltree):

    if keyword in etree.tostring(i).decode("utf-8"):
        
        print(f'index: {h} \n{etree.tostring(i, pretty_print=True).decode("utf-8")}')

index: 1
<PurchaseOrder PurchaseOrderNumber="99505" OrderDate="2001-10-22">
    <Address Type="Shipping">
      <Name>anna kendrick</Name>
      <Street>456 Main Street</Street>
      <City>Buffalo</City>
      <State>NY</State>
      <Zip>98112</Zip>
      <Country>USA</Country>
    </Address>
    <Address Type="Billing">
      <Name>anna kendrick</Name>
      <Street>456 Main Street</Street>
      <City>Buffalo</City>
      <State>NY</State>
      <Zip>98112</Zip>
      <Country>USA</Country>
    </Address>
    <DeliveryNotes>Please notify me before shipping.</DeliveryNotes>
    <Items>
      <Item PartNumber="456-NF">
        <ProductName>Power Supply</ProductName>
        <Quantity>1</Quantity>
        <USPrice>45.99</USPrice>
      </Item>
    </Items>
  </PurchaseOrder>

Element removal action:

The final product obtained allow us to work on the interested sub-element as we wish. For example, we may use the index number to remove unwanted element.

print(roottree.getchildren())

roottree.remove(findalltree[1])

print(roottree.getchildren())

The Result:

Available sub-elements:

[<Element PurchaseOrder at 0x23fee01f140>, <Element PurchaseOrder at 0x23fee01f000>, <Element PurchaseOrder at 0x23fee01f3c0>

Reduced sub-elements after remove sub-element index 1:

[<Element PurchaseOrder at 0x23fee01f140>, <Element PurchaseOrder at 0x23fee01f3c0>]

Activate virtual environment

python -m venv .venv

Windows:
.\.venv\Scripts\activate

Linux & Unix:
source .venv/bin/activate

pip install -r requirements.txt

Clone repository

git clone https://github.com/scheehan/xml_parse_and_remove_tool.git
cd xml_parse_and_remove_tool

How to use the tool

2 features provides by this tool; search for existing element text and remove element.
'usage: [--filename "xml filename"][--check keyword]'
'usage: [--filename "xml filename"][--remove <idx no>]'

python xmlsearchandremove.py --filename sample.xml --check "keyword"

Name	Name	Last commit message	Last commit date
Latest commit scheehan add video link Jul 30, 2024 886a244 · Jul 30, 2024 History 14 Commits
images	images	clean up	Jul 28, 2024
.gitignore	.gitignore	clear cache	Jul 28, 2024
LICENSE	LICENSE	Initial commit	Oct 4, 2019
README.md	README.md	add video link	Jul 30, 2024
myfirmware.xml	myfirmware.xml	clean up	Jul 28, 2024
requirements.txt	requirements.txt	clean up	Jul 28, 2024
sample.xml	sample.xml	clean up	Jul 28, 2024
xmlsearchandremove.py	xmlsearchandremove.py	add video link	Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XML file processing with Python lxml Module

XML schema structure

Serialise XML element objects as string type

XML Content search

Element removal action:

The Result:

Activate virtual environment

Clone repository

How to use the tool

About

Releases

Packages

Languages

License

scheehan/xml_parse_and_remove_tool

Folders and files

Latest commit

History

Repository files navigation

XML file processing with Python lxml Module

XML schema structure

Serialise XML element objects as string type

XML Content search

Element removal action:

The Result:

Activate virtual environment

Clone repository

How to use the tool

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages