Python lxml
last modified January 29, 2024
In this article we show how to parse and generate XML and HTML data in Python using the lxml library.
The lxml library provides Python bindings for the C libraries libxml2 and libxslt.
The following file is used in the examples.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Words</title>
</head>
<body>
<ul>
<li>sky</li>
<li>cup</li>
<li>water</li>
<li>cloud</li>
<li>bear</li>
<li>wolf</li>
</ul>
<div id="output">
...
</div>
</body>
</html>
This is a simple HTML document.
Python lxml iterate tags
In the first example, we iterate over the tags of the document.
#!/usr/bin/python
from lxml import html
fname = 'words.html'
tree = html.parse(fname)
for e in tree.iter():
print(e.tag)
The program lists all available HTML tags.
from lxml import html
We import the html module.
fname = 'words.html' tree = html.parse(fname)
We parse the document from the given file with parse.
for e in tree.iter():
print(e.tag)
We iterate over the elements utilizing iter.
$ ./tags.py html head meta title body ul li li li li li li div
Python lxml root element
The root element is retrieved with getroot.
#!/usr/bin/python
from lxml import html
import re
fname = 'words.html'
tree = html.parse(fname)
root = tree.getroot()
print(root.tag)
print('----------------')
print(root.head.tag)
print(root.head.text_content().strip())
print('----------------')
print(root.body.tag)
print(re.sub('\s+', ' ', root.body.text_content()).strip())
In the program, we get the root element. We print the head, body tags and their text content.
tree = html.parse(fname) root = tree.getroot()
From the document tree, we get the root using the getroot method.
print(root.tag)
We print the tag name (html) of the root element.
print(root.head.tag) print(root.head.text_content().strip())
We print the head tag and its text content.
print(root.body.tag)
print(re.sub('\s+', ' ', root.body.text_content()).strip())
Similarly, we print the body tag and its text content. To remove excessive space, we use a regular expression.
$ ./root.py html ---------------- head Words ---------------- body sky cup water cloud bear wolf ...
Python lxml create document
The lxml module allows to create HTML documents.
#!/usr/bin/python
from lxml import etree
root = etree.Element('html', lang='en')
head = etree.SubElement(root, 'head')
title = etree.SubElement(head, 'title')
title.text = 'HTML document'
body = etree.SubElement(root, 'body')
p = etree.SubElement(body, 'p')
p.text = 'A simple HTML document'
with open('new.html', 'wb') as f:
f.write(etree.tostring(root, pretty_print=True))
We use the etree module for generating the document.
root = etree.Element('html', lang='en')
We create the root element.
head = etree.SubElement(root, 'head') title = etree.SubElement(head, 'title')
Inside the root element, we create two children.
title.text = 'HTML document'
We insert text via the text attribute.
with open('new.html', 'wb') as f:
f.write(etree.tostring(root, pretty_print=True))
Finally, we write the document to a file.
Python lxml findall
The findall method is used to find all specified elements.
#!/usr/bin/python
from lxml import html
fname = 'words.html'
root = html.parse(fname)
els = root.findall('body/ul/li')
for e in els:
print(e.text)
The program finds all li tags and prints their content.
els = root.findall('body/ul/li')
We find all elements with findall. We pass the exact path to the
elements.
for e in els:
print(e.text)
We iterate over the tags and print their text content.
$ ./find_all.py sky cup water cloud bear wolf
Python lxml find by id
A specific element can be found by get_element_by_id.
#!/usr/bin/python
from lxml import html
fname = 'words.html'
tree = html.parse(fname)
root = tree.getroot()
e = root.get_element_by_id('output')
print(e.tag)
print(e.text.strip())
The program finds the div element by its id and prints it tag name
and text content.
$ ./find_by_id.py div ...
Python lxml web scrape
The lxml module can be used for web scraping.
#!/usr/bin/python
import urllib3
import re
from lxml import html
http = urllib3.PoolManager()
url = 'http://webcode.me/countries.html'
resp = http.request('GET', url)
content = resp.data.decode('utf-8')
doc = html.fromstring(content)
els = doc.findall('body/table/tbody/tr')
for e in els[:10]:
row = e.text_content().strip()
row2 = re.sub('\s+', ' ', row)
print(row2)
The program fetches an HTML document that contains a list of most populated countries. It prints the top ten countries from the table.
import urllib3
To fetch the web page, we use the urllib3 library.
http = urllib3.PoolManager()
url = 'http://webcode.me/countries.html'
resp = http.request('GET', url)
We generate a GET request to the resource.
content = resp.data.decode('utf-8')
doc = html.fromstring(content)
We decode the content and parse the document.
els = doc.findall('body/table/tbody/tr')
We find all tr tags which contain the data.
for e in els[:10]:
row = e.text_content().strip()
row2 = re.sub('\s+', ' ', row)
print(row2)
We go over the list of rows and print the top ten rows.
$ ./scrape.py 1 China 1382050000 2 India 1313210000 3 USA 324666000 4 Indonesia 260581000 5 Brazil 207221000 6 Pakistan 196626000 7 Nigeria 186988000 8 Bangladesh 162099000 9 Russia 146838000 10 Japan 126830000
Source
lxml - XML and HTML with Python
In this article we have processed XML/HTML data in Python with lxml.
Author
List all Python tutorials.