Pyquery
last modified January 29, 2024
Pyquery tutorial shows how to make jquery queries on XML documents in Python.
jQuery is a JavaScript library which is used to manipulate DOM. With jQuery, we can find, select, traverse, and manipulate parts of an HTML document.
Pyquery
Pyquery is a Python library which has similar API to jQuery. It uses
lxml module for fast XML and HTML manipulation. The API is as much
as possible similar to jQuery.
Installing pyquery
Pyquery is installed with the following command:
$ sudo pip3 install pyquery
We use the pip3 command to install pyquery module.
The HTML file
In the examples, we will use the following HTML file:
<!DOCTYPE html>
<html>
<head>
<title>Header</title>
<meta charset="utf-8">
</head>
<body>
<h2>Operating systems</h2>
<ul id="mylist" style="width:150px">
<li>Solaris</li>
<li>FreeBSD</li>
<li>Debian</li>
<li>NetBSD</li>
<li>Windows</li>
</ul>
</body>
</html>
Simple example
In the first example, we use pyquery module to get the text of a header.
#!/usr/bin/python
from pyquery import PyQuery as pq
with open("index.html", "r") as f:
contents = f.read()
doc = pq(contents)
text = doc("h2").text()
print(text)
The code example prints the text of the h2 element.
from pyquery import PyQuery as pq
We import the PyQuery class from the pyquery
module. The PyQuery is the main class for doing work.
with open("index.html", "r") as f:
contents = f.read()
We open the index.html file and read its contents
with the read method.
doc = pq(contents)
A PyQuery object is created; the HTML data is passed to the
constructor.
text = doc("h2").text()
We select the h2 tag and get its text with the text
method.
$ ./header.py Operating systems
The text and html methods
The text method retrieves the text of an element while
the html method retrieves the HTML data of the element.
#!/usr/bin/python
from pyquery import PyQuery as pq
with open("index.html", "r") as f:
contents = f.read()
doc = pq(contents)
text = doc("ul").text()
print("\n".join(text.split()))
text = doc("ul").html()
print("\n".join(text.split()))
We get the text data and the HTML data of the ul element.
$ ./get_list.py Solaris FreeBSD Debian NetBSD Windows <li>Solaris</li> <li>FreeBSD</li> <li>Debian</li> <li>NetBSD</li> <li>Windows</li>
Attributes
Element attributes can be retrieved with the attr method.
#!/usr/bin/python
from pyquery import PyQuery as pq
with open("index.html", "r") as f:
contents = f.read()
doc = pq(contents)
tag = doc("ul")
print(tag.attr("id"))
print(tag.attr("style"))
In the code example, we retrieve and print two attributes of the ul
element: id and style.
$ ./attributes.py mylist width:150px
Web scraping
Requests is a simple Python HTTP library. It provides methods for accessing Web resources via HTTP.
#!/usr/bin/python
from pyquery import PyQuery as pq
import requests as req
resp = req.get("http://www.webcode.me")
doc = pq(resp.text)
title = doc('title').text()
print(title)
The example retrieves the title of a simple web page.
resp = req.get("http://www.webcode.me")
doc = pq(resp.text)
We get the HTML data of the page.
title = doc('title').text()
print(title)
We retrieve its title.
$ ./scraping.py My html page
Selecting tags
The selectors are used to select elements in an HTML document that meet certain criteria. The criteria can be their name, id, class name, attributes or a combination of them.
#!/usr/bin/python
from pyquery import PyQuery as pq
def print_item(self, item):
print("Tag: {0}, Text: {1}".format(item.tag, item.text))
with open("index.html", "r") as f:
contents = f.read()
doc = pq(contents)
first_li = doc("li:first")
print(first_li.text())
last_li = doc("li:last")
print(last_li.text())
odd_lis = doc("li:odd")
odd_lis.each(print_item)
The example selects various li tags from the HTML document.
def print_item(self, item):
print("Tag: {0}, Text: {1}".format(item.tag, item.text))
In this function, we print the tag name and its text.
first_li = doc("li:first")
print(first_li.text())
We select the first li tag and print its content with
the text method.
last_li = doc("li:last")
print(last_li.text())
Here we get the last li tag.
odd_lis = doc("li:odd")
odd_lis.each(print_item)
With the help of the each method, we print the tag and its content
of the every odd li element.
$ ./selecting.py Solaris Windows Tag: li, Text: FreeBSD Tag: li, Text: NetBSD
Removing elements
The remove method deletes a tag.
#!/usr/bin/python
from pyquery import PyQuery as pq
with open("index.html", "r") as f:
contents = f.read()
doc = pq(contents)
removed_item = doc('li:last').remove()
print(removed_item)
print(doc)
In the example, we remove the last li tag.
removed_item = doc('li:last').remove()
We select the last li tag and remove it with remove.
The removed element is returned.
print(removed_item) print(doc)
We print the deleted item and the document, which has the element removed.
$ ./removing.py
<li>Windows</li>
<html>
<head>
<title>Header</title>
<meta charset="utf-8"/>
</head>
<body>
<h2>Operating systems</h2>
<ul id="mylist" style="width:150px">
<li>Solaris</li>
<li>FreeBSD</li>
<li>Debian</li>
<li>NetBSD</li>
</ul>
</body>
</html>
The items method
The items method allows to iterate over elements.
#!/usr/bin/python
from pyquery import PyQuery as pq
with open("index.html", "r") as f:
contents = f.read()
doc = pq(contents)
items = [item.text() for item in doc.items('li')]
print(items)
The example iterates over the li elements of the
document.
items = [item.text() for item in doc.items('li')]
The items method is used to create a Python list of
li elements in a list comprehension.
$ ./iterate.py ['Solaris', 'FreeBSD', 'Debian', 'NetBSD', 'Windows']
Appending and prepending elements
The append method adds an element at the end
of a node and the prepend method inserts the
element at the beginning of a node.
#!/usr/bin/python
from pyquery import PyQuery as pq
with open("index.html", "r") as f:
contents = f.read()
doc = pq(contents)
mylist = doc("#mylist")
mylist.prepend("<li>DragonFly</li>")
mylist.append("<li>OpenBSD</li>")
print(mylist)
The code example inserts two li elements with
the prepend and append methods.
The filter method
The filter method is used to filter elements.
#!/usr/bin/python
from pyquery import PyQuery as pq
with open("index.html", "r") as f:
contents = f.read()
doc = pq(contents)
filtered = doc('li').filter(lambda i: pq(this).text().startswith(('F', 'D', 'N')))
print(filtered.text())
The example displays operating systems that start with F, D, or N. We use a filter
method and an anonymous function.
$ ./filtering.py FreeBSD Debian NetBSD
Source
In this article we have worked with the Python pyquery library.
Author
List all Python tutorials.