It lets us parse the DOM and extract the data we want.
In this article, we'll look at how to scrape HTML documents with Beautiful Soup.
Comparing Objects for Equality
We can compare objects for equality.
For example, we can write:
from bs4 import BeautifulSoup
markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
soup = BeautifulSoup(markup, 'html.parser')
first_b, second_b = soup.find_all('b')
print(first_b == second_b)
print(first_b.previous_element == second_b.previous_element)
Then we the first print
prints True
since the first b
element and the 2nd one has the same structure and content.
The 2nd print
prints False
because the previous element to each b
element is different.
Copying Beautiful Soup Objects
We can copy Beautiful Soup objects.
We can use the copy
library to do this:
from bs4 import BeautifulSoup
import copy
markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
soup = BeautifulSoup(markup, 'html.parser')
p_copy = copy.copy(soup.p)
print(p_copy)
The copy is considered to be equal to the original.
Parsing Only Part of a Document
For example, we can write:
from bs4 import BeautifulSoup, SoupStrainer
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
only_a_tags = SoupStrainer("a")
only_tags_with_id_link2 = SoupStrainer(id="link2")
def is_short_string(string):
return string is not None and len(string) < 10
only_short_strings = SoupStrainer(string=is_short_string)
print(only_a_tags)
print(only_tags_with_id_link2)
print(only_short_strings)
We can only select the elements we want with SoupStrainer
.
The selection can be done with a selector, or we can pass in an id
, or pass in a function to do the selection.
Then we see:
a|{}
None|{'id': u'link2'}
None|{'string': <function is_short_string at 0x00000000036FC908>}
printed.
Conclusion
We can parse part of a document, compare parsed objects for equality, and copy objects with Beautiful Soup.