logo

Parsing HTML with BeautifulSoup

Raw HTML is just text. BeautifulSoup turns it into a tree structure you can navigate and search.

from bs4 import BeautifulSoup

html = "<html><body><h1>Hello</h1><p>World</p></body></html>"
soup = BeautifulSoup(html, "html.parser")

Now soup is a searchable object. You can find elements by tag name:

heading = soup.find("h1")
print(heading.text)  # Hello

The .text property extracts just the text content, stripping away the HTML tags. This is usually what you want - the actual data, not the markup.

BeautifulSoup handles messy, broken HTML gracefully. Real-world web pages are often imperfect, and BeautifulSoup doesn't complain.

I teach BeautifulSoup from basics to advanced in my Web Scraping course.