This post is about the fetch and crawl of html pages using requests and BeautifulSoup

I came across an interesting forum and i was looking for posts with some keywords. The forum did have a search form but didn’t support any kind of regex. I think it uses some kind OR’ing logic for search keywords and returns any post that has any of the words. It was built in early 00’s using PHP, lucky for me. So, the approach was

  • Format the search URL based on given keywords
  • Fetch the search results and crawl the hits and fetch them
  • Search for given patterns(including regex)

Getting and parsing pages Link to heading

requests is very easy way to get raw HTML pages

url = "www.forum.com/search?q=something"
import requests

r = requests.get(url)
print(r.content)

Now we have the HTML in r.content, BeautifulSoup is used to parse HTML into something i can traverse with find and findAll methods. Below an example of parsing a list with CSS class search1 and getting all links.

from bs4 import BeautifulSoup
import requests

r = requests.get(new_page)

soup = BeautifulSoup(r.content, 'html5lib')
threads = soup.find('li', attrs = {'class':'search1'}).findAll('a', attrs = {'title':'View This Message'})

At this point, i have links to posts with attribute href, i can fetch the individual post pages, parse them and extract the raw text. well, the next step is just re.search and i am done.

for thread in threads:

    r = requests.get(thread['href'])
    soup = BeautifulSoup(r.content, 'html5lib')

    posts = [p.findAll('div')[0].text for p in soup.findAll('div', attrs = {'class':'am_body_left'})]

Useful libraries Link to heading

  • requests_cache to cache requests. very useful as this website is static.
  • argparse to parse command line options and provide grep-like features.
  • logging to control verbosity.