Crawling website with Python

CRAWL

Crawling a website with Python is an easy task, provided we use below libraries 😉

Libraries:

  • BeautifulSoup – A Python library which eases html parsing. Gives an object using which accessing html tags is just a matters of few mins.
  • Requests – A HTTP library for Python.

BeautifulSoup can be plugged with any html/xml parser you want. It uses lxml parser which uses Xpath internally. Many such parsers are listed in this link BeautifulSoup

Notes:
One important aspect in scrapping webpages is detecting their encoding.
BeautifulSoup by default encodes the html to unicode.
It does sometimes misunderstands encoding.
UnicodeDammit in BeautifulSoup-4 is said to take care of encoding well. Needs to be explored more.

If you have any question feel free to ask in the comment box below. Happy to help :)

 

2 thoughts on “Crawling website with Python

  1. Your style is very unique compared to other folks I’ve read stuff
    from. I appreciate you for posting when you’ve got the opportunity,
    Guess I’ll just book mark this site.

Leave a Reply

Your email address will not be published. Required fields are marked *