Crawling website with Python

Crawling a website with Python is an easy task, provided we use below libraries 😉

Libraries:

BeautifulSoup – A Python library which eases html parsing. Gives an object using which accessing html tags is just a matters of few mins.
Requests – A HTTP library for Python.

BeautifulSoup can be plugged with any html/xml parser you want. It uses lxml parser which uses Xpath internally. Many such parsers are listed in this link BeautifulSoup

Notes:
One important aspect in scrapping webpages is detecting their encoding.
BeautifulSoup by default encodes the html to unicode.
It does sometimes misunderstands encoding.
UnicodeDammit in BeautifulSoup-4 is said to take care of encoding well. Needs to be explored more.

If you have any question feel free to ask in the comment box below. Happy to help

2 thoughts on “Crawling website with Python”

recess episodes says:

January 27, 2015 at 5:37 am

Your style is very unique compared to other folks I’ve read stuff
from. I appreciate you for posting when you’ve got the opportunity,
Guess I’ll just book mark this site.

1. Amit Onkare says:
  
  July 15, 2015 at 4:27 pm
  
  Thanks.
  
  I am glad that you liked the post.

Jobpedia Blog

Showcase your talent

Crawling website with Python

2 thoughts on “Crawling website with Python”

Leave a Reply Cancel reply