Crawling a website with Python is an easy task, provided we use below libraries 😉
- BeautifulSoup – A Python library which eases html parsing. Gives an object using which accessing html tags is just a matters of few mins.
- Requests – A HTTP library for Python.
BeautifulSoup can be plugged with any html/xml parser you want. It uses lxml parser which uses Xpath internally. Many such parsers are listed in this link BeautifulSoup
One important aspect in scrapping webpages is detecting their encoding.
BeautifulSoup by default encodes the html to unicode.
It does sometimes misunderstands encoding.
UnicodeDammit in BeautifulSoup-4 is said to take care of encoding well. Needs to be explored more.
If you have any question feel free to ask in the comment box below. Happy to help
2 thoughts on “Crawling website with Python”
Your style is very unique compared to other folks I’ve read stuff
from. I appreciate you for posting when you’ve got the opportunity,
Guess I’ll just book mark this site.
I am glad that you liked the post.