Crawling a website with Python is an easy task, provided we use below libraries 😉
- BeautifulSoup – A Python library which eases html parsing. Gives an object using which accessing html tags is just a matters of few mins.
- Requests – A HTTP library for Python.
BeautifulSoup can be plugged with any html/xml parser you want. It uses lxml parser which uses Xpath internally. Many such parsers are listed in this link BeautifulSoup
One important aspect in scrapping webpages is detecting their encoding.
BeautifulSoup by default encodes the html to unicode.
It does sometimes misunderstands encoding.
UnicodeDammit in BeautifulSoup-4 is said to take care of encoding well. Needs to be explored more.
If you have any question feel free to ask in the comment box below. Happy to help