Untitled Document
Web Scrapping with Beautiful Soup and Requests
1. Introduction Click Here
#********* Source Code From Website - Mangadaku - visit us at -http://mangadaku.com/ ***** 
# Introduction 
       
# Introduction - Before we start working on Beautiful Soup lets install few modules 
# 1. Install beautifulsoup4 library - pip install beautifulsoup4
# 2. Install lxml library - pip install lxml 
# 3. Install html5 library - pip install html5lib
# 4. Install requests library - pip install requests
       
2. Program - Simple Example on how to get information from webapge Click Here
#********* Source Code From Website - Mangadaku - visit us at -http://mangadaku.com/ *****      
# Program - Simple Example on how to get information from webapge

# In this Example we will work on simple.html file .

from bs4 import BeautifulSoup
import requests

with open('simple.htm') as html_file:
	soup = BeautifulSoup(html_file,'lxml')

match = soup.title
print(match)
       
3. Program - Here we will try to work on div tags Click Here
#********* Source Code From Website - Mangadaku - visit us at -http://mangadaku.com/ *****      
# Program - Here we will try to work on div tags

# here it will match the first div tag from the webpage and it will print the contents 
# within that tag 

from bs4 import BeautifulSoup
import requests

with open('simple.htm') as html_file:
	soup = BeautifulSoup(html_file,'lxml')

match = soup.div
#print(match
    
4. Program - Finding the Exact required tag using id or class name provided in tag Click Here
#********* Source Code From Website - Mangadaku - visit us at -http://mangadaku.com/ *****      
# Program - Finding the Exact required tag using id or class name provided in tag

# and then printing the contents within the tag
# Here we will use find method
# IMP - Since our simple web page doen't contain id attribute in div tag 
# We will use class attribute ;
# Now while passing class attribute in find method we will explicitly pass it as class_
# The reason is ; in Python we have class as a special keyword . So just to avoid the 
# issue Python is asking to pass class_ ; for all other attributes we will use as it is 

from bs4 import BeautifulSoup
import requests

with open('simple.htm') as html_file:
	soup = BeautifulSoup(html_file,'lxml')

#match = soup.find('div',class_='article-2')
#print(match)
       
5. Program - How to Find the text within the tags Click Here
#********* Source Code From Website - Mangadaku - visit us at -http://mangadaku.com/ *****      
# Program - How to Find the text within the tags

# i.e we have text and links within multiple heirarchies of tags ; how to get text in such case 

from bs4 import BeautifulSoup
import requests

with open('simple.htm') as html_file:
	soup = BeautifulSoup(html_file,'lxml')

#match = soup.find('div',class_='article-2')
#print(match.p.a.text)
       
6. Program - find_all method Click Here
#********* Source Code From Website - Mangadaku - visit us at -http://mangadaku.com/ *****      
# Program - find_all method
# find_all method ; Suppose if we have multiple tage with same attribute and you need to 
# get the contents withing these tags then you can use find_all

from bs4 import BeautifulSoup
import requests

with open('simple.htm') as html_file:
	soup = BeautifulSoup(html_file,'lxml')

for article in soup.find_all('div',class_='article'):
	print(article.p.a.text)
	print()
       
Untitled Document