Web- Scrapping Using Python

DS - VRP
2 min readJan 30, 2022

--

I know there are various sources to learn this, even I have learnt from the various resources and trying to record my learnings in this blog. you are welcome to suggest improvements.

As data scientist, we need to collect data from different sources. Data plays vital role in field of analytics and data science field. All data science activity revolves around gathering, analyzing and interpreting the data.

Web Scrapping is the process of extracting data from the web. it can be stored into different storage systems like csv or xml etc. We are using Python programming language for the scrapping and we must have knowledge of python syntax and basic of HTML.

The dictionary meaning of word ‘Scrapping’ implies getting something from the web. Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. This link will help us with top websites which allows scrapping.

Web scrapping is one of the convenient for gathering data, but it can be problematic for the scrapper or the web owner. for scrapper, if scrapper is willing to scrap a website which is allowed to be scrapped, the scrapper may get blocked. We can follow this link to scrap any website without getting blocked. For the website, it may down server of the particular website.

Well I am not using a new example. I read few blogs and watch one video to learn web scrapping. I used this YouTube link to learn this technique. This is video from Corey Schafer. In this video he explained the techniques on his own website and explained the process.

The code:

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get(‘http://coreyms.com').text
soup = BeautifulSoup(source,’lxml’)

csv_file = open(‘cms_scrape.csv’,’w’)

csv_writer = csv.writer(csv_file)

csv_writer.writerow([‘headline’,’summar’,’video_link’])

for article in soup.find_all(‘article’):
headline = article.h2.a.text
print(headline)

summary = article.find(‘div’,class_ = ‘entry-content’).p.text
print(summary)

try:
vid_src = article.find(‘ifram’, class_ = ‘youtube-player’)[‘src’]

vid_id = vid_src.split(‘/’)[4]
vid_id = vid_id.split(‘?’)[0]

yt_link = f’
https://youtube.com/watch?v ={vid_id}’

except Exception as e:
yt_link = None

print(yt_link)

print()

csv_writer.writerow([headline, summary, yt_link])

csv_file.close()

He used the bs4 and request library for the scrapping, there are different methods which can be find in references.

We need change the code as per requirement, the different website have different structure and we need be careful with tags while scrapping.

Thank you, Please suggest any other resource which can be useful for us.

References

  1. https://www.youtube.com/watch?v=ng2o98k983k
  2. https://www.geeksforgeeks.org/python-web-scraping-tutorial/
  3. https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_quick_guide.htm
  4. https://www.datacamp.com/community/tutorials/web-scraping-using-python?utm_source=adwords_ppc&utm_medium=cpc&utm_campaignid=1455363063&utm_adgroupid=65083631748&utm_device=c&utm_keyword=&utm_matchtype=&utm_network=g&utm_adpostion=&utm_creative=332602034358&utm_targetid=dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=9061642

--

--

DS - VRP
DS - VRP

Written by DS - VRP

An aspiring data scientist on a journey of continuous learning and discovery—turning curiosity into insights and challenges into opportunities to innovate

No responses yet