Web Scraping: Pythonic Techniques with Beautiful Soup

Aastha Thakker
Oct 28, 2025
4 min read

Hey everyone!

If you’re reading this, you’re likely quite serious about gaining knowledge about cybersecurity field and overall new updates. Let’s say you’re researching specific topics on different websites and want to document your findings. Instead of manually copy pasting the things and scrolling to long and long websites, what if you can automate your search to extract relevant data for you? This is where web scrapping comes in as a savior. Now, you can automate however you want to collect or extract data from the website.

What is Web Scrapping?

Web scrapping is a technique to automate the process of collecting & parsing raw data from the website. It is also known as Web Harvesting or Data Extraction. Web scrapping basically requires two main things: Crawler & Scrapper. Crawler uses an Artificial Intelligence algorithm to browser or access the particular data, directly from the website, using HTTP protocol. Scraper is a tool which is used to scrape or extract data from the website.

Types of web scrapers based on building method:

a. Pre-Built: These tools are pre-built, easy to download and run easily. These are more advanced and customizable.

b. Self-built: Requires in-depth knowledge of programming. More customization needs more knowledge.

Uses of web scrapping:

Lead Generation: Businesses scrape social media platforms and professional networking sites to gather contact information of potential clients or leads, facilitating targeted marketing campaigns.
Academic Research: We can use web scrapping for scrapping academic journals and databases to collect data for studies & analyze trends.
Healthcare Data Collection: Web scrapper can be used in medical industry. It gathers data on diseases, treatments, clinical trials, and patient outcomes which can be further used for research, diagnosis, and treatment planning.
Government Transparency: Activists and journalists can use this technique to access and analyze government data, which is publicly available, such as public spending records, crime statistics or criminal records etc. This can help in information gathering for various types of cases.
Social Media Monitoring: To monitor customer demand, understand their sentiments and trends, brands and marketers scrape social media platforms.

Which Language is used more commonly for web scrapping? & Why?

Python is the most lovable, fashionable and favorable language for many of the tasks and one of that is web scrapping. Why? Because of its simplicity in coding, vast range of in-built library, integration capabilities and ease of learning of course.

Now, let’s get some hands-on practical experience is scrapping a website. For that we will be using BeautifulSoup library, which is used for pulling data out of HTML & XML files.

For installing it, Open VScode and install the following modules

pip3 install requestspip3 install html5libpip3 install bs4

A) Getting the raw HTML content from the website

import requestsfrom bs4 import BeautifulSoupurl = “http://testphp.vulnweb.com/"# Step 1: Get the raw data from HTML page.r = requests.get(url)htmlcontent = r.contentprint(htmlcontent)

B) Parsing the HTML data using soup and prettify modules to beautify the output.

import requests
from bs4 import BeautifulSoup
url = “http://testphp.vulnweb.com/"
# Step 2: Parse the HTML
soup = BeautifulSoup(htmlcontent,’html.parser’)
print(soup.prettify())

C) Types of Objects:

Tags: specific part of the HTML document, like a headline or paragraph.
NavigableString: Prints the text within a Tag.
BeautifulSoup: Represents the entire parsed HTML document.
Comment: Used to print the comments within the HTML text.

title = soup.title
# Types of objects:
# 1. Tag
print(“Title:”,type(title))
# 2. NavigableString
print(“NavigableString:”,type(title.string))
# 3. BeautifulSoup
print(“Soup:”,type(soup))
# 4. Comment
markup = “<p><! — This is a comment →</p>”
soup2 = BeautifulSoup(markup)
print(“Printing p”,soup2.p)
print(“Printing the string inside p:”,soup2.p.string)
print(“Prinitng type of content in p:”,type(soup2.p.string))

D) Getting the title of HTML page

E) Getting the element (para or anchor tag) from HTML page

paras = soup.find_all(‘p’)

F) This prints the URLs (href attribute) of each anchor. But the thing is, it only prints the URLs without constructing complete navigable links.

G) Getting the navigable links as an output. It constructs a complete URL using the base URL http://testphp.vulnweb.com/ and prints it.

H) Finding the first ‘tag’ from the HTML content.

print(soup.find(‘p’))

I) Getting the text from the specified tag, here it is ‘title’.

print(soup.find(‘title’).get_text())

J) This will extract all the text from the HTML content, excluding the tags or attributes.

print(soup.get_text())

Disadvantages of web scrapping

1. Legal Issues: Scraping data without permission may raise ethical questions and may violate the terms of service of websites, leading to potential ethical & legal consequences.

2. Data Quality: Scrapping does not guarantee cleaned and normalized data. It will have many inaccuracies, incompetency or outdated data which can lead to unreliable analysis.

3. IP Blocking: Websites may detect and block scraping activities. This can disrupt our process of analysis.

4. Maintenance Overhead: Scraping scripts require maintenance to adapt to changes in website structures or anti-scraping measures which can create maintenance overhead.

5. Data Privacy Concerns: Scrappers may collect personal data from websites, intentionally or unintentionally, it can lead to concerns on privacy issues and potential regulatory compliance challenges.

So, this was all about web scraping, uses, codes and disadvantages. It has very real-life use cases but it’s important to respect ethical & legal boundaries. Python is most suitable language for this work, libraries like BeautifulSoup and Scrapy can be used for this purpose.

Got questions or cool project ideas about this topic? Slide into my LinkedIn DMs and share your experiences & opinions!

See you next Thursday!!