WebpageCrawler Class Documentation¶
Overview¶
The WebpageCrawler
class provides tools for crawling a given webpage and recursively exploring links to fetch all child pages under the same base URL. It validates the accessibility of links and categorizes them as either webpages or non-webpage resources based on their MIME type.
Class Reference¶
WebpageCrawler¶
Attributes¶
dict_href_links
(dict): Stores and tracks discovered links during the crawl.
Methods¶
__init__()
¶
Initializes a new instance of the WebpageCrawler
class.
Usage:
crawler = WebpageCrawler()
async fetch(session: ClientSession, url: str) -> str
¶
Asynchronously fetches the HTML content of the specified URL.
Parameters:
session
(ClientSession): The session used to make HTTP requests.url
(str): The URL to fetch.
Returns:
- (str): The HTML content of the page.
Raises:
aiohttp.ClientError
: If the HTTP request fails.
Usage:
try:
html_content = await crawler.fetch(session, url)
except aiohttp.ClientError as e:
print(f"Failed to fetch {url}: {e}")
url_exists(url: str) -> bool
¶
Checks if a given URL is accessible by performing a HEAD
request.
Parameters:
url
(str): The URL to check.
Returns:
- (bool):
True
if the URL is accessible (status code 200), otherwiseFalse
.
Usage:
if crawler.url_exists(url):
print("URL exists")
else:
print("URL does not exist")
async get_links(session: ClientSession, website_link: str, base_url: str) -> List[str]
¶
Extracts and normalizes valid links from the HTML content of a webpage.
Parameters:
session
(ClientSession): The session used for making HTTP requests.website_link
(str): The URL of the webpage to extract links from.base_url
(str): The base URL to filter out external links.
Returns:
- (List[str]): A list of normalized URLs that are valid and fall under the base URL.
Usage:
links = await crawler.get_links(session, website_link, base_url)
async get_subpage_links(session: ClientSession, urls: List[str], base_url: str) -> List[str]
¶
Asynchronously gathers links from multiple webpages.
Parameters:
session
(ClientSession): The session used for making HTTP requests.urls
(List[str]): A list of URLs to fetch links from.base_url
(str): The base URL to filter out external links.
Returns:
- (List[str]): A combined list of all child URLs discovered from the provided list of URLs.
Usage:
all_links = await crawler.get_subpage_links(session, urls, base_url)
async get_all_pages(url: str, base_url: str) -> List[str]
¶
Recursively crawls a website to gather all valid URLs under the same base URL.
Parameters:
url
(str): The initial URL to start crawling from.base_url
(str): The base URL to filter out external links.
Returns:
- (List[str]): A complete list of all URLs discovered under the base URL.
Usage:
all_pages = await crawler.get_all_pages(url, base_url)