WebpageCrawler Class Documentation¶

Overview¶

The WebpageCrawler class provides tools for crawling a given webpage and recursively exploring links to fetch all child pages under the same base URL. It validates the accessibility of links and categorizes them as either webpages or non-webpage resources based on their MIME type.

Class Reference¶

WebpageCrawler¶

Attributes¶

dict_href_links (dict): Stores and tracks discovered links during the crawl.

Methods¶

`init()`¶

Initializes a new instance of the WebpageCrawler class.

Usage:

crawler = WebpageCrawler()

`async fetch(session: ClientSession, url: str) -> str`¶

Asynchronously fetches the HTML content of the specified URL.

Parameters:

session (ClientSession): The session used to make HTTP requests.
url (str): The URL to fetch.

Returns:

(str): The HTML content of the page.

Raises:

aiohttp.ClientError: If the HTTP request fails.

Usage:

try:
    html_content = await crawler.fetch(session, url)
except aiohttp.ClientError as e:
    print(f"Failed to fetch {url}: {e}")

`url_exists(url: str) -> bool`¶

Checks if a given URL is accessible by performing a HEAD request.

Parameters:

url (str): The URL to check.

Returns:

(bool): True if the URL is accessible (status code 200), otherwise False.

Usage:

if crawler.url_exists(url):
    print("URL exists")
else:
    print("URL does not exist")

`async get_links(session: ClientSession, website_link: str, base_url: str) -> List[str]`¶

Extracts and normalizes valid links from the HTML content of a webpage.

Parameters:

session (ClientSession): The session used for making HTTP requests.
website_link (str): The URL of the webpage to extract links from.
base_url (str): The base URL to filter out external links.

Returns:

(List[str]): A list of normalized URLs that are valid and fall under the base URL.

Usage:

links = await crawler.get_links(session, website_link, base_url)

`async get_subpage_links(session: ClientSession, urls: List[str], base_url: str) -> List[str]`¶

Asynchronously gathers links from multiple webpages.

Parameters:

session (ClientSession): The session used for making HTTP requests.
urls (List[str]): A list of URLs to fetch links from.
base_url (str): The base URL to filter out external links.

Returns:

(List[str]): A combined list of all child URLs discovered from the provided list of URLs.

Usage:

all_links = await crawler.get_subpage_links(session, urls, base_url)

`async get_all_pages(url: str, base_url: str) -> List[str]`¶

Recursively crawls a website to gather all valid URLs under the same base URL.

Parameters:

url (str): The initial URL to start crawling from.
base_url (str): The base URL to filter out external links.

Returns:

(List[str]): A complete list of all URLs discovered under the base URL.

Usage:

all_pages = await crawler.get_all_pages(url, base_url)

WebpageCrawler Class Documentation¶

Overview¶

Class Reference¶

WebpageCrawler¶

Attributes¶

Methods¶

__init__()¶

async fetch(session: ClientSession, url: str) -> str¶

url_exists(url: str) -> bool¶

async get_links(session: ClientSession, website_link: str, base_url: str) -> List[str]¶

async get_subpage_links(session: ClientSession, urls: List[str], base_url: str) -> List[str]¶

async get_all_pages(url: str, base_url: str) -> List[str]¶

`init()`¶

`async fetch(session: ClientSession, url: str) -> str`¶

`url_exists(url: str) -> bool`¶

`async get_links(session: ClientSession, website_link: str, base_url: str) -> List[str]`¶

`async get_subpage_links(session: ClientSession, urls: List[str], base_url: str) -> List[str]`¶

`async get_all_pages(url: str, base_url: str) -> List[str]`¶