BeautifulSoup查找网页所有类
编写一个程序来查找给定网站 URL 的所有类。在 Beautiful Soup 中没有找到所有类的内置方法。本文晓得博客为你介绍使用BeautifulSoup查找网页所有类
需要的模块:
bs4 : Beautiful Soup(bs4) 是一个用于从 HTML 和 XML 文件中提取数据的 Python 库。这个模块没有内置在 Python 中。要安装此类型,请在终端中输入以下命令。
pip install bs4
requests: Requests 允许您非常轻松地发送 HTTP/1.1 请求。这个模块也没有内置在 Python 中。要安装此类型,请在终端中输入以下命令。
pip install requests
1、在给定的 HTML 文档中查找类
- 创建 HTML 文档。
- 导入模块。
- 将内容解析为 BeautifulSoup。
- 按类名迭代数据
# html code
html_doc = """<html><head><title>Welcome to geeksforgeeks</title></head>
<body>
<p class="title"><b>Geeks</b></p>
<p class="body">geeksforgeeks a computer science portal for geeks
</body>
"""
# import module
from bs4 import BeautifulSoup
# parse html content
soup = BeautifulSoup( html_doc , 'html.parser')
# Finding by class name
soup.find( class_ = "body" )
输出:
<p class="body">geeksforgeeks a computer science portal for geeks
</p>
2、在 URL中查找所有类
- 导入模块
- 制作请求实例并传递到 URL
- 将请求传递给 Beautifulsoup() 函数
- 然后我们将迭代所有标签并获取类名
# Import Module
from bs4 import BeautifulSoup
import requests
# Website URL
URL = 'https://learnpython.com/blog/'
# class list set
class_list = set()
# Page content from Website URL
page = requests.get( URL )
# parse html content
soup = BeautifulSoup( page.content , 'html.parser')
# get all tags
tags = {tag.name for tag in soup.find_all()}
# iterate all tags
for tag in tags:
# find all element of tag
for i in soup.find_all( tag ):
# if tag has attribute of class
if i.has_attr( "class" ):
if len( i['class'] ) != 0:
class_list.add(" ".join( i['class']))
print( class_list )
输出:
{'main-menu', 'main-menu__item main-menu__item--logout hide', 'site-header-home-navigation-hamburger-link', 'page-item active', 'blog-list-summary-info', 'main-menu__item main-menu__item--create-free-account', 'footer__bottom-text', 'footer__social-share', 'site-navigation',
'footer__hr', 'blog-post-date', 'site-header-home-navigation-below showOnLogged hide', 'logout__full-name user-name-element', 'footer__main-section', 'footer__social-share-item', 'site-header-home-navigation-layer-menu-icon middle', 'main-menu__item main-menu__item--courses',
'summary-blog-post-meta-list-author-name', 'footer__policies-list', 'site-header-home-navigation-layer-menu-icon top', 'logout-modal__link',
'blog-post-featured-image blog-list-feature-image tall', 'site-header-home-navigation-hamburger-wrapper pages hide', 'site-header-home-navigation-hamburger-item showOnLogged hide', 'page-link', 'footer__assistance', 'library-modal__layer', 'summary-read-more-link button--link',
'logout-modal__avatar avatar',
'site-header-home-navigation-hamburger-layout', 'library-modal modal', 'learnpy-blog-navigation', 'site-header-home-navigation-layer-menu-cover-active', 'page-item disabled', 'footer__wrapper', 'footer__quick-link-list-item', 'summary-read-more blog-list',
'footer__assistance-content', 'footer__quick-links', 'site-header-home-navigation-hamburger-item site-header-home-navigation-hamburger-item--articles', 'site-header', 'site-header-home-navigation-hamburger', 'button--primary',
'site-header-home-navigation-layer-menu pages', 'site-header-home-navigation-below-item-button button--primary', 'blog-list-summary', 'logout-modal modal', 'logout-modal__separator-border',
'blog-list-feature-image-link', 'footer__header', 'pagination', 'learnpy-blog-navigation-wrapper', 'to-top home', 'logout-modal__intro', 'blog-list-first-article', 'site-header-home-navigation-hamburger-item site-header-home-navigation-hamburger-item--courses', 'site-header-home-navigation-layer-menu-icon bottom', 'site-header-home-navigation-below-item-link',
'site-header-home-navigation-layer-menu-icon pages',
'button--footer', 'logout-modal__layer modal__layer', 'logout-modal__window modal__window', 'footer__follow-us', 'lazyload', 'learnpy-blog-navigation-item active', 'summary__content', 'footer__quick-link-list', 'summary-blog-post-meta-author-link', 'main-menu__item main-menu__item--library', 'blog-list-summary-title', 'logout__avatar avatar', 'footer',
'site-header-home-navigation-below-item-link button--link',
'footer__policies-list-item', 'page-item', 'blog-list-header-gradient', 'logout-modal__link logout-modal__link--logout', 'footer__vertabelo-link', 'blog-list-content tab-content', 'blog-list-header-background', 'main-menu__item main-menu__item--log-in', 'logout-modal__name user-name-element', 'button--ghost', 'footer__copyright', 'site-header-home-navigation-below hideOnLogged',
'site-header-home-navigation-below-item',
'blog-list-container', 'library-modal__window modal__window', 'site-header-home-navigation-layer-menu-cover', 'site-logo', 'footer__logo'}