使用 BeautifulSoup从HTML中提取JSON

　　JSON是一种简单的数据交换格式,以占带宽小,便于客户端读取,便于服务端解析的显著特点在网页爬取保存数据时，被广泛应用。之前，我们有文章介绍如何使用BeautifulSoup查询关键词谷歌搜索结果排名，在本文中，我们晓得博客将为你介绍使用 BeautifulSoup从HTML中提取JSON，

需要使用的Python库

bs4 : Beautiful Soup(bs4) 是一个 Python 库，用于从 HTML 和 XML 文件中提取数据。这个模块不是 Python 内置的。要安装此类型，请在终端中输入以下命令。

pip install bs4

requests ： Request 允许您非常轻松地发送 HTTP/1.1 请求。这个模块也不是 Python 内置的。要安装此类型，请在终端中输入以下命令。

pip install requests

Python库的方法

导入所有需要的模块。
在 get 函数 (UDF) 中传递 URL，以便它将 GET 请求传递给 URL，并返回响应。

语法： requests.get(url, args)

　　现在使用 bs4 解析 HTML 内容。

语法： BeautifulSoup(page.text, ‘html.parser’)

page.text ：它是原始 HTML 内容。
html.parser ：指定我们要使用的 HTML 解析器。

　　使用 find() 函数获取所有需要的数据，找到带有 li, a, p 标签的客户列表，其中有一些唯一的类或 id。您可以在浏览器中打开网页，通过右键单击查看相关元素，如图所示。

创建一个 Json 文件并使用 json.dump() 方法将 python 对象转换为适当的 JSON 对象。

　　推荐：Beautifulsoup教程

　　下面是完整的python代码实现：

import requests
from bs4 import BeautifulSoup
import json

def json_from_html_using_bs4(base_url):

    page = requests.get(base_url)
    soup = BeautifulSoup(page.text, "html.parser")
    books = soup.find_all('li', attrs={'class':'col-xs-6 col-sm-4 col-md-3 col-lg-3'})
    star = ['One', 'Two', 'Three', 'Four', 'Five']
    res, book_no = [], 1

    # Iterate books classand check for the given tags
    for book in books:

        title = book.find('img')['alt']
        link = base_url[:37] + book.find('a')['href']

        for index in range(5):
            find_stars = book.find('p', attrs={'class': 'star-rating ' + star[index]})

            if find_stars is not None:
                stars = star[index] + " out of 5"
                break

        price = book.find('p', attrs={'class': 'price_color'}).text
        instock = book.find('p', attrs={'class':'instock availability'}).text.strip()
    
        data = {'book no': str(book_no), 'title': title,'rating': stars, 'price': price, 'link': link,'stock': instock}

    # Append the dictionary to the list
        res.append(data)
        book_no += 1
    return res

# Main Function
if __name__ == "__main__":
    # Enter the url of website
    base_url = "https://books.toscrape.com/catalogue/page-1.html"
    res = json_from_html_using_bs4(base_url)
    # it to books.json file.
    with open('books.json', 'w', encoding='latin-1') as f:
        json.dump(res, f, indent=8, ensure_ascii=False)
    print("Created Json File")

输出：