beautifulsoup模块简介 · CoderFAN 资料库

`beautifulsoup`模块简介

原文：https://www.studytonight.com/python/web-scraping/introduction-to-beautifulsoup

在本教程中，我们将学习如何使用 python 的模块解析网页的源代码(我们可以使用 requests模块获得)，并从源代码中找到各种有用的信息，如所有的 HTML 表格标题，或网页上的所有链接等。

**如果我们向它提供了关于 HTML 标签的所有信息，美化程序可以搜索并返回所有出现的 HTML 标签。

在我们开始搜索 HTML 标签和从网页访问信息之前，让我们看看如何格式化收到的 HTTP 响应内容，使其更具可读性。

美化组:美化内容

在beautifulsoup模块中可用的方法prettify可用于格式化使用requests模块接收的 HTTP 响应。

下面是代码示例，它是对上一个教程的扩展:

## import modules
import requests
from fake_useragent import UserAgent
## importing the beautifulsoup module
import bs4

## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")

## creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")

## using 'prettify' method to print the content
print(soup.prettify())

在上面的代码中，我们执行了以下操作:

导入模块:请求、假用户代理和 bs4 。
从任何你喜欢的网址获取回复。
使用BeautifulSoup类创建一个美化组对象。
使用prettify方法打印响应，方法是使用美丽的输出对象。

如果您在阅读了前面的教程后来到这里，您一定已经看到了使用requests模块做出的 GET 请求的响应是什么样子的。

当我们使用prettify方法格式化该响应时，它看起来像是这个 (点击这个下载文件)。

现在响应已经格式化了，让我们学习如何使用美丽的输出来访问 HTTP 响应中的各种 HTML 标签和相关信息(源代码)。

BeautifulSoap:访问 HTML 标签

使用“美化”模块，我们可以轻松找到并访问各种 HTML 标签的内容，如标题、标题、 div 、 p 、 h1 等。让我们看一个简单的例子，我们将打印网页的标题标签。

## import modules
import requests
from fake_useragent import UserAgent
## importing the beautifulsoup module
import bs4

## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")

## creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")

## getting 'title' tag from the google BeautifulSoup -> 'soup'
title_tag = soup.title
print(title_tag)

谷歌

我们还可以只获得包含在开始和结束标题标签中的文本:

## import modules
import requests
from fake_useragent import UserAgent
## importing the beautifulsoup module
import bs4

## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")

## creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")

## getting 'title' tag from the google BeautifulSoup -> 'soup'
title_text = soup.title.text
print(title_text)

谷歌

这是所有 HTML 标签的标准，比如获取头标签，我们可以这样使用soup.head，

## import modules
import requests
from fake_useragent import UserAgent
## importing the beautifulsoup module
import bs4

## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")

## creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")

## getting 'head' tag from the google BeautifulSoup -> 'soup'
print(soup.head)

这将从页面的源代码中返回完整的头标签。

< meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/> 谷歌

我们没有在输出中添加完整的代码，因为它非常庞大。但是可以看到标题标签在头标签里面，里面也有风格标签。

我们还可以通过头标签获取标题标签内容:

## getting 'title' tag from the google BeautifulSoup -> 'soup'
print(soup.head.title.text)

谷歌

这只是为了向您展示，当美丽组遵循树遍历技术来解析 HTML 代码时，我们也可以通过遵循标签的层次结构来访问它们。

同样，让我们访问样式标签:

## getting 'title' tag from the google BeautifulSoup -> 'soup'
print(soup.head.style.text)

gbar，# guser { font-size:13px；填充-顶部:1px！重要；} # gbar {高度:22px } # guser {填充-底部:7px！重要；文本对齐:向右}。gbh，。gbd {边框-顶部:1px 实心# c9d 7 f1；字体大小:1px}。gbh {高度:0；位置:绝对；top:24px；宽度:100% } @ media all { . gb1 {高度:22px 右边距:. 5em 垂直对齐:top}#gbar{float:left}}a.gb1，a.gb4{text-decoration:下划线！重要}a.gb1，a.gb4{color:#00c！重要}。gbi .gb4{color:#dd8e27！重要}。gbf .gb4{color:#900！重要}

到目前为止，我们已经介绍了基本的 HTML 解析和访问标签。在下一个教程中，我们将看到更多的方法和一些方法来浏览任何网页的 HTML 源代码来收集有用的数据。