网页代码都是成对的标签,基础结构如下
<!DOCTYPE html>
<html lang="en">
<head>
<!--网页头部信息-->
<title>网页名</title>
</head>
<body>
<!--下面是网页正文-->
<div>
div-text
</div>
</body>
</html>
网页结构基本都是如此,一般有价值的数据都是在body中
html_str = """
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>网页名</title>
</head>
<body>
<div>
div-text
<span>span-text</span>
<a>a-text</a>
<p>p-text</p>
</div>
<table>
<tr>
<th>Heading</th>
<th>Another Heading</th>
</tr>
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
</tr>
table-text-2
</table>
</body>
</html>
"""
from lxml import etree
html = etree.HTML(html_str)
这段代码是从lxml中导入etree,然后将前面杜撰的一个html_str字符串通过etree.HTML函数,解析成支持xpath的数据类型,并保存到html变量中
两段代码链接起来,成功运行,那就说明lxml库正常
任务一:解析出head里面的title标签的值【即 ‘网页名’】
title/text()来获得,title标签是head的子级head标签就是最外部的标签head/title/text()print(html.xpath('head/title/text()'))。这样就打印出了一个列表['网页名']任务二:解析出div标签下span、a、p标签里面的值
span标签值对应的xpath路径是body/div/span/text(),结果['span-text']a标签值对应的xpath路径是body/div/a/text(),结果['a-text']p标签值对应的xpath路径是body/div/p/text(),结果['p-text']任务三:试着解析div标签的text()
div的子级标签值,获取父级也挺简单div对应的xpath路径是body/div/text()['\n div-text\n ', '\n ', '\n ', '\n ']text()。如果没有子级,得到的结果列表就是一个元素;有两个子级,则列表有三元素;就像一根面条被切了两次,就有三根面条出现。【不是对着切(ˉ▽ ̄~) ~~】/n是换行th和td的文本值th和td都有两个标签,首先按前面的方式来写路径body,再到table,然后是tr,再是th和tdbody/table/tr/th/text()和body/table/tr/td/text()['Heading', 'Another Heading'],['row 1, cell 1', 'row 1, cell 2']html_str = """
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>网页名</title>
</head>
<body>
<div>
div-text
<span>span-text</span>
<a>a-text</a>
<p>p-text</p>
</div>
<table class="2">
<tr>
<th>Heading</th>
<th>Another Heading</th>
</tr>
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
</tr>
table-text-2
</table>
</body>
</html>
"""
from lxml import etree
html = etree.HTML(html_str)
print(html.xpath('head/title/text()'))
print(html.xpath('body/div/span/text()'))
print(html.xpath('body/div/a/text()'))
print(html.xpath('body/div/p/text()'))
print(html.xpath('body/div/text()'))
print(html.xpath('body/table/tr/th/text()'))
print(html.xpath('body/table/tr/td/text()'))