[原创]python3 基础爬虫(静态网页)

python经常被用来写爬虫程序，而在接触python网络爬虫之前，建议先充分了解html (超文本标记语言)

一.简单获取网页源代码

在解析网页之前，首先要爬取网页源代码

而使用python 3下载网页的方式主要有如下两种方式

1.requests

import requests

url="xxxxx.com" #这里为网页地址
html=requests.get(url).text #get请求获得源代码
print(html) #输出获取到的源代码

2.urllib

import urllib.request

url="xxxxx.com" #这里为网页地址
request=urllib.request.Request(url)
response=urllib.request.urlopen(request)
html=response.read() #读取源代码
print(html) #输出

二.headers的使用

首先需要了解，headers是什么?

GET xxxx HTTP/1.1
Host: xxx
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.6.5 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: xxxx
Pragma: no-cache
Cache-Control: no-cache

有些时候，爬取一些网页会出错(或内容未正常显示)，那可能就是因为网页有”反爬虫”措施，这个时候我们就需要利用headers来伪装浏览器进行请求

1.在requests里加入headers

import requests

url="xxxxx.com" #这里为网页地址
headers={} #指定的headers (为一个字典变量)，可以按需求填写(这里未填写，为空)
html=requests.get(url,headers=headers).text #get请求获得源代码 (加入了headers)
print(html) #输出获取到的源代码

2.在urllib里加入headers

import urllib.request

url="xxxxx.com" #这里为网页地址
headers={} #指定的headers (为一个字典变量)，可以按需求填写(这里未填写，为空)
request=urllib.request.Request(url,headers=headers)
response=urllib.request.urlopen(request)
html=response.read() #读取源代码
print(html) #输出

三.网页源码的分析

既然网页源码都已经爬取到了，那么下一步就是从静态网页的源码中分析出有用的信息

这一步需要一些html的知识

解析源码这里建议使用一个库，叫做BeautifulSoup

BeautifulSoup的安装

借助pip安装工具，我们就可以直接在命令行里输入执行

1	pip install beautifulsoup4

来安装BeautifulSoup库，看到Successfully Installed就说明安装完成了

BeautifulSoup的使用

1>创建 BeautifulSoup 对象

首先引用bs4库

1	from bs4 import BeautifulSoup

创建一个 Beautifulsoup 对象

1	soup=BeautifulSoup(html)

创建对象还可以添加如下参数,使用指定解释器

解析器	使用方法
Python标准库	BeautifulSoup(xxx,”html.parser”)
lxml解析器	BeautifulSoup(xxx,”lxml”)
html5lib	BeautifulSoup(xxx,”html5lib”)

2> BeautifulSoup 树结构

BeautifulSoup将HTML文档转换成一个树形结构,每个节点都是Python对象,我们就可以从中轻松读取数据,而树形结构可以归纳为4种:

1.Tag
2.NavigableString
3.BeautifulSoup
4.Comment

3> BeautifulSoup中常用函数

find_all - 返回查找到的元素列表

1	find_all(name,attrs,recursive,text,**kwargs)

参数解释:
name: html中的标签,比如说’li’,’a’,’div’,’span’,’h1’
attrs: html中属性头,比如说’id’,’class’,’href’
text: 搜索有的字符串内容
limit: 限制数目输出数量
代码示例:

soup.find_all('li') #返回list,这里查找的是li标签
soup.find_all(attrs={'class':'title'}) #attrs查找,class查找还有一种方法,如下:
soup.find_all(class_="title") #另一种方法,不用attrs
#注意这里是class_而不是class (有下划线)

find
这个函数就和上面的find_all差不多了,只不过是查找一个匹配元素,而不是返回list
get_text - 获取标签下所有文本
示例:
假如说html里有一个span标签
1
2
3
4
<span>
abc
<p>123</p>
</span>

1 2	html=soup.find('span') print(html.get_text()) #输出abc123

那么这段代码就可以获取到span内的文本内容

python网页爬虫