Question

所以我的兄弟想让我用 Python 编写一个网络爬虫（自学成才），我知道 C ++，Java 和一些 html。我正在使用 2.7 版本并阅读 python 库，但我有一些问题 1.httplib.HTTPConnection和request概念对我来说是新的，我不明白它是否下载了 html 脚本，如 cookie 或实例。

只是为了背景，我需要下载一个页面，并用我有的替换任何 img

如果你们能告诉我你们对 2.7 和 3.1 的看法就好了

46

11

3

0

Answer 1

~~使用 Python 2.7，目前有更多的第三方库。~~（编辑：见下文）。

我建议您使用 stdlib 模块urllib2，它将使您可以舒适地获取 Web 资源。例如：

import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()

要解析代码，请查看BeautifulSoup。

BTW：你到底想做什么：

只是为了背景，我需要下载一个页面，并用我有的替换任何 img

编辑：现在是 2014 年，大多数重要的库已经移植，如果可以的话，你绝对应该使用 Python 3。python-requests是一个非常好的高级库，比urllib2更容易使用。

Answer 2

@ leoluk 提到的python3和requests库的示例：

pip install requests

脚本 req.py：

import requests
url='http://localhost'
# in case you need a session
cd = { 'sessionid': '123..'}
r = requests.get(url, cookies=cd)
# or without a session: r = requests.get(url)
r.content

现在，执行它，你会得到 localhost 的 HTML 源代码！

python3 req.py

Answer 3

如果您使用的是Python > 3.x，则不需要安装任何库，这是直接在 python 框架中构建的。旧的urllib2包已重命名为urllib：

from urllib import request
response = request.urlopen("https://www.google.com")
# set the correct cht below
page_source = response.read().decode('utf-8')
print(page_source)