可以通过爬虫,来下载网络小说。
主要是运用了BeatifulSoup
软件库,可以本很方便地提取网页中的内容。
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间。
以下载小说“诡秘之王”为例,其中以tqdm
软件库来查看下载进度条:
代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
import requests
import time
import numpy as np
from tqdm import tqdm
from bs4 import BeautifulSoup
def get_content(target):
req = requests.get(url = target, verify=False)
req.encoding = 'utf-8'
html = req.text
bf = BeautifulSoup(html, 'lxml')
texts = bf.find('div', id='content')
content = texts.text.strip().split('\xa0'*4)
return content
if __name__ == '__main__':
server = 'https://www.xsbiquge.com'
book_name = '诡秘之主.txt'
target = 'https://www.xsbiquge.com/15_15338/'
req = requests.get(url = target)
req.encoding = 'utf-8'
html = req.text
chapter_bs = BeautifulSoup(html, 'lxml')
chapters = chapter_bs.find('div', id='list')
chapters = chapters.find_all('a')
for chapter in tqdm(chapters): # chapters[1239:], slice from 1239 to the end
chapter_name = chapter.string
url = server + chapter.get('href')
time.sleep(np.random.rand()*5) # add time to avoid block
content = get_content(url)
with open(book_name, 'a', encoding='utf-8') as f:
f.write(chapter_name)
f.write('\n')
f.write('\n'.join(content))
f.write('\n')
|