Python网络数据采集第2版（影印版） pdf epub mobi txt 电子书下载 2025

简体网页||繁体网页

☆☆☆☆☆

出版者:东南大学出版社

作者:Ryan Mitchell

出品人:

页数:0

译者:

出版时间:2018-11

价格:89.00元

装帧:平装

isbn号码:9787564179779

丛书系列:

图书标签:

Python
数据方法
数据分析
tech-network
Python
网络爬虫
数据采集
Web Scraping
数据分析
网络编程
实战
第二版
影印版
技术图书

下载链接在页面底部

facebook linkedin mastodon messenger pinterest reddit telegram twitter viber vkontakte whatsapp 复制链接

想要找书就要到小美书屋

book.quotespace.org

立刻按 ctrl+D收藏本页

你会得到大惊喜!!

具体描述

作者简介

Ryan Mitchell

数据科学家、软件工程师，目前在波士顿LinkeDrive公司负责开发公司的API和数据分析工具。此前，曾在Abine公司构建网络爬虫和网络机器人。她经常做网络数据采集项目的咨询工作，主要面向金融和零售业。另著有Instant Web Scraping with Java。

目录信息

Preface
Part I. Building Scrapers
1. Your First Web Scraper
Connecting
An Introduction to BeautifulSoup
Installing BeautifulSoup
Running BeautifulSoup
Connecting Reliably and Handling Exceptions
2. Advanced HTML Parsing
You Don't Always Need a Hammer
Another Serving of BeautifulSoup
findo and findallo with BeautifulSoup
Other BeautifulSoup Objects
Navigating Trees
Regular Expressions
Regular Expressions and BeautifulSoup
Accessing Attributes
Lambda Expressions
3. Writing Web Crawlers
Traversing a Single Domain
Crawling an Entire Site
Collecting Data Across an Entire Site
Crawling Across the Internet
4. Web Crawling Models
Planning and Defining Objects
Dealing with Different Website Layouts
Structuring Crawlers
Crawling Sites Through Search
Crawling Sites Through Links
Crawling Multiple Page Types
Thinking About Web Crawler Models
5. Scrapy
Installing Scrapy
Initializing a New Spider
Writing a Simple Scraper
Spidering with Rules
Creating Items
Outputting Items
The Item Pipeline
Logging with Scrapy
More Resources
6. St0ring Data
Media Files
Storing Data to CSV
MySQL
Installing MySQL
Some Basic Commands
Integrating with Python
Database Techniques and Good Practice
"Six Degrees" in MySQL
· · · · · · (收起)

读后感

评分☆☆☆☆☆

5.3.2 基本命令第二段第一句话：除了用户自定义变量名（MySQL 5.x 版本是不区分大小写的，MySQL 5.0 之前的版本是不区分大小写的），MySQL 语句是不区分大小写的。（wtf ？？？？？？？ 5.4 Email 查询圣诞节的代码缩进错误（sendMail函数和while都错了，会造成死循环！ 8.2...

评分☆☆☆☆☆

1.可以尝试使用Google API 2.对于容易被封杀的站点使用tor来匿名 3.使用Tesseract识别验证码，可以训练特殊字体提高识别率 4.爬取整个网站的外链链接是件容易的事情 5.使用selenium作为测试网站的框架 6.注意cookie和request header的使用，努力让网站不把你当做爬虫对待

评分☆☆☆☆☆

作者显然是此行达人，踩坑踩多了都是直接上经验。书里的代码很优美、正规并且很简洁，运用了大量的递归算法和正则表达式。但是有些地方译者翻译的有误，比如第31页，倒数第六行冒号翻译成了分号，显然运行了源码并且对比了wiki网站才会知道这是误翻译。另外，作者源码也有错...

评分☆☆☆☆☆

第177页的代码从逻辑上就不对啊，import的pytesseract就没用，而是通过subprocess调用，这应该是第一版的思路，不过我也搞不清这是作者还是译者的锅，把代码改成如下更合理 import time from urllib.request import urlretrieve from PIL import Image import pytesseract from...

评分☆☆☆☆☆