目錄
前言 .................................................................................................. xiv
第Ⅰ部分 準備工作
第1章 理論 ..........................................................................................3
導論 .............................................................................................................................3
定義 .............................................................................................................................5
方法學 ................................................................................................................5
敏捷數據科學宣言 ............................................................................................6
瀑布模型的問題 .......................................................................................................10
研究與應用開發 ..............................................................................................11
敏捷軟件開發的問題 ...............................................................................................14
最終質量:償還技術債 ....................................................................................14
瀑布模型的拉力 ..............................................................................................15
數據科學過程 ...........................................................................................................16
設置預期 ..........................................................................................................17
數據科學團隊的角色 ......................................................................................18
認清機遇與挑戰 ..............................................................................................19
適應變化 ..........................................................................................................21
過程中的注意事項 ...................................................................................................23
代碼審核與結對編程 ......................................................................................25
敏捷開發的環境:提高生産效率 ....................................................................25
用大幅打印實現想法 ......................................................................................27
第2章 敏捷工具 ................................................................................29
可伸縮性=易用性 ...................................................................................................30
敏捷數據科學之數據處理 .......................................................................................30
搭建本地環境 ...........................................................................................................32
配置要求 ..........................................................................................................33
配置Vagrant .....................................................................................................33
下載數據 ..........................................................................................................33
搭建EC2環境 ............................................................................................................34
下載數據 ..........................................................................................................38
下載並運行代碼 .......................................................................................................38
下載代碼 ..........................................................................................................38
運行代碼 ..........................................................................................................38
Jupyter筆記本 ...................................................................................................39
工具集概覽 ...............................................................................................................39
敏捷開發工具棧的要求 ..................................................................................39
Python 3 ...........................................................................................................39
使用JSON行和Parquet序列化事件 .................................................................42
收集數據 ..........................................................................................................45
使用Spark進行數據處理 .................................................................................45
使用MongoDB發布數據 .................................................................................48
使用Elasticsearch搜索數據 .............................................................................50
使用Apache Kafka分發流數據 .......................................................................54
使用PySpark Streaming處理流數據 ...............................................................57
使用scikit-learn與Spark MLlib進行機器學習 ................................................58
使用 Apache Airflow(孵化項目)進行調度 ....................................................59
反思我們的工作流程 ......................................................................................70
輕量級網絡應用 ..............................................................................................70
展示數據 ..........................................................................................................73
本章小結 ...................................................................................................................75
第3章 數據 ........................................................................................77
飛行航班數據 ...........................................................................................................77
航班準點情況數據 ..........................................................................................78
OpenFlights數據庫 ...........................................................................................79
天氣數據 ...................................................................................................................80
敏捷數據科學中的數據處理 ...................................................................................81
結構化數據vs.半結構化數據 ..........................................................................81
SQL vs. NoSQL .........................................................................................................82
SQL ...................................................................................................................83
NoSQL與數據流編程 ......................................................................................83
Spark: SQL + NoSQL ......................................................................................84
NoSQL中的錶結構 ..........................................................................................84
數據序列化 ......................................................................................................85
動態結構錶的特徵提取與呈現 ......................................................................85
本章小結 ...................................................................................................................86
第Ⅱ部分 攀登金字塔
第4章 記錄收集與展示 ......................................................................89
整體使用 ...................................................................................................................90
航班數據收集與序列化 ...........................................................................................91
航班記錄處理與發布 ...............................................................................................94
把航班記錄發布到MongoDB .........................................................................95
在瀏覽器中展示航班記錄 .......................................................................................96
使用Flask和pymongo提供航班信息 ...............................................................97
使用Jinja2渲染HTML5頁麵............................................................................98
敏捷開發檢查站 .....................................................................................................102
列齣航班記錄 .........................................................................................................103
使用MongoDB列齣航班記錄 .......................................................................103
數據分頁 ........................................................................................................106
搜索航班數據 .........................................................................................................112
創建索引 ........................................................................................................112
發布航班數據到Elasticsearch ......................................................................113
通過網頁搜索航班數據 ................................................................................114
本章小結 .................................................................................................................117
第5章 使用圖錶進行數據可視化 .................................................... 119
圖錶質量:迭代至關重要 .......................................................................................120
用發布/裝飾模型伸縮數據庫 ................................................................................120
一階形式 ........................................................................................................121
二階形式 ........................................................................................................122
三階形式 ........................................................................................................123
選擇一種形式 ................................................................................................123
探究時令性 .............................................................................................................124
查詢並展示航班總數 ....................................................................................124
提取“金屬”(飛機(實體)) .....................................................................................132
提取機尾編號 ................................................................................................132
評估飛機記錄 ................................................................................................139
數據完善 .................................................................................................................140
網頁錶單逆嚮工程 ........................................................................................140
收集機尾編號 ................................................................................................142
自動化錶單提交 ............................................................................................143
從HTML中提取數據 .....................................................................................144
評價完善後的數據 ........................................................................................147
本章小結 .................................................................................................................148
第6章 通過報錶探索數據 ............................................................... 149
提取航空公司為實體 .............................................................................................150
使用PySpark把航空公司定義為飛機的分組 ...............................................150
在MongoDB中查詢航空公司數據 ...............................................................151
在Flask中構建航空公司頁麵 ........................................................................151
添加迴到航空公司頁麵的鏈接 ....................................................................152
創建一個包括所有航空公司的主頁 ............................................................153
整理半結構化數據的本體關係 .............................................................................154
改進航空公司頁麵 .................................................................................................155
給航空公司代碼加上名稱 ............................................................................156
整閤維基百科內容 ........................................................................................158
把擴充過的航空公司錶發布到MongoDB ...................................................159
在網頁上擴充航空公司信息 ........................................................................160
調查飛機(實體) .....................................................................................................162
SQL嵌套查詢vs.數據流編程 ........................................................................164
不使用嵌套查詢的數據流編程 ....................................................................164
Spark SQL中的子查詢...................................................................................165
創建飛機主頁 ................................................................................................166
在飛機頁麵上添加搜索 ................................................................................167
創建飛機製造商的條形圖 ............................................................................172
對飛機製造商條形圖進行迭代 ....................................................................174
實體解析:新一輪圖錶迭代 ..........................................................................177
本章小結 .................................................................................................................183
第7章 進行預測 ............................................................................. 185
預測的作用 .............................................................................................................186
預測什麼 .................................................................................................................186
預測分析導論 .........................................................................................................187
進行預測 ........................................................................................................187
探索航班延誤 .........................................................................................................189
使用PySpark提取特徵............................................................................................193
使用scikit-learn構建迴歸模型 ...............................................................................198
讀取數據 ........................................................................................................198
數據采樣 ........................................................................................................199
嚮量化處理結果 ............................................................................................200
準備訓練數據 ................................................................................................201
嚮量化處理特徵 ............................................................................................201
稀疏矩陣與稠密矩陣 ....................................................................................203
準備實驗 ........................................................................................................204
訓練模型 ........................................................................................................204
測試模型 ........................................................................................................205
小結 ................................................................................................................207
使用Spark MLlib構建分類器.................................................................................208
使用專用結構加載訓練數據 ........................................................................208
處理空值 ........................................................................................................210
用Route(路綫)替代FlightNum(航班號) .....................................................210
對連續變量分桶以用於分類 ........................................................................211
使用pyspark.ml.feature嚮量化處理特徵 ......................................................219
用Spark ML做分類 ........................................................................................221
本章小結 .................................................................................................................223
第8章 部署預測係統 ...................................................................... 225
把scikit-learn應用部署為網絡服務 .......................................................................225
scikit-learn模型的保存與讀取 ......................................................................226
提供預測模型的準備工作 ............................................................................227
為航班延誤迴歸分析創建API ......................................................................228
測試API .........................................................................................................232
在産品中使用API ..........................................................................................232
使用Airflow部署批處理模式Spark ML應用 ........................................................234
在生産環境中收集訓練數據 ........................................................................235
Spark ML模型的訓練、存儲與加載 ..............................................................237
在MongoDB中創建預測請求 .......................................................................239
從MongoDB中獲取預測請求 .......................................................................245
使用Spark ML以批處理模式進行預測 ........................................................248
用MongoDB保存預測結果 ...........................................................................252
在網絡應用中展示批處理預測結果 ............................................................253
用Apache Airflow(孵化項目)自動化工作流 ...............................................256
小結 ................................................................................................................264
用Spark Streaming部署流式計算模式Spark ML應用 ..........................................264
在生産環境中收集訓練數據 ........................................................................265
Spark ML模型的訓練、存儲、讀取 ................................................................265
發送預測請求到Kafka ..................................................................................266
用Spark Streaming進行預測 ..........................................................................277
測試整個係統 ................................................................................................283
本章小結 .................................................................................................................285
第9章 改進預測結果 ...................................................................... 287
解決預測的問題 .....................................................................................................287
什麼時候需要改進預測 .........................................................................................288
改進預測錶現 .........................................................................................................288
黏附試驗法:找齣黏性好的 ..........................................................................288
為試驗建立嚴格的指標 ................................................................................289
把當日時間作為特徵 ....................................................................................298
納入飛機數據 ................................................................................................302
提取飛機特徵 ................................................................................................302
在分類器模型中納入飛機特徵 ....................................................................305
納入飛行時間 .........................................................................................................310
本章小結 .................................................................................................................313
附錄A 安裝手冊 ............................................................................. 315
安裝Hadoop ...........................................................................................................315
安裝Spark ...............................................................................................................316
安裝MongoDB .......................................................................................................317
安裝MongoDB的Java驅動 .....................................................................................317
安裝mongo-hadoop ................................................................................................318
編譯mongo-hadoop .......................................................................................318
安裝pymongo_spark ......................................................................................318
安裝 Elasticsearch ..................................................................................................318
安裝Elasticsearch的Hadoop支持庫 .......................................................................319
配置我們的Spark環境 ...........................................................................................320
安裝 Kafka .............................................................................................................320
安裝scikit-learn ......................................................................................................320
安裝Zeppelin ..........................................................................................................321
· · · · · · (
收起)