Streaming Systems

Streaming Systems pdf epub mobi txt 电子书 下载 2025

出版者:O'Reilly Media
作者:Tyler Akidau
出品人:
页数:352
译者:
出版时间:2017-10-25
价格:USD 39.99
装帧:Paperback
isbn号码:9781491983874
丛书系列:
图书标签:
  • 流式计算
  • 大数据
  • 分布式
  • 流计算
  • 计算机
  • 数据库
  • 软件工程
  • 数据挖掘
  • Streaming Systems
  • 大数据
  • 实时处理
  • 分布式系统
  • 流数据
  • 微服务
  • 消息队列
  • 事件驱动
  • 高性能
  • 可扩展
想要找书就要到 小美书屋
立刻按 ctrl+D收藏本页
你会得到大惊喜!!

具体描述

Streaming data is a big deal in big data these days. As more and more businesses seek to tame the massive unbounded data sets that pervade our world, streaming systems have finally reached a level of maturity sufficient for mainstream adoption. With this practical guide, data engineers, data scientists, and developers will learn how to work with streaming data in a conceptual and platform-agnostic way.

Expanded from Tyler Akidau’s popular blog posts "Streaming 101" and "Streaming 102", this book takes you from an introductory level to a nuanced understanding of the what, where, when, and how of processing real-time data streams. You’ll also dive deep into watermarks and exactly-once processing with co-authors Slava Chernyak and Reuven Lax.

You’ll explore:

How streaming and batch data processing patterns compare

The core principles and concepts behind robust out-of-order data processing

How watermarks track progress and completeness in infinite datasets

How exactly-once data processing techniques ensure correctness

How the concepts of streams and tables form the foundations of both batch and streaming data processing

The practical motivations behind a powerful persistent state mechanism, driven by a real-world example

How time-varying relations provide a link between stream processing and the world of SQL and relational algebra

作者简介

Tyler Akidau is a senior staff software engineer at Google, where he is the technical lead for the Data Processing Languages & Systems group, responsible for Google's Apache Beam efforts, Google Cloud Dataflow, and internal data processing tools like Google Flume, MapReduce, and MillWheel. His also a founding member of the Apache Beam PMC. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 Dataflow Model paper and the Streaming 101 and Streaming 102 articles on the O’Reilly website. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Slava Chernyak is a senior software engineer at Google Seattle. Slava spent over five years working on Google’s internal massive-scale streaming data processing systems and has since become involved with designing and building Windmill, Google Cloud Dataflow's next-generation streaming backend, from the ground up. Slava is passionate about making massive-scale stream processing available and useful to a broader audience. When he is not working on streaming systems, Slava is out enjoying the natural beauty of the Pacific Northwest.

Reuven Lax is a senior staff software engineer at Google Seattle, and has spent the past nine years helping to shape Google's data processing and analysis strategy. For much of that time he has focused on Google's low-latency, streaming data processing efforts, first as a long-time member and lead of the MillWheel team, and more recently founding and leading the team responsible for Windmill, the next-generation stream processing engine powering Google Cloud Dataflow. He's very excited to bring Google's data-processing experience to the world at large, and proud to have been a part of publishing both the MillWheel paper in 2013 and the Dataflow Model paper in 2015. When not at work, Reuven enjoys swing dancing, rock climbing, and exploring new parts of the world.

目录信息

Table of Contents
Preface Or: What Are You Getting Yourself Into Here? vii
Part I The Beam Model
1 Streaming 101 3
Terminology: What Is Streaming? 4
On the Greatly Exaggerated Limitations of Streaming 6
Event Time Versus Processing Time 9
Data Processing Patterns 12
Bounded Data 12
Unbounded Data: Batch 13
Unbounded Data: Streaming 14
Summary 22
2 The What, Where, When, and How of Data Processing 25
Roadmap 26
Batch Foundations: What and Where 28
When: Transformations 28
Where: Windowing 32
Going Streaming: When and How 34
When: The Wonderful Thing About Triggers Is Triggers Are Wonderful Things! 34
When: Watermarks 39
When: Early/On-Time/Late Triggers FTW! 44
When: Allowed Lateness (i.e., Garbage Collection) 47
How: Accumulation 51
Summary 55
3 Watermarks 59
Definition 59
Source Watermark Creation 62
Perfect Watermark Creation 64
Heuristic Watermark Creation 65
Watermark Propagation 67
Understanding Watermark Propagation 69
Watermark Propagation and Output Timestamps 75
The Tricky Case of Overlapping Windows 80
Percentile Watermarks 81
Processing-Time Watermarks 84
Case Studies 86
Case Study: Watermarks in Google Cloud Dataflow 87
Case Study: Watermarks in Apache Flink 88
Case Study: Source Watermarks for Google Cloud Pub/Sub 90
Summary 93
4 Advanced Windowing 95
When/Where: Processing-Time Windows 95
Event-Time Windowing 97
Processing-Time Windowing via Triggers 98
Processing-Time Windowing via Ingress Time 100
Where: Session Windows 103
Where: Custom Windowing 107
Variations on Fixed Windows 108
Variations on Session Windows 115
One Size Does Not Fit All 119
Summary 119
5 Exactly-Once and Side Effects 121
Why Exactly Once Matters 121
Accuracy Versus Completeness 122
Side Effects 123
Problem Definition 123
Ensuring Exactly Once in Shuffle 125
Addressing Determinism 126
Performance 127
Graph Optimization 127
Bloom Filters 128
Garbage Collection 129
Exactly Once in Sources 130
Exactly Once in Sinks 131
Use Cases 133
Example Source: Cloud Pub/Sub 133
Example Sink: Files 134
Example Sink: Google BigQuery 135
Other Systems 136
Apache Spark Streaming 136
Apache Flink 136
Summary 138
Part II Streams and Tables
6 Streams and Tables 141
Stream-and-Table Basics Or: a Special Theory of Stream and Table Relativity 142
Toward a General Theory of Stream and Table Relativity 143
Batch Processing Versus Streams and Tables 144
A Streams and Tables Analysis of MapReduce 144
Reconciling with Batch Processing 150
What, Where, When, and How in a Streams and Tables World 150
What: Transformations 150
Where: Windowing 154
When: Triggers 157
How: Accumulation 165
A Holistic View of Streams and Tables in the Beam Model 166
A General Theory of Stream and Table Relativity 171
Summary 172
7 The Practicalities of Persistent State 175
Motivation 175
The Inevitability of Failure 176
Correctness and Efficiency 177
Implicit State 178
Raw Grouping 179
Incremental Combining 181
Generalized State 184
Case Study: Conversion Attribution 186
Conversion Attribution with Apache Beam 189
Summary 199
8 Streaming SQL 201
What Is Streaming SQL? 201
Relational Algebra 202
Time-Varying Relations 203
Streams and Tables 207
Looking Backward: Stream and Table Biases 214
The Beam Model: A Stream-Biased Approach 214
The SQL Model: A Table-Biased Approach 218
Looking Forward: Toward Robust Streaming SQL 226
Stream and Table Selection 227
Temporal Operators 228
Summary 249
9 Streaming Joins 253
All Your Joins Are Belong to Streaming 253
Unwindowed Joins 254
Full Outer 255
Left Outer 258
Right Outer 259
Inner 259
Anti 261
Semi 262
Windowed Joins 266
Fixed Windows 267
Temporal Validity 269
Summary 282
10 The Evolution of Large-Scale Data Processing 283
MapReduce 284
Hadoop 288
Flume 289
Storm 294
Spark 297
MillWheel 300
Kafka 304
Cloud Dataflow 307
Flink 309
Beam 313
Summary 316
Index 319
· · · · · · (收起)

读后感

评分

Streaming SQL没有仔细读,回头再来研究; 关于流式计算,这本书讲得非常透彻,从数据(bounded data VS unbounded data,stream vs table)到计算(batch vs streaming, window/trigger/accumulation)娓娓道来(有时候甚至觉得啰嗦,哈哈),看完之后会对学习流式计算框架很...

评分

Streaming SQL没有仔细读,回头再来研究; 关于流式计算,这本书讲得非常透彻,从数据(bounded data VS unbounded data,stream vs table)到计算(batch vs streaming, window/trigger/accumulation)娓娓道来(有时候甚至觉得啰嗦,哈哈),看完之后会对学习流式计算框架很...

评分

Streaming SQL没有仔细读,回头再来研究; 关于流式计算,这本书讲得非常透彻,从数据(bounded data VS unbounded data,stream vs table)到计算(batch vs streaming, window/trigger/accumulation)娓娓道来(有时候甚至觉得啰嗦,哈哈),看完之后会对学习流式计算框架很...

评分

Streaming SQL没有仔细读,回头再来研究; 关于流式计算,这本书讲得非常透彻,从数据(bounded data VS unbounded data,stream vs table)到计算(batch vs streaming, window/trigger/accumulation)娓娓道来(有时候甚至觉得啰嗦,哈哈),看完之后会对学习流式计算框架很...

评分

Streaming SQL没有仔细读,回头再来研究; 关于流式计算,这本书讲得非常透彻,从数据(bounded data VS unbounded data,stream vs table)到计算(batch vs streaming, window/trigger/accumulation)娓娓道来(有时候甚至觉得啰嗦,哈哈),看完之后会对学习流式计算框架很...

用户评价

评分

啰嗦,内容不丰富,好在比较新。是一本平易近人的书。

评分

消化新东西速度变慢了诶 强烈建议作者把章节顺序调整一下先讲system再讲Streaming 看前几章的时候有种强烈的感觉the author didn't assume that i know nothing

评分

一般,比较无聊

评分

从看的时候的五星,到看完后的四星,真是颇长时间才读完啊。从流式计算的角度来说,它属于科普性质的教材,介绍了流式计算里的重要概念,对于研究流式计算的人来说,那是做了很好的抽象和总结。对于普通人来说,是有些曲高和寡。

评分

不知道这书为啥突然火了,但是对绝大数程序员来说,这本书https://book.douban.com/subject/25971366/ 更好用(虽然老了点)

本站所有内容均为互联网搜索引擎提供的公开搜索信息,本站不存储任何数据与内容,任何内容与数据均与本站无关,如有需要请联系相关搜索引擎包括但不限于百度google,bing,sogou

© 2025 book.quotespace.org All Rights Reserved. 小美书屋 版权所有