The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing | Akidau et al. (Google)

Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernandez-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, ́ Frances Perry, Eric Schmidt, Sam Whittle; The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing; In Proceedings of the Conference on Very Large Data Bases (VLDB), Volume 8, Number 12; 2015-08-31; 12 pages; Google, paywall

Abstract

Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves, in addition to an insatiable hunger for faster answers. Meanwhile, practicality dictates that one can never fully optimize along all dimensions of correctness, latency, and cost for these types of input. As a result, data processing practitioners are left with the quandary of how to reconcile the tensions between these seemingly competing propositions, often resulting in disparate implementations and systems.

We propose that a fundamental shift of approach is necessary to deal with these evolved requirements in modern data processing. We as a field must stop trying to groom unbounded datasets into finite pools of information that eventually become complete, and instead live and breathe under the assumption that we will never know if or when we have seen all of our data, only that new data will arrive, old data may be retracted, and the only way to make this problem tractable is via principled abstractions that allow the practitioner the choice of appropriate tradeoffs along the axes of interest: correctness, latency, and cost.

In this paper, we present one such approach, the Dataflow Mode, along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development

References

  1. Daniel J. Abadi, Don Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, Stan Zdonik. Aurora: a new model and architecture for data stream management, In The VLDB Journal — The International Journal on Very Large Data Bases, v.12 n.2, p.120-139, 2003-08.[doi:10.1007/s00778-003-0095-z]
  2. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle, MillWheel: fault-tolerant stream processing at internet scale, In Proceedings of the VLDB Endowment, v.6 n.11, p.1033-1044, 2013-08.[doi:10.14778/2536222.2536229]
  3. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, Daniel Warneke, The Stratosphere platform for big data analytics, The VLDB Journal — The International Journal on Very Large Data Bases, v.23 n.6, p.939-964, 2014-12.[doi:10.1007/s00778-014-0357-y]
  4. Apache. Apache Hadoop, 2012.
  5. Apache. Apache Storm, 2013.
  6. Apache. Apache Flink, 2014.
  7. Apache. Apache Samza, 2014.
  8. R. S. Barga et al. Consistent Streaming Through Time: A Vision for Event Stream Processing. In Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR), pages 363–374, 2007.
  9. Irina Botan, Roozbeh Derakhshan, Nihal Dindar, Laura Haas, Renée J. Miller, Nesime Tatbul, SECRET: a model for analysis of the execution semantics of stream processing systems, In In Proceedings of the VLDB Endowment, v.3 n.1-2, 2010-09.[doi:10.14778/1920841.1920874]
  10. Oscar Boykin, Sam Ritchie, Ian O’Connell, Jimmy Lin, Summingbird: a framework for integrating batch and online MapReduce computations, In In Proceedings of the VLDB Endowment, v.7 n.13, p.1441-1451, 2014-08.[doi:10.14778/2733004.2733016]
  11. Cask. , 2015.
  12. Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, Nathan Weizenbaum, FlumeJava: easy, efficient data-parallel pipelines, In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, 2010-06-05 → 2010-06-10 (five days!!!), Toronto, Ontario, Canada. [doi:10.1145/1806596.1806638]
  13. B. Chandramouli et al. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. In Proceedings of the 41st International Conference on Very Large Data Bases (VLDB), 2015.
  14. Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel R. Madden, Fred Reiss, Mehul A. Shah, TelegraphCQ: continuous dataflow processing, In Proceedings of the 2003 ACM International Conference on Management of Data (SIGMOD), 2003-06-09 → 2003-06-12, San Diego, California. [doi:10.1145/872757.872857]
  15. Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang, NiagaraCQ: a scalable continuous query system for Internet databases, In Proceedings of the 2000 ACM International Conference on Management of Data (SIGMOD), p.379-390, 2000-05-15 → 2000-05-18, Dallas, Texas, USA. [doi:10.1145/342009.335432]
  16. Jeffrey Dean, Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, In Proceedings of the 6th Conference (or Symposium?) on Operating Systems Design & Implementation (OSDI), p.10-10, 2004-12-06 → 2004-12-08, San Francisco, CA
  17. EsperTech. Esper, 2006.
  18. Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh Srivastava, Building a high-level dataflow system on top of Map-Reduce: the Pig experience, In Proceedings of the VLDB Endowment, v.2 n.2, 2009-08. [doi:10.14778/1687553.1687568]
  19. Google. Dataflow SDK, 2015.
  20. Google. Google Cloud Dataflow. 2015.
  21. Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk, Oliver Spatscheck, A heartbeat mechanism and its application in gigascope, In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), 2005-08-30 → 2005-09-02, Trondheim, Norway
  22. Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker, Semantics and evaluation techniques for window aggregates in data streams, In Proceedings of the 2005 ACM International Conference on Management of Data (SIGMOD), 2005-06-14 → 2005-06-16, Baltimore, Maryland. [doi:10.1145/1066157.1066193]
  23. Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, David Maier, Out-of-order processing: a new architecture for high-performance stream systems, In Proceedings of the VLDB Endowment, v.1 n.1, 2008-08. [doi:10.14778/1453856.1453890]
  24. David Maier, Jin Li, Peter Tucker, Kristin Tufte, Vassilis Papadimos, Semantics of Data streams and operators, In Proceedings of the 10th International Conference on Database Theory, 2005-01-05 → 2005-01-07, Edinburgh, UK. [doi:10.1007/978-3-540-30570-5_3]
  25. N. Marz. How to beat the CAP theorem, In His Blog. 2011.
  26. S. Murthy et al. Pulsar — Real-Time Analytics at Scale. Technical report, eBay, 2015.
  27. SQLStream, 2015.
  28. Utkarsh Srivastava, Jennifer Widom, Flexible time management in data stream systems, In Proceedings of the Twenty-Third Acm SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), 2004-06-14 → 2004-06-16, Paris, France. [doi:10.1145/1055558.1055596]
  29. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy, Hive: a warehousing solution over a map-reduce framework, In Proceedings of the VLDB Endowment, v.2 n.2, 2009-08. [doi:10.14778/1687553.1687609]
  30. Peter A. Tucker, David Maier, Tim Sheard, Leonidas Fegaras, Exploiting Punctuation Semantics in Continuous Data Streams, In IEEE Transactions on Knowledge and Data Engineering, v.15 n.3, p.555-568, 2003-03. [doi:10.1109/TKDE.2003.1198390]
  31. James Whiteneck, Kristin Tufte, Amit Bhat, David Maier, Rafael J. Fernández-Moctezuma, Framing the question: detecting and filling spatial-temporal windows, In Proceedings of the ACM SIGSPATIAL International Workshop on GeoStreaming, p.19-22, 2010-11-02 → 2010-11-02, San Jose, California. [doi:10.1145/1878500.1878506]
  32. F. Yang and others. Sonora: A Platform for Continuous Mobile-Cloud Computing. Technical Report MSR-TR-2012-34, Microsoft Research Asia.
  33. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2012-03-25 → 2012-03-27, San Jose, CA
  34. Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica, Discretized streams: fault-tolerant streaming computation at scale, In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP), 2013-11-03 → 2013-11-06, Farminton, Pennsylvania. [doi:10.1145/2517349.2522737]

Previously filled.

Comments are closed.