The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing | Akidau et al. (Google)

Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernandez-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, ́ Frances Perry, Eric Schmidt, Sam Whittle; The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing; In Proceedings of the Conference on Very Large Data Bases (VLDB), Volume 8, Number 12; 2015-08-31; 12 pages; Google, paywall


Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves, in addition to an insatiable hunger for faster answers. Meanwhile, practicality dictates that one can never fully optimize along all dimensions of correctness, latency, and cost for these types of input. As a result, data processing practitioners are left with the quandary of how to reconcile the tensions between these seemingly competing propositions, often resulting in disparate implementations and systems.

We propose that a fundamental shift of approach is necessary to deal with these evolved requirements in modern data processing. We as a field must stop trying to groom unbounded datasets into finite pools of information that eventually become complete, and instead live and breathe under the assumption that we will never know if or when we have seen all of our data, only that new data will arrive, old data may be retracted, and the only way to make this problem tractable is via principled abstractions that allow the practitioner the choice of appropriate tradeoffs along the axes of interest: correctness, latency, and cost.

In this paper, we present one such approach, the Dataflow Mode, along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development


  1. Daniel J. Abadi, Don Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, Stan Zdonik. Aurora: a new model and architecture for data stream management, In The VLDB Journal — The International Journal on Very Large Data Bases, v.12 n.2, p.120-139, 2003-08.[doi:10.1007/s00778-003-0095-z]
  2. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle, MillWheel: fault-tolerant stream processing at internet scale, In Proceedings of the VLDB Endowment, v.6 n.11, p.1033-1044, 2013-08.[doi:10.14778/2536222.2536229]
  3. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, Daniel Warneke, The Stratosphere platform for big data analytics, The VLDB Journal — The International Journal on Very Large Data Bases, v.23 n.6, p.939-964, 2014-12.[doi:10.1007/s00778-014-0357-y]
  4. Apache. Apache Hadoop, 2012.
  5. Apache. Apache Storm, 2013.
  6. Apache. Apache Flink, 2014.
  7. Apache. Apache Samza, 2014.
  8. R. S. Barga et al. Consistent Streaming Through Time: A Vision for Event Stream Processing. In Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR), pages 363–374, 2007.
  9. Irina Botan, Roozbeh Derakhshan, Nihal Dindar, Laura Haas, Renée J. Miller, Nesime Tatbul, SECRET: a model for analysis of the execution semantics of stream processing systems, In In Proceedings of the VLDB Endowment, v.3 n.1-2, 2010-09.[doi:10.14778/1920841.1920874]
  10. Oscar Boykin, Sam Ritchie, Ian O’Connell, Jimmy Lin, Summingbird: a framework for integrating batch and online MapReduce computations, In In Proceedings of the VLDB Endowment, v.7 n.13, p.1441-1451, 2014-08.[doi:10.14778/2733004.2733016]
  11. Cask. , 2015.
  12. Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, Nathan Weizenbaum, FlumeJava: easy, efficient data-parallel pipelines, In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, 2010-06-05 → 2010-06-10 (five days!!!), Toronto, Ontario, Canada. [doi:10.1145/1806596.1806638]
  13. B. Chandramouli et al. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. In Proceedings of the 41st International Conference on Very Large Data Bases (VLDB), 2015.
  14. Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel R. Madden, Fred Reiss, Mehul A. Shah, TelegraphCQ: continuous dataflow processing, In Proceedings of the 2003 ACM International Conference on Management of Data (SIGMOD), 2003-06-09 → 2003-06-12, San Diego, California. [doi:10.1145/872757.872857]
  15. Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang, NiagaraCQ: a scalable continuous query system for Internet databases, In Proceedings of the 2000 ACM International Conference on Management of Data (SIGMOD), p.379-390, 2000-05-15 → 2000-05-18, Dallas, Texas, USA. [doi:10.1145/342009.335432]
  16. Jeffrey Dean, Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, In Proceedings of the 6th Conference (or Symposium?) on Operating Systems Design & Implementation (OSDI), p.10-10, 2004-12-06 → 2004-12-08, San Francisco, CA
  17. EsperTech. Esper, 2006.
  18. Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh Srivastava, Building a high-level dataflow system on top of Map-Reduce: the Pig experience, In Proceedings of the VLDB Endowment, v.2 n.2, 2009-08. [doi:10.14778/1687553.1687568]
  19. Google. Dataflow SDK, 2015.
  20. Google. Google Cloud Dataflow. 2015.
  21. Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk, Oliver Spatscheck, A heartbeat mechanism and its application in gigascope, In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), 2005-08-30 → 2005-09-02, Trondheim, Norway
  22. Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker, Semantics and evaluation techniques for window aggregates in data streams, In Proceedings of the 2005 ACM International Conference on Management of Data (SIGMOD), 2005-06-14 → 2005-06-16, Baltimore, Maryland. [doi:10.1145/1066157.1066193]
  23. Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, David Maier, Out-of-order processing: a new architecture for high-performance stream systems, In Proceedings of the VLDB Endowment, v.1 n.1, 2008-08. [doi:10.14778/1453856.1453890]
  24. David Maier, Jin Li, Peter Tucker, Kristin Tufte, Vassilis Papadimos, Semantics of Data streams and operators, In Proceedings of the 10th International Conference on Database Theory, 2005-01-05 → 2005-01-07, Edinburgh, UK. [doi:10.1007/978-3-540-30570-5_3]
  25. N. Marz. How to beat the CAP theorem, In His Blog. 2011.
  26. S. Murthy et al. Pulsar — Real-Time Analytics at Scale. Technical report, eBay, 2015.
  27. SQLStream, 2015.
  28. Utkarsh Srivastava, Jennifer Widom, Flexible time management in data stream systems, In Proceedings of the Twenty-Third Acm SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), 2004-06-14 → 2004-06-16, Paris, France. [doi:10.1145/1055558.1055596]
  29. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy, Hive: a warehousing solution over a map-reduce framework, In Proceedings of the VLDB Endowment, v.2 n.2, 2009-08. [doi:10.14778/1687553.1687609]
  30. Peter A. Tucker, David Maier, Tim Sheard, Leonidas Fegaras, Exploiting Punctuation Semantics in Continuous Data Streams, In IEEE Transactions on Knowledge and Data Engineering, v.15 n.3, p.555-568, 2003-03. [doi:10.1109/TKDE.2003.1198390]
  31. James Whiteneck, Kristin Tufte, Amit Bhat, David Maier, Rafael J. Fernández-Moctezuma, Framing the question: detecting and filling spatial-temporal windows, In Proceedings of the ACM SIGSPATIAL International Workshop on GeoStreaming, p.19-22, 2010-11-02 → 2010-11-02, San Jose, California. [doi:10.1145/1878500.1878506]
  32. F. Yang and others. Sonora: A Platform for Continuous Mobile-Cloud Computing. Technical Report MSR-TR-2012-34, Microsoft Research Asia.
  33. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2012-03-25 → 2012-03-27, San Jose, CA
  34. Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica, Discretized streams: fault-tolerant streaming computation at scale, In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP), 2013-11-03 → 2013-11-06, Farminton, Pennsylvania. [doi:10.1145/2517349.2522737]

Previously filled.

Living on Fumes: Digital Footprints, Data Fumes, and the Limitations of Spatial Big Data | Jim Thatcher

Jim Thatcher (Clark University); Living on Fumes: Digital Footprints, Data Fumes, and the Limitations of Spatial Big Data; In International Journal of Communications (IJC); Volume 8; 2014; 19 pages; landing; previously in Proceedings of the 26th International
Cartographic Conference (ICC), 2014.

tl;dr → whereas capitalism is bad, the critical theory: sociotechnical, epistemic project, abductive processes, epistemic limits, epistemic and ontological commitments, capitalist profit motives, private corporations; frameworks of Marcuse, Pickles. You get the idea.


Amid the continued rise of big data in both the public and private sectors, spatial information has come to play an increasingly prominent role. This article defines big data as both a sociotechnical and epistemic project with regard to spatial information. Through interviews, job shadowing, and a review of current literature, both academic researchers and private companies are shown to approach spatial big data sets in analogous ways. Digital footprints and data fumes, respectively, describe a process that inscribes certain meaning into quantified spatial information. Social and economic limitations of this data are presented. Finally, the field of geographic information science is presented as a useful guide in dealing with the “hard work of theory” necessary in the big data movement.


  • In the introductory paragraph, cites opinements in Fast Company and Mashable as authoritative directional indicators.
  • Two problems
    1. <quote>On the one hand, rather than fully capturing life as researchers hope, end-user interactions within big data are necessarily the result of decisions made by an extremely small group of programmers working for private corporations that have [been] promulgated through the mobile application ecosystem.
    2. On the other hand, in accepting that the data gathered through mobile applications reveal meaningful information about the world, researchers are tacitly accepting a commodification and quantification of knowledge.</quote>
  • Big Data is
    • (wait for it …) very big, “large” even.
    • <quote>data whose size forces us to look beyond the tried-and-true methods
      that are prevalent at that time</quote>, Adam Jacobs.
    • Contrarianism
      • Something vague about Taylorism, Max Weber, etc.
      • Something vague about how having more data is better, or is not better.
    • The Fourth Paradigm
      1. empiricism
      2. analysis
      3. simulation.
      4. explore & exploit
    • Sources
      <quote>Most current studies describing themselves as “big data” with a spatial component revolve around two mobile software platforms [Foursquare, Twitter]</quote>

      • Foursquare
      • Twitter
      • Facebook
      • Flickr
  • Types of Data [plural of types of Datum(s)]
    • Checkin
    • Tweet
  • Livehood
  • 25% of Foursquare users link their Twitter accounts (75% don’t)
  • <quote>Finally, the reliance upon data generated with an explicit motive for profit — both for the end user and the corporation—results in epistemological commitments not dissimilar to concerns raised with regard to the knowledges and approaches privileged by GIS use. </quote>
  • <quote>This hard work of theory opens new knowledge projects within the realm of big data. For example, if the check-in is viewed as a form of disciplining technology — one that reports location to enmesh it more fully in capitalist exchange — then purposeful location fraud takes on new meaning as a potential form of resistance or protest.</quote>


  • private companies
  • profit motives
  • capitalism


  • Digital footprints
  • Digital fumes


  • PostgreSQL
  • R
  • Mac (OS)


  • Anderson, C. (2008-06-23). The end of theory: The data deluge makes the scientific method obsolete. Wired.
  • Baker, S. (2012-01-05). Can social media sell soap? The New York Times.
  • Batty, M. (2012). Smart cities, big data. Environment and Planning B, 39, 191–193.
  • Benner, J., & Robles, C. (2012). Trending on Foursquare: Examining the location and categories of venues that trend in three cities. In Proceedings of the Workshop on GIScience in the Big Data Age 2012 (pp. 27–35). Columbus, Ohio.
  • Berry, D. M. (2011). The philosophy of software: Code and mediation in the digital age. London, UK: Palgrave Macmillan.
  • boyd, d., & Crawford, K. (2012). Critical questions for big data. Information, Communication & Society, 15(5), 662–679.
  • Brownlee, J. (2012-03-30). This creepy app isn’t just stalking women without their knowledge, it’s a wake-up call about Facebook privacy. In Cult of Mac.
  • Burgess, J., & Bruns, A. (2012). Twitter archives and the challenges of “big social data” for media and communication research. M/C Journal, 15(5).
  • Carbunar, B., & Potharaju, R. (2012). You unlocked the Mt. Everest badge on Foursquare! Countering location fraud in geosocial networks. In Proceedings of the 2012 IEEE 9th International Conference on Mobile Ad-Hoc and Sensor Systems (MASS), pages 182-190. IEEE Computer Society, Washington, DC.
  • Cerrato, P. (2012-11-01). Big data analytics: Where’s the ROI? InformationWeek: Healthcare.
  • Cheng, Z., Caverlee, J., Lee, K., & Sui, D. (2011). Exploring millions of footprints in location sharing services. In Proceedings of the Fifth International AAAI Conference on WSM. Barcelona, Spain.
  • Crampton, J. (2003). The political mapping of cyberspace. Edinburgh, Scotland: Edinburgh University Press.
  • Crampton, J. (2013). Commentary: Is security sustainable? Environment and Planning D, 31, 571–577.
  • Cranshaw, J., Schwartz, R., Hong, J., & Sadeh, N. (2012). The Livehoods Project: Utilizing social media to understand the dynamics of a city. In Proceedings of the Sixth International AAAI Conference on WSM. Dublin, Ireland.
  • Curry, M. (1997). The digital individual and the private realm. Annals of the AAG, 87, 681–699.
  • DeLyser, D., & Sui, D. (2013). Crossing the qualitative-quantitative divide II: Inventive approaches to big data, mobile methods, and rhythmanalysis. Progress in Human Geography, 37(2), 293–305.
  • Eckert, J., & Hemsley, J. (2013-04-11). Occupied Reographies, Relational or Otherwise. Presentation to the American Association of Geographers, Los Angeles, CA.
  • Exner, J., Zeile, P., & Streich, B. (2011). Urban monitoring laboratory: New benefits and potential for urban planning through the use of urban sensing, geo- and mobile-web. In Proceedings of Real CORP 2011. pages 1087–1096. Wien, Austria.
  • Farmer, C., & Pozdnoukhov, A. (2012). Building streaming GIScience from context, theory, and intelligence. In Proceedings of the Workshop on GIScience in the Big Data Age 2012. pages 5–10. Columbus, Ohio.
  • Goodchild, M. (1992). Geographical information science. In International Journal of Geographical Information Systems, 6, 31–45.
  • Goodchild, M. (2007). Citizens as sensors: The world of volunteered geography. In GeoJournal, 69(4), 211– 221.
  • Goodchild, M., & Glennon, J. A. (2010). Crowdsourcing geographic information for disaster response: A research frontier. In International Journal of Digital Earth, 3(3), 231–241.
  • Harley, J. (1989). Deconstructing the map. In Cartographical, 26, 1–20.
  • Hecht, B., Hong, L., Suh, B., & Chi, E. (2011). Tweets from Justin Bieber’s heart. In Proceedings of the ACM CHI Conference 2011. pages 237–246. Vancouver, BC.
  • Heidegger, M. (1977). The question concerning technology and other essays. W. Lovitt, Translator. New York, NY: Harper Perennial.
  • Hey, T., Tansley, S., & Toelle, K. (Eds.). (2009). The fourth paradigm: Data-intensive scientific discovery. Richmond, WA: Microsoft Research.
  • Horvath, I. (2012). Beyond advanced mechatronics: New design challenges of social-cyber systems. (Draft paper.) In Proceedings of the ACM Workshop on Mechatronic Design, Linz 2012. Linz, Austria
  • Jacobs, A. (2009). The pathologies of big data. In ACM Queue, 7(6), pages 1–12.
  • Joseph, K., Tan, C., & Carley, K. (2012). Beyond “local,” “categories” and “friends”: Clustering Foursquare users with latent “topics.” In Proceedings of ACM Ubicomp 2012. pages 919–926. Pittsburgh, PA.
  • Kingsbury, P., & Jones III, J. P. (2009). Walter Benjamin’s Dionysian adventures on Google earth. In Geoforum, 40, 502–513.
  • Kitchin, R., & Dodge, M. (2007). Rethinking maps. In Progress in Human Geography, 31, 331–344.
  • Kling, F., & Pozdnoukhov, A. (2012). When a city tells a story: Urban topic analysis. In Proceedings of ACM SIGSPATIAL 2012. pages 482–485. Redondo Beach, CA.
  • Lathia, N., Quercia, D., & Crowcroft, J. (2012). The hidden image of the city: Sensing community well-being from urban mobility. In Pervasive Computing,/em>, 7319, 91–98.
  • Laurila, J., Gatica-Perez, D., Aad, I., Blom, J., Bornet, O., Do, T., Dousse, O., Eberle, J., & Miettinen, M. (2012). The mobile big data challenge. Nokia Research.
  • Livehoods, demonstrator & promotional site. (2012).
  • Lohr, S. (2012-12-29). Sure, big data is great. But so is intuition. The New York Times.
  • Long, X., Jin, L., & Joshi, J. (2012). Exploring trajectory-driven local geographic topics in Foursquare. In Proceedings of ACM Ubicomp 2012. pages 927–934. Pittsburgh, PA.
  • Marcuse, H. (1982 [1941]). Some social implications of modern technology. In A. Arato & E. Gebhardt (Eds.), The essential Frankfurt School reader. pages 138–162. New York, NY: Continuum.
  • Martino, M., Britter, R., Outram, C., Zacharias, C., Biderman, A., & Ratti, C. (2010). Senseable city. Cambridge, MA: MIT Senseable City Lab.
  • Mayer-Schonberger, V., & Cukier, K. (2013). Big Data: A revolution that will transform how we live, work, and think. New York, NY: Houghton Mifflin Harcourt.
  • Mitchell, J. (2012-04-10). Life after death of the check-in. In ReadWrite.
  • National Science Foundation. (2012-10-03). NSF announces interagency progress on administration’s= big data initiative. Press release.
  • Noulas, A., Scellato, S., Mascolo, C., & Pontil, M. (2011). An empirical study of geographic user activity patterns in Foursquare. In Proceedings of the Fifth International AAAI Conference on WSM. pages 570–573. Barcelona, Spain.
  • Obermeyer, N. (1995). The hidden GIS technocracy. In Cartography and Geographic Information Science, 22(1), 78–83.
  • O’Sullivan, D. (2006). Geographical information science: Critical GIS. Progress in Human Geography, 30(6), 783–791.
  • Paulos, E., Honicky, R. J., & Hooker, B. (2008). Citizen science: Enabling Participatory Urbanism. In M. Foth (Ed.), Handbook of Research on Urban Informatics: The Practice and Promise of the Real-Time City. pages 414–436. Hershey, PA: Information Science Reference.
  • Pickles, J. (1993). Discourse on method and the History of Discipline: Reflections on Jerome Dobson’s 1993 “Automated geography.” In Professional Geographer, 45, 451–455.
  • Pickles, J. (1995). Ground Truth. New York, NY: Guilford Press.
  • Pickles, J. (1997). Tool or Science? GIS, Technoscience and the Theoretical Turn. In Annals of the AAG, 87, pages 363–372.
  • Presley, S. (2011). Mapping out #LondonRiots. In NFPvoice.
  • Rasmus, D. (2012-01-27). Why big data won’t make you smart, rich, or pretty. In Fast Company.
  • Sakaki, T., Okazaki, M., & Matsuo, Y. (2010). Earthquake shakes Twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web (WWW). pages 851–860. Raleigh, NC.
  • A sea of sensors. staff; (2010-11-04). In The Economist.
  • Sheppard, E. (1993). Automated geography: What kind of geography for what kind of society? In Professional Geographer, 45, 457–460.
  • Siegel, E. (2013). Predictive analytics: The power to predict who will click, buy, lie, or die. Hoboken, NJ: John Wiley.
  • Sui, D. (2008). The wikification of GIS and its consequences: Or Angelina Jolie’s new tattoo and the future of GIS. In Computers, Environment and Urban Systems, 32, 1–5.
  • Taylor, C. (2012-11-07). Triumph of the nerds: Nate Silver wins in 50 states. In Mashable.
  • Thatcher, J. (2013-12). Avoiding the Ghetto through hope and fear: An analysis of immanent technology using ideal types. In GeoJournal. Volume 78, Issue 6, pages 967-980. paywall.
  • Thompson, C. (2012-05-10). Foursquare alters API to eliminate apps like Girls Around Me. In AboutFoursquare.
  • Twitter. (2012). Streaming API request parameters. API Documentation.
  • Weber, M. (1973 [1946]). From Max Weber (C. Mills & H. Gerth, Eds.). New York, NY: Oxford University Press.
  • Wilson, M. (2012). Location-based services, conspicuous mobility, and the location-aware future. In Geoforum, 43(6), 1266–1275.
  • Wright, D., Goodchild, M., & Proctor, D. (1997). Still hoping to turn that theoretical corner. In Annals of the AAG, 87(2), 373.
  • Xu, S., Flexner, S., & Carvalho, V. (2012). Geocoding billions of addresses: Towards a spatial record linkage system with big data. In Proceedings of the Workshop on GIScience in the Big Data Age 2012. pages 17–26. Columbus, Ohio.


Via: backfill.

GraphX: Graph Processing in a Distributed Dataflow Framework | Gonzalez, Xin, Dave, Crankshaw, Franklin, Stoica

Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, Ion Stoica; GraphX: Graph Processing in a Distributed Dataflow Framework; In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (SOSP); 2014-10-06; 16 pages; landing.


In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantages of specialized graph processing systems can be recovered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a familiar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with specialized graph systems, GraphX recasts graph-specific optimizations as distributed join optimizations and materialized view maintenance. By leveraging advances in distributed dataflow frameworks, GraphX brings low-cost fault tolerance to graph processing. We evaluate GraphX on real workloads and demonstrate that GraphX achieves an order of magnitude performance gain over the base dataflow framework and matches the performance of specialized graph processing systems while enabling a wider range of computation.


  • Alternates
    • Pregel
    • PowerGraph
    • MapReduce
    • Spark
    • Dryad
    • Naiad; Microsoft
    • DryadLINQ; Microsoft
    • Pig
    • Spark
    • GraphLINQ, within Naiad; Microsoft
    • Apache Spark
    • Apache Giraph
    • GraphLab (PowerGraph)
  • Applications
    • PageRank
    • community detection
    • latent factor analysis


  • property graph
  • power-law distributions
  • |E| >> |V|
  • vertex program (message passing to other vertices)
  • Bulk Synchronous Parallel (BSP)
  • Gather Apply Scatter (GAS) – decomposition, optimization
    • pull-based model
    • no direct communication between unconnected vertices
  • Compressed Sparse Row (CSR)
  • Map Reduce Triplets (MRT)
  • Graph Partitioning
  • Mirror Vertices
  • Active Vertices
  • Resilient Distributed Datasets (RDD)


  • GraphChi
  • X-Steram
  • CombBLAS
  • GraphChi
  • Resource Description Framework (RDF)


  • D. J. Abadi, A. Marcus, S. R. Madden, K. Hollenbach; SW-Store: A Vertically Partitioned DBMS for Semantic Web Data Management; In Proceedings of the Conference on Very Large Data Bases (VLDB); Volume 18, Number 2; 2009; pages 385–406.
  • F. N. Afrati, J. D. Ullman; Optimizing Joins in a Map-Reduce Environment; In Proceedings of the International Conference on Extending Database Technology (EDBT); 2010; pages 99–110.
  • S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, Y. A. Tian; A Comparison of Join Algorithms for Log Processing in MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD); ACM; 2010; pages 975–986.
  • P. Boldi, M. Rosa, M. Santini, S. Vigna; Layered Label Propagation: A Multiresolution Coordinate-Free Ordering for Compressing Social Networks; In Proceedings of the Conference on the World Wide Web (WWW); 2011; pages 587–596.
  • P. Boldi, S. Vigna; The WebGraph Framework I: Compression Techniques. In Proceedings of the Conference on the World Wide Web (WWW); 2004.
  • J. Broekstra, A. Kampman, F.V. Harmelen; Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In Proceedings of the First International Semantic Web Conference (ISWC); 2002; pages 54–68.
  • A. Buluç, J.R. Gilbert, The Combinatorial BLAS: Design, Implementation and Applications. In International Journal of High Performance Computing Applications (IJHPCA); Voume 25, Number 4; 2011; 496–509.
  • U. V. Çtalyürek, C. Aykanat, B. Uçar; On Two-Dimensional Sparse Matrix Partitioning: Models, Methods and a Recipe. In SIAM Journal of Scientific. Computing; Volume 32, Number 2; 2010, pages 656–683.
  • R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, E. Chen; Kineograph: Taking the Pulse of a Fast-Changing and Connected World. In Proceedings of EuroSys; 2012; pages 85–98.
  • J. Dan, S. Ghemawat; MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of Operating Systems Design and Implementation (OSDI); 2004.
  • S. Ewen, K. Tzoumas, M. Kaufmann, V. Markl; Spinning Fast Iterative Data Flows; In Proceedings of the Conference on Very Large Data Bases (VLDB); Volume 5, Number 11; 2012-07; pages 1268–1279.
  • U. Feige, M. Hajiaghayi, J.R. Lee; Improved Approximation Algorithms for Minimum-Weight Vertex Separators; In Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing (STOC); 2005; ACM; pages 563–572.
  • J.E. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin; Powergraph: Distributed Graph-Parallel Computation on Natural Graphs; In Proceedings of the Operating Systems Design and Implementation (OSDI); USENIX Association; 2012; pages 17–30.
  • A. Huang, W. Wu; Mining eCommerce Graph Data with Spark at Alibaba Taobao. 2014.
  • M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Ftterly; Dryad: Distributed Data-Parallel Programs From Sequential Building Blocks; In Proceedings of EuroSys; 2007; pages 59–72.
  • G. Karypis, V. Kmar; Multilevel K-Way Partitioning Scheme for Irregular Graphs; In Journal of Parallel and Distributed Computing; Volume 48, Issue 1; 1998; 96–129.
  • A. Kyrola, G. Blelloch, C. Guestrin; GraphChi: Large-Scale Graph Computation On Just a PC; In Proceedings of Operating Systems Design and Impementation (OSDI); 2012.
  • J. Leskovec, K. J. Lang, A. Dasgupta, M. W. Mahoney; Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters; In Internet Mathematics; Volume 6, Number 1; 2008; pages 29–123.
  • Y. Low, et al.; GraphLab: A New Parallel Framework for Machine Learning; In Proceedings of UAI; 2010; pages 340–349.
  • Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. M. Hellerstein, Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. In Proceedings of the Conference on Very Large Data Bases (VLDB); 2012.
  • L. F. Mackert, G. M. Lohman; R*-Optimizer Validation and Performance Evaluation for Distributed Queries. In Proceedings of the Conference on Very Large Data Bases (VLDB); 1986; pages 149–159.
  • G. Malewicz, M. H. Austern, A. J. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski; Pregel: a System for Large-Scale Graph Processing; In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD); 2010; pages 135– 146.
  • F. Manola, E. Miller; RDF Primer; W3C Recommendation 10; 2004; pages 1–107.
  • J. Mondal, A. Deshpande; Managing Large Dynamic Graphs Efficiently; In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD); 2012; ACM; pages 145–156.
  • D. Murray; Building new frameworks on Naiad; In His Blog; 2014-04.
  • D.G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, M. Abadi; Naiad: A Timely Dataflow System; In Proceedings of the Symposium on Operating Systems Principles (SOSP); 2013.
  • M. Najork, D. Fetterly, A. Halverson, K. Kenthapadi, S. Gollapudi; Of Hammers and Nails: An Empirical Comparison of Three Paradigms for Processing Large Graphs; In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM); 2012; ACM, pages 103–112.
  • T. Neumann, G. Weikum; RDF-3X: A RISC-style Engine for RDF. In Proceedings of the Conference on Very Large Data Bases (VLDB); 2008.
  • C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins; Pig Latin: A Not-so-Foreign Language for Data Processing; Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD); 2008;
  • L. Page, S. Brin, R. Motwani, T. Winograd; The Pagerank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66; InfoLab; Stanford University; 1999.
  • E. Prud’Hommeaux, A. Seaborne; SPARQL Query Language for RDF; 2008-01.
  • J.M. Pujol, V. Erramilli, G. Siganos, X. Yang, N. Laoutaris, P. Chhabra, P. Rodriguez; The Little Engine(s) That Could: Scaling Online Social Networks; In Proceedings of the Conference of the Special Interest Group on Communications (SIGCOMM); 2010; pages 375–386.
  • I. Robinson, J. Webber, E. Eifrem; Graph Databases; O’Reilly Media; 2013.
  • A. Roy, I. Mihailovic, W. Zwaenepoel; X-Stream: Edge-Centric Graph Processing Using Streaming Partitions; In Proceedings of the Symposium on Operating Systems Princples (SOSP); 2013; ACM; pages 472–488.
  • Y. Saad, Iterative Methods for Sparse Linear Systems; 2nd edition; Society for Industrial and Applied Mathematics; 2003.
  • I. Stanton, G. Kliot; Streaming Graph Partitioning for Large Distributed Graphs; Tech. Rep. MSR-TR-2011-121, Microsoft; 2011-11.
  • P. Stutz, A. Bernstein, W. Cohen; Signal/Collect: Graph Algorithms for the (Semantic) Web; In Proceedings of the International Semantic Web Conference (ISWC); 2010.
  • J. Ugander, L. Backstrom; Balanced Label Propagation for Partitioning Massive Graphs; In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (WSDM); 2013; ACM; pages 507–516.
  • M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica; Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing; In Network Systems Design and Implementation (NSDI); 2012.

Via: backfill.

Big Data and Privacy: A Technological Perspective | PCAST

Big Data and Privacy: A Technological Perspective; Executive Office of the President, President’s Council of Advisors on Science and Technology (PCAST); 2014-05-01; 76 pages; landing.



  • White House / UC Berkeley School of Information / Berkeley Center for Law and Technology; John Podesta; 2014-04-01; transcript, video.
  • White House / Data & Society Research Institute / NYU Information Law Institute; John Podesta; 2014-03-17; video.
  • White House / MIT; John Podesta; 2014-03-04; transcript, video.


PCAST Big Data and Privacy Working Group.
  • Susan L. Graham, co-chair.
  • William Press, co-chair.
  • S. James Gates, Jr.,
  • Mark Gorenberg,
  • John Holdren,
  • Eric S. Lander,
  • Craig Mundie,
  • Maxine Savitz,
  • Eric Schmidt.
  • Marjory S. Blumenthal, Executive Director of PCAST; coordination & framing..


  • John P Holdren, co-chair, OSTP
  • Eric S. Lander, co-chair, Broad Institute (Harvard&MIT)
  • William Press, co- vice chair, U. Texas
  • Maxine Savitz, co- vice chair, National Academy of Engineering
  • Rosina Bierbaum, U. Michigan
  • Christine Cassel, National Quality Forum
  • Christopher Chyba, Princeton
  • S. James Gates, Jr., U. Maryland
  • Gorenberg, Zetta Venture Partners
  • Susan L. Graham, UCB
  • Shirley Ann Jackson, Rensselaer Polytechnic
  • Richard C. Levin, Yale
  • Chad Mirkin, Northwestern
  • Mario Molina, UCSD
  • Craig Mundie, Microsoft
  • Ed Penhoet, UCB
  • Barbara Schaal, Washington University
  • Eric Schmidt, Google
  • Daniel Schrag, Harvard


  • Marjory S. Blumenthal
  • Michael Johnson


From the Executive Summary [page xiii], and also from Section 5.2 [page 49]

  • Recommendation 1 [consider uses over collections activites]
    Policy attention should focus more on the actual uses of big data and less on its collection and analysis.
  • Recommendation 2 [no Microsoft lockin; no national champion]
    Policies and regulation, at all levels of government, should not embed particular technological solutions, but rather should be stated in terms of intended outcomes.
  • Recommendation 3 [fund]
    With coordination and encouragement from [The White House Office of Science and Technology Policy] OSTP, the [Networking and Information Technology Research and Development] NITRD agencies should strengthen U.S. research in privacy‐related technologies and in the relevant areas of social science that inform the successful application of those technologies.
  • Recommendation 4 [talk]
    OSTP, together with the appropriate educational institutions and professional societies, should encourage increased education and training opportunities concerning privacy protection, including career paths for professionals.
  • Recommendation 5 [talk & buy]
    The United States should take the lead both in the international arena and at home by adopting policies that stimulate the use of practical privacy‐protecting technologies that exist today. It can exhibit leadership both by its convening power (for instance, by promoting the creation and adoption of standards) and also by its own procurement practices (such as its own use of privacy‐preserving cloud services)

Table of Contents

  1. Executive Summary
  2. Introduction
    1. Context and outline of this report
    2. Technology has long driven the meaning of privacy
    3. What is different today?
    4. Values, harms, and rights
  3. Examples and Scenarios
    1. Things happening today or very soon
    2. Scenarios of the near future in healthcare and education
    3. Healthcare: personalized medicine,
    4. Healthcare: detection of symptoms by mobile devices
    5. Education
    6. Challenges to the home’s special status
    7. Tradeoffs among privacy, security, and convenience
  4. Collection, Analytics, and Supporting Infrastructure
    1. Electronic sources of personal data
      1. “Born digital” data
      2. Data from sensors
    1. Big data analytics
      1. Data mining
      2. Data fusion and information integration
      3. Image and speech recognition
      4. Social‐network analysis
    2. The infrastructure behind big data
      1. Data centers
      2. The cloud
  5. Technologies and Strategies for Privacy Protection
    1. The relationship between cybersecurity and privacy
    2. Cryptography and encryption
      1. Well Established encryption technology
      2. Encryption frontiers
    3. Notice and consent
      1. Other strategies and techniques
        1. Anonymization or de‐identification
        2. Deletion and non‐retention
    4. Robust technologies going forward
      1. A Successor to Notice and Consent
      2. Context and Use
      3. Enforcement and deterrence
      4. Operationalizing the Consumer Privacy Bill of Rights
  6. PCAST Perspectives and Conclusions
    1. Technical feasibility of policy interventions
    2. Recommendations
    3. Final Remarks
  7. Appendix A. Additional Experts Providing Input
  8. Special Acknowledgment


  • The President’s Council of Advisors on Science and Technology (PCAST)
  • PCAST Big Data and Privacy Working Group
  • Enabling Event
    • President Barack Obama
    • Remarks, 2014-01-17
    • Counselor John Podesta
  • New Concerns
    • Born digital vs born analog
    • standardized components
    • particular limited purpose vs repurposed, reused.
    • data fusion
    • algorithms
    • inferences
  • Provenance of data, recording and tracing the provenance of data
  • Trusted Data Format (TDF)


  • Right to forget, right to be forgotten is unenforceable infeasible [page 48].
  • Prior redress of prospective harms is a reasonable framework [page 49]
    • Conceptualized as vulnerable groups who are stipulated as harmed a priori or are harmed sunt constitua.
  • Government may be forbidden from certain classes of uses, despite their being available in the private

    • Government is allowed some activities and powers
    • Private industry is allowed some activities and powers
    • It is feasible in practice to mix & match
      • government coercion => private privilege => result
      • private privilege => private coercion => result

Consumer Privacy Bill of Rights (CPBR)

Obligations [of service providers, as powerful organizations]

  • Respect for Context => use consistent with collection context.
  • Focused Collection => limited collection.
  • Security => handling techniques
  • Accountability => handling techniques.

Empowerments [of consumers, as individuals]

  • Individual Control => control of collection, control of use.
  • Transparency => of practices [by service providers]
  • Access and Accuracy => right to review & edit [something about proportionality]

Definition of Privacy

The definition is unclear and evolving. It is frequently defined in terms of the harms in curred when it is lost.

Privacy Framework of Via Harms

The Prosser Harms, <quote> page 6.

  1. Intrusion upon seclusion. A person who intentionally intrudes, physically or otherwise (now including electronically), upon the solitude or seclusion of another person or her private affairs or concerns, can be subject to liability for the invasion of her privacy, but only if the intrusion would be highly offensive to a reasonable person.
  2. Public disclosure of private facts. Similarly, a person can be sued for publishing private facts about another person, even if those facts are true. Private facts are those about someone’s personal life that have not previously been made public, that are not of legitimate public concern, and that would be offensive to a reasonable person.
  3. “False light” or publicity. Closely related to defamation, this harm results when false facts are widely published about an individual. In some states, false light includes untrue implications, not just untrue facts as such.
  4. Misappropriation of name or likeness. Individuals have a “right of publicity” to control the use of their name or likeness in commercial settings.



<quote>One perspective informed by new technologies and technology‐mediated communication suggests that privacy is about the “continual management of boundaries between different spheres of action and degrees of disclosure within those spheres,” with privacy and one’s public face being balanced in different ways at different times. See: Leysia Palen, Paul Dourish; Unpacking ‘Privacy’ for a Networked World; In Proceedings of CHI 2003, Association for Computing Machinery, 2003-04-05.</quote>, footnote, page 7.

Adjacency Theory

An oppositional framework wherein harms are “adjacent to” benefits:

  • Invasion of private communications
  • Invasion of privacy ihn a person’s virtual home.
  • Public disclosure of inferred private facts
  • Tracking, stalking and violations of locational privacy.
  • Harm arising from false conclusions about individuals, based on personal profiles from big‐data analytics.
  • Foreclosure of individual autonomy or self‐determination
  • Loss of anonymity and private association.
Mosaic Theory

Oblique referenced via quote from Sotomayor.
<quote>“I would ask whether people reasonably expect that their movements will be recorded and aggregated in a manner that enables the Government to ascertain, more or less at will, their political and religious beliefs, sexual habits, and so on.” United States v. Jones (10‐1259), Sotomayor concurrence.</quote>

Yet, not cited, but related (at least):

Definition of Roles [of data processors]

  • data collectors
  • data analyzers
  • data users

The data generators or producers in this roles framework are substantially only customers or consumers (sic).


  • Definition of analysis versus use
    • <quote>Analysis, per se, does not directly touch the individual (it is neither collection nor, without additional action, use) and may have no external visibility.
    • & by contrast, it is the use of a product of analysis, whether in commerce, by government, by the press, or by individuals, that can cause adverse consequences to individuals.</quote>
  • Big Data => definitions
    • [comprises data with] high‐volume, high‐velocity and high‐variety
      information assets that demand cost‐effective, innovative forms of information processing for enhanced insight and decision making,” attributed to Gartner Inc.
    • a term describing the storage and analysis of large and/or complex data sets using a series of techniques including, but not limited to, NoSQL, MapReduce, and machine learning.” attributed to “computer scientists” on arXiv.


The strong, direct, unequivocal, un-nuanced, provocative language…

<quote>For a variety of reasons, PCAST judges anonymization, data deletion, and distinguishing data from metadata (defined below) to be in this category. The framework of notice and consent is also becoming unworkable as a useful foundation for policy.</quote>

<quote>Anonymization is increasingly easily defeated by the very techniques that are being developed for many legitimate applications of big data. In general, as the size and diversity of available data grows, the likelihood of being able to re‐identify individuals (that is, re‐associate their records with their names) grows substantially. While anonymization may remain somewhat useful as an added safeguard in some situations, approaches that deem it, by itself, a sufficient safeguard need updating. </quote>

<quote>Notice and consent is the practice of requiring individuals to give positive consent to the personal data collection practices of each individual app, program, or web service. Only in some fantasy world do users actually read these notices and understand their implications before clicking to indicate their consent. <snip/>The conceptual problem with notice and consent is that it fundamentally places the burden of privacy protection on the individual. Notice and consent creates a non‐level playing field in the implicit privacy negotiation between provider and user. The provider offers a complex, take‐it‐or‐leave‐it set of terms, while the user, in practice, can allocate only a few seconds to evaluating the offer. This is a kind of market failure. </quote>

<quote>Also rapidly changing are the distinctions between government and the private sector as potential threats to individual privacy. Government is not just a “giant corporation.” It has a monopoly in the use of force; it has no direct competitors who seek market advantage over it and may thus motivate it to correct missteps. Governments have checks and balances, which can contribute to self‐imposed limits on what they may do with people’s information. Companies decide how they will use such information in the context of such factors as competitive advantages and risks, government regulation, and perceived threats and consequences of lawsuits. It is thus appropriate that there are different sets of constraints on the public and private sectors. But government has a set of authorities – particularly in the areas of law enforcement and national security – that place it in a uniquely powerful position, and therefore the restraints placed on its collection and use of data deserve special attention. Indeed, the need for such attention is heightened because of the increasingly blurry line between public and private data. While these differences are real, big data is to some extent a leveler of the differences between government and companies. Both governments and companies have potential access to the same sources of data and the same analytic tools. Current rules may allow government to purchase or otherwise obtain data from the private sector that, in some cases, it could not legally collect itself, or to outsource to the private sector analyses it could not itself legally perform. [emphasis here] The possibility of government exercising, without proper safeguards, its own monopoly powers and also having unfettered access to the private information marketplace is unsettling.</quote>


Substantially in order of appearance in the footnotes, without repeats.

Via: backfill, backfill


And yet even with all the letters and professional editing and techwriting staff available to this national- and historical-level enterprise we still see [Footnote 101, page 31]

Qi, H. and A. Gani, “Research on mobile cloud computing: Review, trend and perspectives,” Digital Information and Communication Technology and it’s Applications (DICTAP), 2012 Second International Conference on, 2012.

The correct listing is at Springer

Digital Information and Communication Technology and Its Applications;International Conference, DICTAP 2011, Dijon, France, June 21-23, 2011. Proceedings, Part I, Series: Communications in Computer and Information Science, Vol. 166 Cherifi, Hocine, Zain, Jasni Mohamad, El-Qawasmeh, Eyas (Eds.) 2011, XIV, 806 p.


  • it’s → is a contraction for it is
  • its → is a possessive

Ergo: s/it's/its/g;

Handbook of Data Analytics | Leada


  • Brian Liou
  • Tristan Tao
  • Elizabeth Lin

Shop: Leada, a consultancy




  • Not a “handbook” in the sense that it’s not recipes for HOWTO at all.
  • Motivational interviews in a Q&A style; ~5 pages each
  • Career Advice.
  • Career Attractor.
  • No Math, Algos, Results.


  • What exactly is a data scientist anyway, and how is it different than a data analyst?
  • Who buys this stuff anyway?
  • What skills do such people need? [someone answers: PowerPoint 2-pager]
  • How does interviewing work in this area?
  • They interview for specific task-level skills; show passion, show “hunger to learn”


  • <bzzzz>Big Data</bzzzz>
  • <zzzz>B2B</zzzz>
  • <zzzz>BI</zzzz>
  • <zzzz>CRM</zzzz>


  • Cassandra
  • Excel
  • HadoopTM MapReduce
  • Hive
  • Java
  • Kaggle
  • MongoDB
  • PDF
  • SAS
  • Storm
  • NoSQL
  • Pig
  • Python
  • R
  • Word


  • regression
  • t-test
  • Algorithm complexity (Big O notation)
  • Machine Learning (general)
  • Natural Language Processing (NLP)


  • BigML
  • Cloudera
  • Facebook
  • Flurry
  • HG Data
  • Linkedin
  • Mode Analytics
  • Persontyle
  • Smarter Remarketer Inc.
  • Stylistics
  • Yelp
  • Yhat


  • C++ programming “low level systems programming
  • Quality Control
  • Computer Science
  • Economics
  • Humanities, generally
  • Parasitology
  • Philosophy

Via: backfill

A Very Short History Of Big Data | Gill Press, Forbes

Gil Press; A Very Short History Of Big Data; In Forbes; 2013-05-09.

Via: A Very Short History of Big Data on


Indirect Sources

i.e. not listed directly, but cited.

  • Istvan Dienes; National Accounting of Information; Reference Manual of SNIA, Version v1.1; 1994; 291 pages.

    • SNA vs SNIA
      • S-something N-something Accounting
      • S-something N-something Information Accounting
    • SNA92 is authoritative
  • Alistair D. Duff; The Information Society Studies; Routledge; 2000-06-01; 216 pages; $200.
  • Andrew Odlyzko (started) Minnesota Internet Traffic Studies (MINTS); 2002-2009; tracking the growth in Internet traffic.
  • Martin Hilbert; How to Measure “How Much Information”? Theoretical, Methodological, and Statistical Challenges for the Social Sciences; In International Journal of Communications(IJOC); Vol 6; 2012; 14 pages.
    • This is an introduction to a ‘special section’ issue of the IJOC on information & measurement studies.
    • Conclusions (in the article and the subsequent articles of the special section)
      1. It is not only statistically feasible, but also analytically insightful to quantify the amount of information handled by society.
      2. However, many of the available sources are not very solid, and the methodologies are still maturing.
      3. The research question and its theoretical framework have defined the methodology, including the choice of the indicator.
      4. There is still no consensus on how to define the most fundamental measures for data and information.
      5. Information quantity is not equal to information quality or information value, but the second requires the first.
      6. Will it be possible and/or useful to harmonize information accounts?

Altiscale, Raymie Stata



  • Concept Big Data Dial Tone; meaning Hadoop SaaS
  • Implementation: Hadoop on Amazon AWS
  • Staff
    • Raymie Stata, CEO
    • Ricardo Jenez, VPE
    • Bill Coughran, Sequoia, Board of Directors; who ran much of engineering at Google for years.
  • Funding
    • “recently”
    • Series A, $12 million
    • Sequoia Capital
    • General Catalyst Partners
    • Accel Partners
    • AME Ventures (Jerry Yang)
    • “and a few individual investors.”
  • Others
    • MapR
    • Cloudera
    • Hortonworks


Also see backfill

Knack & Evolv as examples of “Big Data meets Human Resources”

Via: Big Data, Trying to Build Better Workers; Steve Lohr; In The New York Times; 2013-04-20.
Also: backfill


  • Something about “Big Data Meets Human Resources”
  • Quotes from Éminences Grises
  • Companies (users)
    • I.B.M.
    • Kenexa (of IBM), a recruiting, hiring and training company, $1.3B acquisition
    • “companies like” IBM, Oracle, SAP
    • eHarmony
    • Google
    • Transcom, a global operator of customer-service call centers
  • Highlighted
    • Knack
      • Tests emotional intelligence
      • Uses games to test
      • Customers (pilot testers)
        • NYU Langone Medical Center,
        • Bain & Company
        • An unnamed unit of Shell.
    • Evolv
      • San Francisco
      • Lotsa “researchers” from name-brand schools (Wharton, Yale, Stanford)
  • Quotes
    • Tim Geisert, chief marketing officer for I.B.M.’s Kenexa unit.
    • Guy Halfteck, C.E.O., Knack.
    • Prasad Setty, vice president for people analytics, Google.
    • Michael Housman, an economist and managing director of analytics at Evolv.
    • Neil Rae, an executive vice president of Transcom.