Resources for Getting Started with Distributed Systems | Caitie McCaffrey

Caitie McCaffrey (Microsoft); Resources for Getting Started with Distributed Systems; In Her Blog; 2017-09-07.

tl;dr → Distributed Sagas, within the .NET culture of Microsoft.


  • Distributed SAGA
  • Simple API for Grid Applications (SAGA); In Jimi Wales’ Wiki.
  • Tao
  • Espresso
  • Transaction Processing Performance Council (TPPC, TPC)
  • Pre-materialized aggregates, a technique.

The Canon (A Canon)

Exemplars (Bloggists)

Post Mortems (After Action Reports)

Exemplars (NoSQL)

  • Bigtable, Google
  • Cassandra
  • CouchDB
  • Dynamo, Amazon
  • HBase of Apache
  • MongoDB
  • Neo4J
  • Redis
  • Riak
  • SimpleDB, Amazon

Exemplars (Full SQL)

  • MySQL
  • Oracle
  • … and so on.




The Suitcase Words
  • 2-Phase Commit (2PC)
  • Available Continuous Impressive Dancing (ACID)
    Atomic, Consistent, Isolated, Durable (ACID)
  • Basically-Available, Slow Soft State, Eventually-Consistent (BASE, BASSEC)
    BASE (i.e., not ACID)
  • BLOOM, a programming language, the CALM programming language
  • Consistency As Logical Monotonicity (CALM)
  • Conflict-free Replicated Data Type (CRDT)
  • Consistency, Availability, Partition-Tolerance (CAP), (Folk-) Theorem
  • Fisher, Lynch, Patterson (FLP) Theorem
  • Liveness
  • Lots of Labor (LOL)
  • Safety
  • Serializability
  • Single System Image (SSI)
  • Read Atomic Multi-Partition (RAMP) Transactions

Previously filled.

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing | Akidau et al. (Google)

Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernandez-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, ́ Frances Perry, Eric Schmidt, Sam Whittle; The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing; In Proceedings of the Conference on Very Large Data Bases (VLDB), Volume 8, Number 12; 2015-08-31; 12 pages; Google, paywall


Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves, in addition to an insatiable hunger for faster answers. Meanwhile, practicality dictates that one can never fully optimize along all dimensions of correctness, latency, and cost for these types of input. As a result, data processing practitioners are left with the quandary of how to reconcile the tensions between these seemingly competing propositions, often resulting in disparate implementations and systems.

We propose that a fundamental shift of approach is necessary to deal with these evolved requirements in modern data processing. We as a field must stop trying to groom unbounded datasets into finite pools of information that eventually become complete, and instead live and breathe under the assumption that we will never know if or when we have seen all of our data, only that new data will arrive, old data may be retracted, and the only way to make this problem tractable is via principled abstractions that allow the practitioner the choice of appropriate tradeoffs along the axes of interest: correctness, latency, and cost.

In this paper, we present one such approach, the Dataflow Mode, along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development


  1. Daniel J. Abadi, Don Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, Stan Zdonik. Aurora: a new model and architecture for data stream management, In The VLDB Journal — The International Journal on Very Large Data Bases, v.12 n.2, p.120-139, 2003-08.[doi:10.1007/s00778-003-0095-z]
  2. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle, MillWheel: fault-tolerant stream processing at internet scale, In Proceedings of the VLDB Endowment, v.6 n.11, p.1033-1044, 2013-08.[doi:10.14778/2536222.2536229]
  3. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, Daniel Warneke, The Stratosphere platform for big data analytics, The VLDB Journal — The International Journal on Very Large Data Bases, v.23 n.6, p.939-964, 2014-12.[doi:10.1007/s00778-014-0357-y]
  4. Apache. Apache Hadoop, 2012.
  5. Apache. Apache Storm, 2013.
  6. Apache. Apache Flink, 2014.
  7. Apache. Apache Samza, 2014.
  8. R. S. Barga et al. Consistent Streaming Through Time: A Vision for Event Stream Processing. In Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR), pages 363–374, 2007.
  9. Irina Botan, Roozbeh Derakhshan, Nihal Dindar, Laura Haas, Renée J. Miller, Nesime Tatbul, SECRET: a model for analysis of the execution semantics of stream processing systems, In In Proceedings of the VLDB Endowment, v.3 n.1-2, 2010-09.[doi:10.14778/1920841.1920874]
  10. Oscar Boykin, Sam Ritchie, Ian O’Connell, Jimmy Lin, Summingbird: a framework for integrating batch and online MapReduce computations, In In Proceedings of the VLDB Endowment, v.7 n.13, p.1441-1451, 2014-08.[doi:10.14778/2733004.2733016]
  11. Cask. , 2015.
  12. Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, Nathan Weizenbaum, FlumeJava: easy, efficient data-parallel pipelines, In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, 2010-06-05 → 2010-06-10 (five days!!!), Toronto, Ontario, Canada. [doi:10.1145/1806596.1806638]
  13. B. Chandramouli et al. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. In Proceedings of the 41st International Conference on Very Large Data Bases (VLDB), 2015.
  14. Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel R. Madden, Fred Reiss, Mehul A. Shah, TelegraphCQ: continuous dataflow processing, In Proceedings of the 2003 ACM International Conference on Management of Data (SIGMOD), 2003-06-09 → 2003-06-12, San Diego, California. [doi:10.1145/872757.872857]
  15. Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang, NiagaraCQ: a scalable continuous query system for Internet databases, In Proceedings of the 2000 ACM International Conference on Management of Data (SIGMOD), p.379-390, 2000-05-15 → 2000-05-18, Dallas, Texas, USA. [doi:10.1145/342009.335432]
  16. Jeffrey Dean, Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, In Proceedings of the 6th Conference (or Symposium?) on Operating Systems Design & Implementation (OSDI), p.10-10, 2004-12-06 → 2004-12-08, San Francisco, CA
  17. EsperTech. Esper, 2006.
  18. Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh Srivastava, Building a high-level dataflow system on top of Map-Reduce: the Pig experience, In Proceedings of the VLDB Endowment, v.2 n.2, 2009-08. [doi:10.14778/1687553.1687568]
  19. Google. Dataflow SDK, 2015.
  20. Google. Google Cloud Dataflow. 2015.
  21. Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk, Oliver Spatscheck, A heartbeat mechanism and its application in gigascope, In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), 2005-08-30 → 2005-09-02, Trondheim, Norway
  22. Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker, Semantics and evaluation techniques for window aggregates in data streams, In Proceedings of the 2005 ACM International Conference on Management of Data (SIGMOD), 2005-06-14 → 2005-06-16, Baltimore, Maryland. [doi:10.1145/1066157.1066193]
  23. Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, David Maier, Out-of-order processing: a new architecture for high-performance stream systems, In Proceedings of the VLDB Endowment, v.1 n.1, 2008-08. [doi:10.14778/1453856.1453890]
  24. David Maier, Jin Li, Peter Tucker, Kristin Tufte, Vassilis Papadimos, Semantics of Data streams and operators, In Proceedings of the 10th International Conference on Database Theory, 2005-01-05 → 2005-01-07, Edinburgh, UK. [doi:10.1007/978-3-540-30570-5_3]
  25. N. Marz. How to beat the CAP theorem, In His Blog. 2011.
  26. S. Murthy et al. Pulsar — Real-Time Analytics at Scale. Technical report, eBay, 2015.
  27. SQLStream, 2015.
  28. Utkarsh Srivastava, Jennifer Widom, Flexible time management in data stream systems, In Proceedings of the Twenty-Third Acm SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), 2004-06-14 → 2004-06-16, Paris, France. [doi:10.1145/1055558.1055596]
  29. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy, Hive: a warehousing solution over a map-reduce framework, In Proceedings of the VLDB Endowment, v.2 n.2, 2009-08. [doi:10.14778/1687553.1687609]
  30. Peter A. Tucker, David Maier, Tim Sheard, Leonidas Fegaras, Exploiting Punctuation Semantics in Continuous Data Streams, In IEEE Transactions on Knowledge and Data Engineering, v.15 n.3, p.555-568, 2003-03. [doi:10.1109/TKDE.2003.1198390]
  31. James Whiteneck, Kristin Tufte, Amit Bhat, David Maier, Rafael J. Fernández-Moctezuma, Framing the question: detecting and filling spatial-temporal windows, In Proceedings of the ACM SIGSPATIAL International Workshop on GeoStreaming, p.19-22, 2010-11-02 → 2010-11-02, San Jose, California. [doi:10.1145/1878500.1878506]
  32. F. Yang and others. Sonora: A Platform for Continuous Mobile-Cloud Computing. Technical Report MSR-TR-2012-34, Microsoft Research Asia.
  33. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2012-03-25 → 2012-03-27, San Jose, CA
  34. Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica, Discretized streams: fault-tolerant streaming computation at scale, In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP), 2013-11-03 → 2013-11-06, Farminton, Pennsylvania. [doi:10.1145/2517349.2522737]

Previously filled.

We Can Hear You With WiFi | Wang, Zou, Zhou, Wu, Ni

Guanhua Wang, Yongpan Zou, Zimu Zhou, Kaishun Wu, Lionel M. Ni; WWe Can Hear You with Wi-Fi!; In Proceedings of MobiCom; 2014-09-11; 12 pages; library.


Recent literature advances Wi-Fi signals to “see” people’s motions and locations. This paper asks the following question: Can Wi-Fi “hear” our talks? We present WiHear, which enables Wi-Fi signals to “hear” our talks without deploying any devices. To achieve this, WiHear needs to detect and analyze fine-grained radio reflections from mouth movements. WiHear solves this micro-movement detection problem by introducing Mouth Motion Profile that leverages partial multipath effects and wavelet packet transformation. Since Wi-Fi signals do not require line-of-sight, WiHear can “hear” people talks within the radio range. Further, WiHear can simultaneously “hear” multiple people’s talks leveraging MIMO technology. We implement WiHear on both USRP N210 platform and commercial Wi-Fi infrastructure. Results show that within our pre-defined vocabulary, WiHear can achieve detection accuracy of 91% on average for single individual speaking no more than 6 words and up to 74% for no more than 3 people talking simultaneously. Moreover, the detection accuracy can be further improved by deploying multiple receivers from different angle.

Living on Fumes: Digital Footprints, Data Fumes, and the Limitations of Spatial Big Data | Jim Thatcher

Jim Thatcher (Clark University); Living on Fumes: Digital Footprints, Data Fumes, and the Limitations of Spatial Big Data; In International Journal of Communications (IJC); Volume 8; 2014; 19 pages; landing; previously in Proceedings of the 26th International
Cartographic Conference (ICC), 2014.

tl;dr → whereas capitalism is bad, the critical theory: sociotechnical, epistemic project, abductive processes, epistemic limits, epistemic and ontological commitments, capitalist profit motives, private corporations; frameworks of Marcuse, Pickles. You get the idea.


Amid the continued rise of big data in both the public and private sectors, spatial information has come to play an increasingly prominent role. This article defines big data as both a sociotechnical and epistemic project with regard to spatial information. Through interviews, job shadowing, and a review of current literature, both academic researchers and private companies are shown to approach spatial big data sets in analogous ways. Digital footprints and data fumes, respectively, describe a process that inscribes certain meaning into quantified spatial information. Social and economic limitations of this data are presented. Finally, the field of geographic information science is presented as a useful guide in dealing with the “hard work of theory” necessary in the big data movement.


  • In the introductory paragraph, cites opinements in Fast Company and Mashable as authoritative directional indicators.
  • Two problems
    1. <quote>On the one hand, rather than fully capturing life as researchers hope, end-user interactions within big data are necessarily the result of decisions made by an extremely small group of programmers working for private corporations that have [been] promulgated through the mobile application ecosystem.
    2. On the other hand, in accepting that the data gathered through mobile applications reveal meaningful information about the world, researchers are tacitly accepting a commodification and quantification of knowledge.</quote>
  • Big Data is
    • (wait for it …) very big, “large” even.
    • <quote>data whose size forces us to look beyond the tried-and-true methods
      that are prevalent at that time</quote>, Adam Jacobs.
    • Contrarianism
      • Something vague about Taylorism, Max Weber, etc.
      • Something vague about how having more data is better, or is not better.
    • The Fourth Paradigm
      1. empiricism
      2. analysis
      3. simulation.
      4. explore & exploit
    • Sources
      <quote>Most current studies describing themselves as “big data” with a spatial component revolve around two mobile software platforms [Foursquare, Twitter]</quote>

      • Foursquare
      • Twitter
      • Facebook
      • Flickr
  • Types of Data [plural of types of Datum(s)]
    • Checkin
    • Tweet
  • Livehood
  • 25% of Foursquare users link their Twitter accounts (75% don’t)
  • <quote>Finally, the reliance upon data generated with an explicit motive for profit — both for the end user and the corporation—results in epistemological commitments not dissimilar to concerns raised with regard to the knowledges and approaches privileged by GIS use. </quote>
  • <quote>This hard work of theory opens new knowledge projects within the realm of big data. For example, if the check-in is viewed as a form of disciplining technology — one that reports location to enmesh it more fully in capitalist exchange — then purposeful location fraud takes on new meaning as a potential form of resistance or protest.</quote>


  • private companies
  • profit motives
  • capitalism


  • Digital footprints
  • Digital fumes


  • PostgreSQL
  • R
  • Mac (OS)


  • Anderson, C. (2008-06-23). The end of theory: The data deluge makes the scientific method obsolete. Wired.
  • Baker, S. (2012-01-05). Can social media sell soap? The New York Times.
  • Batty, M. (2012). Smart cities, big data. Environment and Planning B, 39, 191–193.
  • Benner, J., & Robles, C. (2012). Trending on Foursquare: Examining the location and categories of venues that trend in three cities. In Proceedings of the Workshop on GIScience in the Big Data Age 2012 (pp. 27–35). Columbus, Ohio.
  • Berry, D. M. (2011). The philosophy of software: Code and mediation in the digital age. London, UK: Palgrave Macmillan.
  • boyd, d., & Crawford, K. (2012). Critical questions for big data. Information, Communication & Society, 15(5), 662–679.
  • Brownlee, J. (2012-03-30). This creepy app isn’t just stalking women without their knowledge, it’s a wake-up call about Facebook privacy. In Cult of Mac.
  • Burgess, J., & Bruns, A. (2012). Twitter archives and the challenges of “big social data” for media and communication research. M/C Journal, 15(5).
  • Carbunar, B., & Potharaju, R. (2012). You unlocked the Mt. Everest badge on Foursquare! Countering location fraud in geosocial networks. In Proceedings of the 2012 IEEE 9th International Conference on Mobile Ad-Hoc and Sensor Systems (MASS), pages 182-190. IEEE Computer Society, Washington, DC.
  • Cerrato, P. (2012-11-01). Big data analytics: Where’s the ROI? InformationWeek: Healthcare.
  • Cheng, Z., Caverlee, J., Lee, K., & Sui, D. (2011). Exploring millions of footprints in location sharing services. In Proceedings of the Fifth International AAAI Conference on WSM. Barcelona, Spain.
  • Crampton, J. (2003). The political mapping of cyberspace. Edinburgh, Scotland: Edinburgh University Press.
  • Crampton, J. (2013). Commentary: Is security sustainable? Environment and Planning D, 31, 571–577.
  • Cranshaw, J., Schwartz, R., Hong, J., & Sadeh, N. (2012). The Livehoods Project: Utilizing social media to understand the dynamics of a city. In Proceedings of the Sixth International AAAI Conference on WSM. Dublin, Ireland.
  • Curry, M. (1997). The digital individual and the private realm. Annals of the AAG, 87, 681–699.
  • DeLyser, D., & Sui, D. (2013). Crossing the qualitative-quantitative divide II: Inventive approaches to big data, mobile methods, and rhythmanalysis. Progress in Human Geography, 37(2), 293–305.
  • Eckert, J., & Hemsley, J. (2013-04-11). Occupied Reographies, Relational or Otherwise. Presentation to the American Association of Geographers, Los Angeles, CA.
  • Exner, J., Zeile, P., & Streich, B. (2011). Urban monitoring laboratory: New benefits and potential for urban planning through the use of urban sensing, geo- and mobile-web. In Proceedings of Real CORP 2011. pages 1087–1096. Wien, Austria.
  • Farmer, C., & Pozdnoukhov, A. (2012). Building streaming GIScience from context, theory, and intelligence. In Proceedings of the Workshop on GIScience in the Big Data Age 2012. pages 5–10. Columbus, Ohio.
  • Goodchild, M. (1992). Geographical information science. In International Journal of Geographical Information Systems, 6, 31–45.
  • Goodchild, M. (2007). Citizens as sensors: The world of volunteered geography. In GeoJournal, 69(4), 211– 221.
  • Goodchild, M., & Glennon, J. A. (2010). Crowdsourcing geographic information for disaster response: A research frontier. In International Journal of Digital Earth, 3(3), 231–241.
  • Harley, J. (1989). Deconstructing the map. In Cartographical, 26, 1–20.
  • Hecht, B., Hong, L., Suh, B., & Chi, E. (2011). Tweets from Justin Bieber’s heart. In Proceedings of the ACM CHI Conference 2011. pages 237–246. Vancouver, BC.
  • Heidegger, M. (1977). The question concerning technology and other essays. W. Lovitt, Translator. New York, NY: Harper Perennial.
  • Hey, T., Tansley, S., & Toelle, K. (Eds.). (2009). The fourth paradigm: Data-intensive scientific discovery. Richmond, WA: Microsoft Research.
  • Horvath, I. (2012). Beyond advanced mechatronics: New design challenges of social-cyber systems. (Draft paper.) In Proceedings of the ACM Workshop on Mechatronic Design, Linz 2012. Linz, Austria
  • Jacobs, A. (2009). The pathologies of big data. In ACM Queue, 7(6), pages 1–12.
  • Joseph, K., Tan, C., & Carley, K. (2012). Beyond “local,” “categories” and “friends”: Clustering Foursquare users with latent “topics.” In Proceedings of ACM Ubicomp 2012. pages 919–926. Pittsburgh, PA.
  • Kingsbury, P., & Jones III, J. P. (2009). Walter Benjamin’s Dionysian adventures on Google earth. In Geoforum, 40, 502–513.
  • Kitchin, R., & Dodge, M. (2007). Rethinking maps. In Progress in Human Geography, 31, 331–344.
  • Kling, F., & Pozdnoukhov, A. (2012). When a city tells a story: Urban topic analysis. In Proceedings of ACM SIGSPATIAL 2012. pages 482–485. Redondo Beach, CA.
  • Lathia, N., Quercia, D., & Crowcroft, J. (2012). The hidden image of the city: Sensing community well-being from urban mobility. In Pervasive Computing,/em>, 7319, 91–98.
  • Laurila, J., Gatica-Perez, D., Aad, I., Blom, J., Bornet, O., Do, T., Dousse, O., Eberle, J., & Miettinen, M. (2012). The mobile big data challenge. Nokia Research.
  • Livehoods, demonstrator & promotional site. (2012).
  • Lohr, S. (2012-12-29). Sure, big data is great. But so is intuition. The New York Times.
  • Long, X., Jin, L., & Joshi, J. (2012). Exploring trajectory-driven local geographic topics in Foursquare. In Proceedings of ACM Ubicomp 2012. pages 927–934. Pittsburgh, PA.
  • Marcuse, H. (1982 [1941]). Some social implications of modern technology. In A. Arato & E. Gebhardt (Eds.), The essential Frankfurt School reader. pages 138–162. New York, NY: Continuum.
  • Martino, M., Britter, R., Outram, C., Zacharias, C., Biderman, A., & Ratti, C. (2010). Senseable city. Cambridge, MA: MIT Senseable City Lab.
  • Mayer-Schonberger, V., & Cukier, K. (2013). Big Data: A revolution that will transform how we live, work, and think. New York, NY: Houghton Mifflin Harcourt.
  • Mitchell, J. (2012-04-10). Life after death of the check-in. In ReadWrite.
  • National Science Foundation. (2012-10-03). NSF announces interagency progress on administration’s= big data initiative. Press release.
  • Noulas, A., Scellato, S., Mascolo, C., & Pontil, M. (2011). An empirical study of geographic user activity patterns in Foursquare. In Proceedings of the Fifth International AAAI Conference on WSM. pages 570–573. Barcelona, Spain.
  • Obermeyer, N. (1995). The hidden GIS technocracy. In Cartography and Geographic Information Science, 22(1), 78–83.
  • O’Sullivan, D. (2006). Geographical information science: Critical GIS. Progress in Human Geography, 30(6), 783–791.
  • Paulos, E., Honicky, R. J., & Hooker, B. (2008). Citizen science: Enabling Participatory Urbanism. In M. Foth (Ed.), Handbook of Research on Urban Informatics: The Practice and Promise of the Real-Time City. pages 414–436. Hershey, PA: Information Science Reference.
  • Pickles, J. (1993). Discourse on method and the History of Discipline: Reflections on Jerome Dobson’s 1993 “Automated geography.” In Professional Geographer, 45, 451–455.
  • Pickles, J. (1995). Ground Truth. New York, NY: Guilford Press.
  • Pickles, J. (1997). Tool or Science? GIS, Technoscience and the Theoretical Turn. In Annals of the AAG, 87, pages 363–372.
  • Presley, S. (2011). Mapping out #LondonRiots. In NFPvoice.
  • Rasmus, D. (2012-01-27). Why big data won’t make you smart, rich, or pretty. In Fast Company.
  • Sakaki, T., Okazaki, M., & Matsuo, Y. (2010). Earthquake shakes Twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web (WWW). pages 851–860. Raleigh, NC.
  • A sea of sensors. staff; (2010-11-04). In The Economist.
  • Sheppard, E. (1993). Automated geography: What kind of geography for what kind of society? In Professional Geographer, 45, 457–460.
  • Siegel, E. (2013). Predictive analytics: The power to predict who will click, buy, lie, or die. Hoboken, NJ: John Wiley.
  • Sui, D. (2008). The wikification of GIS and its consequences: Or Angelina Jolie’s new tattoo and the future of GIS. In Computers, Environment and Urban Systems, 32, 1–5.
  • Taylor, C. (2012-11-07). Triumph of the nerds: Nate Silver wins in 50 states. In Mashable.
  • Thatcher, J. (2013-12). Avoiding the Ghetto through hope and fear: An analysis of immanent technology using ideal types. In GeoJournal. Volume 78, Issue 6, pages 967-980. paywall.
  • Thompson, C. (2012-05-10). Foursquare alters API to eliminate apps like Girls Around Me. In AboutFoursquare.
  • Twitter. (2012). Streaming API request parameters. API Documentation.
  • Weber, M. (1973 [1946]). From Max Weber (C. Mills & H. Gerth, Eds.). New York, NY: Oxford University Press.
  • Wilson, M. (2012). Location-based services, conspicuous mobility, and the location-aware future. In Geoforum, 43(6), 1266–1275.
  • Wright, D., Goodchild, M., & Proctor, D. (1997). Still hoping to turn that theoretical corner. In Annals of the AAG, 87(2), 373.
  • Xu, S., Flexner, S., & Carvalho, V. (2012). Geocoding billions of addresses: Towards a spatial record linkage system with big data. In Proceedings of the Workshop on GIScience in the Big Data Age 2012. pages 17–26. Columbus, Ohio.


Via: backfill.

Locality-Sensitive Hashing for Search in High Dimensional Spaces






Some search queries

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise (DBSCAN) | Ester, Kriegel, Sander, Xu

Martin Ester, Hans-Peter Kriegel, Jiirg Sander, Xiaowei Xu; A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise; In Proceedings of Knowledge Discovery in Databases (KDD); 1996; 6 pages.


Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that

  1. DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that
  2. DBSCAN outperforms CLARANS by factor of more than 100 in terms of efficiency.

MySQL++ v3.2.2 User Manual



  • Specialized SQL Structures (SSQLS)


Attention Decay in Science | Della Briotta Parolo, Pan, Ghosh, Huberman, Kaski, Fortunato

Pietro Della Briotta Parolo, Raj Kumar Pan, Rumi Ghosh, Bernardo A. Huberman, Kimmo Kaski, Santo Fortunato; Attention Decay in Science; preprint; Elsevier (submitted to some journal of theirs); submitted: 2015-03-09; 12 pages; arXiv:1503.01881.


The exponential growth in the number of scientific papers makes it increasingly difficult for researchers to keep track of all the publications relevant to their work. Consequently, the attention that can be devoted to individual papers, measured by their citation counts, is bound to decay rapidly. In this work we make a thorough study of the life-cycle of papers in different disciplines. Typically, the citation rate of a paper increases up to a few years after its publication, reaches a peak and then decreases rapidly. This decay can be described by an exponential or a power law behavior, as in ultradiffusive processes, with exponential fitting better than power law for the majority of cases. The decay is also becoming faster over the years, signaling that nowadays papers are forgotten more quickly. However, when time is counted in terms of the number of published papers, the rate of decay of citations is fairly independent of the period considered. This indicates that the attention of scholars depends on the number of published items, and not on real time.

Via: backfill

GraphX: Graph Processing in a Distributed Dataflow Framework | Gonzalez, Xin, Dave, Crankshaw, Franklin, Stoica

Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, Ion Stoica; GraphX: Graph Processing in a Distributed Dataflow Framework; In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (SOSP); 2014-10-06; 16 pages; landing.


In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantages of specialized graph processing systems can be recovered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a familiar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with specialized graph systems, GraphX recasts graph-specific optimizations as distributed join optimizations and materialized view maintenance. By leveraging advances in distributed dataflow frameworks, GraphX brings low-cost fault tolerance to graph processing. We evaluate GraphX on real workloads and demonstrate that GraphX achieves an order of magnitude performance gain over the base dataflow framework and matches the performance of specialized graph processing systems while enabling a wider range of computation.


  • Alternates
    • Pregel
    • PowerGraph
    • MapReduce
    • Spark
    • Dryad
    • Naiad; Microsoft
    • DryadLINQ; Microsoft
    • Pig
    • Spark
    • GraphLINQ, within Naiad; Microsoft
    • Apache Spark
    • Apache Giraph
    • GraphLab (PowerGraph)
  • Applications
    • PageRank
    • community detection
    • latent factor analysis


  • property graph
  • power-law distributions
  • |E| >> |V|
  • vertex program (message passing to other vertices)
  • Bulk Synchronous Parallel (BSP)
  • Gather Apply Scatter (GAS) – decomposition, optimization
    • pull-based model
    • no direct communication between unconnected vertices
  • Compressed Sparse Row (CSR)
  • Map Reduce Triplets (MRT)
  • Graph Partitioning
  • Mirror Vertices
  • Active Vertices
  • Resilient Distributed Datasets (RDD)


  • GraphChi
  • X-Steram
  • CombBLAS
  • GraphChi
  • Resource Description Framework (RDF)


  • D. J. Abadi, A. Marcus, S. R. Madden, K. Hollenbach; SW-Store: A Vertically Partitioned DBMS for Semantic Web Data Management; In Proceedings of the Conference on Very Large Data Bases (VLDB); Volume 18, Number 2; 2009; pages 385–406.
  • F. N. Afrati, J. D. Ullman; Optimizing Joins in a Map-Reduce Environment; In Proceedings of the International Conference on Extending Database Technology (EDBT); 2010; pages 99–110.
  • S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, Y. A. Tian; A Comparison of Join Algorithms for Log Processing in MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD); ACM; 2010; pages 975–986.
  • P. Boldi, M. Rosa, M. Santini, S. Vigna; Layered Label Propagation: A Multiresolution Coordinate-Free Ordering for Compressing Social Networks; In Proceedings of the Conference on the World Wide Web (WWW); 2011; pages 587–596.
  • P. Boldi, S. Vigna; The WebGraph Framework I: Compression Techniques. In Proceedings of the Conference on the World Wide Web (WWW); 2004.
  • J. Broekstra, A. Kampman, F.V. Harmelen; Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In Proceedings of the First International Semantic Web Conference (ISWC); 2002; pages 54–68.
  • A. Buluç, J.R. Gilbert, The Combinatorial BLAS: Design, Implementation and Applications. In International Journal of High Performance Computing Applications (IJHPCA); Voume 25, Number 4; 2011; 496–509.
  • U. V. Çtalyürek, C. Aykanat, B. Uçar; On Two-Dimensional Sparse Matrix Partitioning: Models, Methods and a Recipe. In SIAM Journal of Scientific. Computing; Volume 32, Number 2; 2010, pages 656–683.
  • R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, E. Chen; Kineograph: Taking the Pulse of a Fast-Changing and Connected World. In Proceedings of EuroSys; 2012; pages 85–98.
  • J. Dan, S. Ghemawat; MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of Operating Systems Design and Implementation (OSDI); 2004.
  • S. Ewen, K. Tzoumas, M. Kaufmann, V. Markl; Spinning Fast Iterative Data Flows; In Proceedings of the Conference on Very Large Data Bases (VLDB); Volume 5, Number 11; 2012-07; pages 1268–1279.
  • U. Feige, M. Hajiaghayi, J.R. Lee; Improved Approximation Algorithms for Minimum-Weight Vertex Separators; In Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing (STOC); 2005; ACM; pages 563–572.
  • J.E. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin; Powergraph: Distributed Graph-Parallel Computation on Natural Graphs; In Proceedings of the Operating Systems Design and Implementation (OSDI); USENIX Association; 2012; pages 17–30.
  • A. Huang, W. Wu; Mining eCommerce Graph Data with Spark at Alibaba Taobao. 2014.
  • M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Ftterly; Dryad: Distributed Data-Parallel Programs From Sequential Building Blocks; In Proceedings of EuroSys; 2007; pages 59–72.
  • G. Karypis, V. Kmar; Multilevel K-Way Partitioning Scheme for Irregular Graphs; In Journal of Parallel and Distributed Computing; Volume 48, Issue 1; 1998; 96–129.
  • A. Kyrola, G. Blelloch, C. Guestrin; GraphChi: Large-Scale Graph Computation On Just a PC; In Proceedings of Operating Systems Design and Impementation (OSDI); 2012.
  • J. Leskovec, K. J. Lang, A. Dasgupta, M. W. Mahoney; Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters; In Internet Mathematics; Volume 6, Number 1; 2008; pages 29–123.
  • Y. Low, et al.; GraphLab: A New Parallel Framework for Machine Learning; In Proceedings of UAI; 2010; pages 340–349.
  • Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. M. Hellerstein, Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. In Proceedings of the Conference on Very Large Data Bases (VLDB); 2012.
  • L. F. Mackert, G. M. Lohman; R*-Optimizer Validation and Performance Evaluation for Distributed Queries. In Proceedings of the Conference on Very Large Data Bases (VLDB); 1986; pages 149–159.
  • G. Malewicz, M. H. Austern, A. J. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski; Pregel: a System for Large-Scale Graph Processing; In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD); 2010; pages 135– 146.
  • F. Manola, E. Miller; RDF Primer; W3C Recommendation 10; 2004; pages 1–107.
  • J. Mondal, A. Deshpande; Managing Large Dynamic Graphs Efficiently; In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD); 2012; ACM; pages 145–156.
  • D. Murray; Building new frameworks on Naiad; In His Blog; 2014-04.
  • D.G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, M. Abadi; Naiad: A Timely Dataflow System; In Proceedings of the Symposium on Operating Systems Principles (SOSP); 2013.
  • M. Najork, D. Fetterly, A. Halverson, K. Kenthapadi, S. Gollapudi; Of Hammers and Nails: An Empirical Comparison of Three Paradigms for Processing Large Graphs; In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM); 2012; ACM, pages 103–112.
  • T. Neumann, G. Weikum; RDF-3X: A RISC-style Engine for RDF. In Proceedings of the Conference on Very Large Data Bases (VLDB); 2008.
  • C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins; Pig Latin: A Not-so-Foreign Language for Data Processing; Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD); 2008;
  • L. Page, S. Brin, R. Motwani, T. Winograd; The Pagerank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66; InfoLab; Stanford University; 1999.
  • E. Prud’Hommeaux, A. Seaborne; SPARQL Query Language for RDF; 2008-01.
  • J.M. Pujol, V. Erramilli, G. Siganos, X. Yang, N. Laoutaris, P. Chhabra, P. Rodriguez; The Little Engine(s) That Could: Scaling Online Social Networks; In Proceedings of the Conference of the Special Interest Group on Communications (SIGCOMM); 2010; pages 375–386.
  • I. Robinson, J. Webber, E. Eifrem; Graph Databases; O’Reilly Media; 2013.
  • A. Roy, I. Mihailovic, W. Zwaenepoel; X-Stream: Edge-Centric Graph Processing Using Streaming Partitions; In Proceedings of the Symposium on Operating Systems Princples (SOSP); 2013; ACM; pages 472–488.
  • Y. Saad, Iterative Methods for Sparse Linear Systems; 2nd edition; Society for Industrial and Applied Mathematics; 2003.
  • I. Stanton, G. Kliot; Streaming Graph Partitioning for Large Distributed Graphs; Tech. Rep. MSR-TR-2011-121, Microsoft; 2011-11.
  • P. Stutz, A. Bernstein, W. Cohen; Signal/Collect: Graph Algorithms for the (Semantic) Web; In Proceedings of the International Semantic Web Conference (ISWC); 2010.
  • J. Ugander, L. Backstrom; Balanced Label Propagation for Partitioning Massive Graphs; In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (WSDM); 2013; ACM; pages 507–516.
  • M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica; Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing; In Network Systems Design and Implementation (NSDI); 2012.

Via: backfill.

Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach | Schwartz, Eichstaedt, Kern, Dziurzynski, Ramones, Agrawal, Shah, Kosinski, Stillwell, Seligman, Ungar

H. Andrew Schwartz, Johannes C. Eichstaedt, Margaret L. Kern, Lukasz Dziurzynski, Stephanie M. Ramones, Megha Agrawal, Achal Shah, Michal Kosinski, David Stillwell, Martin E. P. Seligman, Lyle H. Ungar; Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach; In PLoS One; 2013-09-23; 16 pages; landing.


We analyzed 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests, and found striking variations in language with personality, gender, and age. In our open-vocabulary technique, the data itself drives a comprehensive exploration of language that distinguishes people, finding connections that are not captured with traditional closed-vocabulary word-category analyses. Our analyses shed new light on psychosocial processes yielding results that are face valid (e.g., subjects living in high elevations talk about the mountains), tie in with other research (e.g., neurotic people disproportionately use the phrase ‘sick of’ and the word ‘depressed’), suggest new hypotheses (e.g., an active life implies emotional stability), and give detailed insights (males use the possessive ‘my’ when mentioning their ‘wife’ or ‘girlfriend’ more often than females use ‘my’ with ‘husband’ or ’boyfriend’). To date, this represents the largest study, by an order of magnitude, of language and person


  • Differential Language Analysis (DLA)
  • Five Factor Model (FFM), Big Five
  • Linguistic Inquiry and Word Count (LIWC)
  • Open Vocabularoy, Closed Vocabulary
  • Regression
    • L0 Norm
    • L1 Norm
  • multi-predictor to multi-output regression
  • World Well-Being Program
  • Method
    • Linguistic Feature Extraction
    • Correlational Analysis
    • Visualization
  • Pointwise Mutual Information (PMI)
  • Pott’s happyfuntokenizer for <3 and :-)
  • Personality Tests
    • My Personality, an app
    • International Personality Item Pool
    • NEO Personality Inventory Revised (NEO-PI-R)
  • Vocabulatires
    • Language of Gender
    • Language of Age
    • Language of Personality
  • Latent Dirichlet Allocation (LDA)


Via: backfill

Big Data and Privacy: A Technological Perspective | PCAST

Big Data and Privacy: A Technological Perspective; Executive Office of the President, President’s Council of Advisors on Science and Technology (PCAST); 2014-05-01; 76 pages; landing.



  • White House / UC Berkeley School of Information / Berkeley Center for Law and Technology; John Podesta; 2014-04-01; transcript, video.
  • White House / Data & Society Research Institute / NYU Information Law Institute; John Podesta; 2014-03-17; video.
  • White House / MIT; John Podesta; 2014-03-04; transcript, video.


PCAST Big Data and Privacy Working Group.
  • Susan L. Graham, co-chair.
  • William Press, co-chair.
  • S. James Gates, Jr.,
  • Mark Gorenberg,
  • John Holdren,
  • Eric S. Lander,
  • Craig Mundie,
  • Maxine Savitz,
  • Eric Schmidt.
  • Marjory S. Blumenthal, Executive Director of PCAST; coordination & framing..


  • John P Holdren, co-chair, OSTP
  • Eric S. Lander, co-chair, Broad Institute (Harvard&MIT)
  • William Press, co- vice chair, U. Texas
  • Maxine Savitz, co- vice chair, National Academy of Engineering
  • Rosina Bierbaum, U. Michigan
  • Christine Cassel, National Quality Forum
  • Christopher Chyba, Princeton
  • S. James Gates, Jr., U. Maryland
  • Gorenberg, Zetta Venture Partners
  • Susan L. Graham, UCB
  • Shirley Ann Jackson, Rensselaer Polytechnic
  • Richard C. Levin, Yale
  • Chad Mirkin, Northwestern
  • Mario Molina, UCSD
  • Craig Mundie, Microsoft
  • Ed Penhoet, UCB
  • Barbara Schaal, Washington University
  • Eric Schmidt, Google
  • Daniel Schrag, Harvard


  • Marjory S. Blumenthal
  • Michael Johnson


From the Executive Summary [page xiii], and also from Section 5.2 [page 49]

  • Recommendation 1 [consider uses over collections activites]
    Policy attention should focus more on the actual uses of big data and less on its collection and analysis.
  • Recommendation 2 [no Microsoft lockin; no national champion]
    Policies and regulation, at all levels of government, should not embed particular technological solutions, but rather should be stated in terms of intended outcomes.
  • Recommendation 3 [fund]
    With coordination and encouragement from [The White House Office of Science and Technology Policy] OSTP, the [Networking and Information Technology Research and Development] NITRD agencies should strengthen U.S. research in privacy‐related technologies and in the relevant areas of social science that inform the successful application of those technologies.
  • Recommendation 4 [talk]
    OSTP, together with the appropriate educational institutions and professional societies, should encourage increased education and training opportunities concerning privacy protection, including career paths for professionals.
  • Recommendation 5 [talk & buy]
    The United States should take the lead both in the international arena and at home by adopting policies that stimulate the use of practical privacy‐protecting technologies that exist today. It can exhibit leadership both by its convening power (for instance, by promoting the creation and adoption of standards) and also by its own procurement practices (such as its own use of privacy‐preserving cloud services)

Table of Contents

  1. Executive Summary
  2. Introduction
    1. Context and outline of this report
    2. Technology has long driven the meaning of privacy
    3. What is different today?
    4. Values, harms, and rights
  3. Examples and Scenarios
    1. Things happening today or very soon
    2. Scenarios of the near future in healthcare and education
    3. Healthcare: personalized medicine,
    4. Healthcare: detection of symptoms by mobile devices
    5. Education
    6. Challenges to the home’s special status
    7. Tradeoffs among privacy, security, and convenience
  4. Collection, Analytics, and Supporting Infrastructure
    1. Electronic sources of personal data
      1. “Born digital” data
      2. Data from sensors
    1. Big data analytics
      1. Data mining
      2. Data fusion and information integration
      3. Image and speech recognition
      4. Social‐network analysis
    2. The infrastructure behind big data
      1. Data centers
      2. The cloud
  5. Technologies and Strategies for Privacy Protection
    1. The relationship between cybersecurity and privacy
    2. Cryptography and encryption
      1. Well Established encryption technology
      2. Encryption frontiers
    3. Notice and consent
      1. Other strategies and techniques
        1. Anonymization or de‐identification
        2. Deletion and non‐retention
    4. Robust technologies going forward
      1. A Successor to Notice and Consent
      2. Context and Use
      3. Enforcement and deterrence
      4. Operationalizing the Consumer Privacy Bill of Rights
  6. PCAST Perspectives and Conclusions
    1. Technical feasibility of policy interventions
    2. Recommendations
    3. Final Remarks
  7. Appendix A. Additional Experts Providing Input
  8. Special Acknowledgment


  • The President’s Council of Advisors on Science and Technology (PCAST)
  • PCAST Big Data and Privacy Working Group
  • Enabling Event
    • President Barack Obama
    • Remarks, 2014-01-17
    • Counselor John Podesta
  • New Concerns
    • Born digital vs born analog
    • standardized components
    • particular limited purpose vs repurposed, reused.
    • data fusion
    • algorithms
    • inferences
  • Provenance of data, recording and tracing the provenance of data
  • Trusted Data Format (TDF)


  • Right to forget, right to be forgotten is unenforceable infeasible [page 48].
  • Prior redress of prospective harms is a reasonable framework [page 49]
    • Conceptualized as vulnerable groups who are stipulated as harmed a priori or are harmed sunt constitua.
  • Government may be forbidden from certain classes of uses, despite their being available in the private

    • Government is allowed some activities and powers
    • Private industry is allowed some activities and powers
    • It is feasible in practice to mix & match
      • government coercion => private privilege => result
      • private privilege => private coercion => result

Consumer Privacy Bill of Rights (CPBR)

Obligations [of service providers, as powerful organizations]

  • Respect for Context => use consistent with collection context.
  • Focused Collection => limited collection.
  • Security => handling techniques
  • Accountability => handling techniques.

Empowerments [of consumers, as individuals]

  • Individual Control => control of collection, control of use.
  • Transparency => of practices [by service providers]
  • Access and Accuracy => right to review & edit [something about proportionality]

Definition of Privacy

The definition is unclear and evolving. It is frequently defined in terms of the harms in curred when it is lost.

Privacy Framework of Via Harms

The Prosser Harms, <quote> page 6.

  1. Intrusion upon seclusion. A person who intentionally intrudes, physically or otherwise (now including electronically), upon the solitude or seclusion of another person or her private affairs or concerns, can be subject to liability for the invasion of her privacy, but only if the intrusion would be highly offensive to a reasonable person.
  2. Public disclosure of private facts. Similarly, a person can be sued for publishing private facts about another person, even if those facts are true. Private facts are those about someone’s personal life that have not previously been made public, that are not of legitimate public concern, and that would be offensive to a reasonable person.
  3. “False light” or publicity. Closely related to defamation, this harm results when false facts are widely published about an individual. In some states, false light includes untrue implications, not just untrue facts as such.
  4. Misappropriation of name or likeness. Individuals have a “right of publicity” to control the use of their name or likeness in commercial settings.



<quote>One perspective informed by new technologies and technology‐mediated communication suggests that privacy is about the “continual management of boundaries between different spheres of action and degrees of disclosure within those spheres,” with privacy and one’s public face being balanced in different ways at different times. See: Leysia Palen, Paul Dourish; Unpacking ‘Privacy’ for a Networked World; In Proceedings of CHI 2003, Association for Computing Machinery, 2003-04-05.</quote>, footnote, page 7.

Adjacency Theory

An oppositional framework wherein harms are “adjacent to” benefits:

  • Invasion of private communications
  • Invasion of privacy ihn a person’s virtual home.
  • Public disclosure of inferred private facts
  • Tracking, stalking and violations of locational privacy.
  • Harm arising from false conclusions about individuals, based on personal profiles from big‐data analytics.
  • Foreclosure of individual autonomy or self‐determination
  • Loss of anonymity and private association.
Mosaic Theory

Oblique referenced via quote from Sotomayor.
<quote>“I would ask whether people reasonably expect that their movements will be recorded and aggregated in a manner that enables the Government to ascertain, more or less at will, their political and religious beliefs, sexual habits, and so on.” United States v. Jones (10‐1259), Sotomayor concurrence.</quote>

Yet, not cited, but related (at least):

Definition of Roles [of data processors]

  • data collectors
  • data analyzers
  • data users

The data generators or producers in this roles framework are substantially only customers or consumers (sic).


  • Definition of analysis versus use
    • <quote>Analysis, per se, does not directly touch the individual (it is neither collection nor, without additional action, use) and may have no external visibility.
    • & by contrast, it is the use of a product of analysis, whether in commerce, by government, by the press, or by individuals, that can cause adverse consequences to individuals.</quote>
  • Big Data => definitions
    • [comprises data with] high‐volume, high‐velocity and high‐variety
      information assets that demand cost‐effective, innovative forms of information processing for enhanced insight and decision making,” attributed to Gartner Inc.
    • a term describing the storage and analysis of large and/or complex data sets using a series of techniques including, but not limited to, NoSQL, MapReduce, and machine learning.” attributed to “computer scientists” on arXiv.


The strong, direct, unequivocal, un-nuanced, provocative language…

<quote>For a variety of reasons, PCAST judges anonymization, data deletion, and distinguishing data from metadata (defined below) to be in this category. The framework of notice and consent is also becoming unworkable as a useful foundation for policy.</quote>

<quote>Anonymization is increasingly easily defeated by the very techniques that are being developed for many legitimate applications of big data. In general, as the size and diversity of available data grows, the likelihood of being able to re‐identify individuals (that is, re‐associate their records with their names) grows substantially. While anonymization may remain somewhat useful as an added safeguard in some situations, approaches that deem it, by itself, a sufficient safeguard need updating. </quote>

<quote>Notice and consent is the practice of requiring individuals to give positive consent to the personal data collection practices of each individual app, program, or web service. Only in some fantasy world do users actually read these notices and understand their implications before clicking to indicate their consent. <snip/>The conceptual problem with notice and consent is that it fundamentally places the burden of privacy protection on the individual. Notice and consent creates a non‐level playing field in the implicit privacy negotiation between provider and user. The provider offers a complex, take‐it‐or‐leave‐it set of terms, while the user, in practice, can allocate only a few seconds to evaluating the offer. This is a kind of market failure. </quote>

<quote>Also rapidly changing are the distinctions between government and the private sector as potential threats to individual privacy. Government is not just a “giant corporation.” It has a monopoly in the use of force; it has no direct competitors who seek market advantage over it and may thus motivate it to correct missteps. Governments have checks and balances, which can contribute to self‐imposed limits on what they may do with people’s information. Companies decide how they will use such information in the context of such factors as competitive advantages and risks, government regulation, and perceived threats and consequences of lawsuits. It is thus appropriate that there are different sets of constraints on the public and private sectors. But government has a set of authorities – particularly in the areas of law enforcement and national security – that place it in a uniquely powerful position, and therefore the restraints placed on its collection and use of data deserve special attention. Indeed, the need for such attention is heightened because of the increasingly blurry line between public and private data. While these differences are real, big data is to some extent a leveler of the differences between government and companies. Both governments and companies have potential access to the same sources of data and the same analytic tools. Current rules may allow government to purchase or otherwise obtain data from the private sector that, in some cases, it could not legally collect itself, or to outsource to the private sector analyses it could not itself legally perform. [emphasis here] The possibility of government exercising, without proper safeguards, its own monopoly powers and also having unfettered access to the private information marketplace is unsettling.</quote>


Substantially in order of appearance in the footnotes, without repeats.

Via: backfill, backfill


And yet even with all the letters and professional editing and techwriting staff available to this national- and historical-level enterprise we still see [Footnote 101, page 31]

Qi, H. and A. Gani, “Research on mobile cloud computing: Review, trend and perspectives,” Digital Information and Communication Technology and it’s Applications (DICTAP), 2012 Second International Conference on, 2012.

The correct listing is at Springer

Digital Information and Communication Technology and Its Applications;International Conference, DICTAP 2011, Dijon, France, June 21-23, 2011. Proceedings, Part I, Series: Communications in Computer and Information Science, Vol. 166 Cherifi, Hocine, Zain, Jasni Mohamad, El-Qawasmeh, Eyas (Eds.) 2011, XIV, 806 p.


  • it’s → is a contraction for it is
  • its → is a possessive

Ergo: s/it's/its/g;

Handbook of Data Analytics | Leada


  • Brian Liou
  • Tristan Tao
  • Elizabeth Lin

Shop: Leada, a consultancy




  • Not a “handbook” in the sense that it’s not recipes for HOWTO at all.
  • Motivational interviews in a Q&A style; ~5 pages each
  • Career Advice.
  • Career Attractor.
  • No Math, Algos, Results.


  • What exactly is a data scientist anyway, and how is it different than a data analyst?
  • Who buys this stuff anyway?
  • What skills do such people need? [someone answers: PowerPoint 2-pager]
  • How does interviewing work in this area?
  • They interview for specific task-level skills; show passion, show “hunger to learn”


  • <bzzzz>Big Data</bzzzz>
  • <zzzz>B2B</zzzz>
  • <zzzz>BI</zzzz>
  • <zzzz>CRM</zzzz>


  • Cassandra
  • Excel
  • HadoopTM MapReduce
  • Hive
  • Java
  • Kaggle
  • MongoDB
  • PDF
  • SAS
  • Storm
  • NoSQL
  • Pig
  • Python
  • R
  • Word


  • regression
  • t-test
  • Algorithm complexity (Big O notation)
  • Machine Learning (general)
  • Natural Language Processing (NLP)


  • BigML
  • Cloudera
  • Facebook
  • Flurry
  • HG Data
  • Linkedin
  • Mode Analytics
  • Persontyle
  • Smarter Remarketer Inc.
  • Stylistics
  • Yelp
  • Yhat


  • C++ programming “low level systems programming
  • Quality Control
  • Computer Science
  • Economics
  • Humanities, generally
  • Parasitology
  • Philosophy

Via: backfill

Private traits and attributes are predictable from digital records of human behavior | Kosinski, Stillwell, Graepel

Michal Kosinski, David Stillwell, Thore Graepel; Private traits and attributes are predictable from digital records of human behavior; In Proceedings of the National Academy of Sciences of the United States of America (PNAS); 2013-02-12; 4 pages; landing.


We show that easily accessible digital records of behavior, Facebook Likes, can be used to automatically and accurately predict a range of highly sensitive personal attributes including: sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender. The analysis presented is based on a dataset of over 58,000 volunteers who provided their Facebook Likes, detailed demographic profiles, and the results of several psychometric tests. The proposed model uses dimensionality reduction for preprocessing the Likes data, which are then entered into logistic/linear regression to predict individual psychodemographic profiles from Likes. The model correctly discriminates between homosexual and heterosexual men in 88% of cases, African Americans and Caucasian Americans in 95% of cases, and between Democrat and Republican in 85% of cases. For the personality trait “Openness,” prediction accuracy is close to the test–retest accuracy of a standard personality test. We give examples of associations between attributes and Likes and discuss implications for online personalization and privacy.


  • You Are What You Like, promotional site.
  • Singular Value Decomposition (SVD)
  • Pseudo-Inverse of a Matrix
  • Five Factor Model (FFM)
    • Dimensions
      1. Openness to Experience
      2. Conscientiousness
      3. Extraversion
      4. Agreeableness
      5. Emotional Stability
    • Instruments
      • NEO Personality Inventory (NEO-PI-R)
      • NEO Five-Factor Inventory (NEO-FFI)
  • Intelligence
    • Raven’s Standard Progressive Matrices (SPM)
    • Spearman’s Theory of General Ability
  • International Personality Item Pool (IPIP)
  • Satisfaction With Life (SWL)
  • myPersonality Project
  • Receiver-Operating Characteristic (ROC)
  • Area Under [the] Curve (AUC)


  • Lazer D, et al. (2009) Computational social science. In Science 323(5915):721–723.
  • Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender
    systems. In Computer 42(8):30–37.
  • Chen Y, Pavlov D, Canny JF (2009) Large-scale behavioral targeting. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), pp 209–218.
  • Butler D (2007) Data sharing threatens privacy. In Nature 449(7163):644–645.
  • Narayanan A, Shmatikov V (2008) Robust de-anonymization of large sparse datasets. In Proceedings of the IEEE Symposium on Security and Privacy, pp 111–125.
  • Duhigg C (2012) The Power of Habit: Why We Do What We Do in Life and Business
    (Random House, New York).
  • Ince HO, Yarali A, Özsel D (2009) Customary killings in Turkey and Turkish modernization. In Middle East Studies 45(4):537–551.
  • 8. Fast LA, Funder DC (2008) Personality as manifest in word use: Correlations with selfreport, acquaintance report, and behavior. In Journal of Personal Social Psychology 94(2):334–346.
  • Costa PT, McCrae RR (1992) Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) Manual (Psychological Assessment Resources, Odessa, FL).
  • Gosling SD, Ko SJ, Mannarelli T, Morris ME (2002) A room with a cue: Personality
    judgments based on offices and bedrooms. In Journal of Personal Social Psychology 82(3):379–398.
  • Hu J, Zeng H-J, Li H, Niu C, Chen Z (2007) Demographic prediction based on user’s browsing behavior. In Proceedings of the International World Wide Web Conference (WWW), pp 151–160.
  • Murray D, Durrell K (1999) Inferring demographic attributes of anonymous Internet
    users. In Revised Papers from the International Workshop on Web Usage Analysis and User Profiling, eds Masand BM, Spiliopoulou M (Springer, London), pp 7–20.
  • De Bock K, Van Den Poel D (2010) Predicting website audience demographics for Web advertising targeting using multi-website clickstream data. In Fundamenta Informaticae 98(1):49–70.
  • Goel S, Hofman JM, Sirer MI (2012) Who does what on the Web: Studying Web
    browsing behavior at scale. In International Conference on Weblogs and Social Media, pp 130–137.
  • Kosinski M, Kohli P, Stillwell DJ, Bachrach Y, Graepel T (2012) Personality and website choice. In Proceedings of the ACM Web Science Conference, pp 251–254.
  • Marcus B, Machilek F, Schütz A (2006) Personality in cyberspace: Personal Web sites as media for personality expressions and impressions. In Journal of Personal Social Psychology 90(6):1014–1031.
  • Rentfrow PJ, Gosling SD (2003) The do re mi’s of everyday life: The structure and
    personality correlates of music preferences. In Journal Personal Social Psychology 84(6):1236–1256.
  • Quercia D, Lambiotte R, Kosinski M, Stillwell D, Crowcroft J (2012) The Personality of popular Facebook users. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (CSCW), 2012, pp 955–964.
  • Bachrach Y, Kohli P, Graepel T, Stillwell DJ, Kosinski M (2012) Personality and patterns of Facebook usage. In Proceedings of the ACM Web Science Conference, pp 36–44.
  • Quercia D, Kosinski M, Stillwell DJ, Crowcroft J (2011) Our Twitter profiles, our selves: Predicting personality with Twitter. In Proceedings of the 2011 IEEE International Conference on Privacy, Security, Risk, and Trust, or maybe in Proceedings of the IEEE International Conference on Social Computing, pp 180–185.
  • Golbeck J, Robles C, Edmondson M, Turner K (2011) Predicting personality from
    Twitter. Proceedings of the IEEE International Conference on Social Computing, pp 149–156.
  • Golbeck J, Robles C, Turner K (2011) Predicting personality with social media. In Proceedings of the Conference on Human Factors in Computing Systems (CHI), pp 253–262.
  • Jernigan C, Mistree BF (2009) Gaydar: Facebook friendships expose sexual orientation. First Monday 14(10).
  • Golub GH, Kahan W (1965) Calculating the singular values and pseudo-inverse of a matrix. In Journal Society for Industrial & Applied Math (SIAM) 2(2):205–224; also as Journal of SIAM Numerical Analysis, B 2(2).
  • Goldberg LR, et al. (2006) The international personality item pool and the future of
    public-domain personality measures. In Journal Research in Personality 40(1):84–96.
  • Raven JC (2000) The Raven’s progressive matrices: Change and stability over culture and time. In Cognitive Psychology 41(1):1–48.
  • Diener E, Emmons RA, Larsen RJ, Griffin S (1985) The satisfaction with life scale. In Journal Personal Assessment 49(1):71–75.
  • Musick K, Meier A (2010) Are both parents always better than one? Parental conflict
    and young adult well-being. In Social Science Research 39(5):814–830.
  • Schimmack U, Diener E, Oishi S (2002) Life-satisfaction is a momentary judgment and a stable personality characteristic: The use of chronically accessible and stable sources. In Journal of Personality 70(3):345–384.
  • Nass C, Lee KM (2000) Does computer-generated speech manifest personality? An experimental test of similarity-attraction. In Journal of Experimental Psychology 7(3):171–181.


  • Costa PT, McCrae RR (1992) Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) Manual (Psychological Assessment Resources, Odessa, FL).
  • Goldberg LR, et al. (2006) The international personality item pool and the future of public-domain personality measures. In Journal of Research on Personality 40(1):84–96.
  • Raven JC (2000) The Raven’s progressive matrices: change and stability over culture and time. In Cognitive Psychology 41(1):1–48.
  • Lubinski D (2004) Introduction to the special section on cognitive abilities: 100 years after Spearman’s (1904) “’General intelligence,’ objectively determined and measured”. In Journal of Personal Social Psychology 86(1):96–111.
  • Diener E, Emmons RA, Larsen RJ, Griffin S (1985) The satisfaction with life scale. In Journal of Personal Assessment 49(1):71–75.
  • Golub GH, Kahan W (1965) Calculating the singular values and pseudo-inverse of a matrix. In Journal Society for Industrial & Applied Math (SIAM) 2(2):205–224.


Fast Unfolding of Communities of Large Networks | Blondel, Guillaume, Lambiotte, Lefebre

Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre; Fast unfolding of communities in large networks; In Journal of Statistical Mechanics: Theory and Experiment; Volume 10; 2008; 12 pages; landing


We propose a simple method to extract the community structure of large networks. Our method is a heuristic method that is based on modularity optimization. It is shown to outperform all other known community detection method in terms of computation time. Moreover, the quality of the communities detected is very good, as measured by the so-called modularity. This is shown first by identifying language communities in a Belgian mobile phone network of 2.6 million customers and by analyzing a web graph of 118 million nodes and more than one billion links. The accuracy of our algorithm is also verified on ad-hoc modular networks.


The Louvain Method


Via: backfill

Inferring Trip Destinations From Driving Habits Data | Dewri, Annadata, Eltarjaman, Thurimella

Rinku Dewri, Prasad Annadata, Wisam Eltarjaman, Ramakrishna Thurimella; Inferring Trip Destinations From Driving Habits Data; In Proceedings of Workshop on Privacy in the Electronic Society (WPES); 2013; 9 pages.


The collection of driving habits data is gaining momentum as vehicle telematics based solutions become popular in consumer markets such as auto-insurance and driver assistance services. These solutions rely on driving features such as time of travel, speed, and braking to assess accident risk and driver safety. Given the privacy issues surrounding the geographic tracking of individuals, many solutions explicitly claim that the customer’s GPS coordinates are not recorded. Although revealing driving habits can give us access to a number of innovative products, we believe that the disclosure of this data only offers a false sense of privacy. Using speed and time data from real world driving trips, we show that the destinations of trips may also be determined without having to record GPS coordinates. Based on this, we argue that customer privacy expectations in non-tracking telematics applications need to be reset, and new policies need to be implemented to inform customers of possible risks.


  • Products
    • Progressive’sSnapshot,
    • AllState’s Drivewise,
    • State Farm’s In-Drive,
    • National General Insurance’s Low-Mileage Discount,
    • Travelers’ Intellidrive,
    • Esurance’s Drivesense,
    • Safeco’s Rewind,
    • Aviva’s Drive,
    • Amaguiz PAYD,
    • Insure The Box,
    • Cover-box,
    • Ingenie,
    • MyDrive.
  • Quasi-identifiers
  • Telematics
  • OnStar
  • OBD-II
  • LandAirSea GPS Tracking Key
  • OpenStreetMap
  • Stop Points
  • Depth-First Search (DFS)

Via: backfill, backfill

Personal Data Project at the World Economic Forum

Rethinking Personal Data
Rethinking Personal Data Project

Unlocking the Value of Personal Data: From Collection to Usage; 2013-02; 36 pages; landing.

Annual Meeting of the New Champions: Unlocking the Value of Data; Conference Summary; Tianjin, CN; 2012-09-12; 6 pages; landing.

Unlocking the Economic Value of Personal Data: Balancing Growth and Protection; Workshop Summary; Brussels, BE; 2012-10-08; 14 pages.

Rethinking Personal Data: Strengthening Trust; 2012-05-14; 36 pages; landing.

Personal Data: The Emergence of a New Asset Class; 2011-02-17; 40 pages; landing.

Tracking Sentiment in Mail: How Genders Differ on Emotional Axes | Mohammad, Yang

Saif M. Mohammad, Tony (Wenda) Yang; Tracking Sentiment in Mail: How Genders Differ on Emotional Axes;In Proceedings of the ACL Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA); 2011-06; 10 pages; Also available at arXiv; 2013-09-24.


With the widespread use of email, we now have access to unprecedented amounts of text that we ourselves have written. In this paper, we show how sentiment analysis can be used in tandem with effective visualizations to quantify and track emotions in many types of mail. We create a large word–emotion association lexicon by crowdsourcing, and use it to compare emotions in love letters, hate mail, and suicide notes. We show that there are marked differences across genders in how they use emotion words in work-place email. For example, women use many words from the joy–sadness axis, whereas men prefer terms from the fear–trust axis. Finally, we show visualizations that can help people track emotions in their emails.

From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales | Mohammad

Saif Mohammad; From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales; In Proceedings of the ACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH); 2011, Available at arXiv; 2013-09-23; 10 pages.


Today we have access to unprecedented amounts of literary texts. However, search still relies heavily on key words. In this paper, we show how sentiment analysis can be used in tandem with effective visualizations to quantify and track emotions in both individual books and across very large collections. We introduce the concept of emotion word density, and using the Brothers Grimm fairy tales as example, we show how collections of text can be organized for better search. Using the Google Books Corpus we show how to determine an entity’s emotion associations from cooccurring words. Finally, we compare emotion words in fairy tales and novels, to show that fairy tales have a much wider range of emotion word densities than novels.

Via: backfill

A Very Short History Of Big Data | Gill Press, Forbes

Gil Press; A Very Short History Of Big Data; In Forbes; 2013-05-09.

Via: A Very Short History of Big Data on


Indirect Sources

i.e. not listed directly, but cited.

  • Istvan Dienes; National Accounting of Information; Reference Manual of SNIA, Version v1.1; 1994; 291 pages.

    • SNA vs SNIA
      • S-something N-something Accounting
      • S-something N-something Information Accounting
    • SNA92 is authoritative
  • Alistair D. Duff; The Information Society Studies; Routledge; 2000-06-01; 216 pages; $200.
  • Andrew Odlyzko (started) Minnesota Internet Traffic Studies (MINTS); 2002-2009; tracking the growth in Internet traffic.
  • Martin Hilbert; How to Measure “How Much Information”? Theoretical, Methodological, and Statistical Challenges for the Social Sciences; In International Journal of Communications(IJOC); Vol 6; 2012; 14 pages.
    • This is an introduction to a ‘special section’ issue of the IJOC on information & measurement studies.
    • Conclusions (in the article and the subsequent articles of the special section)
      1. It is not only statistically feasible, but also analytically insightful to quantify the amount of information handled by society.
      2. However, many of the available sources are not very solid, and the methodologies are still maturing.
      3. The research question and its theoretical framework have defined the methodology, including the choice of the indicator.
      4. There is still no consensus on how to define the most fundamental measures for data and information.
      5. Information quantity is not equal to information quality or information value, but the second requires the first.
      6. Will it be possible and/or useful to harmonize information accounts?

openPDS – The privacy-preserving Personal Data Store

Buzzy Terms

  • A full Trust Network reference platform.
  • Privacy-preserving group computation.


  • A Personal Data Store (PDS) is a service (a server) that answers questions, rather than aggregating and (re-)serving a profile.
  • Respond to questions about claims; e.g. is over 18, is-righthanded, has driver license.


Via backfill

ID3 Popularizations



Implementations of a PDS to hold personal data, and provide answers to questions about that data.


  • 16624 LOC overall
  • 2801 LOC Python
  • 4255 LOC JavaScript
$ find openPDS -name .git -prune -o -print | sort