Exploratory data analysis: Our free online course

Moira Burke, Solomon Messing, Chris Saden, and I have created a new online course on exploratory data analysis (EDA) as part of Udacity’s “Data Science” track. It is designed to teach students how to explore data sets. Students learn how to do EDA using R and the visualization package ggplot.

We emphasize the value of EDA for building and testing intuitions about a data set, identifying problems or surprises in data, summarizing variables and relationships, and supporting other data analysis tasks. The course materials are all free, and you can also sign up for tutoring, grading (especially useful for the final project), and certification.

Between providing general advice on data analysis and visualization, stepping students through exactly how to produce particular plots, and reasoning about how the data can answer questions of interest, the course includes interviews with four of our amazing colleagues on the Facebook Data Science team:

One unique feature of this course is that one of the data sets we use is a “pseudo-Facebook” data set that Moira and I created to share many features with real Facebook data, but to not describe any particular real Facebook users or reveal certain kinds of information about aggregate behavior. Other data sets used in the course include two different data sets giving sale prices for diamonds and panel “scanner” data describing yogurt purchases.

It was an fascinating and novel process putting together this course. We scripted almost everything in detail in advance — before any filming started — using first outlines, then drafts using Markdown in R with knitr, and then more detailed scripts with Udacity-specific notation for all the different shots and interspersed quizzes. I think is part of what leads Kaiser Fung to write:

The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality.

Check out the course and let me know what you think — we’re still incorporating feedback.

Interpreting discrete-choice models

Are individuals random-utility maximizers? Or do individuals have private knowledge of shocks to their utility?

“McFadden (1974) observed that the logit, probit, and similar discrete-choice models have two interpretations. The first interpretation is that of individual random utility. A decisionmaker draws a utility function at random to evaluate a choice situation. The distribution of choices then reflects the distribution of utility, which is the object of econometric investigation. The second interpretation is that of a population of decision makers. Each individual in the population has a deterministic utility function. The distribution of choices in the population reflects the population distribution of preferences. … One interpretation of this game theoretic approach is that the econometrician confronts a population of random-utility maximizers whose decisions are coupled. These models extend the notion of Nash equilibrium to random- utility choice. The other interpretation views an individual’s shock as known to the individual but not to others in the population (or to the econometrician). In this interpretation, the Brock-Durlauf model is a Bayes-Nash equilibrium of a game with independent types, where the type of individual i is the pair (x_i, e_i). Information is such that the first component of each player i’s type is common knowledge, while the second is known only to player i.” — Blume, Brock, Durlauf & Ioannides. 2011. Identification of Social Interactions. Handbook of Social Economics, Volume 1B.

Do what the virtuous person would do?

In the film The Descendents, George Clooney’s character Matt King wrestles — sometimes comically — with new and old choices involving his family and Hawaii. In one case, King decides he wants to meet a rival, both just to meet him and to give him some news; that is, he (at least explicitly) has generally good reason to meet him. Perhaps he even ought to meet him. When he actually does meet him, he cannot just do these things, he also argues with his rival, etc. King’s unplanned behaviors end up causing his rival considerable trouble.1

This struck me as related to some challenges in formulating what one should do — that is, in the “practical reasoning” side of ethics.

One way of getting practical advice out of virtue ethics is to say that one should do what the virtuous person would do in this situation. On its face, this seems right. But there are also some apparent counterexamples. Consider a short-tempered tennis player who has just lost a match.2 In this situation, the virtuous person would walk over to his opponent, shake his hand, and say something like “Good match.” But if this player does that, he is likely to become enraged and even assault his victorious opponent. So it seems better for him to walk off the court without attempting any of this — even though this is clearly rude.

The simple advice to do what the virtuous person would do in the present situation is, then, either not right or not so simple. It might be right, but not so simple to implement, if part of “the present situation” is one’s own psychological weaknesses. Aspects of the agent’s psychology — including character flaws — seem to license bad behavior and to remove reasons for taking the “best” actions.

King and other characters in The Descendents face this problem, both in the example above and at some other points in the movie. He begins a course of action (at least in part) because this is what the virtuous person would do. But then he is unable to really follow through because he lacks the necessary virtues.3 We might take this as a reminder of the ethical value to being humble — to account for our faults — when reasoning about what we ought to do.4 It is also a reminder of how frustrating this can be, especially when one can imagine (and might actually be able to) following through on doing what the virtuous person would do.

One way to cope with these weaknesses is to leverage other aspects of one’s situation. We can make public commitments to do the virtuous thing. We can change our environment, sometimes by binding our future selves, like Ulysses, from acting on our vices once we’ve begun our (hopefully) virtuous course of action. Perhaps new mobile technologies will be a substantial help here — helping us intervene in our own lives in this way.

  1. Perhaps deserved trouble. But this certainly didn’t play a stated role in the reasoning justifying King’s decision to meet him. []
  2. This example is first used by Gary Watson (“Free Agency”, 1975) and put to this use by Michael Smith in his “Internalism” (1995). Smith introduces it as a clear problem for the “example” model of how what a virtuous person would do matters for what we should each do. []
  3. Another reading of some of these events in The Descendents is that these characters actually want to do the “bad behaviors”, and they (perhaps unconciously) use their good intentions to justify the course of action that leads to the bad behavior. []
  4. Of course, the other side of such humility is being short on self-efficacy. []

A deluge of experiments

The Atlantic reports on the data deluge and its value for innovation.1 I particularly liked how Erik Brynjolfsson and Andrew McAfee, who wrote the Atlantic piece, highlight the value of experimentation for addressing causal questions — and that many of the questions we care about are causal.2

In writing about experimentation, they report that Hal Varian, Google’s Chief Economist, estimates that Google runs “100-200 experiments on any given day”. This struck me as incredibly low! I would have guessed more like 10,000 or maybe more like 100,000.

The trick of course is how one individuates experiments. Say Google has an automatic procedure whereby each ad has a (small) random set of users who are prevented from seeing it and are shown the next best ad instead. Is this one giant experiment? Or one experiment for each ad?

This is a bit of a silly question.3

But when most people — even statisticians and scientists — think of an experiment in this context, they think of something like Google or Amazon making a particular button bigger. (Maybe somebody thought making that button bigger would improve a particular metric.) They likely don’t think of automatically generating an experiment for every button, such that a random sample see that particular button slightly bigger. It’s these latter kinds of procedures that lead to thinking about tens of thousands of experiments.

That’s the real deluge of experiments.

  1. I don’t know that I would call much of it ‘innovation’. There is some outright innovation, but a lot of that is in the general strategies for using the data. There is much more gained in minor tweaking and optimization of products and services. []
  2. Perhaps they even overstate the power of simple experiments. For example, they do not mention the fact that many times the results these kinds of experiments often change over time, so that what you learned 2 months ago is no longer true. []
  3. Note that two single-factor experiments over the same population with independent random assignment can be regarded as a single experiment with two factors. []

Frege’s judgment stroke

Are the conditions required to assert something conventions? Can they be formalized? Donald Davidson on whether convention is foundational to communication:

But Frege was surely right when he said, “There is no word or sign in language whose function is simply to assert something.” Frege, as we know, set out to rectify matters by inventing such a sign, the turnstile ⊢’ [sometimes called Frege's 'judgment stroke' or 'assertion sign']. And here Frege was operating on the basis of a sound principle: if there is a conventional feature of language, it can be made manifest in the symbolism. However, before Frege invented the assertion sign he ought to have asked himself why no such sign existed before. Imagine this: the actor is acting a scene in which there is supposed to be a fire. (Albee’s Tiny Alice, for example.) It is his role to imitate as persuasively as he can a man who is trying to warn others of a fire. “Fire!” he screams. And perhaps he adds, at the behest of the author, “I mean it! Look at the smoke!” etc. And now a real fire breaks out, and the actor tries vainly to warn the real audience. “Fire!” he screams, “I mean it! Look at the smoke!” etc. If only he had Frege’s assertion sign.

It should be obvious that the assertion sign would do no good, for the actor would have used it in the first place, when he was only acting. Similar reasoning should convince us that it is no help to say that the stage, or the proscenium arch, creates a conventional setting that negates the convention of assertion. For if that were so, the acting convention could be put into symbols also; and of course no actor or director would use it. The plight of the actor is always with us. There is no known, agreed upon, publically recognizable convention for making assertions. Or, for that matter, giving orders, asking questions, or making promises. These are all things we do, often successfully, and our success depends in part on our having made public our intention to do them. But it was not thanks to a convention that we succeeded.1

  1. Davidson, Donald. (1984). Communication and convention. Synthese 59 (1), 3-17. []