18 Mar 2014
Moira Burke, Solomon Messing, Chris Saden, and I have created a new online course on exploratory data analysis (EDA) as part of Udacity’s “Data Science” track. It is designed to teach students how to explore data sets. Students learn how to do EDA using R and the visualization package ggplot.
We emphasize the value of EDA for building and testing intuitions about a data set, identifying problems or surprises in data, summarizing variables and relationships, and supporting other data analysis tasks. The course materials are all free, and you can also sign up for tutoring, grading (especially useful for the final project), and certification.
Between providing general advice on data analysis and visualization, stepping students through exactly how to produce particular plots, and reasoning about how the data can answer questions of interest, the course includes interviews with four of our amazing colleagues on the Facebook Data Science team:
- Aude Hofleitner shares the process behind research on coordinated migration using hometown and “current city” Facebook data. (Udacity, YouTube)
- Lada Adamic gives an example of the importance of considering transformations of both x- and y-axes in an analysis from our forthcoming paper on the spread of rumors, memes, and urban legends on Facebook. (Udacity, YouTube)
- Sean Taylor illustrates the bias–variance tradeoff and other modeling decisions in his work on sentiment expressed by NFL (American football) fans. (Udacity, YouTube)
- Eytan Bakshy provides advice and encouragement to people working to become a “data scientist” (whatever that is). (Udacity, YouTube)
One unique feature of this course is that one of the data sets we use is a “pseudo-Facebook” data set that Moira and I created to share many features with real Facebook data, but to not describe any particular real Facebook users or reveal certain kinds of information about aggregate behavior. Other data sets used in the course include two different data sets giving sale prices for diamonds and panel “scanner” data describing yogurt purchases.
It was an fascinating and novel process putting together this course. We scripted almost everything in detail in advance — before any filming started — using first outlines, then drafts using Markdown in R with knitr, and then more detailed scripts with Udacity-specific notation for all the different shots and interspersed quizzes. I think this is part of what leads Kaiser Fung to write:
The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality.
Check out the course and let me know what you think — we’re still incorporating feedback.