After open data, we need open data analysis: a first look at OpenCPU

Posted on Sat 16 June 2012 in misc


OpenCPU is the early stages PhD-project of Jeroen Ooms under supervision of Jan de Leeuw at the UCLA. Its buzzwordy tagline "Scientific computing in the cloud" captures it partially, but does not do it justice.

Most notable is that the entire architecture is truly open, not only the software level but also on the API and practical level. I.e. you can download the AGPL-licensed server as a Debian/Ubuntu-package and run your own instance. In addition to that, all interactions with the server take place through a cleanly designed REST-interface and you can execute all standard R-code, including your own.

This level of practical openness is very relevant, as a FOSS-licenced product can still generate a stale software ecosystem around it and practical lock-in due to specific protocols, customized commands, etc.

Betting on complete openness should get you mindshare in the FOSS-community, sticking to standard R allows a sizable group of data analysts to leverage their skills and utilizing familiar elements such as REST and JSON reaches out to the web developer community. If OpenCPU gets enough traction in the intersection of these communities, it would move R a big step forward as a language for open and collaborative data analysis on the web.

Example: multidimensional scaling

As an basic demonstration, I reproduced the steps of this simple multidimensional scaling example. All interactions are basic HTTP requests, so every client or platform that can preform a GET and POST will work--a command line tool, a software library, your browser (recommended), etc.

Store data and functions

We start by saving a dataset by executing a HTTP POST-request to the URL That URL references the built-in R read.csv2() function, which has a "file"-argument accepting a filename or a URL to a CSV-file. Because we append /save, the CSV-file is stored on the sever, and a identifier hash is returned in a standard HTTP 200 response.

POST /R/pub/utils/read.csv2/save
file | ""
=> x874d4d3cfe

This resulting identifier hash can be used to access the stored dataset, again by using a the standard HTTP protocol. We can execute a GET-request, and by appending /csv, /json or /print get the output we want (try it).

GET /R/tmp/x874d4d3cfe/csv
GET /R/tmp/x874d4d3cfe/json
GET /R/tmp/x874d4d3cfe/ascii

This POST and GET-pattering also works for storing custom functions, which can later be used in the same way as the built-in read.csv2() function. We store a basic cleaning function by POSTing a snippet of custom R-code to the build-in R identity() function.

POST /R/pub/base/identity/save
x | function(x) { x[x == 9] <- NA; x[x == 8] <- 1; x[x == 2] <- 0; 
    x <- na.omit(x); return(x) }
=> x775426e3ae

The stored function is similarly identified with a We can take a look at this function, by GETing a plain text representation (try it).

GET /R/tmp/x775426e3ae/ascii

Before applying the cleaning function, we can get a summary of the dataset by POSTing the dataset hash identifier as an argument to the R summary() function. We can also get the help page of a function by appending /help/' and a content type, e.g./text` for plain text (try it), HTML (try it) or a nice PDF-layout (try it).

GET /R/pub/base/summary/help/text

POST /R/pub/base/summary/print 
object | x874d4d3cfe

Applying custom functions

We can clean the dataset we posted in the first step with the custom cleaning-function we stored, by using both hash identifiers. If we execute a POST-request with the hash of the dataset we saved as argument, we can either just print the results using /print, or directly save the cleaned dataset using /save. This last step results in a new hash, identifying the cleaned dataset.

POST /R/tmp/x775426e3ae/print
x | x874d4d3cfe

POST /R/tmp/x775426e3ae/save
x | x874d4d3cfe
=> xc8644750f2

Again, we save a custom function, containing some basic matrix algebra to convert the dataframe to a distance matrix, using POST.

POST POST /R/pub/base/identity/save
x | function(data) {X <- as.matrix(data); C <- t(X) %*% X; 
    D <- max(C) - C; diag(D) <- 0; Ds <- as.dist(D, diag = T); return(Ds) }
=> x4de5033631

Applying that function to the cleaned dataset, we get a new stored object, which we can GET to see how it looks (try it).

POST /R/tmp/x4de5033631/print
data | xc8644750f2

POST /R/tmp/x4de5033631/save
data | xc8644750f2
=> x7b75c1174a

Calculate and visualise MDS-solution

We use the cmdscale() function to calculate our MDS solution. GETing the help-page of the function (try it) shows that we need two arguments, the dataset (d) and the number of dimensions (k).

GET /R/pub/stats/cmdscale/help/text

We can thus apply this mdscale() function by POSTing with two arguments, the hash of the previously saved distance matrix, and 2 for the number of dimensions. Again, appending `/save' stores the result (a object containing the MDS-solution) and returns the identifying hash.

POST /R/pub/stats/cmdscale/save
d | x7b75c1174a
k | 2
=> x6c61708c39

Finally, we can pass the object holding the MDS-solution to the plot() function, which gives us a nice 2D-visualisation (for the labeled and commented version, see the original analysis). As we save the plot into a new object with a hash, we can easily GET it (try it), and embed it (even choosing between /svg, /png, etc.).

POST /R/pub/graphics/plot/save
x | x6c61708c39
=> x3c1615b975

GET /R/tmp/x3c1615b975/svg
GET /R/tmp/x3c1615b975/png

reproduced MDS vis

This reproduces (after rotation, and with labels) the original, final MDS-visualisation (interpretation), as depicted below.

original MDS vis


The example is likely a bit of an overwhelming string of hashes and HTTP verbs, but it demonstrates that you can access all R commands needed to go from basic description over datacleaning to data-analysis and visualization. And all these steps follow the same, predictable pattern leading to web friendly, reproducible data analysis: all code, data and output can be easily shared, embedded and linked to.

As the basic architecture is in place, interest in the platform will hopefully translate itself in user-friendly tools to work with it, and innovative applications. The first external demo-application already appeared, a online editor allowing you to generate a nice rapport based on Markup-syntax and R-code.

Given the boom of open data, there is a growing need for truly open and collaborative open data analysis, and OpenCPU should be a good fit.