R bloggers

Syndicate content
R news and tutorials contributed by (573) R bloggers
Updated: 2 hours 33 min ago

Comparing the execution time between foverlaps and findOverlaps

Mon, 2015-04-13 05:49

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)



Both of these functions find overlaps between genomic intervals. The findOverlaps function is from the Bioconductor package GenomicRanges(or IRanges if you don't need to compare intervals with an associated chromosome and strand). foverlaps is from the data.tablepackage and is inspired by findOvelaps.

In genomics, we often have one large data set X with small interval ranges (usually sequenced reads) and another smaller data set Y with larger interval spans (usually exons, introns etc.). Generally, we are tasked with finding which intervals in X overlap with which intervals in Y.

In the foverlaps function Y has to be indexed using the setkey function (we don't have to do it on X). The key is intended to speed-up finding overlaps.


Which one is faster?

To check this we used the benchmark function from the rbenchmarkpackage. It's a simple wrapper of the system.time function.


The code below plots the execution time of both functions for increasing numbers of rows of data set X.

Interestingly,  foverlaps is the fastest way to solve the problem of finding overlaps, but only when the large data set has less than 200k rows.






We also plotted situation when we exchanged the place of X and Y in arguments of both functions. In this case you can see that almost from the beginning foverlaps is much slower than findOverlaps.


Information about my R session:

> sessionInfo()R version 3.1.3 (2015-03-09)Platform: x86_64-apple-darwin13.4.0 (64-bit)Running under: OS X 10.9.4 (Mavericks)
locale:[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:[1] stats4    parallel  stats     graphics  grDevices utils     datasets [8] methods   base     
other attached packages:[1] data.table_1.9.4     rbenchmark_1.0.0     GenomicRanges_1.18.4[4] GenomeInfoDb_1.2.4   IRanges_2.0.1        S4Vectors_0.4.0     [7] BiocGenerics_0.12.1 
loaded via a namespace (and not attached):[1] chron_2.3-45   plyr_1.8.1     Rcpp_0.11.5    reshape2_1.4.1 stringr_0.6.2 [6] tools_3.1.3    XVector_0.6.0



To leave a comment for the author, please follow the link and comment on his blog: Recipes, scripts and genomics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

JSON serialization now even faster and prettier

Sun, 2015-04-12 20:00

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

The jsonlite package implements a robust, high performance JSON parser and generator for R, optimized for statistical data and the web. This week version 0.9.16 appeared on CRAN which has a new prettifying system, improved performance and some additional tweaks for the new mongolite package.

Improved Performance

Everyones favorite feature of jsonlite: performance. We found a way to significanlty speed up toJSON for data frames for the cases of dataframe="rows" (the default) or dataframe="values". On my macbook I now get these results:

data(diamonds, package="ggplot2") system.time(toJSON(diamonds, dataframe = "rows")) # user system elapsed # 0.133 0.003 0.136 system.time(toJSON(diamonds, dataframe = "columns")) # user system elapsed # 0.070 0.003 0.072 system.time(toJSON(diamonds, dataframe = "values")) # user system elapsed # 0.094 0.005 0.099

A somewhat larger dataset:

data(flights, package="nycflights13") system.time(toJSON(flights, dataframe = "rows")) # user system elapsed # 1.506 0.072 1.578 system.time(toJSON(flights, dataframe = "columns")) # user system elapsed # 0.585 0.024 0.608 system.time(toJSON(flights, dataframe = "values")) # user system elapsed # 0.873 0.039 0.912

That is pretty darn fast for a text based serialization format. By comparison, we easily beat write.csv which is actually a much more simple output format:

system.time(write.csv(diamonds, file="/dev/null")) # user system elapsed # 0.361 0.003 0.364 system.time(write.csv(flights, file="/dev/null")) # user system elapsed # 3.284 0.033 3.318 Pretty even prettier

Yihui has pushed for a new prettifying system that inserts indentation directly in the R code rather than making yajl prettify the entire JSON blob at the end. As a result we can use different indentation rules for different R classes. See the PR for details. The main differce is that atomic vectors are now prettified without linebreaks:

x <- list(foo = 1:3, bar = head(cars, 2)) toJSON(x, pretty=TRUE) { "foo": [1, 2, 3], "bar": [ { "speed": 4, "dist": 2 }, { "speed": 4, "dist": 10 } ] } toJSON(x, pretty=T, dataframe = "col") { "foo": [1, 2, 3], "bar": { "speed": [4, 4], "dist": [2, 10] } }

This can be helpful for manually inspecting or debugging your JSON data. The prettify function still uses yajl, so if you prefer this style, simply use prettify(toJSON(x)).

New mongolite package

There were some additional internal enhancements to support the new mongolite package, which will be announced later this month. This package will extend the concepts and power of jsonlite to the in-database JSON documents. Have a look at the git repository for a sneak preview.

To leave a comment for the author, please follow the link and comment on his blog: OpenCPU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

a vignette on Metropolis

Sun, 2015-04-12 18:15

(This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers)

Over the past week, I wrote a short introduction to the Metropolis-Hastings algorithm, mostly in the style of our Introduction to Monte Carlo with R book, that is, with very little theory and worked-out illustrations on simple examples. (And partly over the Atlantic on my flight to New York and Columbia.) This vignette is intended for the Wiley StatsRef: Statistics Reference Online Series, modulo possible revision. Again, nothing novel therein, except for new examples.


Filed under: Books, Kids, R, Statistics, Travel, University life Tagged: Columbia University, Introducing Monte Carlo Methods with R, Metropolis-Hastings algorithm, mixture, New York city, testing as mixture estimation, vignette

To leave a comment for the author, please follow the link and comment on his blog: Xi'an's Og » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A Couple of Handy ggplot Tricks – Using Environmental Variables and Saving Charts

Sun, 2015-04-12 17:18

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

A couple of handy tricks when working with ggplot that had escaped my radar until today.

First up, I had a problem in a function I was using to generate a ggplot2 in which I wanted to accept a couple of optional arguments in to the function and then make use of them in a ggplot aes() element.

Normally, ggplot will try to dereference a variable as a column in the current ggplot data context, rather than as a variable in it’s own right. So what can we do? Hack around aith aes_string()? In actual fact, the following trick identified via Stack Overflow did exactly what I needed – make a particular environment context available in ggplot directly:

core_qualifying_rank_slopegraph= function(qualiResults,qm, spacer=0.25,cutoff=c(16,10)){ #http://stackoverflow.com/questions/10659133/local-variables-within-aes .e = environment() # if we pass this context into ggplot, we can access both spacer and cutoff g=ggplot(qualiResults,aes(x=session,y=laptime), environment = .e) g= g+geom_text(data=qm[qm['session']=='q1time',], aes(x=1,y=qspos,label=driverName, colour=(qspos>cutoff[1] ) ), size=3) g= g+geom_text(data=qm[qm['session']=='q2time',], aes(x=2,y=qspos,label=driverName, colour=(qspos>cutoff[2] ) ), size=3) ... g=g+geom_segment(data=qualiResults[!is.na(qualiResults['q2time']),], x=1+spacer,xend=2-spacer, aes(y=q1pos,yend=q2pos,group=driverName), colour='slategrey') ... g }

By the by, the complete version of the fragment above generates a chart like the, heavily influenced by Tufte style slopegraphs, which shows progression through the F1 qualifying session in China this weekend:

Note that I use a discrete rank rather than continuous laptime scale for the y-axis, which would be more in keeping with the original slope graph idea. (The f1datajunkie.com post on F1 China 2015 – How They Qualified explores another chart type where continuous laptime scales are used, and a two column layout reminiscent of an F1 grid as a trick to try to minimise overlap of drivername labels, along with a 1-dimensional layout that shows all the qualifying session classification laptimes.)

The second useful trick I learned today was a recipe for saving chart objects. (With sales of the Wrangling F1 Data With R book (which provides the context for my learning these tricks) all but stalled, I need a new dream to live for that gives me hope of making enough from F1 related posts to cover the costs of seeing a race for real one weekend, so I’ve started wondering whether I could sell or license particular charts one day (if I can produce them quickly enough), either as PDFs or, perhaps, as actual chart objects, rather than having to give away all the code and wrangled data, for example….

So in that respect, the ability to save ggplot chart objects and then share them in a way that others can use them (if they have a workflow that can accommodate R/grid grobs) could be quite attractive… and this particular trick seems to do the job…

g=ggplot(..etc..) ... #Get the grob... g_out = ggplotGrob(g) #Save the grob save(g_out,file='ggrobtest') #Look - nothing up my sleeves rm(g_out) rm(g) #> g #Error: object 'g' not found #> g_out #Error: object 'g_out' not found load("ggrobtest") #The g_out grob is reinstated and can be plotted as follows: library(grid) grid.draw(g_out)

Handy…:-)


To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog... » Rstats. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Cefamandole: PK after IV with six subjects in JAGS

Sun, 2015-04-12 11:09

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

After last week's post on Theoph I noticed the MEMSS library contained more PK data sets. The Cefamandole data is an IV data set with six subjects. It is somewhat challenging, since there seem to be several elimination rates.
DataData are displayed below. Notice that the lines seem somewhat curved, there is clearly more going on that just simple elimination.
ModelsThe first model fits concentration as function of two eliminations. The problem here is that the eliminations need to be ordered. All faster eliminations should have one prior, all slower another. After much fiddling around I found that I cannot force the ordering at subject level with priors or limits. Hence I chose a different way out. The model gets an extra component, it will predict 0 if the parameters are incorrectly ordered. This indeed gave the desired result in terms of model behavior.

The model itself though, was less to my linking. It goes way too low at the later times. Upon consideration, since the model assumes homoscedastic error it does not matter to the likelihoood if the fit at the lower end is not so good. After all, an error of 1 is nothing compared to errors of 10 or 20 at the upper end of the scale. In addition it can be argued that the error is not homoscedastic. If it were homoscedastic, the underlying data would show way more variation at the lower end of the scale. It would need confirmation from the measuring team, but for now I will assume proportional error.
Model 2: Proportional errorThe proportional error will be implemented by using a log normal distribution for concentration. This was pretty simple to code. The log of the expected concentration is used in combination with dlnorm(). The only issue remaining is that expectation 0 for incorrectly ordered parameters has to be shifted a bit, log(0) is a bit of an issue.
This model looks pretty decent. There are still some things to do, similar to last week, translate the parameters c1star and c2star into the parameters which can be interpreted from PK perspective.
Model outputsModel 1Inference for Bugs model at "C:/Users/Kees/AppData/Local/Temp/RtmpSe5nAn/modele207c187661.txt", fit using jags,
 4 chains, each with 10000 iterations (first 5000 discarded), n.thin = 10
 n.sims = 2000 iterations saved
          mu.vect sd.vect    2.5%     25%     50%     75%    97.5%  Rhat n.eff
C[1]       91.473  24.508  51.690  76.444  88.708 103.600  144.987 1.016   160
C[2]      324.947 132.978 147.584 253.433 304.913 366.175  626.711 1.007   460
Ke[1]       0.020   0.002   0.016   0.018   0.019   0.021    0.024 1.045    76
Ke[2]       0.141   0.192   0.081   0.116   0.132   0.147    0.202 1.072   180
c1star[1]  58.783   8.381  43.892  53.222  58.514  63.406   76.601 1.047    64
c1star[2]  59.841  18.962  32.887  47.487  56.324  67.606  108.087 1.048    72
c1star[3]  92.337  10.569  70.647  85.720  92.805  99.469  112.338 1.041    74
c1star[4]  89.554  12.223  67.975  81.086  88.831  96.932  116.699 1.027   110
c1star[5] 146.419  21.504 106.395 132.153 145.199 159.149  193.558 1.032    84
c1star[6] 121.228  14.872  87.712 112.706 122.379 130.819  148.106 1.092    42
c2star[1] 426.241 107.777 278.572 355.121 406.845 471.583  703.121 1.038    92
c2star[2] 201.118  35.861 149.478 179.179 197.081 217.084  278.248 1.033   830
c2star[3] 489.294 218.162 268.493 349.208 419.178 537.325 1125.062 1.088    42
c2star[4] 449.402  68.164 340.627 402.410 440.666 487.652  603.535 1.038    75
c2star[5] 402.518  43.391 335.760 373.724 397.703 423.902  498.253 1.017   380
c2star[6] 140.893  46.737  79.548 114.343 133.240 157.967  237.080 1.024   650
ke1[1]      0.020   0.002   0.016   0.018   0.020   0.021    0.025 1.042    74
ke1[2]      0.021   0.004   0.016   0.019   0.020   0.023    0.032 1.060    57
ke1[3]      0.018   0.002   0.014   0.017   0.018   0.019    0.022 1.031    99
ke1[4]      0.020   0.002   0.017   0.019   0.020   0.021    0.025 1.023   120
ke1[5]      0.020   0.002   0.016   0.019   0.020   0.021    0.024 1.036    74
ke1[6]      0.019   0.002   0.015   0.018   0.019   0.020    0.022 1.071    46
ke2[1]      0.167   0.025   0.128   0.150   0.164   0.181    0.225 1.050    64
ke2[2]      0.101   0.026   0.067   0.085   0.096   0.111    0.164 1.050    91
ke2[3]      0.181   0.042   0.122   0.151   0.173   0.200    0.283 1.076    43
ke2[4]      0.143   0.019   0.110   0.130   0.141   0.155    0.181 1.040    74
ke2[5]      0.112   0.017   0.086   0.100   0.110   0.121    0.151 1.027   130
ke2[6]      0.118   0.038   0.062   0.093   0.114   0.137    0.205 1.051    61
deviance  464.708   7.996 451.264 458.919 464.029 469.608  482.206 1.006   550

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 31.8 and DIC = 496.6
DIC is an estimate of expected predictive error (lower deviance is better).
Model 2Inference for Bugs model at "C:/Users/Kees/AppData/Local/Temp/RtmpSe5nAn/modele2045997fac.txt", fit using jags, 4 chains, each with 10000 iterations (first 5000 discarded), n.thin = 10 n.sims = 2000 iterations saved          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.effC[1]       23.273  10.942   8.344  16.933  21.737  27.259  48.399 1.004   710C[2]      159.545  32.361 106.916 141.343 157.165 172.618 230.957 1.004  1200Ke[1]       0.010   0.001   0.009   0.010   0.010   0.011   0.012 1.010   270Ke[2]       0.040   0.005   0.032   0.036   0.039   0.042   0.050 1.006   660c1star[1]  14.141   3.164   8.481  11.970  13.923  16.049  20.826 1.012   230c1star[2]  13.177   4.777   6.178   9.622  12.276  16.064  24.394 1.020   210c1star[3]  34.030   6.876  20.963  29.102  33.778  38.458  47.854 1.006   450c1star[4]  12.399   4.038   6.525   9.482  11.712  14.524  22.656 1.019   130c1star[5]  40.359   8.756  24.398  34.248  40.002  45.836  58.609 1.007   360c1star[6]  38.415   9.194  21.683  32.043  37.885  44.595  57.469 1.006   440c2star[1] 123.292  17.872  93.250 110.418 121.625 134.114 164.930 1.000  2000c2star[2] 134.174  18.465 103.312 121.677 132.580 145.038 176.803 1.004   680c2star[3] 138.541  24.539  97.522 122.254 135.654 151.958 195.143 1.000  2000c2star[4] 184.399  20.868 148.465 170.310 183.027 197.048 229.782 1.002  1500c2star[5] 247.359  38.113 178.403 222.054 244.317 269.412 330.096 1.003  2000c2star[6] 150.684  22.359 111.345 135.290 149.545 164.626 198.568 1.002  2000ke1[1]      0.010   0.001   0.008   0.010   0.010   0.011   0.012 1.013   210ke1[2]      0.012   0.001   0.009   0.011   0.011   0.012   0.014 1.022   180ke1[3]      0.010   0.001   0.008   0.009   0.010   0.010   0.011 1.005   540ke1[4]      0.011   0.001   0.009   0.010   0.010   0.011   0.013 1.020   130ke1[5]      0.010   0.001   0.008   0.010   0.010   0.011   0.012 1.007   350ke1[6]      0.010   0.001   0.009   0.010   0.010   0.011   0.012 1.005   540ke2[1]      0.042   0.005   0.034   0.038   0.041   0.044   0.055 1.004   690ke2[2]      0.041   0.006   0.032   0.037   0.040   0.044   0.056 1.013   230ke2[3]      0.042   0.008   0.032   0.037   0.040   0.045   0.062 1.006   440ke2[4]      0.037   0.003   0.031   0.034   0.036   0.038   0.044 1.008   320ke2[5]      0.040   0.005   0.032   0.037   0.040   0.043   0.053 1.003  1100ke2[6]      0.036   0.006   0.027   0.033   0.036   0.040   0.048 1.003  1200deviance  372.564   7.382 359.811 367.354 372.140 376.777 388.833 1.001  2000
For each parameter, n.eff is a crude measure of effective sample size,and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
DIC info (using the rule, pD = var(deviance)/2)pD = 27.3 and DIC = 399.8DIC is an estimate of expected predictive error (lower deviance is better).Code usedlibrary(MEMSS)
library(ggplot2)
library(R2jags)
png('Cefamandole.png')
qplot(y=conc,x=Time,col=Subject,data=Cefamandole) +
    scale_y_log10('cefamandole (mcg/ml)')+
    geom_line()+
    theme(legend.position='bottom')
dev.off()

library(R2jags)
datain <-  list(
    time=Cefamandole$Time,
    conc=Cefamandole$conc,
    n=nrow(Cefamandole),
    subject =c(1:nlevels(Cefamandole$Subject))[Cefamandole$Subject],
    nsubject = nlevels(Cefamandole$Subject)
)

model1 <- function() {
  tau <- 1/pow(sigma,2)
  sigma ~ dunif(0,100)
  for (i in 1:n) {
    pred[i] <- (ke1[subject[i]]<ke2[subject[i]])*
        (c1star[subject[i]]*exp(-ke1[subject[i]]*time[i])+
          c2star[subject[i]]*exp(-ke2[subject[i]]*time[i]))
    conc[i] ~ dnorm(pred[i],tau)
  }
  
  for(i in 1:nsubject) {
    ke1[i] ~ dlnorm(ke[1],kemtau[1])
    ke2[i] ~ dlnorm(ke[2],kemtau[2]) 
    c1star[i] ~ dlnorm(cm[1],ctau[1])
    c2star[i] ~ dlnorm(cm[2],ctau[2])
  }
  for (i in 1:2) {
    kem[i] ~ dunif(-10,10) 
    kemtau[i] <- 1/pow(kesd[i],2) 
    kesd[i] ~ dunif(0,10)
    Ke[i] <- exp(ke[i])
    cm[i] ~ dunif(-10,20)
    ctau[i] <- 1/pow(csd[i],2)
    csd[i] ~ dunif(0,100)
    C[i] <- exp(cm[i])    
  }
  ke <- sort(kem)


parameters <- c('Ke','ke1','ke2','c1star','c2star','C')
inits <- function() {
  list(ke1 = rnorm(6,0.13,.03),
      ke2=  rnorm(6,0.0180,.005), 
      kem=  log(c(rnorm(1,0.13,.03),rnorm(1,0.0180,.005))),
      kesd =runif(2,.1,.4),
      cm = log(rnorm(2,25,5)),
      c1star=rnorm(6,25,3),
      c2star=rnorm(6,25,3)
  )}
jagsfit1 <- jags(datain, model=model1, 
    inits=inits,
    parameters=parameters,progress.bar="gui",
    n.iter=10000,
    n.chains=4,n.thin=10)
jagsfit1
#plot(jagsfit1)
#traceplot(jagsfit1)

Time <- seq(0,max(Cefamandole$Time))
la1 <- sapply(1:nrow(jagsfit1$BUGSoutput$sims.matrix),
    function(i) {
      C1 <- jagsfit1$BUGSoutput$sims.matrix[i,'C[1]']
      C2 <- jagsfit1$BUGSoutput$sims.matrix[i,'C[2]']
      k1 <- jagsfit1$BUGSoutput$sims.matrix[i,'Ke[1]']
      k2 <- jagsfit1$BUGSoutput$sims.matrix[i,'Ke[2]']
      y=C1*exp(-k1*Time)+C2*exp(-k2*Time)
    })
res1 <- data.frame(
    Time=Time,
    Conc=apply(la1,1,mean),
    C05=apply(la1,1,FUN=function(x) quantile(x,0.05)),
    C95=apply(la1,1,FUN=function(x) quantile(x,0.95)))
png('cef1.png')
ggplot(res1, aes(x=Time)) +
    geom_line(aes(y=Conc))+
    scale_y_log10('cefamandole (mcg/ml)')+
    geom_line(aes(y=conc,x=Time,col=Subject),alpha=1,data=Cefamandole)+
    geom_ribbon(aes(ymin=C05, ymax=C95),alpha=.2)+
    theme(legend.position='bottom')
dev.off()
#########
#
# model 2: proportional error
#
#########
model2 <- function() {
  tau <- 1/pow(sigma,2)
  sigma ~ dunif(0,100)
  for (i in 1:n) {
    pred[i] <- log(
        (ke1[subject[i]]<ke2[subject[i]])*
            (c1star[subject[i]]*exp(-ke1[subject[i]]*time[i])+
              c2star[subject[i]]*exp(-ke2[subject[i]]*time[i]))
            +.001*(ke1[subject[i]]>ke2[subject[i]]))
    conc[i] ~ dlnorm(pred[i],tau)
  }
  
  
  for(i in 1:nsubject) {
    ke1[i] ~ dlnorm(ke[1],kemtau[1])
    ke2[i] ~ dlnorm(ke[2],kemtau[2]) 
    c1star[i] ~ dlnorm(cm[1],ctau[1])
    c2star[i] ~ dlnorm(cm[2],ctau[2])
  }
  for (i in 1:2) {
    kem[i] ~ dunif(-10,10) 
    kemtau[i] <- 1/pow(kesd[i],2) 
    kesd[i] ~ dunif(0,10)
    Ke[i] <- exp(ke[i])
    cm[i] ~ dunif(-10,20)
    ctau[i] <- 1/pow(csd[i],2)
    csd[i] ~ dunif(0,100)
    C[i] <- exp(cm[i])    
  }
  ke <- sort(kem)
  


parameters <- c('Ke','ke1','ke2','c1star','c2star','C')
inits <- function() {
  list(ke1 = rnorm(6,0.13,.03),
      ke2=  rnorm(6,0.0180,.005), 
      kem=  log(c(rnorm(1,0.13,.03),rnorm(1,0.0180,.005))),
      kesd =runif(2,.1,.4),
      cm = log(rnorm(2,25,5)),
      c1star=rnorm(6,25,3),
      c2star=rnorm(6,25,3)
  )}
jagsfit2 <- jags(datain, model=model2, 
    inits=inits,
    parameters=parameters,progress.bar="gui",
    n.iter=10000,
    n.chains=4,n.thin=10)
jagsfit2
la2 <- sapply(1:nrow(jagsfit2$BUGSoutput$sims.matrix),
    function(i) {
      C1 <- jagsfit2$BUGSoutput$sims.matrix[i,'C[1]']
      C2 <- jagsfit2$BUGSoutput$sims.matrix[i,'C[2]']
      k1 <- jagsfit2$BUGSoutput$sims.matrix[i,'Ke[1]']
      k2 <- jagsfit2$BUGSoutput$sims.matrix[i,'Ke[2]']
      y=C1*exp(-k1*Time)+C2*exp(-k2*Time)
    })


res2 <- data.frame(
    Time=Time,
    Conc=apply(la2,1,mean),
    C05=apply(la2,1,FUN=function(x) quantile(x,0.05)),
    C95=apply(la2,1,FUN=function(x) quantile(x,0.95)))

png('cef2.png')
ggplot(res2, aes(x=Time)) +
    geom_line(aes(y=Conc))+
    scale_y_log10('cefamandole (mcg/ml)')+
    geom_line(aes(y=conc,x=Time,col=Subject),alpha=1,data=Cefamandole)+
    geom_ribbon(aes(ymin=C05, ymax=C95),alpha=.2)+
    theme(legend.position='bottom')
dev.off()

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

The wings of a programmer

Sun, 2015-04-12 10:45

(This article was first published on Burns Statistics » R language, and kindly contributed to R-bloggers)

Programmers think programming is really hard.  Non-programmers think it’s even harder than that.

Figure 1: The perceived difficulty of programming. 

Why is programming so arduous?

There are a few reasons.  Here is one.

Wings

Programming is exacting, programming needs creativity.

These are absolutely at odds with each other.

One wing wants the programmer to respect every minuscule triviality.  The other wants the programmer to conceive a scintillating new creature that fits seamlessly into an existing ecosystem.  Getting either of these out of one particular homo sapiens brain is a lot to ask.  Here we are, wanting both at once.

Failure mode 1

Do the details, ignore the creativity.

This gets you code that runs, that passes the existing test suite.  But code that is a dead end.  Ask one more thing of it and it falls apart.  The code is likely to be verbose.

Failure mode 2

Do the creativity, ignore the details.

This gives you code that doesn’t do much of anything (if you get any code at all).

Failure mode 3

Do details and creativity simultaneously.

This is frustrating for the programmer and frustrating for the eventual users.  Trying to do both of these at the same time means that neither gets done.

Premature optimization is the root of all evil

– Donald Knuth

Simultaneously doing details and design is a form of premature optimization.  Plus, brain rule #4 suggests that trying to do them both at once is, at best, inefficient — gist should come before details.

Better

Better is to use both wings, but — unlike a bird — flap one for a while and then the other.  Not like a bird, more like Paul Bunyan’s dog Sport.  Sport’s back half was sewn on upside down after he was accidentally chopped in half.  This turned out to be a good thing: when he got tired of running on his front legs, he would just flip over and run on his back legs.

First mold your new baby so it fits into the current milieu.  Only once it seems to have a good shape should you worry about it doing everything correctly.  Worry whole hog.

Failure mode 0

Real programmers know that numbering starts at zero.  Indeed there is failure mode 0, which is:

Don’t program at all.

See also

Tao Te Programming is mostly about the things that make the line in Figure 1 slope back up.  It also suggests additional ways to make programming easier for beginners.

Epilogue

There, upside down (unlike a Bird),

— from “The Sloth” by Theodore Roethke

Appendix R

Figure 1 was produced in R with the function:

 P.perceived_programming_difficulty <- function (filename = "perceived_programming_difficulty.png") { if (length(filename)) { png(file = filename, width = 512) par(mar = c(5, 4, 0, 2) + 0.1) } pdiff <- stats::splinefun(1:4, c(11, 2, 7, 8.8), method="natural") curve(pdiff, 0.5, 5, lwd=6, col="darkblue", xlab="Experience", ylab="Perceived difficulty", axes=FALSE, xlim=c(0.4,5)) axis(1, at=c(0.7, 2.6, 4.5), tck=0, c("none", "some", "lots")) axis(2, at=c(3, 9, 15), tck=0, c("easy", "really hard", "really really hard")) box() if (length(filename)) { dev.off() } }

The post The wings of a programmer appeared first on Burns Statistics.

To leave a comment for the author, please follow the link and comment on his blog: Burns Statistics » R language. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Modeling Categories with Breadth and Depth

Fri, 2015-04-10 20:15

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

Religion is a categorical variable with followers differentiated by their degree of devotion. Liberals and conservatives check their respective boxes when surveyed, although moderates from each group sometimes seem more alike than their more extreme compatriots. All Smartphone users might be classified as belonging to the same segment, yet the infrequent user is distinct from the intense who cannot take their eyes off their screens. Both of us have the flu, but you are really sick. My neighbor and I belong to a cluster called dog-owners. However, my dog is a pet on a strict allowance and theirs is a member of the family with no apparent limit on expenditures. There seems to be more structure underlying such categorizations than is expressed by a zero-one indicator of membership.

Plutchik's wheel of emotion offers a concrete example illustrating the breadth and depth of affective categories spanning the two dimensions defined by positive vs. negative and active vs. passive. The concentric circles in the diagram below show the depth of the emotion with the most intense toward the center. Loathing, disgust and boredom suggest an increasing activation of a similar type of negative response between contempt and remorse. Breadth is varied as one moves around the circle traveling through the entire range of emotions. When we speak of opposites such as flight (indicated in green as apprehension, fear and terror) or fight (the red emotions of annoyance, anger and rage), we are relating our personal experiences with two contrasting categories of feelings and behaviors. Yet, there is more to this categorization than is express by a set of exhaustive and mutually exclusive boxes, even if the boxes are called latent classes.



The R statistical language approaches such structured categories from a number of perspectives. I have written two posts on archetypal analysis. The first focused on the R package archetypes and demonstrated how to use those functions to model customer heterogeneity by identifying the extremes and representing everyone else as convex combinations of those end points. The second post argued that we tend to think in terms of contrasting ideals (e.g., liberal versus conservative) so that we perceive entities to be more different in juxtaposition than they would appear on their own.

Latent variable mixture models provide another approach to the modeling of unobserved constructs with both type and intensity, at least for those interested in psychometrics in R. The intensity measure for achievement tests is item difficulty, and differential item functioning (DIF) is the term used by psychometricians to describe items with different difficulty parameters in different groups of test takers. The same reasoning applies to customers who belong to different segments (types) seeking different features from the same products (preference intensity). These are all mixture models in the sense that we cannot estimate one set of parameters for the entire sample because the data are a mixture of hidden groups with different parameters. R packages with the prefix or suffix "mix" in their title (e.g., mixRasch or flexmix) suggest such a mixture modeling.

R can also capture the underlying organization shaping categories through matrix factorization. This blog is filled with posts demonstrating how the R package NMF (nonnegative matrix factorization) easily decomposes a data matrix into the product of two components: (1) something that looks like factor loadings of the measures in the columns and (2) something similar to a soft or fuzzy clustering of respondents in the rows. Both these components will be block diagonal matrices when we have the type of separation we have been discussing. You can find examples of such matrix decompositions in a listing of previous posts at the end of my discussion of brand and product representation.

Consumer segments are structured by their use of common products and features in order to derive similar benefits. This is not an all-or-none process but a matter of degree specified by depth (e.g., usage frequency) and breadth (e.g., the variety and extent of feature usage). You can select almost any product category, and at one end you will find heavy users doing everything that can possibly be done with the product. As usage decreases, it falls off in clumps with clusters of features no longer wanted or needed. These are the latent features of NMF that simultaneously bind together consumers and the features they use. For product categories such as movies or music, the same process applies but now the columns are films seen or recordings heard. All of this may sound familiar to those of you who have studied recommendation systems or topic modeling, both of which can be run with NMF.

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

New packages for reading data into R — fast

Fri, 2015-04-10 13:18

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Hadley Wickham and the RStudio team have created some new packages for R, which will be very useful for anyone who needs to read data into R (that is, everyone). The readr package provides functions for reading text data into R, and the readxl package provides functions for reading Excel spreadsheet data into R. Both are much faster than the functions you're probably using now.

The readr package provides several functions for reading tabular text data into R. This is a task normally accomplished with the read.table family of functions in R, and readr provides a number of replacement functions that provide additional functionality and are much faster.

First, there's read_table which provides a near-replacement for read.table. Here's a comparison of using both functions on a file with 4 million rows of data (which I created by stacking copies of this file):

dat <- read_table("biggerfile.txt",
  col_names=c("DAY","MONTH","YEAR","TEMP"))

dat2 <- read.table("biggerfile.txt",
  col.names=c("DAY","MONTH","YEAR","TEMP"))

The commands look quite similar, but while read.table took just over 30 seconds to complete, readr's read_table accomplished the same task in less than a second. The trick is that read_table treats the data as a fixed-format file, and uses C++ to process the data quickly. (One small caveat is that read.table supports arbitrary amounts of whitespace between columns, while read_table requires the columns be lined up exactly. In practice, this isn't much of a restriction.)

Base R has a function for reading fixed-width data too, and here readr really shines:

dat <- read_fwf("biggerfile.txt",
  fwf_widths(c(3,15,16,12),
    col_names=c("DAY","MONTH","YEAR","TEMP")))

dat2 <- read.fwf("biggerfile.txt", c(3,15,16,12),
  col.names=c("DAY","MONTH","YEAR","TEMP"))

While readr's read_fwf again accomplished the task in about a second, the standard read.fwf took over 3 minutes — almost 200 times as long.

Other functions in the package include read_csv (and a European-friendly variant read_csv2) for comma-separated data, read_tsv for tab-separated data, and read_lines for line-by-line file extraction (great for complicated post-processing). The package also makes it much easier to read columns of dates in various formats, and sensibly always handles text data as strings (no more strings.as.factors=FALSE). 

For data in Excel format, there's also the new readxl package. This package provides function to read Excel worksheets in both .xls and .xlsx formats. I haven't benchmarked the read_excel function myself, but like the readr functions it's based on a C++ library so should be quite snappy. And best of all, it has no external dependencies, so you can use it to read Excel data on just about any platform — there's no requirement that Excel itself be installed.

The readr package is on CRAN now, and readxl can be installed from GiHub. If you try them yourself, let us know how it goes in the comments.

RStudio blog: readr 0.1.0

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R User Group Recap: Heatmaps and Using the caret Package

Fri, 2015-04-10 11:01

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

At our most recent R user group meeting we were delighted to have presentations from Mark Lawson and Steve Hoang, both bioinformaticians at Hemoshear. All of the code used in both demos is in our Meetup’s GitHub repo.Making heatmaps in RSteve started with an overview of making heatmaps in R. Using the iris dataset, Steve demonstrated making heatmaps of the continuous iris data using the heatmap.2 function from the gplots package, the aheatmap function from NMF, and the hard way using ggplot2. The “best in class” method used aheatmap to draw an annotated heatmap plotting z-scores of columns and annotated rows instead of raw values, using the Pearson correlation instead of Euclidean distance as the distance metric.library(dplyr)
library(NMF)
library(RColorBrewer)
iris2 = iris # prep iris data for plotting
rownames(iris2) = make.names(iris2$Species, unique = T)
iris2 = iris2 %>% select(-Species) %>% as.matrix()
aheatmap(iris2, color = "-RdBu:50", scale = "col", breaks = 0,
annRow = iris["Species"], annColors = "Set2",
distfun = "pearson", treeheight=c(200, 50),
fontsize=13, cexCol=.7,
filename="heatmap.png", width=8, height=16)

Classification and regression using caretMark wrapped up with a gentle introduction to the caret package for classification and regression training. This demonstration used the caret package to split data into training and testing sets, and run repeated cross-validation to train random forest and penalized logistic regression models for classifying Fisher’s iris data.First, get a look at the data with the featurePlot function in the caret package:library(caret)
set.seed(42)
data(iris)
featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "pairs",
auto.key = list(columns = 3))

Next, after splitting the data into training and testing sets and using the caret package to automate training and testing both random forest and partial least squares models using repeated 10-fold cross-validation (see the code), it turns out random forest outperforms PLS in this case, and performs fairly well overall:setosaversicolorvirginicaSensitivity1.001.000.00Specificity1.000.501.00Pos Pred Value1.000.50NaNNeg Pred Value1.001.000.67Prevalence0.330.330.33Detection Rate0.330.330.00Detection Prevalence0.330.670.00Balanced Accuracy1.000.750.50A big thanks to Mark and Steve at Hemoshear for putting this together!​Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

drat 0.0.3: More features, more fixes, more documentation

Fri, 2015-04-10 08:14

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Several weeks ago we introduced the drat package. Its name stands for drat R Archive Template, and it helps with easy-to-create and easy-to-use repositories for R packages. Two early blog posts describe drat: First Steps Towards Lightweight Repositories, and Publishing a Package, and since the previous release, a a guest post on drat was also added to the blog.

Several people have started to use drat to publish their packages, and this is a very encouraging sign. I also created a new repository and may have more to say about this in another post once I get time to write something up.

Today version 0.0.3 arrived on CRAN. It rounds out functionality, adds some fixes and brings more documentation:

  • git support via git2r is improved; it is still optional (ie commit=TRUE is needed when adding packages) but plan to expand it
  • several small bugs got fixed, including issues #9 and #7,
  • four new vignettes got added, including two guests posts by Steven and Colin as well as two focusing, respectively, on drat for authors and and drat for users.

The work on the vignettes is clearly in progress as Colin's guest post isn't really finished yet, and I am not too impressed with how the markdown is rendered at CRAN so some may still become pdf files instead.

Courtesy of CRANberries, there is a comparison to the previous release. More detailed information is on the drat page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Scale back or transform back multiple linear regression coefficients: Arbitrary case with ridge regression

Fri, 2015-04-10 07:25

(This article was first published on Memo's Island, and kindly contributed to R-bloggers)

Summary

The common case in data science or machine learning applications, different features or predictors manifest them in different scales. This could bring difficulty in interpreting the resulting coefficients of linear regression, such as one feature having very large or small values compare to other predictors and being in different units first of all. The common approach to overcome this to use z-score's for each features, centring and scaling.

This approach would allow us to interpret the effects.  A possible question however is how could we map regression coefficients obtained with the scaled data back to original coefficients. In the context of ridge regression, this question is posed by Mark Seeto in the R mailing list and provided a solution for two predictor case with an R code. In this post we formalize his approach. Note that, in the case of how to scale, Professor Gelman suggests diving them by two standard deviations. In this post we won't cover that approach and use usual approach.

Algebraic Solution: No error term

An arbitrary linear regression for $n$ variable reads as follows
$$y=(Sigma_{i=1}^n   beta_{i} x_{i}) + beta_{0}$$
here, $y$ is being response variable, $x_{i}$ are the predictors, $n=1,..,n$. Let's use primes for the scaled regression equation  for $n$ variable.
$$y'=(Sigma_{i=1}^n   beta_{i}' x_{i}') + beta_{0}'$$
We would like to express $beta_{i}$ by only using $beta_{i}'$ and two statistic from the data, namely mean and standard deviations, $mu_{x_{i}}$, $mu_{y}$, $sigma_{x_{i}}$ and $sigma_{y}$.

The following transformation can be shown by using the z-scores and some algebra,

$$beta_{0}=beta_{0}' sigma_{y} + mu_{y} - Sigma_{i=1}^{n} frac{sigma_{y}}{sigma_{x_{i}}}beta_{i}' mu_{x_{i}}$$
$$beta_{i} = beta_{i}' frac{sigma_{y}}{sigma_{x_{i}}}$$ 

Ridge regression in R

There are many packages and tools in R to perform ridge regression. One of the prominent one is glmnet. Following Mark Seeto's example, here we extent that in to many variate case with a helper function scale.back.lm from R1magic package.  Function provides a transform utility for $n$-variate case. Here we demo this using 6 predictors, also available as gist,


rm(list=ls())
library(glmnet)
library(R1magic) # https://github.com/msuzen/R1magic
set.seed(4242)
n <- 100 # observations
X <- model.matrix(~., data.frame(x1 = rnorm(n, 1, 1),
x2 = rnorm(n, 2, 2),
x3 = rnorm(n, 3,2),
x4 = rnorm(n, 4,2),
x5 = rnorm(n, 5,1),
x6 = rnorm(n, 6,1)
))[,-1] # glmnet adds the intercept
Y <- matrix(rnorm(n, 1, 2),n,1)
# Now apply scaling
X.s <- scale(X)
Y.s <- scale(Y)
# Ridge regression & coefficients with scaled data
glm.fit.s <- glmnet(X.s, Y.s, alpha=0)
betas.scaled <- as.matrix(as.vector(coef(glm.fit.s)[,80]), 1, 7)
# trasform the coefficients
betas.transformed <- scale.back.lm(X, Y, betas.scaled)
# Now verify the correctness of scaled coefficients:
# ridge regression & coefficients
glm.fit <- glmnet(X, Y, alpha=0)
betas.fit <- as.matrix(as.vector(coef(glm.fit)[,80]), 1, 7)
# Verify correctness: Difference is smaller than 1e-12
sum(betas.fit-betas.transformed) < 1e-12 # TRUE

Conclusion

Multiple regression is used by many practitioners. In this post we have shown how to scale continuous predictors and transform back the regression coefficients to original scale. Scaled coefficients would help us to better interpret the results. The question of when to standardize the data is a different issue.

To leave a comment for the author, please follow the link and comment on his blog: Memo's Island. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

sparklines for R

Fri, 2015-04-10 06:50

(This article was first published on FishyOperations, and kindly contributed to R-bloggers)

I’ve always liked the jQuery Sparklines library and today I had a use case for implementing these in one of my Rmarkdown reports.

While it wouldn’t be too difficult to staticly include a javascript based chart, ideally I would simply want to dynamically generate the sparkline using values computed in R. Luckily we now have htmlwidgets for R. This library makes it stupidly simple to integrate javascript libraries in R code. It simply up to one who has a use case for integrating a javascript library with R to insert some glue code: and so sparklines was born.

Usage is quite straightforward. I’ll retrieve some weatherdata for Brussels to populate a few sparkline charts.

library(weatherData) weatherset <- getWeatherForDate("BRU", start_date = "2015-03-15", end_date = "2015-04-09", opt_verbose=F) weather_values <- as.vector(weatherset$Mean_TemperatureC)

Let’s have a look at the temperature data.

weather_values ## [1] 4 5 8 6 6 6 6 4 4 5 4 3 6 7 10 7 9 7 6 4 6 5 4 ## [24] 6 8 10

Now for the sparklines:

library(sparklines) sparkline(weather_values, "line")

{ "x": { "values": [ 4, 5, 8, 6, 6, 6, 6, 4, 4, 5, 4, 3, 6, 7, 10, 7, 9, 7, 6, 4, 6, 5, 4, 6, 8, 10 ], "config": { "type": "line", "width": "100px", "height": "20px" } },"evals": [ ] }

sparkline(weather_values, "bar")

{ "x": { "values": [ 4, 5, 8, 6, 6, 6, 6, 4, 4, 5, 4, 3, 6, 7, 10, 7, 9, 7, 6, 4, 6, 5, 4, 6, 8, 10 ], "config": { "type": "bar", "width": "100px", "height": "20px" } },"evals": [ ] }

sparkline(weather_values, "tristate")

{ "x": { "values": [ 4, 5, 8, 6, 6, 6, 6, 4, 4, 5, 4, 3, 6, 7, 10, 7, 9, 7, 6, 4, 6, 5, 4, 6, 8, 10 ], "config": { "type": "tristate", "width": "100px", "height": "20px" } },"evals": [ ] }

sparkline(weather_values, "box")

{ "x": { "values": [ 4, 5, 8, 6, 6, 6, 6, 4, 4, 5, 4, 3, 6, 7, 10, 7, 9, 7, 6, 4, 6, 5, 4, 6, 8, 10 ], "config": { "type": "box", "width": "100px", "height": "20px" } },"evals": [ ] }

Almost all configuration options and chart types as specified at http://omnipotent.net/jquery.sparkline/#s-docs are available. And, as a nice side note, it is also possible to use it in combination with Shiny :)

Installation instructions, configuration options, sourcecode and other documentation is available at https://github.com/Bart6114/sparklines.

To leave a comment for the author, please follow the link and comment on his blog: FishyOperations. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Feeling the FPP love

Fri, 2015-04-10 00:43

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

It is now exactly 12 months since the print version of my forecasting textbook with George Athanasopoulos was released on Amazon.com. Although the book is freely available online, it seems that a lot of people still like to buy print books.

It’s nice to see that it has been getting some good reviews. It is rated 4.6 stars on Amazon.com with 6 out of 8 reviewers giving it 5 stars (the 3 reviewers on Amazon.co.uk all gave it 5 stars).

My favourite Amazon review is this one:

The book is well written and up to date — the online edition is likely to continue to be updated frequently. Hyndman is an inspiration. His blog is very interesting if you are a statistician, and written in a very clear style. His research group is the author of the R forecast package used in this book. He and his collaborators have made great strides in systematizing smoothing methods. You are not only reading a clear introductory textbook, you are reading one that’s up to date with modern forecasting practice (excluding the more exotic data mining methods, which clearly go beyond introductory texts). (I’m sure George Athanasopoulos has many fine qualities, but I’m less familiar with him.)

This book is ideal for self-study because the associated website has the answers to the exercises.

Isn’t he nice? Yes, George is a great guy!

However, the review is not entirely accurate. The website does not contain answers to the exercises. We do have solutions that we can give to instructors using the book, but at this stage they are not available more widely.

The only bad review (3 stars) was this one:

Online version much better than printed version.

Seriously? The only difference between the online and print versions is that a small number of typos have been corrected in the online version.

A few reviewers have called for an index. The reason it doesn’t have one is that the online version can be searched. Look up anything you like using the search box at the top of every page. Surely that’s better than me constructing an index by hand.

Another review appeared yesterday on the Information Management site, also very positive.

It’s nice to know that our work is appreciated, but we are well aware that there are areas for improvement. George and I hope to work on a revision to the book this year. If anyone has any suggestions for the next edition, please let us know in the comments below.

 

To leave a comment for the author, please follow the link and comment on his blog: Hyndsight » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Recreating the vaccination heatmaps in R

Thu, 2015-04-09 17:14

(This article was first published on Benomics » R, and kindly contributed to R-bloggers)

In February the WSJ graphics team put together a series of interactive visualisations on the impact of vaccination that blew up on twitter and facebook, and were roundly lauded as great-looking and effective dataviz. Some of these had enough data available to look particularly good, such as for the measles vaccine:

Credit to the WSJ and creators: Tynan DeBold and Dov Friedman

How hard would it be to recreate an R version?

Base R version

Quite recently Mick Watson, a computational biologist based here in Edinburgh, put together a base R version of this figure using heatmap.2 from the gplots package.

If you’re interested in the code for this, I suggest you check out his blog post where he walks the reader through creating the figure, beginning from heatmap defaults.

However, it didn’t take long for someone to pipe up asking for a ggplot2 version (3 minutes in fact…) and that’s my preference too, so I decided to have a go at putting one together.

ggplot2 version

Thankfully the hard work of tracking down the data had already been done for me, to get at it follow these steps:

  1. Register and login to “Project Tycho
  2. Go to level 1 data, then Search and retrieve data
  3. Now change a couple of options: geographic level := state; disease outcome := incidence
  4. Add all states (highlight all at once with Ctrl+A (or Cmd+A on Macs)
  5. Hit submit and scroll down to Click here to download results to excel
  6. Open in excel and export to CSV

Simple right!

Now all that’s left to do is a bit of tidying. The data comes in wide format, so can be melted to our ggplot2-friendly long format with:

measles &lt;- melt(measles, id.var=c(&quot;YEAR&quot;, &quot;WEEK&quot;))

After that we can clean up the column names and use dplyr to aggregate weekly incidence rates into an annual measure:

colnames(measles) &lt;- c(&quot;year&quot;, &quot;week&quot;, &quot;state&quot;, &quot;cases&quot;) mdf &lt;- measles %&gt;% group_by(state, year) %&gt;% summarise(c=if(all(is.na(cases))) NA else sum(cases, na.rm=T)) mdf$state &lt;- factor(mdf$state, levels=rev(levels(mdf$state)))

It’s a bit crude but what I’m doing is summing the weekly incidence rates and leaving NAs if there’s no data for a whole year. This seems to match what’s been done in the WSJ article, though a more intepretable method could be something like average weekly incidence, as used by Robert Allison in his SAS version.

After trying to match colours via the OS X utility “digital colour meter” without much success, I instead grabbed the colours and breaks from the original plot’s javascript to make them as close as possible.

In full, the actual ggplot2 command took a fair bit of tweaking:

ggplot(mdf, aes(y=state, x=year, fill=c)) + geom_tile(colour=&quot;white&quot;, linewidth=2, width=.9, height=.9) + theme_minimal() + scale_fill_gradientn(colours=cols, limits=c(0, 4000), breaks=seq(0, 4e3, by=1e3), na.value=rgb(246, 246, 246, max=255), labels=c(&quot;0k&quot;, &quot;1k&quot;, &quot;2k&quot;, &quot;3k&quot;, &quot;4k&quot;), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + scale_x_continuous(expand=c(0,0), breaks=seq(1930, 2010, by=10)) + geom_segment(x=1963, xend=1963, y=0, yend=51.5, size=.9) + labs(x=&quot;&quot;, y=&quot;&quot;, fill=&quot;&quot;) + ggtitle(&quot;Measles&quot;) + theme(legend.position=c(.5, -.13), legend.direction=&quot;horizontal&quot;, legend.text=element_text(colour=&quot;grey20&quot;), plot.margin=grid::unit(c(.5,.5,1.5,.5), &quot;cm&quot;), axis.text.y=element_text(size=6, family=&quot;Helvetica&quot;, hjust=1), axis.text.x=element_text(size=8), axis.ticks.y=element_blank(), panel.grid=element_blank(), title=element_text(hjust=-.07, face=&quot;bold&quot;, vjust=1, family=&quot;Helvetica&quot;), text=element_text(family=&quot;URWHelvetica&quot;)) + annotate(&quot;text&quot;, label=&quot;Vaccine introduced&quot;, x=1963, y=53, vjust=1, hjust=0, size=I(3), family=&quot;Helvetica&quot;) Result

I’m pretty happy with the outcome but there are a few differences: the ordering is out (someone pointed out the original is ordered by two letter code rather than full state name) and the fonts are off (as far as I can tell they use “Whitney ScreenSmart” among others).

Obviously the original is an interactive chart which works great with this data. It turns out it was built with the highcharts library, which actually has R bindings via the rCharts package, so in theory the original chart could be entirely recreated in R! However, for now at least, that’ll be left as an exercise for the reader…

Full code to reproduce this graphic is on github.

To leave a comment for the author, please follow the link and comment on his blog: Benomics » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

readr 0.1.0

Thu, 2015-04-09 16:18

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

I’m pleased to announced that readr is now available on CRAN. Readr makes it easy to read many types of tabular data:

  • Delimited files withread_delim(), read_csv(), read_tsv(), and read_csv2().
  • Fixed width files with read_fwf(), and read_table().
  • Web log files with read_log().

You can install it by running:

install.packages("readr")

Compared to the equivalent base functions, readr functions are around 10x faster. They’re also easier to use because they’re more consistent, they produce data frames that are easier to use (no more stringsAsFactors = FALSE!), they have a more flexible column specification, and any parsing problems are recorded in a data frame. Each of these features is described in more detail below.

Input

All readr functions work the same way. There are four important arguments:

  • file gives the file to read; a url or local path. A local path can point to a a zipped, bzipped, xzipped, or gzipped file – it’ll be automatically uncompressed in memory before reading. You can also pass in a connection or a raw vector.

    For small examples, you can also supply literal data: if file contains a new line, then the data will be read directly from the string. Thanks to data.table for this great idea!

    library(readr) read_csv("x,yn1,2n3,4") #> x y #> 1 1 2 #> 2 3 4
  • col_names: describes the column names (equivalent to header in base R). It has three possible values:
    • TRUE will use the the first row of data as column names.
    • FALSE will number the columns sequentially.
    • A character vector to use as column names.
  • col_types: overrides the default column types (equivalent to colClasses in base R). More on that below.
  • progress: By default, readr will display a progress bar if the estimated loading time is greater than 5 seconds. Use progress = FALSE to suppress the progress indicator.
Output

The output has been designed to make your life easier:

  • Characters are never automatically converted to factors (i.e. no more stringsAsFactors = FALSE!).
  • Column names are left as is, not munged into valid R identifiers (i.e. there is no check.names = TRUE). Use backticks to refer to variables with unusual names, e.g. df$`Income ($000)`.
  • The output has class c("tbl_df", "tbl", "data.frame") so if you also use dplyr you’ll get an enhanced print method (i.e. you’ll see just the first ten rows, not the first 10,000!).
  • Row names are never set.
Column types

Readr heuristically inspects the first 100 rows to guess the type of each columns. This is not perfect, but it’s fast and it’s a reasonable start. Readr can automatically detect these column types:

  • col_logical() [l], contains only T, F, TRUE or FALSE.
  • col_integer() [i], integers.
  • col_double() [d], doubles.
  • col_euro_double() [e], “Euro” doubles that use , as the decimal separator.
  • col_date() [D]: Y-m-d dates.
  • col_datetime() [T]: ISO8601 date times
  • col_character() [c][/c], everything else.

You can manually specify other column types:

  • col_skip() [_], don’t import this column.
  • col_date(format) and col_datetime(format, tz), dates or date times parsed with given format string. Dates and times are rather complex, so they’re described in more detail in the next section.
  • col_numeric() [n], a sloppy numeric parser that ignores everything apart from 0-9, - and . (this is useful for parsing currency data).
  • col_factor(levels, ordered), parse a fixed set of known values into a (optionally ordered) factor.

There are two ways to override the default choices with the col_types argument:

  • Use a compact string: "dc__d". Each letter corresponds to a column so this specification means: read first column as double, second as character, skip the next two and read the last column as a double. (There’s no way to use this form with column types that need parameters.)
  • With a (named) list of col objects: read_csv("iris.csv", col_types = list( Sepal.Length = col_double(), Sepal.Width = col_double(), Petal.Length = col_double(), Petal.Width = col_double(), Species = col_factor(c("setosa", "versicolor", "virginica")) ))

    Any omitted columns will be parsed automatically, so the previous call is equivalent to:

    read_csv("iris.csv", col_types = list( Species = col_factor(c("setosa", "versicolor", "virginica")) )
Dates and times

One of the most helpful features of readr is its ability to import dates and date times. It can automatically recognise the following formats:

  • Dates in year-month-day form: 2001-10-20 or 2010/15/10 (or any non-numeric separator). It can’t automatically recongise dates in m/d/y or d/m/y format because they’re ambiguous: is 02/01/2015 the 2nd of January or the 1st of February?
  • Date times as ISO8601 form: e.g. 2001-02-03 04:05:06.07 -0800, 20010203 040506, 20010203 etc. I don’t support every possible variant yet, so please let me know if it doesn’t work for your data (more details in ?parse_datetime).

If your dates are in another format, don’t despair. You can use col_date() and col_datetime() to explicit specify a format string. Readr implements it’s own strptime() equivalent which supports the following format strings:

  • Year: %Y (4 digits). %y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
  • Month: %m (2 digits), %b (abbreviated name in current locale), %B (full name in current locale).
  • Day: %d (2 digits), %e (optional leading space)
  • Hour: %H
  • Minutes: %M
  • Seconds: %S (integer seconds), %OS (partial seconds)
  • Time zone: %Z (as name, e.g. America/Chicago), %z (as offset from UTC, e.g. +0800)
  • Non-digits: %. skips one non-digit charcater, %* skips any number of non-digit characters.
  • Shortcuts: %D = %m/%d/%y, %F = %Y-%m-%d, %R = %H:%M, %T = %H:%M:%S, %x = %y/%m/%d.

To practice parsing date times with out having to load the file each time, you can use parse_datetime() and parse_date():

parse_date("2015-10-10") #> [1] "2015-10-10" parse_datetime("2015-10-10 15:14") #> [1] "2015-10-10 15:14:00 UTC" parse_date("02/01/2015", "%m/%d/%Y") #> [1] "2015-02-01" parse_date("02/01/2015", "%d/%m/%Y") #> [1] "2015-01-02" Problems

If there are any problems parsing the file, the read_ function will throw a warning telling you how many problems there are. You can then use the problems() function to access a data frame that gives information about each problem:

csv <- "x,y 1,a b,2 " df <- read_csv(csv, col_types = "ii") #> Warning: 2 problems parsing literal data. See problems(...) for more #> details. problems(df) #> row col expected actual #> 1 1 2 an integer a #> 2 2 1 an integer b df #> x y #> 1 1 NA #> 2 NA 2 Helper functions

Readr also provides a handful of other useful functions:

  • read_lines() works the same way as readLines(), but is a lot faster.
  • read_file() reads a complete file into a string.
  • type_convert() attempts to coerce all character columns to their appropriate type. This is useful if you need to do some manual munging (e.g. with regular expressions) to turn strings into numbers. It uses the same rules as the read_* functions.
  • write_csv() writes a data frame out to a csv file. It’s quite a bit faster than write.csv() and it never writes row.names. It also escapes " embedded in strings in a way that read_csv() can read.
Development

Readr is still under very active development. If you have problems loading a dataset, please try the development version, and if that doesn’t work, file an issue.


To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

What can be in an R data.frame column?

Thu, 2015-04-09 14:29

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

As an R programmer have you every wondered what can be in a data.frame column?

The documentation is a bit vague, help(data.frame) returns some comforting text including:

Value

A data frame, a matrix-like structure whose columns may be of differing types (numeric, logical, factor and character and so on).

If you ask an R programmer the commonly depended upon properties of a data.frame columns are:

  1. All columns in a data frame have the same length. (true, but with an asterisk)
  2. All columns in a data frame are vectors with type (see help(typeof)) and class (see help(class)) deriving from one of the primitive types (such as: numeric, logical, factor and character and so on). (FALSE!)
  3. (weakening from 2) All items in a single column in a data frame have the same number of items per row index. (FALSE)
  4. (weakening from 3) All items in a single column in a data frame are at least homogeneous and have the type/class per row index. (FALSE, or that class/type can be list hiding heterogeneous nested entries)

Unfortunately only for first item is actually true. The data.frame() and as.data.frame() methods try to do some conversions to make more of the items in the above list are usually true. We know data.frame is implemented as a list of columns, but the idea is the class data.frame overrides a lot of operators and should be able to maintain some useful invariants for us.

data.frame is one of R’s flagship types. You would like it to have fairly regular and teachable observable behavior. (Though given the existence of the reviled attach() command I a beginning to wonder if data.frame was a late addition to S, the language R is based on.)

But if you are writing library code (like vtreat) you end up working with the data frames as they are, and not as you would like them to be.

Here is an example of the problem:

d <- data.frame(a=1:3) d$v <- tapply(X=1:6, INDEX=c('a','a','b','b','c','c'), FUN=sum, simplify=TRUE) print(class(d$a)) ## [1] "integer" print(class(d$v)) ## [1] "array"

Even with the simplify=TRUE argument set, tapply() returns an array, and that array type survives when added to a data.frame. There is no implicit as.numeric() conversion to change from an array to a primitive vector class. Any code written under the assumption the columns of the data frame restrict themselves to simple classes and types will fail.

Case in point: earlier versions of vtreat would fail to recognize such a column as numeric (because the library was checking the class name, as I had falsely assumed the is.numeric() check was as fragile as the is.vector() checks) and treat the column as strings. And this is the cost of not having type strictness: there is no way to write concise correct code for dealing with other people’s data. vtreat already had special case code for POSIXlt types (one way nested lists can get into data frames!), but I didn’t have special code to check for lists and arrays in general. It isn’t so much we used the wrong type-check (looking at class() instead of using is.numeric(), which can be debated) it is we failed to put in enough special case code to catch (or at least warn) on all the unexpected corner cases.

This is why I like type systems, they let you document (in a machine readable way, so you can also enforce!) the level of diversity of input you expect. If the inputs are not that diverse they you then have some chance that simple concise code can be correct. If the inputs are a diverse set of unrelated types that don’t share common interfaces, then no concise code can be correct.

Many people say there is no great cost to R’s loose type system, and I say there is. It isn’t just my code. The loose types are why things like ifelse() are 30 lines of code instead of 5 lines of code (try print(ifelse), you will notice the majority of the code is trying to strip off attributes and defend against types that are almost, but not quite what one would expect; only a minority of the code is doing the actual work). This drives up the expense of writing a fitter (such as: lm, glm, randomForest, gbm, rpart, …) as to be correct the fitter may have to convert a number of odd types into primitives. And it drives up the cost of using fitters, as you have to double check the authors anticipated all types you end up sending. And you may not even know which types you are sending due to odd types entering through use of other libraries and functions (such as tapply()).

If your rule of code composition is Postel’s law (instead of checkable types and behavior contracts) you are going to have very bloated code as each module is forced enumerate and correct a large number of “almost the same” behaviors and encodings. You will also have a large number of “rare” bugs as there is no way every library checks all corner cases, and each new programmer accidentally injects a different unexpected type into their work. When there are a large number of rare bugs lurking: then bugs are encountered often and diagnosing them is expensive (as each one feels unique).

When you work with systems that are full of special cases your code becomes infested with the need to handle special cases. Elegance and correctness become opposing goals instead of synergistic achievements.

Okay, I admit arrays are not that big a deal. But arrays are the least of your worries.

Columns of a data frame can be any the following types:

  • POSIXlt a complicated list structure, making the column a nested list.
  • Arbitrary lists (including ragged lists, and lists with different types in each row).
  • Matrices.
  • Data frames.

Below is an example of a pretty nasty data frame. Try code() and typeof() on various columns; try str() on various entries; and definitely try the print(unclass(d[1,'xPOSIXlt'])) as it looks like str() hides the awful details in this case (perhaps it or something it depends on is overridden).

d <- data.frame(xInteger=1:3, xNumeric=0, xCharacter='a', xFactor=as.factor('b'), xPOSIXct=Sys.time(), xRaw=raw(3), xLogical=TRUE, xArrayNull=as.array(list(NULL,NULL,NULL)), stringsAsFactors=FALSE) d$xPOSIXlt <- as.POSIXlt(Sys.time()) d$xArray <- as.array(c(7,7,7)) d$xMatrix <- matrix(data=-1,nrow=3,ncol=2) d$xListH <- list(10,20,'thirty') d$xListR <- list(list(),list('a'),list('a','b')) d$xData.Frame <- data.frame(xData.FrameA=6:8,xData.FrameB=11:13) print(colnames(d)) ## [1] "xInteger" "xNumeric" "xCharacter" "xFactor" "xPOSIXct" ## [6] "xRaw" "xLogical" "xArrayNull" "xPOSIXlt" "xArray" ## [11] "xMatrix" "xListH" "xListR" "xData.Frame" print(d) ## xInteger xNumeric xCharacter xFactor xPOSIXct xRaw xLogical ## 1 1 0 a b 2015-04-09 10:40:26 00 TRUE ## 2 2 0 a b 2015-04-09 10:40:26 00 TRUE ## 3 3 0 a b 2015-04-09 10:40:26 00 TRUE ## xArrayNull xPOSIXlt xArray xMatrix.1 xMatrix.2 xListH xListR ## 1 NULL 2015-04-09 10:40:26 7 -1 -1 10 NULL ## 2 NULL 2015-04-09 10:40:26 7 -1 -1 20 a ## 3 NULL 2015-04-09 10:40:26 7 -1 -1 thirty a, b ## xData.Frame.xData.FrameA xData.Frame.xData.FrameB ## 1 6 11 ## 2 7 12 ## 3 8 13 print(unclass(d[1,'xPOSIXct'])) ## [1] 1428601226 print(unclass(d[1,'xPOSIXlt'])) ...

(Note: neither is.numeric(d$xPOSIXct) or is.numeric(d$xPOSIXlt) is true, though both pass nicely through as.numeric(). So even is.numeric() doesn’t signal everything we need to know about the ability to use a column as a numeric quantity.)

(Also notice length(d$xData.Frame) is 2: the number of columns of the sub-data frame. And it is not 3 or nrow(d$xData.Frame). So even the statement “all columns have the same length” needs a bit of an asterisk by it. The columns all have the same length- but not the length returned by the length() method. Also note nrow(c(1,2,3)) return NULL so you can’t use that function everywhere either.)

Related posts:

  1. R style tip: prefer functions that return data frames
  2. Check your return types when modeling in R
  3. R annoyances

To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Stata embraces Bayesian statistics

Thu, 2015-04-09 13:50

(This article was first published on eKonometrics, and kindly contributed to R-bloggers)

Stata 14 has just been released. The new and big thing with version 14 is the introduction of Bayesian Statistics. A wide variety of new models can now be estimated with Stata by combining 10 likelihood models, 18 prior distributions, different types of outcomes, and multiple equation models. Stata has also made available a 255-page reference manual for free to illustrate Bayesian statistical analysis.

Of course R already offered numerous options for Bayesian Inference. It will be interesting to hear from colleagues proficient in Bayesian statistics to compare Stata’s newly added functionality with what has already been available from R.

Given the hype with big data and the newly generated demand for data mining and advanced analytics, it would have been timely for Stata to also add data mining and machine learning algorithms. My two cents: data mining algorithms are in greater demand than Bayesian statistics. Stata users will have to wait for a year or more to see such capabilities. In the meanwhile, R offers several options for data mining and machine learning algorithms.

To leave a comment for the author, please follow the link and comment on his blog: eKonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A small neuroimage interactive plotter

Thu, 2015-04-09 11:39

(This article was first published on A HopStat and Jump Away » Rbloggers, and kindly contributed to R-bloggers)

Manipulate Package

The manipulate from RStudio allows you to create simple Tcl/Tk operators for interactive visualization. I will use it for a simple slider to view different slices of an image.

library(manipulate) fslr package

I'm calling the fslr package because I know that if you have it installed, you will likely have FSL and have a 1mm T1 template from MNI in a specific location. fslr also loads the oro.nifti package so that readNIfTI is accessible after loading fslr. You can download a test NIfTI image here if you don't have access to any and don't have FSL downlaoded.

Here I will read in the template image:

library(fslr) options(fsl.path='/usr/local/fsl') template = file.path(fsldir(), "data/standard", "MNI152_T1_1mm_brain.nii.gz") img = readNIfTI(template) The iplot function

The iplot function defined below takes in a nifti object, the specific plane to be plotted and additional options to be passed to oro.nifti::image. The function is located on my GitHub here.

iplot = function(img, plane = c("axial", "coronal", "sagittal"), ...){ ## pick the plane plane = match.arg(plane, c("axial", "coronal", "sagittal")) # Get the max number of slices in that plane for the slider ns= switch(plane, "axial"=dim(img)[3], "coronal"=dim(img)[2], "sagittal"=dim(img)[1]) ## run the manipulate command manipulate({ image(img, z = z, plot.type= "single", plane = plane, ...) # this will return mouse clicks (future experimental work) pos <- manipulatorMouseClick() if (!is.null(pos)) { print(pos) } }, ## make the slider z = slider(1, ns, step=1, initial = ceiling(ns/2)) ) } Example plots

Here are some examples of how this iplot function would be used:

iplot(img) iplot(img, plane = "coronal") iplot(img, plane = "sagittal")

The result will be a plotted image of the slice with a slider. This is most useful if you run it within RStudio.

Below are 2 example outputs of what you see in RStudio:

Slice 91:

Slice 145:

Conclusions

The iplot function allows users to interactively explore neuroimages. The plotting is not as fast as I'd like, I may try to speed up the oro.nifti::image command or implement some subsampling. It does however show a proof of concept how interactive neuroimaging visualization can be done in R.

Note

manipulate must be run in RStudio for manipulation. The fslr function fslview will call FSLView from FSL for interactive visualization. This is an option of interactive neuroimaging “in R”, but not a real or satisfactory implementation for me (even though I use it frequently). If anyone has implemented such a solution in R, I'd love to hear about it.


To leave a comment for the author, please follow the link and comment on his blog: A HopStat and Jump Away » Rbloggers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Where are the R users?

Thu, 2015-04-09 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

A recent post by David Smith included a map that shows the locations of R user groups around the world. While is exhilarating to see how R user groups span the globe, the map does not give any idea about the size of the community at each location. The following plot, constructed from information on the websites of the groups listed in Revolutions Analytics' Local R User Group Directory (the same source for the map) shows the membership size for the largest 25 groups.

There are 11 groups with over a thousand members and a couple more who are close to achieving that milestone. With the possible exception of Groupe des utilisateurs du logicel R, I believe all of the groups in the top 25 hold regular, face-to-face meetings: so the plot gives some idea of the size, location and density of the "social" R community. 

There are, however, quite a few problems with the data that make it less than optimal for even the limited goal of characterizing the number of R users who regularly participate in face-to-face R events.

  • Only 102 of the 159 groups make it easy to find membership information. (Most of these are groups that have webpages through meetup.com which prominently display membership information.)
  • It is not clear that membership information is up to date. As an R user group organize, I know that it is difficult to keep track of active members.
  • People join multiple groups. It is very likely, for example, there is a significant overlap between the members of the Berkeley R Language Beginner Study Group and the Bay Area useR Group.

Nevertheless, I think this plot along with the map mentioned above do give some idea of where the action is in the R world. 

Note that the data used to build the plot may be obtained here:  Download RUGS_WW_4_7_15 and the code is here:  Download RUG_Bar_Chart

Also note that it is still a good time to start a new R user group. The deadline for funding for large user groups through Revolution Analytics 2015 R User Group Sponsorship Program has passed. However, we will be taking applications for new groups until the end of September. The $120 grant for Vector level groups should be enough to finance a site on meetup.com. If you are thinking of starting a new group have a look at the link above as well as our tips for getting started.

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Monitoring Price Fluctuations of Book Trade-In Values on Amazon

Wed, 2015-04-08 20:00

(This article was first published on Category: R | Statistically Significant, and kindly contributed to R-bloggers)

I am planning to finish school soon and I would like to shed some weight before moving on. I have collected a fair number of books that I will likely never use again and it would be nice to get some money for them. Sites like Amazon and eBay let you sell your book to other customers, but Amazon will also buy some of your books directly (Trade-Ins), saving you the hassle of waiting for a buyer.

Before selling, I remembered listening to a Planet Money episode about a couple of guys that tried to make money off of buying and selling used textbooks on Amazon. Their strategy was to buy books at the end of a semester when students are itching to get rid of them, and sell them to other students at the beginning of the next semester. To back up their business, they have been scraping Amazon’s website for years, keeping track of prices in order to find the optimal times to buy and sell.

I collected a few books I was willing to part with and set up a scraper in R. I am primarily interested in selling my books to Amazon, so I tracked Amazon’s Trade-In prices for these books. This was done fairly easily with Hadley Wickham’s package rvest and the Chrome extension Selector Gadget. Selector Gadget tells me that the node I’m interested in is #tradeInButton_tradeInValue. The code to do the scraping is below.

Amazon_scrape_prices.Rlink 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 library(rvest)</p> <p>urls_df = read.csv(&ldquo;AmazonBookURLs.csv&rdquo;, stringsAsFactors = FALSE, comment.char = &ldquo;&rdquo;) load(file = &ldquo;AmazonBookPrices.RData&rdquo;) price_df_temp = data.frame(Title = urls_df$Title, Date = Sys.time(), Price = NA_real_, stringsAsFactors = FALSE) for (i in 1:nrow(urls_df)) { tradein_html = urls_df$URL[i] %&gt;% html() %&gt;% html_node(&ldquo;#tradeInButton_tradeInValue&rdquo;) if (is.null(tradein_html)) { next } price = tradein_html %&gt;% html_text() %&gt;% gsub(&ldquo;(^[[:space:]]+$|[[:space:]]+$)&rdquo;, &ldquo;&rdquo;, .) %&gt;% as.numeric() price_df_temp$Price[i] = price } price_df = rbind(price_df, price_df_temp)</p> <p>save(price_df, file = &ldquo;AmazonBookPrices.RData&rdquo;)

After manually collecting this data for less than a week, I am able to plot the trends for the eight books I am interested in selling. The plot and code are below.

plot_amazon_prices.Rlink 1 2 3 4 5 6 7 library(ggplot2); theme_set(theme_bw()) library(scales)</p> <p>price_df$TitleTrunc = paste0(substring(price_df$TitleTrunc, 1, 30), ifelse(nchar(price_df$Title) &gt; 30, &ldquo;&hellip;&rdquo;, &ldquo;&rdquo;)) ggplot(price_df, aes(Date, Price)) + geom_step() + geom_point() + facet_wrap(~ TitleTrunc, scales = &ldquo;free_y&rdquo;) + scale_y_continuous(labels = dollar) + theme(axis.text.x = element_text(angle = 90, vjust = 0))

I am surprised how much the prices fluctuate. I expected them to be constant for the most part, with large changes once a week or less often. Apparently Amazon is doing quite a bit of tweaking to determine optimal price points. I would also guess that $2 is their minimum trade-in price. It looks like I missed out on my best chance to sell A First Course in Stochastic Processes, but that the price of A Primer on Linear Models will keep rising forever.

To leave a comment for the author, please follow the link and comment on his blog: Category: R | Statistically Significant. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs