R bloggers

Syndicate content
R news and tutorials contributed by (563) R bloggers
Updated: 1 hour 36 min ago

Visualizing Website Pathing With Network Graphs

Mon, 2014-09-08 06:40

(This article was first published on randyzwitch.com » R, and kindly contributed to R-bloggers)

Last week, version 1.4 of RSiteCatalyst was released, and now it’s possible to get site pathing information directly within R. Now, it’s easy to create impressive looking network graphs from your Adobe Analytics data using RSiteCatalyst and d3Network. In this blog post, I will cover simple and force-directed network graphs, which show the pairwise representation between pages. In a follow-up blog post, I will show how to visualize longer paths using Sankey diagrams, also from the d3Network package.

Obtaining Pathing Data With QueuePathing

Although the QueuePathing() function is new to RSiteCatalyst, its syntax should feel familar (even with all of the breaking changes we made!). In the case of creating our network graphs, we want to download all pairwise combinations of pages, which is easy to do using the ::anything:: operator:Because we are using a pathing pattern of (“::anything::”, “::anything::”), the data frame that is returned from this function will have three columns: step.1, step.2 and count, which is the number of occurrences of the path.

Plotting Graph Using d3SimpleNetwork

Before jumping into the plotting, we need to do some quick data cleaning. Lines 1-5 below are optional; I don’t set the Adobe Analytics s.pageName on each of my blog pages (a worst practice if there ever was one!), so I use the sub() function in Base R to strip the domain name from the beginning of the page. The other data frame modification is to remove the ‘Entered Site’ and ‘Exited Site’ from the pagename pairs. Although this is important information generally, these behaviors aren’t needed to show the pairwise relationship between pages.

Running the above code results in the following graph:

Hmmm…looks like a blob of spaghetti, a common occurrence when creating graphs. We can do better.

Pruning Edges From The Graph

There are many complex algorithms for determining how to prune edges/nodes from a network. For the sake of simplicity, I’m going to use a very simple algorithm: each path has to occur more than 5 times for it to be included in the network. This will prune roughly 80% of the pairwise page combinations while keeping ~75% of the occurrences. This is simple to do using the subset() function in R:The result of pruning the number of edges is a much less cluttered graph:
Even with fewer edges in the graph, we still lose some of the information about the pages, since we don’t know what topics/groups the pages represent. We can fix that using a slightly more complex version of the d3Network graph code.

Force-directed graphs

The graphs above outline the structure of randyzwitch.com, but they can be improved by adding color-coding to the nodes to represent the topic of the post, as well as making the edges thicker/thinner based on how frequently the path occurs. This can be done using the d3ForceNetwork() function as so:
Running the code results in the following force-directed graph:

Interpretation

I’m not going to lie, all three of these diagrams are hard to interpret. Like wordclouds, network graphs can often be visually interesting, yet difficult to ascertain any concrete information. Network graphs also have the tendency to reinforce what you already know (you or someone you know designed your website, you should already have a feel for its structure!).

However, in the case of the force-directed graph above, I do see some interesting patterns. Specifically, there are a considerable number of nodes that aren’t attached to the main network structure. This may be occurring due to my method of pruning the network edges. More likely is that these disconnected nodes represent “dead-ends” in my blog, either because few pages link to them, there are technical errors, these are high bounce-rate pages or represent one-off topics that satiate the reader.

In terms of action I can take, I can certainly look up the bounce rate for these disconnected pages/nodes and re-write the content to make it more ‘sticky’. There’s also the case of the way my “Related Posts” plugin determines related pages. As far as I know, it’s quite naive, using the existing words on the page to determine relationships between posts. So one follow-up could be to create an actual recommender system to better suggest content to my readers. Perhaps that’s a topic for a different blog post.

Regardless of the actions I’ll end up taking from this information, hopefully this blog post has piqued some ideas of how to use RSiteCatalyst in a non-standard way, to extend the standard digital analytics information you are capturing with Adobe Analytics into creating interesting visualizations and potential new insights.

Example Data

For those of you who aren’t Adobe Analytics customers (or are, but don’t have API access), here are the data from the queue_pathing_pages data frame above. Just read this data into R, then you should be able to follow along with the d3Network code.

Related posts:
  1. RSiteCatalyst Version 1.4 Release Notes
  2. RSiteCatalyst Version 1.1 Release Notes
  3. RSiteCatalyst Version 1.2 Release Notes

To leave a comment for the author, please follow the link and comment on his blog: randyzwitch.com » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Generating quantile forecasts in R

Mon, 2014-09-08 01:36

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

From today’s email:

I have just finished reading a copy of ‘Forecasting:Principles and Practice’ and I have found the book really interesting. I have particularly enjoyed the case studies and focus on practical applications.

After finishing the book I have joined a forecasting competition to put what I’ve learnt to the test. I do have a couple of queries about the forecasting outputs required. The output required is a quantile forecast, is this the same as prediction intervals? Is there any R function to produce quantiles from 0 to 99?

If you were able to point me in the right direction regarding the above it would be greatly appreciated.

Many Thanks,

Presumably the competition is GEFCOM2014 which I’ve posted about before.

The future value of a time series is unknown, so you can think of it as a random variable, and its distribution is the “forecast distribution”. A “quantile forecast” is a quantile of the forecast distribution. The usual point forecast is often the mean or the median of the forecast distribution. A prediction interval is a range of specified coverage probability under that distribution. For example, if we assume the forecast distribution is normal, then the 95% prediction interval is defined by the 2.5% and 97.5% quantiles of the forecast distribution.

Still assuming normality, we could generate the forecast quantiles from 1% to 99% in R using

qnorm((1:99)/100, m, s)

where mu and sigma are the estimated mean and standard deviation of the forecast distribution. So if you are using the forecast package in R, you can do something like this:

library(forecast) fit <- auto.arima(WWWusage) fc <- forecast(fit, h=20, level=95) qf <- matrix(0, nrow=99, ncol=20) m <- fc$mean s <- (fc$upper-fc$lower)/1.96/2 for(h in 1:20) qf[,h] <- qnorm((1:99)/100, m[h], s[h])   plot(fc) matlines(101:120, t(qf), col=rainbow(120), lty=1)

Of course, assuming a normal distribution is rather restrictive and not very interesting. For a more interesting but much more complicated approach to generating quantiles, see my 2010 paper on Density forecasting for long-term peak electricity demand.

To leave a comment for the author, please follow the link and comment on his blog: Hyndsight » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

An exercise in non-linear modeling

Sun, 2014-09-07 04:05

(This article was first published on G-Forge » R, and kindly contributed to R-bloggers)

Finding the right curve can be tricky. The image is CC by Martin Gommel.

In my previous post I wrote about the importance of age and why it is a good idea to try avoiding modeling it as a linear variable. In this post I will go through multiple options for (1) modeling non-linear effects in a linear regression setting, (2) benchmark the methods on a real dataset, and (3) look at how the non-linearities actually look. The post is based on the supplement in my article on age and health-related quality of life (HRQoL).

Background What is linearity?

Wikipedia has an excellent explanation of linearity:

linearity refers to a mathematical relationship or function that can be graphically represented as a straight line

Why do we assume linearity?

Linearity is a common assumption that is made when building a linear regression model. In a linear relation, a continuous variable has the same impact throughout the variable’s span. This makes the estimate is easy to interpret; an increase of one unit gives the corresponding coefficient’s change in the outcome. While this generates simple models with its advantages, it is difficult to believe that nature follows a simple straight line. With todays large data sets I believe that our models should permit non-linear effects.

How do we do non-linearity?

There are plenty of non-linear alternatives that can be used to better find the actual relationships. Most of them rely on converting the single continuous variable into several. The simplest form is when we transform the variable into a polynomial, e.g. instead of having the model:

HRQoL ~ β0 + βage * Age + βsex * Sex

We expand age to also include the age squared:

HRQoL ~ β0 + βage * Age + βage” * Age2 + βsex * Sex

This allows for the line for a bend, unfortunately as we add the squared term the coefficients are more difficult to interpret, and after adding a cubic term, i.e. Age3, it is almost impossible to interpret the coefficients. Due to this difficulty in interpretation I either use the rms::contrast function or the stats::predict in order to illustrate how the variable behaves.

Splines

A frequently used alternative to polynomials are splines. The most basic form of a spline consists of lines that are connected at different “knots”. For instance, a linear spline with 1 knot can assume a V-shape, while 2 knots allow for an N-shaped relationship. The number of knots decide the flexibility of a spline, i.e. more knots allow a more detailed description of the relationship.

The models

The dataset comes from the Swedish Hip Arthroplasty Register‘s excellent PROM database and consists of more than 30,000 patients that have filled out the EQ-5D form both before and after surgery. We will focus on age and its impact on the EQ-5D index and the EQ-VAS one year after total hip replacement surgery. We will control for sex, Charnley class, pre-operative HRQoL, pre-operative pain.

The restricted cubic spline

I’m a big fan of Frank Harrell‘s rms-package so we will start there although we start by splitting the dataset using the caret::createDataPartition. We then need to set the rms::datadist in order for the rms-functions to work as expected:

?View Code RSPLUS1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 library(caret) train_no <- createDataPartition(1:nrow(my_dataset), list = FALSE, p = .7) train <- my_dataset[train_no, ] test <- my_dataset[-train_no, ]   # A list with all the fits that are later to be benchmarked fits <- list(eq5d = list(), eq_vas = list())   # The rms-setup library(rms) ddist <- datadist(train) options("datadist" = "ddist")

Frank Harrell is a proponent of restricted cubic splines, alias natural cubic splines. This is a type of spline that uses cubic terms in the center of the data and restricts the ends to a straight line, preventing the center from distorting the ends, i.e. curling. His rcs() also nicely integrates with the anova in order to check if non-linearity actually exists:

?View Code RSPLUS1 2 3 4 5 idx_model <- ols(eq5d1 ~ rcs(Age, 3) + Sex * Charnley_class + rcs(eq5d0, 3)+rcs(pain0, 3), data=train, x=TRUE, y=TRUE) anova(idx_model, ss=FALSE)

gives:

Analysis of Variance Response: eq5d1 Factor F d.f. P Age 140.71 2 <.0001 Nonlinear 41.07 1 <.0001 Sex (Factor+Higher Order Factors) 26.94 3 <.0001 All Interactions 6.80 2 0.0011 Charnley_class (Factor+Higher Order Factors) 275.08 4 <.0001 All Interactions 6.80 2 0.0011 eq5d0 522.37 2 <.0001 Nonlinear 36.54 1 <.0001 pain0 3.85 2 0.0213 Nonlinear 4.56 1 0.0328 Sex * Charnley_class (Factor+Higher Order Factors) 6.80 2 0.0011 TOTAL NONLINEAR 30.48 3 <.0001 TOTAL NONLINEAR + INTERACTION 21.02 5 <.0001 TOTAL 330.16 11 <.0001 Error d.f.: 19096

As we can see the model seems OK for the EQ-5D index, supporting both the non-linearity and the interaction between Charnley class and sex. If we for some reason cannot use the rms-package we can check for linearity by using the splines::ns with two regular models as suggested below:

?View Code RSPLUS1 2 3 4 5 6 7 8 9 lm1 <- lm(eq5d1 ~ Age + Sex * Charnley_class + ns(eq5d0, 3)+ns(pain0, 3), data=train) lm2 <- lm(eq5d1 ~ ns(Age, 3) + Sex * Charnley_class + ns(eq5d0, 3)+ns(pain0, 3), data=train) anova(lm1, lm2)

gives:

Analysis of Variance Table Model 1: eq5d1 ~ Age + Sex * Charnley_class + ns(eq5d0, 3) + ns(pain0, 3) Model 2: eq5d1 ~ ns(Age, 3) + Sex * Charnley_class + ns(eq5d0, 3) + ns(pain0, 3) Res.Df RSS Df Sum of Sq F Pr(>F) 1 19095 193.01 2 19093 192.66 2 0.35112 17.398 2.825e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In order to avoid overfitting we will try to select models based upon the AIC/BIC criteria. The selection is simply finding the lowest value where in general AIC allows slightly more complex models compared to BIC. We will start with finding the optimal number of knots for the EQ-5D index using the AIC:

?View Code RSPLUS1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 #' A simple function for updating the formula and extracting #' the information criteria #' #' @param no A number that is used together with the add_var_str #' @param fit A regression fit that is used for the update #' @param rm_var The variable that is to be substituted #' @param add_var_str A sprintf() string that accepts the no #' parameter for each update #' @param ic_fn The information criteria function (AIC/BIC) getInfCrit <- function(no, fit, rm_var, add_var_str, ic_fn) { new_var <- sprintf(add_var_str, no) updt_frml <- as.formula(sprintf(".~.-%s+%s", rm_var, new_var)) ret <- ic_fn(update(fit, updt_frml)) names(ret) <- new_var return(ret)}   # We start with a model where the other splines # have been AIC-selected idx_model <- ols(eq5d1 ~ rcs(Age, 3) + Sex * Charnley_class + rcs(eq5d0, 8)+rcs(pain0, 6), data=train, x=TRUE, y=TRUE)   sapply(3:9, getInfCrit, fit = idx_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=AIC) # rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # -33678.89 -33674.68 -33686.30 -33686.93 -33685.95 -33686.73 -33699.37   fits$eq5d[["RCS with AIC"]] <- update(idx_model, .~.-rcs(Age, 3)+rcs(Age, 5))

It can be discussed if the model should stop at 3 knots but I chose to continue a little higher as the drop was relatively large between the 4 and 5 knots. This is most likely due to a unfortunate fit for the 4 knots. We could also have motivated a larger number of knots but even with proper visualization these are difficult to interpret. When modeling confounders, such as the preoperative EQ-5D index (eq5d0) and the pre-operative pain (pain0), we may prefer a higher number of knots in order to avoid any residual confounding and we do not need to worry about visualizing/explaining the relations.

Now if we apply the same methodology to the EQ-VAS we get:

?View Code RSPLUS1 2 3 4 5 6 7 8 vas_model <- ols(eq_vas1 ~ Sex * Charnley_cat + Sex + rcs(Age, 3) + rcs(eq_vas0, 3) + OpNr + rcs(pain0, 3), data=train, x=TRUE, y=TRUE) sapply(3:9, getInfCrit, fit = vas_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=AIC) # rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # 166615.6 166619.8 166600.2 166600.1 166602.0 166603.0 166596.7 fits$eq_vas[["RCS with AIC"]] <- update(vas_model, .~.-rcs(Age, 3)+rcs(Age, 5))

Now lets do the same for the BIC:

?View Code RSPLUS1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 idx_model <- ols(eq5d1 ~ rcs(Age, 3) + Sex * Charnley_cat + rcs(eq5d0, 3)+rcs(pain0, 3), data=train, x=TRUE, y=TRUE)   sapply(3:9, getInfCrit, fit = idx_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=BIC) # rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # -33486.35 -33474.16 -33477.95 -33470.69 -33462.17 -33455.28 -33459.98 fits$eq5d[["RCS with BIC"]] <- idx_model   vas_model <- ols(eq_vas1 ~ rcs(Age, 3) + Sex * Charnley_cat + rcs(eq_vas0, 3) + OpNr + rcs(pain0, 3), data=train, x=TRUE, y=TRUE)   sapply(3:9, getInfCrit, fit = vas_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=BIC) # rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # 166725.7 166737.8 166726.0 166733.8 166743.5 166752.4 166754.0 fits$eq_vas[["RCS with BIC"]] <- vas_model B-splines

B-splines, alias Basis spline, are common alternatives to restricted cubic splines that also use knots to control for flexibility. As these are not restricted at the ends they have more flexible tails than restricted cubic splines. In order to get a comparison we will use the same model for the other variables:

?View Code RSPLUS1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 # Use the same setting model as used in the RCS vars <- attr(terms(fits$eq5d[["RCS with AIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] idx_model <- update(fits$eq5d[["RCS with AIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var)) sapply(3:9, getInfCrit, fit = idx_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=AIC) # bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # -33669.07 -33683.35 -33680.55 -33683.44 -33683.03 -33681.79 -33685.55 # Chose 6 knots for illustration as it otherwise be the # same as the BIC model - not that interesting fits$eq5d[["BS with AIC"]] <- update(idx_model, .~.-bs(Age, 3)+bs(Age, 6))   vars <- attr(terms(fits$eq5d[["RCS with BIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] idx_model <- update(fits$eq5d[["RCS with BIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var)) sapply(3:9, getInfCrit, fit = idx_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=BIC) # bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # -33468.29 -33475.09 -33464.40 -33459.42 -33451.12 -33442.38 -33438.71 fits$eq5d[["BS with BIC"]] <- update(idx_model, .~.-bs(Age, 3)+bs(Age, 4))   vars <- attr(terms(fits$eq_vas[["RCS with AIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] vas_model <- update(fits$eq_vas[["RCS with AIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var)) sapply(3:9, getInfCrit, fit = vas_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=AIC) # bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # 166640.3 166617.2 166621.1 166612.9 166613.2 166614.8 166615.0 fits$eq_vas[["BS with AIC"]] <- update(vas_model, .~.-bs(Age, 3)+bs(Age, 6))   vars <- attr(terms(fits$eq_vas[["RCS with BIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] vas_model <- update(fits$eq_vas[["RCS with BIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var)) sapply(3:9, getInfCrit, fit = vas_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=BIC) # bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # 166750.4 166735.1 166746.9 166746.5 166754.7 166764.2 166772.2 fits$eq_vas[["BS with BIC"]] <- update(vas_model, .~.-bs(Age, 3)+bs(Age, 4)) Polynomials

Polynomials can be of varying degrees and have often been argued as a simple approach to fitting a more flexible non-linear relationship. As the majority of the patients are located around the mean age, 69.1 years, it is important to remember that these patients will have the strongest influence on the curve appearance. It is therefore possible that the tails are less reliable than the central portion.

?View Code RSPLUS1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 # Use the same setting model as used in the RCS vars <- attr(terms(fits$eq5d[["RCS with AIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] idx_model <- update(fits$eq5d[["RCS with AIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var)) sapply(2:9, getInfCrit, fit = idx_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=AIC) # poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # -33669.58 -33669.07 -33680.10 -33678.83 -33681.82 -33681.89 -33680.35 -33680.17 fits$eq5d[["Polynomial with AIC"]] <- update(idx_model, .~.-poly(Age, 2)+poly(Age, 6))   vars <- attr(terms(fits$eq5d[["RCS with BIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] idx_model <- update(fits$eq5d[["RCS with BIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var)) sapply(2:9, getInfCrit, fit = idx_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=BIC) # poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # -33476.79 -33468.29 -33471.83 -33462.59 -33457.76 -33450.37 -33440.97 -33432.99 fits$eq5d[["Polynomial with BIC"]] <- idx_model   vars <- attr(terms(fits$eq_vas[["RCS with AIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] vas_model <- update(fits$eq_vas[["RCS with AIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var)) sapply(2:9, getInfCrit, fit = vas_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=AIC) # poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # 166638.4 166640.3 166622.3 166623.9 166615.7 166616.8 166617.2 166617.5 fits$eq_vas[["Polynomial with AIC"]] <- update(vas_model, .~.-poly(Age, 2)+poly(Age, 6))   vars <- attr(terms(fits$eq_vas[["RCS with BIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] vas_model <- update(fits$eq_vas[["RCS with BIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var)) sapply(2:9, getInfCrit, fit = vas_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=BIC) # poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # 166740.6 166750.4 166740.2 166749.7 166749.3 166758.3 166766.6 166774.7 fits$eq_vas[["Polynomial with BIC"]] <- update(vas_model, .~.-poly(Age, 2)+poly(Age, 4)) Multiple Fractional Polynomial Models

Multiple fractional polynomials (MFP) have been proposed as an alternative to splines. These use a defined set of exponential transformations of the variable, where it omits predictors according to their p-values. The mfp has a built-in algorithm and we don't need to rely on either BIC or AIC with MFP.

?View Code RSPLUS1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 library(mfp)   # We will use the simple BIC models and # instead of rcs() that is not available # for mfp we use ns() from the splines package fits$eq5d[["MFP"]] <- mfp(eq5d1 ~ fp(Age) + Sex * Charnley_class + ns(eq5d0, 3)+ns(pain0, 3), data=train) fits$eq_vas[["MFP"]] <- mfp(eq_vas1 ~ fp(Age) + Sex * Charnley_class + ns(eq_vas0, 3)+ns(pain0, 3), data=train) Generalized additive models

Generalized additive model (GAM) are an extension to generalized linear models and specializes in non-linear relationships. Elements of statistical learning by Hastie et. al. is an excellent source for diving deeper into these. The simplest way to explain the GAM smoothers is that they penalize the flexibility in order to avoid over-fitting, there plenty of options - the ones used here are:

Thin plate regression splines: This is generally the most common type of smoother in GAM models. The name refers to the physical analogy of bending a thin sheet of metal.
Cubic regression splines: The basis for the spline is cubic with evenly spread knots.
P-splines: P-splines are similar to B-splines in that they share basis with the main difference that P-splines enforce a penalty on the coefficients.

?View Code RSPLUS1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 library(mgcv)   fits$eq5d[["GAM thin plate"]] <- gam(eq5d1 ~ s(Age, bs="tp") + Sex * Charnley_class + ns(eq5d0, 3) + ns(pain0, 3), data=train) fits$eq_vas[["GAM thin plate"]] <- gam(eq_vas1 ~ s(Age, bs="tp") + Sex * Charnley_class + ns(eq_vas0, 3) + ns(pain0, 3), data=train)   fits$eq5d[["GAM cubic regression"]] <- update(fits$eq5d[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="cr"))   fits$eq_vas[["GAM cubic regression"]] <- update(fits$eq_vas[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="cr"))     fits$eq5d[["GAM P-splines"]] <- update(fits$eq5d[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="ps"))   fits$eq_vas[["GAM P-splines"]] <- update(fits$eq_vas[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="ps")) Benchmark

With all these fancy models we will first try to evaluate how they perform when cross-validated and then on our test-set that we've set apart at the start. We will evaluate according to root-mean-square error (RMSE) and mean absolute percentage error (MAPE). RMSE tends to penalize for having outliers while the MAPE is more descriptive of the performance on average. Our testing functions will therefore be:

?View Code RSPLUS1 2 3 4 5 6 7 8 rmse <- function(fit, newdata, outcome_var){ pred <- predict(fit, newdata=newdata) sqrt(mean((newdata[, outcome_var]-pred)^2, na.rm=TRUE)) } mape <- function(fit, newdata, outcome_var){ pred <- predict(fit, newdata=newdata) mean(abs(newdata[, outcome_var]-pred)/mean(newdata[, outcome_var], na.rm=TRUE)*100, na.rm=TRUE) }

Furthermore in this particular article I wanted to look into what happens at the tails as almost 70 % of the patients were between 60 and 80 years while the variable span was 40 to 100 years. I therefore defined a central vs peripheral portion where the central portion was defined as being between the 15:th and 85:th percentile.

The cross-validation was a straight forward implementation. As this can take a little time it is useful to think about parallelization, we will here use the parallel-package although the foreach is also an excellent option.

?View Code RSPLUS1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 crossValidateInParallel <- function(fit, df, outcome_var, k=10, tail_prop=.15){ df$central <- with(df, Age >= quantile(Age, probs=tail_prop) & Age <= quantile(Age, probs=1-tail_prop)) df$cv <- rep(1:k, length.out=nrow(df))   cv_internal_fn <- function(x, cv_fit, df, outcome_var){ lc_train <- subset(df, cv != x) lc_test <- subset(df, cv == x)   # In the original version I had a BIC/AIC optimizing # function that worked with the cross-validation # Here, we'll keep it simple and just use the previously # chosen functions cv_fit <- update(fit, data=lc_train)   return(c(Main_RMSE = rmse(cv_fit, newdata=lc_test, outcome_var=outcome_var), Central_RMSE = rmse(cv_fit, newdata=subset(lc_test, central == TRUE), outcome_var=outcome_var), Peripheral_RMSE = rmse(cv_fit, newdata=subset(lc_test, central == FALSE), outcome_var=outcome_var), Main_MAPE = mape(cv_fit, newdata=lc_test, outcome_var=outcome_var), Central_MAPE = mape(cv_fit, newdata=subset(lc_test, central == TRUE), outcome_var=outcome_var), Peripheral_MAPE = mape(cv_fit, newdata=subset(lc_test, central == FALSE), outcome_var=outcome_var))) }   # It is convenient to use the detectCores() function in # order to use the machines full capacity. Subtracting # 1 is also preferable as it prohibits the computer from # stopping other tasks, i.e. you can keep surfing the web :-) cl <- makeCluster(mc <- getOption("cl.cores", ifelse(detectCores() <= 1, 1, detectCores() - 1))) # In Windows each cores starts out fresh and we therefore need # to export the functions, data etc so that they can access # it as expected or you will get nasty errors clusterEvalQ(cl, library(mfp)) clusterEvalQ(cl, library(mgcv)) clusterEvalQ(cl, library(rms)) clusterEvalQ(cl, library(splines)) clusterEvalQ(cl, library(stringr)) clusterExport(cl, c("rmse", "mape"))   res <- parallel::parLapply(1:k, cl=cl, fun=cv_internal_fn, outcome_var=outcome_var, cv_fit = fit, df=df)   stopCluster(cl) res <- as.data.frame(Gmisc::mergeLists(lapplyOutput=res)) ret <- colMeans(res) attr(ret, "raw") <- res return(ret) }

If we run all the models we get the following result:

RMSE   MAPE Method Main Central Peripheral   Main Central Peripheral The EQ-5D index Restricted cubic splines   AIC 0.100 0.098 0.105   8.83 8.60 9.42   BIC 0.100 0.099 0.105   8.86 8.63 9.46 B-splines   AIC 0.100 0.099 0.105   8.85 8.80 9.48   BIC 0.101 0.099 0.105   8.91 8.79 9.48 Polynomials   AIC 0.103 0.105 0.117   8.94 9.09 10.21   BIC 0.103 0.105 0.116   8.96 9.07 10.09 MFP 0.101 0.099 0.105   8.87 8.64 9.43 Generalized additive models   Thin plate 0.100 0.098 0.105   8.86 8.62 9.47   Cubic regression 0.100 0.098 0.105   8.86 8.62 9.47   P-splines 0.100 0.098 0.105   8.86 8.62 9.47 The EQ VAS Restricted cubic splines   AIC 18.54 18.49 18.66   19.0 18.7 19.7   BIC 18.54 18.49 18.67   19.0 18.7 19.7 B-splines   AIC 18.54 18.66 18.65   19.0 19.1 19.7   BIC 18.55 18.66 18.69   19.1 19.0 19.7 Polynomials   AIC 19.27 20.06 21.81   19.5 20.1 22.9   BIC 19.24 19.96 21.67   19.5 20.0 22.6 MFP 18.55 18.50 18.69   19.0 18.7 19.7 Generalized additive models   Thin plate 18.54 18.48 18.67   19.0 18.7 19.7   Cubic regression 18.54 18.48 18.67   19.0 18.7 19.7   P-splines 18.54 18.49 18.67   19.0 18.7 19.7

The difference between is almost negligable although the polynomial is clearly not performing as well as the other methods. This was though less pronounced in the testset where the polynomial had some issues with the tails:

RMSE   MAPE Method Main Central Peripheral   Main Central Peripheral The EQ-5D index Restricted cubic splines   AIC 0.100 0.098 0.104   8.75 8.56 9.23   BIC 0.100 0.099 0.104   8.79 8.60 9.27 B-splines   AIC 0.100 0.099 0.104   8.77 8.77 9.30   BIC 0.100 0.099 0.105   8.85 8.78 9.34 Polynomials   AIC 0.100 0.099 0.106   8.73 8.66 9.18   BIC 0.101 0.099 0.106   8.78 8.69 9.15 MFP 0.100 0.099 0.104   8.79 8.61 9.25 Generalized additive models   Thin plate 0.100 0.098 0.105   8.79 8.59 9.29   Cubic regression 0.100 0.098 0.104   8.79 8.60 9.27   P-splines 0.100 0.099 0.104   8.79 8.60 9.28 The EQ VAS Restricted cubic splines   AIC 18.70 18.50 19.18   19.1 18.7 20.1   BIC 18.71 18.51 19.19   19.1 18.8 20.2 B-splines   AIC 18.70 18.71 19.18   19.2 19.2 20.2   BIC 18.71 18.71 19.17   19.2 19.1 20.2 Polynomials   AIC 18.74 18.80 19.67   19.1 19.0 20.4   BIC 18.74 18.75 19.64   19.1 19.0 20.4 MFP 18.72 18.52 19.19   19.2 18.8 20.1 Generalized additive models   Thin plate 18.69 18.50 19.17   19.1 18.7 20.1   Cubic regression 18.69 18.50 19.17   19.1 18.7 20.1   P-splines 18.69 18.50 19.17   19.1 18.7 20.1 ?View Code RSPLUS1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 cv_results <- list(eq5d = NULL, eq_vas = NULL) for (group in names(fits)){ for (method in names(fits[[group]])){ cv_results[[group]] <- rbind(cv_results[[group]], crossValidateInParallel(fits[[group]][[method]], df=train, outcome_var = sprintf("%s1", group)))   } rownames(cv_results[[group]]) <- names(fits[[group]]) }   out <- rbind(t(apply(cv_results$eq5d, 1, function(x) c(sprintf("%.3f", x[1:3]), sprintf("%.2f", x[4:6])))), t(apply(cv_results$eq_vas, 1, function(x) c(sprintf("%.2f", x[1:3]), sprintf("%.1f", x[4:6]))))) cgroup <- unique(gsub("^.+_", "", colnames(cv_results$eq5d))) n.cgroup <- as.numeric(table(gsub("^.+_", "", colnames(cv_results$eq5d)))) colnames(out) <- gsub("_.+$", "", colnames(cv_results$eq5d)) rownames(out) <- capitalize(gsub("(RCS|BS|GAM|Polynomial) (with |)", "", c(names(fits$eq5d), names(fits$eq_vas)))) htmlTable(out, rgroupCSSstyle = "", rowlabel = "Method", cgroup = cgroup, n.cgroup = n.cgroup, rgroup = rep(c("Restricted cubic splines", "B-splines", "Polynomials", "", "Generalized additive models"), 2), n.rgroup = rep(c(2, 2, 2, 1, 3), 2), tspanner = c("The EQ-5D index", "The EQ VAS"), n.tspanner = sapply(cv_results, nrow), ctable=TRUE) test$central <- with(test, Age >= quantile(Age, probs=.15) & Age <= quantile(Age, probs=.85))   testset_results <- list(eq5d = NULL, eq_vas = NULL) for (group in names(fits)){ outcome_var <- sprintf("%s1", group) for (method in names(fits[[group]])){ testset_results[[group]] <- rbind(testset_results[[group]], c(Main_RMSE = rmse(fits[[group]][[method]], newdata=test, outcome_var=outcome_var), Central_RMSE = rmse(fits[[group]][[method]], newdata=subset(test, central == TRUE), outcome_var=outcome_var), Peripheral_RMSE = rmse(fits[[group]][[method]], newdata=subset(test, central == FALSE), outcome_var=outcome_var), Main_MAPE = mape(fits[[group]][[method]], newdata=test, outcome_var=outcome_var), Central_MAPE = mape(fits[[group]][[method]], newdata=subset(test, central == TRUE), outcome_var=outcome_var), Peripheral_MAPE = mape(fits[[group]][[method]], newdata=subset(test, central == FALSE), outcome_var=outcome_var))) } rownames(testset_results[[group]]) <- names(fits[[group]]) }     out <- rbind(t(apply(testset_results$eq5d, 1, function(x) c(sprintf("%.3f", x[1:3]), sprintf("%.2f", x[4:6])))), t(apply(testset_results$eq_vas, 1, function(x) c(sprintf("%.2f", x[1:3]), sprintf("%.1f", x[4:6]))))) cgroup <- unique(gsub("^.+_", "", colnames(testset_results$eq5d))) n.cgroup <- as.numeric(table(gsub("^.+_", "", colnames(testset_results$eq5d)))) colnames(out) <- gsub("_.+$", "", colnames(testset_results$eq5d)) rownames(out) <- capitalize(gsub("(RCS|BS|GAM|Polynomial) (with |)", "", c(names(fits$eq5d), names(fits$eq_vas)))) htmlTable(out, rgroupCSSstyle = "", rowlabel = "Method", cgroup = cgroup, n.cgroup = n.cgroup, rgroup = rep(c("Restricted cubic splines", "B-splines", "Polynomials", "", "Generalized additive models"), 2), n.rgroup = rep(c(2, 2, 2, 1, 3), 2), tspanner = c("The EQ-5D index", "The EQ VAS"), n.tspanner = sapply(testset_results, nrow), ctable=TRUE) Graphs

Now lets look at how the relations found actually look. In order to quickly style all graphs I use a common setup:

?View Code RSPLUS1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 # The ggplot look and feel my_theme <- theme_bw() + theme(legend.position = "bottom", legend.text=element_text(size=7), axis.title.x = element_text(size=9), axis.title.y = element_text(size=9), axis.text.x = element_text(size=8), axis.text.y = element_text(size=8))     getPredDf <- function(ds, eq_type, eq_values, Age){ new_data <- data.frame(Age = Age) new_data$Sex="Male" new_data[[sprintf("%s0", eq_type)]] = eq_values new_data$Charnley_class="A" new_data$pain0 = median(train$pain0, na.rm=TRUE) new_data$OpNr = 1   ret <- list("Best" = new_data) new_data$Sex="Female" new_data$Charnley_class="C" new_data$OpNr = 2 ret[["Worst"]] <- new_data   return(ret) }   getAgePredDf <- function(ds, eq_type){ mode_for_preop_eq <- as.numeric(names(which.max(table(ds[[sprintf("%s0", eq_type)]])))) Age <- seq(from=40, to=90, by=.1) return(getPredDf(ds, eq_type, eq_values = rep(mode_for_preop_eq, length.out=length(Age)), Age=Age)) }   getRmsPred <- function(nd, model, eq_type){ new_data <- nd[["Best"]] best_age_spl <- predict(model, newdata=new_data, conf.type="mean", conf.int=.95) new_data <- nd[["Worst"]] worst_age_spl <- predict(model, newdata=new_data, conf.type="mean", conf.int=.95)   getDf <- function(pred, newdata, eq_type){ df <- data.frame(Age = newdata$Age, Lower = pred$lower, Upper = pred$upper) df[[sprintf("%s1", eq_type)]] = pred$linear.predictors df[[sprintf("%s0", eq_type)]] = newdata[[sprintf("%s0", eq_type)]] df$Pain = newdata$pain0 df$OpNr = newdata$OpNr return(df) } df <- getDf(best_age_spl, new_data, eq_type) tmp <- getDf(worst_age_spl, new_data, eq_type) df$Cat = 1 tmp$Cat = 2 df <- rbind(df, tmp) rm(tmp) df$Cat <- factor(as.numeric(df$Cat), labels=c(" Sex: Male n Charnley class: A n First THR ", " Sex: Female n Charnley class: C n Previous contralateral THR ")) return(df) }   getGlmPred <- function(nd, model, eq_type){ new_data <- nd[["Best"]] best_age_spl <- predict(model, newdata=new_data, se.fit = TRUE) new_data <- nd[["Worst"]] worst_age_spl <- predict(model, newdata=new_data, se.fit = TRUE)   getDf <- function(pred, newdata, eq_type){ df <- data.frame(Age = newdata$Age, Lower = pred$fit - qnorm(0.975)*pred$se.fit, Upper = pred$fit + qnorm(0.975)*pred$se.fit) df[[sprintf("%s1", eq_type)]] = pred$fit df[[sprintf("%s0", eq_type)]] = newdata[[sprintf("%s0", eq_type)]] df$Pain = newdata$pain0 df$OpNr = newdata$OpNr return(df) } df <- getDf(best_age_spl, new_data, eq_type) tmp <- getDf(worst_age_spl, new_data, eq_type) df$Cat = 1 tmp$Cat = 2 df <- rbind(df, tmp) rm(tmp) df$Cat <- factor(as.numeric(df$Cat), labels=c(" Sex: Malen Charnley class: A n First THR ", " Sex: Femalen Charnley class: C n Previous contralateral THR ")) return(df) }   g_legend<-function(a.gplot){ tmp <- ggplot_gtable(ggplot_build(a.gplot)) leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") legend <- tmp$grobs[[leg]] return(legend) }   plot2Lines <- function(eq5d_df, vas_df){ clrs <- brewer.pal(3, "Pastel1") clrs <- GISTools::add.alpha(clrs, .5) clrs <- clrs[c language="(2,1,3)"][/c] gg_eq5d <- ggplot(data=eq5d_df, aes(x=Age, y=eq5d1, fill=Cat, linetype=Cat)) + geom_hline(yintercept=eq5d_df$eq5d0[1], lwd=1, lty=2, colour="#00640055") + geom_ribbon(aes(ymin=Lower, ymax=Upper)) + geom_line() + ylab("EQ-5D index one year after surgery") + xlab("Age (years)") + scale_x_continuous(expand=c(0,0))+ scale_linetype_manual(name = "", values=c(1,2)) + scale_fill_manual(name = "", values = clrs[1:2], guide="none") + my_theme   gg_eqvas <- ggplot(data=vas_df, aes(x=Age, y=eq_vas1, fill=Cat, linetype=Cat)) + geom_hline(yintercept=vas_df$eq_vas0[1], lwd=1, lty=2, colour="#00640055") + geom_ribbon(aes(ymin=Lower, ymax=Upper)) + geom_line() + theme_bw() + ylab("EQ VAS one year after surgery") + xlab("Age (years)") + scale_x_continuous(expand=c(0,0))+ scale_linetype_manual(name = "", values=c(1,2)) + scale_fill_manual(name = "", values = clrs[1:2]) + theme(legend.position = "bottom") + my_theme     mylegend<-g_legend(gg_eqvas)   grid.arrange(arrangeGrob(gg_eq5d + theme(legend.position="none"), gg_eqvas + theme(legend.position="none"), nrow=1), mylegend, nrow=2,heights=c(10, 2))   }

It is important to remember that different algorithms will find different optima and may therefore seem different to the eye even though they fit the data equally well. I think of it as a form of skeleton that defines how it can move, it will try to adapt the best it can but within its boundaries. We can for instance bend our elbow but not the forearm (unless you need my surgical skill).

Restricted cubic splines AIC

BIC

B-splines AIC

BIC

Polynomials

Note that the y-scale differs for the polynomials

AIC

BIC

MFP - multiple fractional polynomial

Generalized additive models Thin plate regression splines

Cubic regression splines

P-splines

Summary

I hope you enjoyed the post and found something useful. If there is some model that you lack, write up some code and I can see if it runs. Note that there are differences from the original supplement that used a slightly more complex setup for choosing the number of knots and a slightly different dataset.

To leave a comment for the author, please follow the link and comment on his blog: G-Forge » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Mapping products in a space

Sun, 2014-09-07 03:59

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

I have read about people doing a Bayesian PCA at some points and always wondered how that would work. Then, at some point I thought of a way to do so. As ideas evolved my interest became not PCA as such, but rather in a prefmap. As a first step in that this post contains the mapping from a sensory space to a two dimensional space. For prefmap this step is commonly done via a PCA.
DataData are the coctail data from sensominer package.
AlgorithmThe calculation is mostly inspired by the many PLS algorithms to which I was exposed when I was doing chemometrics. Scores and loadings may be obtained from each other by multiplying with the data matrix. In this case it means I just take a set of product scores and obtain the associated descriptors via a simple matrix multiplication. The resulting product and descriptor vectors can be used to reconstruct the original matrix; the best solution minimizes difference between the constructed and original data. For dimension two subtract reconstructed data from original data and repeat on residuals.
ScalingPCA has the possibility to have unit length scores or loadings, or R and Q mode if that is your favorite jargon. If one has a more singular value decomposition look, it is just where the eigenvalues go. At this point I made the choice to do that in the variable space. Unique solutionPCA is known not to have one unique solution; each solution is equivalent to its mirror image. It seemed most elegant to do this completely at the end, after inspection of the data it seemed the location of product 12 was suitable for making the solution unique, since it was extreme on both dimensions. The final step (generated quantities) forces the location to be top right quadrant for data reported.
Codelibrary(rstan)
nprod <- 16
ndescr <- 13
sprofile <- as.matrix(scale(senso.cocktail,scale=FALSE))
datain <- list(
    nproduct=nprod,
    ndescriptor=ndescr,
    profile=sprofile
    )
   
model1 <- "
data {
        int<lower=0> ndescriptor;
        int<lower=0> nproduct;
        matrix[nproduct,ndescriptor] profile;
}
parameters {
    row_vector[nproduct] prodsp1;
    row_vector[nproduct] prodsp2;
    real<lower=0,upper=1> sigma1;
    real<lower=0,upper=1> sigma2;
}
transformed parameters {
   vector [ndescriptor] descrsp1;
   vector [ndescriptor] descrsp2;
   matrix[nproduct,ndescriptor] expected1;  
   matrix[nproduct,ndescriptor] expected2;  
   matrix[nproduct,ndescriptor] residual1;  

   descrsp1 <- profile'*prodsp1';
   expected1 <- (descrsp1*prodsp1)';
   residual1 <- profile-expected1;
   descrsp2 <- profile'*prodsp2';
   expected2 <- (descrsp2*prodsp2)';
}
model {  
     for (r in 1:nproduct) {
        prodsp1[r] ~ normal(0,1);
        prodsp2[r] ~ normal(0,1);
        for (c in 1:ndescriptor) {
           profile[r,c] ~ normal(expected1[r,c],sigma1);
           residual1[r,c] ~ normal(expected2[r,c],sigma2);
        }
     }
}
generated quantities {
   vector [ndescriptor] descrspace1;
   vector [ndescriptor] descrspace2;
   row_vector [nproduct] prodspace1;
   row_vector [nproduct] prodspace2;
   prodspace1 <-(
                     ((prodsp1[12]>0)*prodsp1)-
                     ((prodsp1[12]<0)*prodsp1)
                  );
   prodspace2 <-(
                     ((prodsp2[12]>0)*prodsp2)-
                     ((prodsp2[12]<0)*prodsp2)
                  ); 
   descrspace1 <-(
                     ((prodsp1[12]>0)*descrsp1)-
                     ((prodsp1[12]<0)*descrsp1)
                  );
   descrspace2 <-(
                     ((prodsp2[12]>0)*descrsp2)-
                     ((prodsp2[12]<0)*descrsp2)
                  ); 
}
"
pars <- c('prodspace1','prodspace2','descrspace1','descrspace2')
fit1 <- stan(model_code = model1,
    data = datain,
    pars=pars)
ResultsFor comparison, first a standard biplot.
Product space It is not difficult to extract the samples and plot them. See end of post. One notable property of the plot is that the products are in ellipses with the minor axis towards the center. Apparently part of variation between MCMC samples is rotational freedom between dimensions. Other than that the solution is actually pretty close to the PCA

Descriptor spaceThe rotational freedom is even more clear here.

Additional codedatalibrary(SensoMineR)
data(cocktail)
biplot pr <- prcomp(senso.cocktail) 
plot(pr)
biplot(pr)
product plot fit1samps <- as.data.frame(fit1)

prod <- reshape(fit1samps,
    drop=names(fit1samps)[33:59],
    direction='long',
    varying=list(names(fit1samps)[1:16],
        names(fit1samps)[17:32]),
    timevar='sample',
    times=1:16,
    v.names=c('PDim1','PDim2')
)
   
prod <- prod[order(prod$PDim1),]
plot(prod$PDim1,prod$PDim2,
    col=c(2,17,3,4,6,5,7:10,13,12,11,14:16)[prod$sample],
    pch=46,
    cex=2,
    xlim=c(-1,1)*.75,
    ylim=c(-1,1)*.75)
sa <- sapply(1:16,function(x)
        c(sample=x,
            Dim1=mean(prod$PDim1[prod$sample==x]),
            Dim2=mean(prod$PDim2[prod$sample==x])))
sa <- as.data.frame(t(sa))
text(x=sa$Dim1,y=sa$Dim2,labels=sa$sample,cex=1.5)
descriptor plotdescr <- reshape(fit1samps,
    drop=names(fit1samps)[c language="(1:32,59)"][/c],
    direction='long',
    varying=list(names(fit1samps)[33:45],
        names(fit1samps)[46:58]),
    timevar='sample',
    times=1:13,
    v.names=c('DDim1','DDim2')
)

descr <- descr[order(descr$DDim1),]
plot(descr$DDim1,descr$DDim2,
    col=c(2,1,3:13)[descr$sample],
    pch=46,
    cex=2,
    xlim=c(-1,1)*9,
    ylim=c(-1,1)*9)
sa <- sapply(1:13,function(x)
        c(sample=x,
            Dim1=mean(descr$DDim1[descr$sample==x]),
            Dim2=mean(descr$DDim2[descr$sample==x])))
sa <- as.data.frame(t(sa))
text(x=sa$Dim1,y=sa$Dim2,labels=names(senso.cocktail))

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Slides of 12 tutorials at ACM SIGKDD 2014

Sun, 2014-09-07 02:29

(This article was first published on blog.RDataMining.com, and kindly contributed to R-bloggers)

Slides of 12 tutorials taught by data science experts and thought leaders at ACM SIGKDD 2014 are provided at http://www.kdd.org/kdd2014/tutorials.html. Below is a list of them.

1.Scaling Up Deep Learning
Yoshua Bengio

2. Constructing and mining web-scale knowledge graphs
Antoine Bordes, Evgeniy Gabrilovich

3. Bringing Structure to Text: Mining Phrases, Entity Concepts, Topics, and Hierarchies
Jiawei Han, Chi Wang, Ahmed El-Kishky

4. Computational Epidemiology
Madhav Marathe, Naren Ramakrishnan, Anil Kumar S. Vullikanti

5. Management and Analytic of Biomedical Big Data with Cloud-based In-Memory Database and Dynamic Querying: A Hands-on Experience with Real-world Data
Roger Mark, John Ellenberger, Mengling Feng, Mohammad Ghassemi, Thomas Brennan, Ishrar Hussain

6. The Recommender Problem Revisited
Xavier Amatriain, Bamshad Mobasher

7. Correlation clustering: from theory to practice
Francesco Bonchi, David Garcia-Soriano, Edo Liberty

8. Deep Learning
Ruslan Salakhutdinov

9. Network Mining and Analysis for Social Applications
Feida Zhu, Huan Sun, Xifeng Yan

10. Sampling for Big Data
Graham Cormode, Nick Duffield

11. Statistically Sound Pattern Discovery
Geoff Webb, Wilhelmiina Hamalainen

12. Recommendation in Social Media
Jiliang Tang, Jie Tang, Huan Liu


To leave a comment for the author, please follow the link and comment on his blog: blog.RDataMining.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Good for TI, Good for Schools, Bad for Kids, Bad for Stat

Sat, 2014-09-06 02:33

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

In my last post, I agreed with Prof. Xiao-Li Meng that Advanced Placement (AP) Statistics courses turn off many students to the statistics field, by being structured in a manner that makes for a boring class.  I cited as one of the problems the fact that the course officially requires TI calculators.  This is a sad waste of resources, as the machines are expensive while R is free, and R is capable of doing things that are much more engaging for kids.

Interestingly, this week the Washington Post ran an article on the monopoly that TI calculators have in the schools.  This was picked up by a Slashdot poster, who connected it to my blog post on AP Stat.  The Post article has some interesting implications.

As the article notes, it’s not just an issue of calculators vs. R.  It’s an issue of calculators in general vs. the TI calculator.  Whether by shrewd business strategy or just luck, TI has attained a structural monopoly.  The textbooks and standardized exams make use of TI calculators, which forces all the teachers to use that particular brand.

Further reinforcing that monopoly are the kickbacks, er, donations to the schools.  When my daughter was in junior high school and was told by the school to buy a TI calculator, I noticed at the store that Casio calculators were both cheaper and had more capabilities.  I asked the teacher about this, and she explained that TI makes donations to the schools.

All this shows why Ms. Chow, the Casio rep quoted in the article, is facing an uphill battle in trying to get schools to use her brand. But there is also something very troubling about Chow’s comment, “That is one thing we do struggle with, teachers worried about how long it is going to take them to learn [Casio products].”  Math teachers would have trouble learning to use a calculator?  MATH teachers?!  I am usually NOT one to bash the U.S. school system, but if many math teachers are this technically challenged, one must question whether they should be teaching math in the first place.  This also goes to the point in my last blog post that kids generally are not getting college-level instruction in the nominally college-level AP Stat courses.

Chow’s comment also relates to my speculation that, if there were a serious proposal to switch from TI to R, the biggest source of resistance would be the AP Stat teachers themselves.  Yet I contend that even they would find that it is easy to learn R to the level needed, meaning being able to do what they currently do on TIs—and to go further, such as analyzing large data sets that engage kids, producing nice color graphics.  This is not hard at all; the teachers don’t need to become programmers.

The Post article also brings up the issue of logistics.  How would teachers give in-class tests in an R-based AP Stat curriculum?  How would the national AP Stat exam handle this?

Those who dismiss using R for AP Stat on such logistical grounds may be shocked to know that the AP Computer Science exam is not conducted with a live programmable computer at hand either. It’s all on paper, with the form of the questions being designed so that a computer is not needed.  (See the sample test here.)  My point is that, if even a test that is specifically about programming can be given without a live computer present, certainly the AP Stat course doesn’t need one either.  For that matter, most questions on the AP Stat exam  concentrate on concepts, not computation, anyway, which is the way it should be.

The teachers should demand a stop to this calculator scam, and demand that the textbooks, AP Stat exam etc. be based on R (or some other free software) rather than on expensive calculators. The kids would benefit, and so would the field of statistics.


To leave a comment for the author, please follow the link and comment on his blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

1.2 Millions Deaths by Ebola projected within Six Months?

Fri, 2014-09-05 20:32

(This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers)

The World Health Organization, Samaratins Purse, Doctors Without Borders, and other international medical emergency relief programs are desperately calling for additional resources in the international fight against Ebola that has already killed thousands and is likely to kills thousands more even if a full arsenal of aid was made available.
Figure 1: The lines are projected values while the points are data points.
The World Health Organization released a statement on August 28th that the epidemic could afflict more than 20,000 people before it could be brought under control. This however, assumes full international backing for an intervention to control the deadly outbreak. Failure to fully support the WHO's plan presumably would cause the disease to continue to spread in a similar manner as it already has.

At first a figure as high as 20,000 seems exaggerated especially when looking just at the number of 3,000 cases reported the same day as the announcement. However, I believe that this estimate is vastly too small and is entirely based on an effective and well funded international relief mission. Using a projection from all of the WHO reports to date I calculate that if the disease continues to spread at the rate it currently is then we will have more than 20,000 cases by October 24. 

The report also states that it will likely take six to nine months in order to stop the epidemic. However if the epidemic if nothing changes and the epidemic continues to rage as it currently does then my projects estimate that as many as 4.7 million people will have been infected and 1.2 million will have already died.

These are extremely dire predictions and I hope that they are purely extrapolations based on what will later be seen as the scariest month of this epidemic. However, the exponential growth model fits the data very well and at least in the short term should be expected to be fairly accurate.

All of this analysis is done in R and the code can be found on Github.

From 41 CDC Ebola reports I have assembled a small database of cases by country listing the number of 'Suspected or Confirmed Cases', the number of 'Deaths' suspected to be associated with Ebola, and the number of 'Laboratory Confirmed' cases of Ebola. You can find and add to this database as a google spreadsheet here. If running the code for yourself it will import the spreadsheet data directly.

Mapping this data by country and fitting a local polynomial regression to give a fitted line for each country gives us some signs of a very disturbing trend. The country in which the current outbreak originated is Guinea and though the disease continues to claim new victims it is much less worrisome compared with Sierra Leone and Liberia where rates of suspected cases and numbers of deaths are exponentially growing.

Figure 2: The increase of deaths in Liberia is much steeper than the other two heavily afflicted countries of Guinea and Sierra Leone.
Figure 3: The increase of laboratory confirmed cases in Liberia is much less steep than the increase is deaths indicating that the poor medical infrastructure is not able to keep up with the number of diagnoses demanded.
Figure 4: The increase of deaths in Liberia is much steeper than the other two heavily afflicted countries of Guinea and Sierra Leone.By exponential growth, we mean that whatever the current number of infected people are, we can expect them to infect some additional number of people proportion to the transmission rate. The problem with exponential growth is that while the inclusion of new victims can initially start out small the more victims there are the more are likely to be added to the victim pool each day.Figure 5: The total number of cases is rising extremely quickly.When we look at the total numbers of each case summed across country we arrive at the above graph.
From this graph it is clear that a direct linear model cannot fit well at all. Suspecting that the change over time might fit an exponential growth function, I take the natural log of the values mapped above.
Figure 6: A log transformation of the total number of cases creates a relatively linear relationship between time and number of cases reported.
This new transformed graph demonstrates an extremely distributing confirmation that using an exponential growth model would be an appropriate way of modelling the spread of Ebola. In order to estimate the spread of Ebola I define a simple model with a constant and a linear relationship between days since the outbreak was announced and the log of the measure we are concerned with:
$$log(Y)=alpha+beta_1 Day$$
And estimate the model using weights to weight the data based on the number of days into the survey so that more recent observations are considered more important. I also discard the observations for the first 21 days because we can expect the preliminary data at that time was less accurate. Using the above model gives:


Intercept Day
Suspected 4.38881946 0.02245505
Deaths 4.00491144 0.02096758
Laboratory 3.86052949 0.02314866

While intercept estimates are generally considered to be less important the coefficients on Day can be directly interpreted as percent changes by day. Thus we can expect from the current data that each day we will have a little over 2% additional suspected cases, deaths, and laboratory confirmations.

In order allow for the model to be a little more flexible in my projections I include a slightly more complex model including a squared term for the days since announcement.
$$log(Y)=alpha+beta_1 Day+beta_2 Day^2$$
I use this model to project suspected cases, deaths, and laboratory results for the next three weeks. The values up until today show the comparison between the expected values estimated from the model (EDeaths, ESusp, and ELab) with that from the data (Death, Susp, and Lab). We can see the model fits the data quite well with all estimates within 100 of the observed while most are much closer. Using this model we can see that the total number of deaths is expected to be around 3,500 by the 24th and 7,200 suspected cases. Things just get worse from there.
 

date day Deaths EDeaths Susp ESusp Lab ELab
1 2014-03-25 1 59 89 86 140 0 49
2 2014-03-26 2 60 90 86 141 1 50
3 2014-03-27 3 66 90 103 143 4 51
7 2014-03-31 7 70 94 112 149 24 56
8 2014-04-01 8 80 95 122 151 24 57
9 2014-04-02 9 83 97 127 152 35 59
14 2014-04-07 14 95 102 151 161 52 66
17 2014-04-10 17 101 106 157 167 66 71
24 2014-04-17 24 122 115 197 182 101 83
28 2014-04-21 28 129 122 203 192 109 91
30 2014-04-23 30 136 125 208 197 112 95
37 2014-04-30 37 146 137 221 218 126 112
42 2014-05-05 42 155 148 231 235 127 126
51 2014-05-14 51 157 169 233 270 129 155
60 2014-05-23 60 174 196 258 315 146 190
65 2014-05-28 65 191 213 290 344 170 214
70 2014-06-02 70 199 232 341 377 186 240
73 2014-06-05 73 222 245 425 399 238 257
78 2014-06-10 78 244 269 462 440 253 289
79 2014-06-11 79 261 274 494 449 277 296
86 2014-06-18 86 337 313 528 517 364 348
92 2014-06-24 92 338 353 599 587 441 399
100 2014-07-02 100 467 416 759 700 544 481
105 2014-07-07 105 481 462 779 784 557 540
106 2014-07-08 106 518 472 844 803 626 552
112 2014-07-14 112 539 539 888 925 664 634
114 2014-07-16 114 601 564 964 971 706 665
122 2014-07-24 122 632 677 1048 1183 745 800
126 2014-07-28 126 672 744 1201 1311 814 877
129 2014-07-31 129 728 800 1323 1417 909 941
132 2014-08-03 132 826 860 1439 1533 953 1008
133 2014-08-04 133 887 882 1603 1574 1009 1032
137 2014-08-08 137 961 974 1779 1753 1134 1132
141 2014-08-12 141 1013 1077 1848 1956 1176 1242
142 2014-08-13 142 1069 1105 1975 2011 1251 1271
144 2014-08-15 144 1145 1163 2127 2127 1310 1332
148 2014-08-19 148 1229 1290 2240 2381 1383 1461
150 2014-08-21 150 1350 1360 2473 2522 1460 1530
151 2014-08-22 151 1427 1397 2561 2596 1528 1566
157 2014-08-28 157 1552 1641 3069 3094 1752 1800
166 2014-09-06 166 2106 4062 2218 Today
167 2014-09-07 167 2166 4189 2270
168 2014-09-08 168 2228 4321 2323
169 2014-09-09 169 2292 4457 2378
170 2014-09-10 170 2359 4599 2433
171 2014-09-11 171 2427 4745 2491
172 2014-09-12 172 2498 4897 2549
173 2014-09-13 173 2572 5055 2609
174 2014-09-14 174 2647 5218 2670
175 2014-09-15 175 2725 5386 2733
176 2014-09-16 176 2806 5562 2797
177 2014-09-17 177 2890 5743 2863
178 2014-09-18 178 2976 5931 2930
179 2014-09-19 179 3065 6126 2998
180 2014-09-20 180 3157 6329 3069
181 2014-09-21 181 3253 6539 3141
182 2014-09-22 182 3351 6756 3215
183 2014-09-23 183 3453 6982 3290
184 2014-09-24 184 3559 7217 3367  Falseness of my Model
This model by definition globally (into the distant future) cannot be true. This is obvious when we use the model to project out to one year. At one year the number of infected cases is estimated as 436 billion. Since the entire population of the Earth is only 8 billion or so we know that this cannot be true.

However, this kind of model can be a good approximation locally (in the near future). If it is a good approximation locally then the next WHO report is going to list around 2100 deaths and 4060 suspected cases as of today.

So, I ask the question: is 1.2 million deaths a projection which is either local or global. I cannot answer that but it certainly is within the realm of feasibility since the nation of Liberia alone has over 4 million people and Guinea 10 million and Sierra Leone 6 million. The real question becomes, do we think the ability of Liberia and other afflicted nations to control the spread of Ebola will increase, decrease, or remain the same over time?

From Figure 3 we can see that Liberia is significantly behind other nations in its ability to diagnose Ebola. This and the well known lack of medical facilities suggests to me that as the crisis escalates the ability of Liberia to maintain any sense of order and with it any hope of controlling the spread of the disease is likely to degrade. If this is the case then it is quite possible that even this horrifying projection is an underestimate of the pain and carnage likely to result from this outbreak.


What to Do

News reports and the governments they are reporting on seem to have been placing a good deal of emphasis on investing in vaccines and treatment options. However, while all of these options are good, they are long term options (6 to 12 months).  In the meantime, every resource available must be used to contain and restrict the spread of this outbreak.

It is extremely foolish to think that any nation is immune to this disease. So far in the entire history of Ebola outbreaks up until the present less than 10 thousand people have been infected. This relatively low infection count coupled with rapid mortality makes it unlikely that the disease will significantly mutate among the human population.

However, if my projections are anywhere close to accurate then the number of infected people are going to be much higher than has ever occurred previously. This will create many more habitats for which the virus can possible mutate new traits which could increase its transmission rate. These mutations could take the form of longer gestation periods which might lead to a greater time between being infectious and being detectable.

Another possible trait might be the ability to go airborne which would significantly increase its ability ability to be transmitted. Some scientists it very unlikely to become airborne because it is too heavy. This may be the case. However, as the possibility of it becoming airborne could result in a global spread of the disease resulting in unprecedented number of deaths world wide it is more than prudent to heavily invest in controlling the number of new patients infected by this disease.


In addition, even if the disease does not mutate from the state that it is in currently to a new one, it has shown itself to be extremely effective at being transmitted with a large number of health workers becoming infected and dying from the disease. These health workers should have known how to control the spread of the disease and prevent infection. Do we really expect that if the disease were to enter any other nation on Earth that the general population is going to be better prepared to protect themselves than the specialists who have already fallen victim to this disease?

Thus, it is imperative that we do everything within our power to control the spread of this terrible disease. Even if my model only has a ten percent chance of being accurate over the next six months, we would be extremely foolish to risk not responding to this outbreak with every resource within reason humanity can muster.

To leave a comment for the author, please follow the link and comment on his blog: Econometrics by Simulation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

jsonlite gets a triple mushroom boost!

Fri, 2014-09-05 20:00

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

The jsonlite package is a JSON parser/generator optimized for the web. It implements a bidirectional mapping between JSON data and the most important R data types, which allows for converting objects to JSON and back without manual data restructuring. This is ideal for interacting with web APIs, or to build pipelines where data seamlessly flow in and out of R through JSON. The quickstart vignette gives a brief introduction, or just try:

fromJSON(toJSON(mtcars))

Or use some data from the web:

# Latest commits in r-base r_source <- fromJSON("https://api.github.com/repos/wch/r-source/commits") # Pretty print: committer <- format(r_source$commit$author$name) date <- as.Date(r_source$commit$committer$date) message <- sub("nn.*","", r_source$commit$message) paste(date, committer, message) New in 0.9.11: performance!

Version 0.9.11 has a few minor bugfixes, but most of the work of this release has gone into improving performance. The implementation of toJSON has been optimized in many ways, and with a little help from Winston Chang, the most CPU intensive bottleneck has been ported to C code. The result is quite impressive: encoding dataframes to row-based JSON format is about 3x faster, and encoding dataframes to column-based JSON format is nearly 10x faster in comparision with the previous release.

The diamonds dataset from the ggplot2 package has about 0.5 million values which makes a nice benchmark. On my macbook it takes jsonlite on average 1.18s to encode it to row-based JSON, and 0.34s for column-based json:

library(jsonlite) library(microbenchmark) data("diamonds", package="ggplot2") microbenchmark(json_rows <- toJSON(diamonds), times=10) # Unit: seconds # expr min lq median uq max neval # toJSON(diamonds) 1.12773 1.140724 1.175872 1.180354 1.21786 10 microbenchmark(json_columns <- toJSON(diamonds, dataframe="col"), times=10) # Unit: milliseconds # expr min lq median uq # max neval # toJSON(diamonds, dataframe = "col") 333.9494 334.799 338.0843 340.0929 350.3026 10 Parsing and simplification performance

The performance of fromJSON has been improved as well. The parser itself was already a high performance c++ library that was borrowed from RJSONIO, which has not changed. However the simplification code used to reduce deeply nested lists into nice vectors and data frames has been tweaked in many places and is on average 3 to 5 times faster than before (depending on what the JSON data look like). For the diamonds example, the row-based data gets parsed in about 2.32s and column based data in 1.25s.

microbenchmark(fromJSON(json_rows), times=10) # Unit: seconds # expr min lq median uq max neval # fromJSON(json_rows) 2.178211 2.278337 2.319519 2.376085 2.423627 10 microbenchmark(fromJSON(json_columns), times=10) # Unit: seconds # expr min lq median uq max neval # fromJSON(json_columns) 1.17289 1.252284 1.253999 1.265763 1.306357 10

For comparison, we can also disable simplification in which case parsing takes respectively 0.70 and 0.39 seconds for these data. However without simplification we end up with a big nested list of lists which is often not very useful.

microbenchmark(fromJSON(json_rows, simplifyVector=F), times=10) # Unit: milliseconds # expr min lq median uq max neval # fromJSON(json_rows, simplifyVector = F) 635.5767 648.4693 704.6996 720.0335 727.8869 10 microbenchmark(fromJSON(json_columns, simplifyVector=F), times=10) # Unit: milliseconds # expr min lq median uq max neval # fromJSON(json_columns, simplifyVector = F) 385.3224 388.4772 395.1916 409.3432 463.9695 10

To leave a comment for the author, please follow the link and comment on his blog: OpenCPU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Packrat on CRAN

Fri, 2014-09-05 16:24

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Packrat is now available on CRAN, with version 0.4.1-1! Packrat is an R package that helps you manage your project’s R package dependencies in an isolated, reproducible and portable way.

Install packrat from CRAN with:

install.packages("packrat")

In particular, this release provides better support for local repositories. Local repositories are just folders containing package sources (currently as folders themselves).

One can now specify local repositories on a per-project basis by using:

packrat::set_opts(local.repos = <pathToRepo>)

and then using

packrat::install_local(<pkgName>)

to install that package from the local repository.

There is also experimental support for a global ‘cache’ of packages, which can be used across projects. If you wish to enable this feature, you can use (note that it is disabled by default):

packrat::set_opts(use.cache = TRUE)

in each project where you would utilize the cache.

By doing so, if one project installs or uses e.g. Shiny 0.10.1 for CRAN, and another version uses that same package, packrat will look up the installed version of that package in the cache — this should greatly speed up project initialization for new projects that use projects overlapping with other packrat projects with large, overlapping dependency chains.

In addition, this release provides a number of usability improvements and bug fixes for Windows.

Please visit rstudio.github.io/packrat for more information and a guide to getting started with Packrat.


To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

EM Algorithm for Bayesian Lasso R Cpp Code

Fri, 2014-09-05 12:56

(This article was first published on Lindons Log » R, and kindly contributed to R-bloggers)

Bayesian Lasso

$$begin{align*}
p(Y_{o}|beta,phi)&=N(Y_{o}|1alpha+X_{o}beta,phi^{-1} I_{n{o}})\
pi(beta_{i}|phi,tau_{i}^{2})&=N(beta_{i}|0, phi^{-1}tau_{i}^{2})\
pi(tau_{i}^{2})&=Exp left( frac{lambda}{2} right)\
pi(phi)&propto phi^{-1}\
pi(alpha)&propto 1\
end{align*}$$

Marginalizing over (alpha) equates to centering the observations and losing a degree of freedom and working with the centered ( Y_{o} ).
Mixing over (tau_{i}^{2}) leads to a Laplace or Double Exponential prior on (beta_{i}) with rate parameter (sqrt{philambda}) as seen by considering the characteristic function
$$begin{align*}
varphi_{beta_{i}|phi}(t)&=int e^{jtbeta_{i}}pi(beta_{i}|phi)dbeta_{i}\
&=int int e^{jtbeta_{i}}pi(beta_{i}|phi,tau_{i}^{2})pi(tau_{i}^{2})dtau_{i} dbeta_{i}\
&=frac{lambda}{2} int e^{-frac{1}{2}frac{t^{2}}{phi}tau_{i}^{2}}e^{-frac{lambda}{2}tau_{i}^{2}}dtau_{i}\
&=frac{lambda}{frac{t^{2}}{phi}+lambda}=frac{lambdaphi}{t^{2}+lambdaphi}
end{align*}$$.

EM Algorithm

The objective is to find the mode of the joint posterior (pi(beta,phi|Y_{o})). It is easier, however, to find the joint mode of (pi(beta,phi|Y_{o},tau^{2})) and use EM to exploit the scale mixture representation.

$$begin{align*}
log pi(beta,phi|Y_{o},tau^{2})=c+ frac{n_o+p-3}{2}log phi -frac{phi}{2}||Y_{o}-X_{o}beta||^{2}-sum_{i=1}^{p}frac{phi}{2}frac{1}{tau_{i}^{2}}beta^{2}_{i}
end{align*}$$

Expectation

The expecation w.r.t. (tau_{i}^{2}) is handled as by
$$
begin{align*}
&frac{lambda}{2}int frac{1}{tau_{i}^{2}}left( frac{phi}{2pitau_{i}^{2}} right)^{frac{1}{2}}e^{-frac{1}{2}phibeta_{i}^{2}frac{1}{tau_{i}^{2}}}e^{-frac{lambda}{2}tau_{i}^{2}}dtau_{i}^{2}\
&frac{lambda}{2}int left( frac{phi}{2pi[tau_{i}^{2}]^{3}} right)^{frac{1}{2}}e^{-frac{1}{2}phibeta_{i}^{2}frac{1}{tau_{i}^{2}}}e^{-frac{lambda}{2}tau_{i}^{2}}dtau_{i}^{2}\
end{align*}$$

This has the kernel of an Inverse Gaussian distribution with shape parameter (phi beta_{i}^{2}) and mean (sqrt{frac{phi}{lambda}}|beta_{i}|)

$$begin{align*}
&frac{{lambda}}{2|beta_{i}|}int left( frac{beta_{i}^{2}phi}{2pi[tau_{i}^{2}]^{3}} right)^{frac{1}{2}}e^{-frac{1}{2}phibeta_{i}^{2}frac{1}{tau_{i}^{2}}}e^{-frac{lambda}{2}tau_{i}^{2}}dtau_{i}^{2}\
&frac{lambda}{2|beta_{i}|}e^{-sqrt{lambdaphibeta_{i}^{2}}}int left( frac{beta_{i}^{2}phi}{2pi[tau_{i}^{2}]^{3}} right)^{frac{1}{2}}e^{-frac{1}{2}phibeta_{i}^{2}frac{1}{tau_{i}^{2}}}e^{-frac{lambda}{2}tau_{i}^{2}}e^{sqrt{lambdaphibeta_{i}^{2}}}dtau_{i}^{2}\
&frac{lambda}{2|beta_{i}|}e^{-sqrt{lambdaphibeta_{i}^{2}}}\
end{align*}$$

Normalization as follows

$$begin{align*}
&frac{lambda}{2}int left( frac{phi}{2pitau_{i}^{2}} right)^{frac{1}{2}}e^{-frac{1}{2}phibeta_{i}^{2}frac{1}{tau_{i}^{2}}}e^{-frac{lambda}{2}tau_{i}^{2}}dtau_{i}^{2}\
&frac{lambda}{2}int tau_{i}^{2}left( frac{phi}{2pi[tau_{i}^{2}]^{3}} right)^{frac{1}{2}}e^{-frac{1}{2}phibeta_{i}^{2}frac{1}{tau_{i}^{2}}}e^{-frac{lambda}{2}tau_{i}^{2}}dtau_{i}^{2}\
end{align*}$$
$$begin{align*}
&frac{lambda}{2|beta_{i}|}e^{-sqrt{lambdaphibeta_{i}^{2}}}sqrt{frac{phi}{lambda}}|beta_{i}|\
end{align*}$$

( Rightarrow mathbb{E}left[ frac{1}{tau_{i}^{2}} right]=sqrt{frac{lambda}{phi^{t}}}frac{1}{|beta_{i}^{t}|}).
Let (Lambda^{t}=diag(sqrt{frac{lambda}{phi^{t}}}frac{1}{|beta_{1}^{t}|}, dots, sqrt{frac{lambda}{phi^{t}}}frac{1}{|beta_{p}^{t}|})).

Maximization

$$begin{align*}
&Q(beta,phi||beta^{t},phi^{t})=c+ frac{n_o+p-3}{2}log phi -frac{phi}{2}||Y_{o}-X_{o}beta||^{2} – frac{phi}{2}beta^{T}Lambda^{t}beta\
&=c+ frac{n_o+p-3}{2}log phi -frac{phi}{2}||beta-(X_{o}^{T}X_{o}+Lambda^{t})^{-1}X_{o}^{T}Y_{o}||^{2}_{(X_{o}^{T}X_{o}+Lambda^{t})}-frac{phi}{2}Y_{o}^{T}(I_{n_{o}}-X_{o}^{T}(X_{o}^{T}X_{o}+Lambda^{t})^{-1}X_{o})Y_{o}\
end{align*}$$

$$begin{align*}
beta^{t+1}&=(X_{o}^{T}X_{o}+Lambda^{t})^{-1}X_{o}^{T}Y_{o}\
end{align*}$$

$$begin{align*}
phi^{t+1}=frac{n_{o}+p-3}{Y_{o}^{T}(I_{n_{o}}-X_{o}^{T}(X_{o}^{T}X_{o}+Lambda^{t})^{-1}X_{o})Y_{o}}
end{align*}$$

RCpp C++ Code #include <RcppArmadillo.h> // [[Rcpp::depends(RcppArmadillo)]] using namespace Rcpp; using namespace arma; double or_log_posterior_density(int no, int p, double lasso, const Col<double>& yo, const Mat<double>& xo, const Col<double>& B,double phi); // [[Rcpp::export]] List or_lasso_em(NumericVector ryo, NumericMatrix rxo, SEXP rlasso){ //Define Variables// int p=rxo.ncol(); int no=rxo.nrow(); double lasso=Rcpp::as<double >(rlasso); //Create Data// arma::mat xo(rxo.begin(), no, p, false); arma::colvec yo(ryo.begin(), ryo.size(), false); yo-=mean(yo); //Pre-Processing// Col<double> xoyo=xo.t()*yo; Col<double> B=xoyo/no; Col<double> Babs=abs(B); Mat<double> xoxo=xo.t()*xo; Mat<double> D=eye(p,p); Mat<double> Ip=eye(p,p); double yoyo=dot(yo,yo); double deltaB; double deltaphi; double phi=no/dot(yo-xo*B,yo-xo*B); double lp; //Create Trace Matrices Mat<double> B_trace(p,20000); Col<double> phi_trace(20000); Col<double> lpd_trace(20000); //Run EM Algorithm// cout << "Beginning EM Algorithm" << endl; int t=0; B_trace.col(t)=B; phi_trace(t)=phi; lpd_trace(t)=or_log_posterior_density(no,p,lasso,yo,xo,B,phi); do{ t=t+1; lp=sqrt(lasso/phi); Babs=abs(B); D=diagmat(sqrt(Babs)); B=D*solve(D*xoxo*D+lp*Ip,D*xoyo); phi=(no+p-3)/(yoyo-dot(xoyo,B)); //Store Values// B_trace.col(t)=B; phi_trace(t)=phi; lpd_trace(t)=or_log_posterior_density(no,p,lasso,yo,xo,B,phi); deltaB=dot(B_trace.col(t)-B_trace.col(t-1),B_trace.col(t)-B_trace.col(t-1)); deltaphi=phi_trace(t)-phi_trace(t-1); } while((deltaB>0.00001 || deltaphi>0.00001) && t<19999); cout << "EM Algorithm Converged in " << t << " Iterations" << endl; //Resize Trace Matrices// B_trace.resize(p,t); phi_trace.resize(t); lpd_trace.resize(t); return Rcpp::List::create( Rcpp::Named("B") = B, Rcpp::Named("B_trace") = B_trace, Rcpp::Named("phi") = phi, Rcpp::Named("phi_trace") = phi_trace, Rcpp::Named("lpd_trace") = lpd_trace ) ; } An Example in R rm(list=ls()) #Generate Design Matrix set.seed(3) no=100 foo=rnorm(no,0,1) sd=4 xo=cbind(foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd)) for(i in 1:40) xo=cbind(xo,foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd),foo+rnorm(no,0,sd)) #Scale and Center Design Matrix xo=scale(xo,center=T,scale=F) var=apply(xo^2,2,sum) xo=scale(xo,center=F,scale=sqrt(var/no)) #Generate Data under True Model p=dim(xo)[2] b=rep(0,p) b[1]=1 b[2]=2 b[3]=3 b[4]=4 b[5]=5 xo%*%b yo=xo%*%b+rnorm(no,0,1) yo=yo-mean(yo) #Run Lasso or_lasso=or_lasso_em(yo,xo,100) #Posterior Density Increasing at Every Iteration? or_lasso$lpd_trace[2:dim(or_lasso$lpd_trace)[1],1]-or_lasso$lpd_trace[1:(dim(or_lasso$lpd_trace)[1]-1),1]>=0 mean(or_lasso$lpd_trace[2:dim(or_lasso$lpd_trace)[1],1]-or_lasso$lpd_trace[1:(dim(or_lasso$lpd_trace)[1]-1),1]>=0) #Plot Results plot(or_lasso$B,ylab=expression(beta[lasso]),main="Lasso MAP Estimate of Regression Coefficients")

Park, T., & Casella, G. (2008). The Bayesian Lasso Journal of the American Statistical Association, 103 (482), 681-686 DOI: 10.1198/016214508000000337
Figueiredo M.A.T. (2003). Adaptive sparseness for supervised learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, 25 (9) 1150-1159. DOI: http://dx.doi.org/10.1109/tpami.2003.1227989
Better Shrinkage Priors:
Armagan A., Dunson D.B. & Lee J. GENERALIZED DOUBLE PARETO SHRINKAGE., Statistica Sinica, PMID:

The post EM Algorithm for Bayesian Lasso R Cpp Code appeared first on Lindons Log.

To leave a comment for the author, please follow the link and comment on his blog: Lindons Log » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

In case you missed it: August 2014 Roundup

Fri, 2014-09-05 09:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from August of particular interest to R users:  

R is the most popular software in the KDNuggets poll for the 4th year running.

The frequency of R user group meetings continues to rise, and there are now 147 R user groups worldwide.

A video interview with David Smith, Chief Community Officer at Revolution Analytics, at the useR! 2014 conference.

In a provocative op-ed, Norm Matloff worries that Statistics is losing ground to Computer Science.

A new certification program for Revolution R Enterprise.

An interactive map of R user groups around the world, created with R and Shiny.

Using R to generate calendar entries (and create photo opportunities).

Integrating R with production systems with Domino.

The New York Times compares data science to janitorial work.

Rdocumentation.org provides search for CRAN, GitHub and BioConductor packages and publishes a top-10 list of packages by downloads.

An update to the "airlines" data set (the "iris" of Big Data) with flights through the end of 2012. 

A consultant compares the statistical capabilities of R, Matlab, SAS, Stata and SPSS.

Using heatmaps to explore correlations in financial portfolios.

Video of John Chambers' keynote at the useR! 2014 conference on the interfaces, efficiency, big data and the history of R.

CIO magazine says the open source R language is becoming pervasive.

Reviews of some presentations at the JSM 2014 conference that used R.

GRAN is a new R package to manage package repositories to support reproducibility.

The ASA launches a PR campaign to promote the role of statisticians in society.

Video replay of the webinar Applications in R, featuring examples from several companies using R.

General interest stories (not related to R) in the past month included: dance moves from Japan, an earthquake's signal in personal sensors, a 3-minute movie in less than 4k, smooth time-lapse videos, representing mazes as trees and the view from inside a fireworks display.

As always, thanks for the comments and please send any suggestions to me at david@revolutionanalytics.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R: Image Analysis using EBImage

Fri, 2014-09-05 08:55

(This article was first published on Analysis with Programming, and kindly contributed to R-bloggers)

Currently, I am taking Statistics for Image Analysis on my masteral, and have been exploring this topic in R. One package that has the capability in this field is the EBImage from Bioconductor, which will be showcased in this post.

Installation
For those using Ubuntu, you may likely to encounter this error:

It has something to do with the tiff.h C header file, but it's not that serious since mytechscribblings has an effective solution for this, do check that out.

Importing DataTo import a raw image, consider the following codes:

Output of display(Image).Yes, this is the photo that we are going to use for our analysis. Needless to say, that's me and my friends. In the proceeding section we will do image manipulation and other processing.

Image PropertiesSo what do we get from our raw image? To answer that, simply run print(Image). This will return the properties of the image, including the array of pixel values. With these information, we apply mathematical and statistical operations to do enhancement on the image.

There are two sections (Summary and array of the pixels) in the above output, with the following entries for the first section:

CodeValueDescriptionTable 1: Information from 1st section of print(Image).colormodeColorThe type (Color/Grayscale) of the color of the image.storage.modedoubleType of values in the array.dim1984 1488 3Dimension of the array, (x, y, z).nb.total.frames:3Number of channels in each pixel, z entry in dim.nb.render.frames1Number of channels rendered.
The second section is the obtained values from mapping pixels in the image to the real line between 0 and 1 (inclusive). Both extremes of this interval [0, 1], are black and white colors, respectively. Hence, pixels with values closer to any of these end points are expected to be darker or lighter, respectively. And because pixels are contained in a large array, then we can do all matrix manipulations available in R for processing.

Adjusting BrightnessIt is better to start with the basic first, one of which is the brightness. As discussed above, brightness can be manipulated using + or -:

LighterDarkerTable 2: Adjusting Brightness.Output of display(Image1).Output of display(Image2).
Adjusting ContrastContrast can be manipulated using multiplication operator(*):

LowHighTable 3: Adjusting Contrast.Output of display(Image3).Output of display(Image4).
Gamma CorrectionGamma correction is the name of a nonlinear operation used to code and decode luminance or tristimulus values in video or still image systems, defined by the following power-law expression: begin{equation}nonumber V_{mathrm{out}} = AV_{mathrm{in}}^{gamma} end{equation} where $A$ is a constant and the input and output values are non-negative real values; in the common case of $A = 1$, inputs and outputs are typically in the range 0-1. A gamma value $gamma< 1$ is sometimes called an encoding gamma (Wikipedia, Ref. 1).

$gamma = 2$$gamma = 0.7$Table 4: Adjusting Gamma Correction.Output of display(Image5).Output of display(Image6).
CroppingSlicing array of pixels, simply mean cropping the image.

Output of the above code.Spatial TransformationSpatial manipulation like rotate (rotate), flip (flip), and translate (translate) are also available in the package. Check this out,


Color ManagementSince the array of pixels has three axes in its dimension, for example in our case is 1984 x 1488 x 3. The third axis is the slot for the three channels: Red, Green and Blue, or RGB. Hence, transforming the color.mode from Color to Grayscale, implies disjoining the three channels from single rendered frame (three channels for each pixel) to three separate array of pixels for red, green, and blue frames.

OriginalRed ChannelTable 5: Color Mode Transformation.Green ChannelBlue Channel
To revert the color mode, simply run

FilteringIn this section, we will do smoothing/blurring using low-pass filter, and edge-detection using high-pass filter. In addition, we will also investigate median filter to remove noise.

Low-Pass (Blur)Table 6: Image Filtering.High Pass
OriginalFilteredTable 7: Median Filter.From Google, Link Here.Output of display(medFltr)
For comparison, I run median filter on first-neighborhood in Mathematica, and I got this
Clearly, Mathematica has better enhancement than R for this particular filter. But R has a good foundation already, as we witness with EBImage. There are still lots of interesting functions in the said package, that is worth exploring, I suggest you check that out.

For the meantime, we will stop here, but hoping we can play more on this topic in the succeeding post.

References
  1. Gamma Correction. Wikipedia. Retrieved August 31, 2014.
  2. Gregoire Pau, Oleg Sklyar, Wolfgang Huber (2014). Introduction to EBImage, an image processing and analysis toolkit for R.

To leave a comment for the author, please follow the link and comment on his blog: Analysis with Programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Keep your team informed with “slackr”

Fri, 2014-09-05 08:30

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

Karl Broman did a spiffy job summarizing a good number of the options available to R folk if they want to get notifications from R. You can also generate OS X notifications as well. If you’re using Slack for team coordination and communications, you’ve got a new option - slackr that also enables you go a bit deeper than just notifications, letting you push R output to the service for sharing results or observations.

What is Slack?

Slack (@SlackHQ) is a cloud-hosted, team messaging platform that lets you setup public & private internal channels for various types of communications, incuding markdown, text, pictures, video, links and more. They also offer connectivity with many outside services (e.g. github, twitter, etc.). The service is painless to setup, and there are desktop and mobile applications for most platforms (as well as the web interface). It has the utility of e-mail, twitter and skype (and more) combined, and their API makes it possible to create your own integrations.

While their full API affords more flexibility, they have simple “webhook”-type integrations that are more lightweight and make quick work out connecting to Slack. The slackr package takes advantage of the webook API to connect R with the service. To use it, you’ll first need to signup for the service and get your teammates to join and then setup the webhook integration.

Why slackr?

If you’ve ever used a plaintext messaging tool (e.g. Skype) to try to share R code snippets or output, you know the drill: select what was sent to you; copy, then paste into a text editor so it’s actually readable. The slackr package eliminates those steps by letting you execute one function - slackr() - and send any R output/expression to any Slack team channel or team member. Here’s an example, using the code from the lm stats package function example code (hit up ?lm in R to see that directly):

# setup the slackr API. this example assumes a .slackr config file in ~/ slackrSetup() # run the lm() example ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) group <- gl(2, 10, 20, labels = c("Ctl","Trt")) weight <- c(ctl, trt) lm.D9 <- lm(weight ~ group) lm.D90 <- lm(weight ~ group - 1) # omitting intercept # share the results with Jay slackr(anova(lm.D9), summary(lm.D90), channel="@jayjacobs")

Here’s what will be seen in the slack channel:

The slackr() function can also be setup to do trivial notifications (and the various Slack apps and integrations can notify you anywhere in an out of slack, if that’s your cup of tea):

performSomeLongClassifictationTask() # notify me directly slackr("Classification complete", channel="@hrbrmstr") # or notify the default channel slackr("Classification complete")

With slackrSetup(), you can choose the default channel and username, as well as select the icon being used (overriding the default one during the initial webhook setup). The config file (mentioned earlier) is pretty straightforward:

token: YOUR_SLACK_API_TOKEN channel: #general username: slackr icon_emoji: :information_source: incoming_webhook_url: https://YOURTEAM.slack.com/services/hooks/incoming-webhook?

and definitely beats passing all those in as parameters (and, doesn’t have to live in ~/.slackr if you want to use an alternate location or have multiple profiles for multiple teams).

The webhook API is text/rich-text-only, but the full API lets you send anything. For that, full OAuth setup is required, and since it’s super-simple to just drag graphics from an RStudio window to Slack the extra functionality hasn’t made it to the slackr project TODO list yet, but I can defintely see a ggslack() or even a dev.slack() graphics device (ala png()) function or two making their way to the package in the not-too-distant future.

How to get slackr

The slackr package is up on github and may make it to CRAN next week. Over the coming weeks we’ll probably add the ability to consume output from slack channels and definitely welcome any issues, comments or feature requests.

To leave a comment for the author, please follow the link and comment on his blog: Data Driven Security. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Error notifications from R

Fri, 2014-09-05 00:44

(This article was first published on The stupidest thing... » R, and kindly contributed to R-bloggers)

I’m enthusiastic about having R notify me when my script is done.

But among my early uses of this, my script threw an error, and I never got a text or pushbullet about that. And really, I’m even more interested in being notified about such errors than anything else.

It’s relatively easy to get notified of errors. At the top of your script, include code like options(error = function() { } )

Fill in the function with your notification code. If there’s an error, the error message will be printed and then that function will be called. (And then the script will halt.)

You can use geterrmessage() to grab the error message to include in your notification.

For example, if you want to use RPushbullet for the notification, you could put, at the top of your script, something like this:

options(error = function() { library(RPushbullet) pbPost("note", "Error", geterrmessage()) })

Then if the script gives an error, you’ll get a note with title “Error” and with the error message as the body of the note.

Update: I knew I’d heard about this sort of thing somewhere, but I couldn’t remember where. Duh; Rasmus mentioned it on twitter just a couple of days ago! Fortunately, he reminded me of that in the comments below.


To leave a comment for the author, please follow the link and comment on his blog: The stupidest thing... » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Spell Checker for R…qdap::check_spelling

Thu, 2014-09-04 18:43

(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers)

I often have had requests for a spell checker for R character vectors. The utils::aspell function can be used to check spelling but many Windows users have reported difficulty with the function.

I came across an article on spelling in R entitled “Watch Your Spelling!” by Kurt Hornik and Duncan Murdoch. The paper walks us through definitions of spell checking, history, and a suggested spell checker implementation for R. A terrific read. Hornik & Murdoch (2010) end with the following call:

Clearly, more work will be needed: modern statistics needs better lexical resources, and a dictionary based on the most frequent spell check false alarms can only be a start. We hope that this article will foster community interest in contributing to the development of such resources, and that refined domain specific dictionaries can be made available and used for improved text analysis with R in the near future (p. 28).

I answered a question on stackoverflow.com a few months back that lead to creating a suite of spell checking functions. The original functions used an agrep approach that was slow and inaccurate. I discovered Mark van der Loo’s terrific stringdist package to do the heavy lifting. It calculates string distances very quickly with various methods.

The rest of this blog post is meant as a minimal introduction to qdap‘s spell checking functions. A video will lead you through most of the process and accompanying scripts are provided.

Primitive Spell Checking Function

The which_misspelled function is a low level function that basically determines if each word of a single string is in a dictionary. It optionally gives suggested corrections.

library(qdap) x <- "Robots are evl creatres and deserv exterimanitation." which_misspelled(x, suggest=FALSE) which_misspelled(x, suggest=TRUE) Interactive Spell Checking

Typically a user will want to use the interactive spell checker (spell_checker_interactive) as it is more flexible and accurate.

dat <- DATA$state dat[1] <- "Jasperita I likedd the cokie icekream" dat ## [1] "Jasperita I likedd the cokie icekream" ## [2] "No it's not, it's dumb." ## [3] "What should we do?" ## [4] "You liar, it stinks!" ## [5] "I am telling the truth!" ## [6] "How can we be certain?" ## [7] "There is no way." ## [8] "I distrust you." ## [9] "What are you talking about?" ## [10] "Shall we move on? Good then." ## [11] "I'm hungry. Let's eat. You already?" (o <- check_spelling_interactive(dat)) preprocessed(o) fixit <- attributes(o)$correct fixit(dat) A More Realistic Usage m <- check_spelling_interactive(mraja1spl$dialogue[1:75]) preprocessed(m) fixit <- attributes(m)$correct fixit(mraja1spl$dialogue[1:75]) References

Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3(2), 22-28.


To leave a comment for the author, please follow the link and comment on his blog: TRinker's R Blog » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Bayesian First Aid: Poisson Test

Thu, 2014-09-04 17:21

(This article was first published on Publishable Stuff, and kindly contributed to R-bloggers)

As the normal distribution is sort of the default choice when modeling continuous data (but not necessarily the best choice), the Poisson distribution is the default when modeling counts of events. Indeed, when all you know is the number of events during a certain period it is hard to think of any other distribution, whether you are modeling the number of deaths in the Prussian army due to horse kicks or the numer of goals scored in a football game. Like the t.test in R there is also a poisson.test that takes one or two samples of counts and spits out a p-value. But what if you have some counts, but don’t significantly feel like testing a null hypothesis? Stay tuned!

Bayesian First Aid is an attempt at implementing reasonable Bayesian alternatives to the classical hypothesis tests in R. For the rationale behind Bayesian First Aid see the original announcement. The development of Bayesian First Aid can be followed on GitHub. Bayesian First Aid is a work in progress and I’m grateful for any suggestion on how to improve it!

The Model

The original poisson.test function that comes with R is rather limited, and that makes it fairly simple to construct the Bayesian alternative. However, at first sight poisson.test may look more limited than it actually is. The one sample version just takes one count of events $x$ and the number of periods $T$ during which the number of events were counted. If your ice cream truck sold 14 ice creams during one day you would call the function like poisson.test(x = 14, T = 1). This seems limited, what if you have a number of counts, say you sell ice cream during a whole week, what to do then? The trick here is that you can add up the counts and the number of time periods and this will be perfectly fine. The code below will still give you an estimate for the underlying rate of ice cream sales per day:

ice_cream_sales = c(14, 16, 9, 18, 10, 6, 13) poisson.test(x = sum(ice_cream_sales), T = length(ice_cream_sales))

Note that this only works if the counts are well modeled by the same Poisson distribution. If the ice cream sales are much higher on the weekends, adding up the counts might not be a good idea. poisson.test is also limited in that it can only handle two counts; you can compare the performance of your ice cream truck with just one competitor’s, no more. As the Bayesian alternative accepts the same input as poisson.test it inherits some of it’s limitations (but it can easily be extended, read on!). The model for the Bayesian First Aid alternative to the one sample possion.test is:

Here $x$ is again the count of events, $T$ is the number of periods, and $lambda$ is the parameter of interest, the underlying rate at which the events occur. In the two sample case the one sample model is just separately fitted to each sample.

As $x$ is assumed to be Poisson distributed, all that is required to turn this into a fully Bayesian model is a prior on $lambda$. In the the literature there are two common recommendations for an objective prior for the rate of a Poisson distribution. The first one is $lambda propto 1 / lambda$ which is the same as $log(lambda) propto text{const}$ and is proposed, for example, by Villegas (1977). While it can be argued that this prior is as non-informative as possible, it is problematic in that it will result in an improper posterior when the number of events is zero ($x = 0$). I feel that seeing zero events should tell the model something and, at least, not cause it to blow up. The second proposal is Jeffreys prior $lambda propto 1 / sqrt{lambda}$, (as proposed by the great BUGS Book) which has a slight positive bias compared to the former prior but handles counts of zero just fine. The difference between these two priors is very small and will only matter when you have very few counts. Therefore Bayesian First Aid alternative to poisson.test uses Jeffreys prior.

So if the model uses Jeffreys prior, what is the $lambda sim text{Gamma}(0.5, 0.00001)$ doing in the model definition? Well, the computational framework underlying Bayesian First Aid is JAGS and in JAGS you build your model using probability distributions. The Jeffreys prior is not a proper probability distribution but it turns out that it can be reasonably well approximated by ${Gamma}(0.5, epsilon)$ with $epsilon rightarrow 0$ (in the same way as $lambda propto 1 / lambda$ can be approximated by ${Gamma}(epsilon, epsilon)$ with $epsilon rightarrow 0$).

The bayes.poisson.test Function

The bayes.poisson.test function accepts the same arguments as the original poisson.test function, you can give it one or two counts of events. If you just ran poisson.test(x=14, T=1), prepending bayes. runs the Bayesian First Aid alternative and prints out a summary of the model result (like bayes.poisson.test(x=14, T=1)). By saving the output, for example, like fit <- bayes.poisson.test(x=14, T=1) you can inspect it further using plot(fit), summary(fit) and diagnostics(fit).

To demonstrate the use of bayes.poisson.test I will use data from Boice and Monson (1977) on the number of women diagnosed with breast cancer in one group of 1,047 tuberculosis patients that had received on average 102 X-ray exams and one group of 717 tuberculosis patients whose treatment had not required a large number of X-ray exams. Here is the full data set:

Here WY stand for woman-years (as if woman-years would be different from man-years, or person-years…). While the data is from a relatively old article we are going to replicate a more recent reanalysis of that data from the article Testing the Ratio of Two Poisson Rates by Gu et al. (2008). They tested the alternative hypothesis that the rate of breast cancer per person-year would be 1.5 times greater in the group that was X-rayed compared to the control group. They tested it like this:

no_cancer_cases <- c(41, 15) # person-centuries rather than person-years to get the estimated rate # on a more interpretable scale. person_centuries <- c(28.011, 19.025) poisson.test(no_cancer_cases, person_centuries, r = 1.5, alternative = "greater") ## ## Comparison of Poisson rates ## ## data: no_cancer_cases time base: person_centuries ## count1 = 41, expected count1 = 41.61, p-value = 0.291 ## alternative hypothesis: true rate ratio is greater than 1.5 ## 95 percent confidence interval: ## 1.098 Inf ## sample estimates: ## rate ratio ## 1.856

and concluded that “There is not enough evidence that the incidence rate of breast cancer in the X-ray fluoroscopy group is 1.5 times to the incidence rate of breast cancer in control group”. It is oh-so-easy to interpret this as that there is no evidence that the incidence rate is more than 1.5 times higher, but this is wrong and the Bayesian First Aid alternative makes this clear:

library(BayesianFirstAid) bayes.poisson.test(no_cancer_cases, person_centuries, r = 1.5, alternative = "greater") ## Warning: The argument 'alternative' is ignored by bayes.poisson.test ## ## Bayesian Fist Aid poisson test - two sample ## ## number of events: 41 and 15, time periods: 28.011 and 19.025 ## ## Estimates [95% credible interval] ## Group 1 rate: 1.5 [1.1, 1.9] ## Group 2 rate: 0.80 [0.43, 1.2] ## Rate ratio (Group 1 rate / Group 2 rate): ## 1.8 [1.1, 3.4] ## ## The event rate of group 1 is more than 1.5 times that of group 2 by a probability ## of 0.754 and less than 1.5 times that of group 2 by a probability of 0.246 .

The warning here is nothing to worry about, there is no need to specify what alternative is tested and bayes.poisson.test just tells you that. So sure, the evidence is far from conclusive, but given the data and the model there is a 75% probability that the incidence rate is more than 1.5 times higher in the X-rayed group. That is, rather than just saying that there is not enough evidence we have quantified how much evidence there is, and the evidence actually slightly favors the alternative hypothesis. This is also easily seen in the default plot of bayes.poisson.test:

plot( bayes.poisson.test(no_cancer_cases, person_centuries, r = 1.5) )

Comparing More than Two Groups

Back to the ice cream truck, say that you sold 14 ice creams in one day and your competitors Karl and Anna sold 22 and 7 ice creams, respectively. How would you estimate and compare the underlying rates of sold ice creams of these three trucks when bayes.poisson.test only accepts counts from two groups? When you want to go off the beaten path the model.code function is your friend as it takes the result from Bayesian First Aid method and returns R and JAGS code that replicates the analysis you just ran. In this case start by running the model with two counts and then print out the model code:

fit <- bayes.poisson.test(x = c(14, 22), T = c(1, 1)) model.code(fit) ### Model code for the Bayesian First Aid two sample Poisson test ### require(rjags) # Setting up the data x <- c(14, 22) t <- c(1, 1) # The model string written in the JAGS language model_string <- "model { for(group_i in 1:2) { x[group_i] ~ dpois(lambda[group_i] * t[group_i]) lambda[group_i] ~ dgamma(0.5, 0.00001) x_pred[group_i] ~ dpois(lambda[group_i] * t[group_i]) } rate_diff <- lambda[1] - lambda[2] rate_ratio <- lambda[1] / lambda[2] }" # Running the model model <- jags.model(textConnection(model_string), data = list(x = x, t = t), n.chains = 3) samples <- coda.samples(model, c("lambda", "x_pred", "rate_diff", "rate_ratio"), n.iter=5000) # Inspecting the posterior plot(samples) summary(samples)

Just copy-n-paste this code directly into an R script and make the following changes:

  • Add the data for Karl’s ice cream truck:
    • x <- c(14, 22) → x <- c(14, 22, 7)
    • t <- c(1, 1) → t <- c(1, 1, 1)
  • Change the JAGS code so that you iterate over three groups instead of two:
    • for(group_i in 1:2) { → for(group_i in 1:3) {

And that’s it! Now we can run the model script and take a look at the estimated rates of ice cream sales for the three trucks.

plot(samples)

If you want to compare many groups you should perhaps consider using a hierarchical Poisson model. (Pro tip: John K. Kruschke’s Doing Bayesian Data Analysis has a great chapter on hierarchical Poisson models.)

References

Boice, J. D., & Monson, R. R. (1977). Breast cancer in women after repeated fluoroscopic examinations of the chest. Journal of the National Cancer Institute, 59(3), 823-832. link to article (unfortunatel behind paywall)

Gu, K., Ng, H. K. T., Tang, M. L., & Schucany, W. R. (2008). Testing the ratio of two poisson rates. Biometrical Journal, 50(2), 283-298. doi: 10.1002/bimj.200710403, pdf

Villegas, C. (1977). On the representation of ignorance. Journal of the American Statistical Association, 72(359), 651-654. doi: 10.2307/2286233

Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter, D. (2012). The BUGS book: a practical introduction to Bayesian analysis. CRC Press. pdf of chapter 5 on Prior distributions

To leave a comment for the author, please follow the link and comment on his blog: Publishable Stuff. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

2014-02 Invertible Reproducible Documents

Thu, 2014-09-04 16:23

(This article was first published on Stat Tech » R, and kindly contributed to R-bloggers)

Reproducible documents provide an efficient way to produce reports by automatically generating content from code chunks within the report. The processing of a source document, that contains code chunks, to a final document, that contains automatically-generated content, is typically one way, with the resulting report being read-only. This report describes an experiment that attempts to make the final report document modifiable and attempts to invert the process from final document back to source document so that the modifications to the final document can be efficiently conveyed back to the original author of the report.

Eric Lim, Paul Murrell, and Finlay Thompson

Download.

To leave a comment for the author, please follow the link and comment on his blog: Stat Tech » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Shiny: R made interactive @ useR! 2014

Thu, 2014-09-04 14:13

(This article was first published on DataScience.LA » R, and kindly contributed to R-bloggers)

At useR! 2014, one of the most anticipated presentations was Joe Cheng’s Shiny: R made interactive. It was but one of the of a fantastic series of talks by RStudio representatives; check out Winston Chang’s ggvis: Interactive graphics in R for another great talk. Shiny begins with your R code and ends with the customer’s view: an interactive, browser-accessible application. This enables you to place a great deal of power into your users’ hands to customize their view and provides a variety of new and innovative ways to interact with the data, such as by tuning parameters and focusing on sub-populations. Essentially, Shiny functions by allowing you to “wire up” up a reactive application, greatly speeding up the development of a data and analysis based web app.

Shiny comes out of the box with a number of standard industry UI widgets such as date inputs, file inputs, sliders, buttons – the standard widget toolbox. Shiny also leverages a simple panel-based layout manager which, in concert with the widgets, allows for a fluid development experience.

Video Highlights

If you’re interested in bypassing Joe Cheng’s fantastic explanation and you want to cut right to his first demo, he presents a web application for k-means clustering on the iris data set:

Joe’s next demo involves a slick mapping component. Here you begin to see how Shiny apps have the potential to leverage intriguing data sources and provide first rate visualizations beginning only with ZIP codes and a limited amount of R code:

He then shows an amazing example of what an R developer with no previous web development experience can build using Shiny.

For his final demo, Joe embeds Shiny interactive documents using R markdown and combines them with Yihui Xie’s knitR. This is demoed in the video below with an application which parses the logs for RStudio’s CRAN mirror and visualizes package download trends.

Shiny is a fun and easy way to widen the audience interacting with your data analyses. It leverages fantastic modern programming paradigms to provide a system with sane defaults and powerful components to interface the web with R. Enjoy!

To leave a comment for the author, please follow the link and comment on his blog: DataScience.LA » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Long Memory and the Nile: Herodotus, Hurst and H

Thu, 2014-09-04 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

The ancient Egyptians were a people with long memories. The lists of their pharaohs went back thousands of years, and we still have the names and tax assessments for certain persons and institutions from the time of Ramesses II. When Herodotus began writing about Egypt and the Nile (~ 450 BC), the Egyptians who knew that their prosperity depended on the river’s annual overflow, had been keeping records of the Nile’s high water mark for more than three millennia. So, it seems reasonable, and maybe even appropriate, that one of the first attempts to understand long memory in time series was motivated by the Nile.

The story of British hydrologist and civil servant H.E. Hurst who earned the nickname “Abu Nil”, Father of the Nile, for his 62 year career of measuring and studying the river is now fairly well known. Pondering an 847 year record of Nile overflow data, Hurst noticed that the series was persistent in the sense that heavy flood years tended to be followed by heavier than average flood years while below average flood years were typically followed by light flood years. Working from an ancient formula on optimal dam design he devised the equation: log(R/S) = K*log(N/2) where R is the range of the time series, S is the standard deviation of year-to-year flood measurements and N is the number of years in the series. Hurst noticed that the value of K for the Nile series and other series related to climate phenomena tended to be about 0.73, consistently much higher than the 0.5 value that one would expect from independent observations and short autocorrelations.

Today, mostly due to the work of Benoit Mandelbrot who rediscovered and popularized Hurst work in the early 1960s, Hurst’s Rescale/Range Analysis, and the calculation of the Hurst exponent (Mandlebrot renamed “K” to “H”) is the demarcation point for the modern study of Long Memory Time Series. To investigate let’s look at some monthly flow data taken at the Dongola measurement station that is just upstream from the high dam at Aswan. (Look here for the data, and here for map and nice Python-based analysis that covers some of the same ground as is presented below.) The data consists of average monthly flow measurements from January 1869 to December 1984.

head(nile_dat) Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1869 NA NA NA NA NA NA 2606 8885 10918 8699 NA NA 2 1870 NA NA 1146 682 545 540 3027 10304 10802 8288 5709 3576 3 1871 2606 1992 1485 933 731 760 2475 8960 9953 6571 3522 2419 4 1872 1672 1033 728 605 560 879 3121 8811 10532 7952 4976 3102 5 1873 2187 1851 1235 756 556 1392 2296 7093 8410 5675 3070 2049 6 1874 1340 847 664 516 466 964 3061 10790 11805 8064 4282 2904

To get a feel for the data we plot a portion of the time series.

The pattern is very regular and the short term correlations are apparent. The following boxplots show the variation in monthly flow. 

Herodotus clearly knew what he was talking about when he wrote (The Histories: Book 2, 19):

I was particularly eager to find out from them (the Egyptian priests) why the Nile starts coming down in a flood at the summer solstice and continues flooding for a hundred days, but when the hundred days are over the water starts to recede and decreases in volume, with the result that it remains low for the whole winter, until the summer solstice comes round again.

To construct a long memory time series we aggregate the monthly flows to produce a yearly time series of total flow (droppng the years 1869 and 1870 because of the missing values).

Plotting the ACF, we see that the autocorelations persist for nearly 20 years!! 

So, let's compute the Hurst exponent. For our first try, we use a simple function suggested by an example in Bernard Pfaff's classic text: Analysis of Integrated and Cointegrated Time Series with R.

simpleHurst <- function(y){ sd.y <- sd(y) m <- mean(y) y <- y - m max.y <- max(cumsum(y)) min.y <- min(cumsum(y)) RS <- (max.y - min.y)/sd.y H <- log(RS) / log(length(y)) return(H) } simpleHurst(x) [1] 0.7348662

Bingo! 0.73 - just what we were expecting for a long memory time series. Unfortunately, things are not so simple. The function hurst() from the pracma package which is a much more careful calculation than simpleHurst() yields:

hurst(nile.yr.ts) [1] 1.041862

This is midly distressing since H is supposed to be bounded above by 1. The function hurstexp() from the same package which is based on Weron's MatLab code and implements the small sample correction seems to solve that problem.

> hurstexp(nile.yr.ts) Corrected R over S Hurst exponent: 1.041862 Theoretical Hurst exponent: 0.5244148 Corrected empirical Hurst exponent: 0.7136607 Empirical Hurst exponent: 0.6975531

0.71 is more reasonable result. However, as a post on the Timely Portfolio blog pointed out a few years ago, computing the Hurst exponent is an estimation problem not merely a calculation. So, where are the error bars?

I am afraid that confidence intervals and a look at several other methods available in R for estimating the Hurst exponent will have to wait for another time. In the meantime, the following references may be of interest. The first two are light reading from early champions of applying Rescale/Range analysis and the Hurst exponent to Financial time series. The book by Mandelbrot and Hudson is especially interesting for its sketch of the historical background. The last two are relatively early papers on the subject.

My code for this post may be downloaded here: Long_memory_Post.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Looking For Life

Thu, 2014-09-04 06:10

(This article was first published on Ripples, and kindly contributed to R-bloggers)

 

Machines take me by surprise with great frequency (Alan Turing)

Imagine a 8×8 grid in which cells can be alive (black colored) or empty (light gray colored): As with the One-dimensional Cellular Automata, the next state of a cell is a function of the states of the cell’s nearest neighbors (the only neighbors that we consider are the eight cells that form the immediate perimeter of a cell). To evolve this community of cells over time, rules are quite simple:

  • If a live cell has less than two neighbors, then it dies (loneliness)
  • If a live cell has more than three neighbors, then it dies (overcrowding)
  • If an dead cell has three live neighbors, then it comes to life (reproduction)
  • Otherwise, a cell stays as is (stasis)

These are the rules of the famous Conway’s Game Of Life, a cellular automaton invented in the late 1960s by John Conway trying to refine the description of a cellular automaton to the simplest one that could support universal computation. In 1970 Martin Gardner described Conway’s work in his Scientific American column. Gardner’s article inspired many people around the world to experiment with Conway’s, which eventually led to the final pieces of how the Game Of Life could support universal computation in what was surely a global collaborative effort.

In this experiment I will try to find interesting objects that can be found in the Game Of Life: static (remain the same over the time), periodic (change but repeating their initial configuration in some iterations) or moving objects (travel through the grid before disappearing). Why are interesting? Because these are the kind of objects required to perform computations.

The plan is simple: I will generate some random grids and will evolve them over time a significant number of times. After doing this, I will check those grids having still some alive cells inside. Will I find there what I am looking for?

I generated 81 grids in which live cells are randomly located using binomial random variables with probabilities equal to i/80 with i from 0 (all cells empty) to 80 (all cells alive). This is a quick way to try a set of populations with a wide range of live cells. I measure % of alive cells after each iteration. I will analyze those grids which still have live cells after all iterations. This is what happens after 150 iterations:

I find some interesting objects. Since I keep them in my script, I can list them with ls(pattern= "Conway", all.names = TRUE). Two of them are specially interesting because are not static. Are those which produce non-constant lines in the previous plot.

First one is a periodic object which reproduces itself after every 3 iterations:

Second one is a bit more complex. After 8 iterations appears rotated:

What kind of calculations can be done with these objects? I don’t now yet but let’s give time to time. Do you want to look for Life?

library(ggplot2) library(scales) library(gridExtra) SumNeighbors = function (m) #Summarizes number of alive neighbors for each cell { rbind(m[-1,],0)+rbind(0,m[-nrow(m),])+cbind(m[,-1],0)+cbind(0,m[,-ncol(m)])+ cbind(rbind(m[-1,],0)[,-1],0)+cbind(0, rbind(0,m[-nrow(m),])[,-nrow(m)])+ cbind(0,rbind(m[-1,],0)[,-nrow(m)])+cbind(rbind(0,m[-nrow(m),])[,-1],0) } NextNeighborhood = function (m) #Calculates next state for each cell according to Conway's rules { (1-(SumNeighbors(m)<2 | SumNeighbors(m)>3)*1)-(SumNeighbors(m)==2 & m==0)*1 } splits=80 #Number of different populations to simulate. Each population s initialized randomly #according a binomial with probability i/splits with i from 0 to splits iter=150 results=data.frame() rm(list = ls(pattern = "conway")) #Remove previous solutions (don't worry!) for (i in 0:splits) { z=matrix(rbinom(size=1, n=8^2, prob=i/splits), nrow=8); z0=z results=rbind(results, c(i/splits, 0, sum(z)/(nrow(z)*ncol(z)))) for(j in 1:iter) {z=NextNeighborhood(z); results=rbind(results, c(i/splits, j, sum(z)/(nrow(z)*ncol(z))))} #Save interesting solutions if (sum(z)/(nrow(z)*ncol(z))>0) assign(paste("Conway",format(i/splits, nsmall = 4), sep=""), z) } colnames(results)=c("start", "iter", "sparsity") #Plot reults of simulation opt1=theme(panel.background = element_rect(fill="gray85"), panel.grid.minor = element_blank(), panel.grid.major.x = element_blank(), panel.grid.major.y = element_line(color="white", size=.5, linetype=2), plot.title = element_text(size = 45, color="black"), axis.title = element_text(size = 24, color="black"), axis.text = element_text(size=20, color="black"), axis.ticks = element_blank(), axis.line = element_line(colour = "black", size=1)) ggplot(results, aes(iter, sparsity, group = start))+ geom_path(size=.8, alpha = 0.5, colour="black")+ scale_x_continuous("Iteration", expand = c(0, 0), limits=c(0, iter), breaks = seq(0,iter,10))+ scale_y_continuous("Alive Cells", labels = percent, expand = c(0, 0), limits=c(0, 1), breaks = seq(0, 1,.1))+ labs(title = "Conway's Game Of Life Simulation")+opt1 #List of all interesting solutions ls(pattern= "Conway", all.names = TRUE) #Example to plot the evolution of an interesting solution (in this case "Conway0.5500") require(reshape) opt=theme(legend.position="none", panel.background = element_blank(), panel.grid = element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), axis.text =element_blank()) p1=ggplot(melt(Conway0.5500), aes(x=X1, y=X2))+geom_tile(aes(fill=value), colour="white", lwd=2)+ scale_fill_gradientn(colours = c("gray85", "black"))+opt p2=ggplot(melt(NextNeighborhood(Conway0.5500)), aes(x=X1, y=X2))+geom_tile(aes(fill=value), colour="white", lwd=2)+ scale_fill_gradientn(colours = c("gray85", "black"))+opt p3=ggplot(melt(NextNeighborhood(NextNeighborhood(Conway0.5500))), aes(x=X1, y=X2))+geom_tile(aes(fill=value), colour="white", lwd=2)+ scale_fill_gradientn(colours = c("gray85", "black"))+opt p4=ggplot(melt(NextNeighborhood(NextNeighborhood(NextNeighborhood(Conway0.5500)))), aes(x=X1, y=X2))+geom_tile(aes(fill=value), colour="white", lwd=2)+ scale_fill_gradientn(colours = c("gray85", "black"))+opt #Arrange four plots in a 2x2 grid grid.arrange(p1, p2, p3, p4, ncol=2)

To leave a comment for the author, please follow the link and comment on his blog: Ripples. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs